Single executable on all CHPC platforms

Advances in compiler technology and MPI libraries are starting to allow building single executables optimized for multiple CPU architectures and running over multiple networks. The document below summarizes how to achieve this on CHPC machines. Please, note that not all compilers and MPIs allow this, therefore we detail how to do this for those that work and what problems to expect with those that don't work.

Short summary
General remarks
Multi-architecture CPU optimization
MPI multiple network interfaces support
- Intel MPI
- MPICH
- OpenMPI
Multi-architecture CPU optimization with multi-network MPI
Recommendations for different scenarios - skip to here if you don't want to read all the details

Short summary

In this document we show how to build a single executable optimized for multiple CPU architectures that runs in parallel over multiple network types. This should be beneficial for both CHPC staff and for users who need to run their applications optimally on all CHPC clusters.

Moreover, we evaluate the common Application Binary Interface (ABI) for MPI as implemented in several MPI distributions and show how a single executable can be run using several different MPI distributions without the need to recompile.

A result of this is a single parallel executable that can run optimally on all CHPC clusters and on three out of the four MPI distributions that we support.

General remarks

CPU architectures change from generation to generation, affecting the data/instruction processing and adding/modifying CPU instructions. A common trend recently has been improving vectorization capabilities of the CPUs. As of mid 2018, CHPC runs four generations of Intel CPUs -- Nehalem, SandyBridge, Haswell and Skylake -- each of which shows incremental improvement in vectorization processing power which is significant enough to be important be able to harnessed optimally.

Starting with AVX, Intel CPUs feature increasingly complex logic of clock speed adjustments depending on how many CPU cores and vector units are being used. If less cores and vectorization is utilized, the CPU may run at considerably larger clock speed than when all cores/vector units are used. A good review of this beavior is at this Anandtech webpage. For this reason the parallel performance of the current CPUs may not scale linearly with increased core utilization, although in utilizing all the cores and the most performing vector units (e.g. AVX512 in Skylake) should still be the most efficient.

The two commercial compilers that CHPC licenses, Intel and PGI, both support building multiple optimized code paths for different CPU architectures into a single executable, alleviating the need to build separate executables for each CPU type. The open source GNU compiler does not currently support this option.

Parallel programs add another complexity factor in the form of network interface over which the parallel program (usually using an MPI library) runs. Most of CHPC clusters feature high performance InfiniBand network, but, Lonepeak cluster nodes as well as user desktops, have only slower Ethernet network, which in the past required using separate MPIs built for one network or another.

Most of current MPIs allow to include multiple network channels into a single MPI build, which then allows to run executable built with such MPI on several different networks.

As such, we are getting to a point when it is possible to make a single high performance executable that runs optimally on many CPU architectures and over many networks.

That said, due to the simplicity of build and deployment, as well as good performance, we recommend using the Intel compiler and Intel MPI as the first choice for building user applications and we are planning to build most of the applications that we support in this manner.

Multi-architecture CPU optimization

Intel compilers

Intel calls this approach automatic CPU dispatch, and it is invoked with the -ax compiler flag. The current highest CPU architecture at CHPC includes AVX512 vectorization instructions and is achieved with-axCORE-AVX512. This option instructs the compiler to produce two binary paths, one for the AVX512 compatible CPU, and other for generic x86 CPU. So, the code will run optimally using AVX512 on an AVX512 CPU, but, run suboptimally using only SSE vectorization on any other CPU (including those having AVX and higher SSE vectorization instructions).

In order to build executable that vectorizes optimally on all CHPC clusters, that is on Nehalem, SandyBridge, Haswell and Skylake generations of Intel Xeon CPUs, we need to add specifications for those particular architectures, i.e. -axCORE-AVX512,CORE-AVX2,AVX,SSE4.2. To verify this is indeed the case, one can enable detailed compiler reporting via flags -diag-enable=all -qopt-reportand then examine compile report *.optrpt files, one for each source file. There we should see the following sections reporting optimizations for the given architectures:

Begin optimization report for: main(int, char **) [skylake_avx512]
Begin optimization report for: main(int, char **) [core_4th_gen_avx]
Begin optimization report for: main(int, char **) [core_2nd_gen_avx]
Begin optimization report for: main(int, char **) [core_i7_sse4_2]
Begin optimization report for: main(int, char **) [generic]

In general, the *.optrpt files are a good place to look at to examine how well the compiler vectorized the program, although even better tool is the graphical Intel Advisor.

Note that using the single optimization option -fast does not build multiple target executables; to produce highly optimized code add the flags -O3 -ipo, e.g.

icc -O3 -ipo -axCORE-AVX512,CORE-AVX2,AVX,SSE4.2 hello_ser.c -vec-report
hello_ser.c(8): (col. 5) remark: LOOP WAS VECTORIZED
hello_ser.c(4): (col. 1) remark: main has been targeted for automatic cpu dispatch

Please, also note that we have had trouble with building some codes with this complex multi-target option. If a program compilation fails, remove the SSE4.2 option. This will cause not to build optimized code for Lonepeak, but, the program will still run there using the generic code path, and most likely not be significantly slower.

Below are results of series of HPL benchmark runs compiled with Intel 15.0.1 compiler, MKL 11.2, Intel MPI 5.0.1 and the -ax option vs. the -x option which optimizes for a specified CPU target only. We ran this benchmark on 4 or 5 different nodes, reporting average with standard deviation - due to fact that there have been small variances from run to run due to system noise. The values are in GFlops per node.

	-axCORE-AVX2,AVX,SSE4.2	-xSSE4.2	-xAVX	-xCORE-AVX2	Speedup vs. Ember	Cores/node	Core increase
Ember Westmere	119.85+/-1.34	120.23+/-0.59			1.00	12	1.00
Kingspeak Sandybridge	311.63+/-3.56		310.73+/-3.58		2.60	16	1.33
Kingspeak Haswell	765.90+/-6.48			765.48+/-7.49	6.39	24	2.00

From the table above, it is evident that the automatic CPU dispatch (-ax) flag produced CPU specific optimized code and single executable runs on all the three platforms optimally. Since most of this benchmark runtime is spent in the LAPACK routines, part of the MKL library, it does not necessarily show the power of compiler optimization for each of the CPU architecture, nevertheless, the code runs optimally on all the platforms without giving an illegal instruction error which would be the case when we would optimize only for the latest CPU architecture and run on an earlier one.

It is also worth noticing the increase of GFlops performance among the three generation, doubling it per core with the AVX instruction set (2 vs. 4 double precision wide vector unit) and making it ~3x faster with the AVX2 (adding the fused multiply-add instruction, i.e. doing vector multiplication and addition in a single instruction).

NVHPC compilers

PGI compilers call this the unified binary. It is achieved by bundling the different CPU architecture names into the -tp compiler flag. For the three CHPC compiler architectures, this corresponds to -tp=nehalem,sandybridge,haswell,skylake .

Notice that using single optimization option -fastsse works fine, e.g.

nvc -fastsse -tp=nehalem,sandybridge,haswell,skylake -Minfo=unified,vect hello_ser.c
main:
      8, PGI Unified Binary version for -tp=skylake-64
     16, Generated 2 alternate versions of the loop
         Generated vector simd code for the loop
main:
      8, PGI Unified Binary version for -tp=haswell-64
     16, Generated 2 alternate versions of the loop
         Generated vector simd code for the loop
main:
      8, PGI Unified Binary version for -tp=sandybridge-64
     16, Generated 3 alternate versions of the loop
         Generated vector simd code for the loop
main:
      8, PGI Unified Binary version for -tp=nehalem-64
     16, Generated 3 alternate versions of the loop
         Generated vector simd code for the loop

However, there are problems with difficulty in building MPI distributions with the unified binary approach, which we'll detail below.

It is possible to work around all these issues, however, it makes PGI builds a little more cumbersome.

GNU compilers

GNU compilers (as far as we know) don't allow for multiple code paths in an executable so one has to build optimized code for each CPU architecture. Also note that the gcc 8.5.0 series which is shipped with Rocky Linux 8, does have optimization flags for recent Intel CPUs that are on the Notchpeak cluster, but, it does not support optimizations for the AMD CPUs that are on Notchpeak. The AMD CPUs will take advantage of the older AVX2 instructions, equivalent to the CPUs on the Kingspeak cluster. For Kingspeak, use -march=sandybridge, for Notchpeak Intel nodes, use flag -march=skylake-avx512. For Notchpeak AMD nodes, use flag -march=broadwell.

Newer versions of the GNU compilers are installed as needed that include support for newer CPUs. Version 10.2.0 version includes -march=znver2 flag which includes specific optimizations for the Notchpeak AMD CPUs, for example:

module load gcc/10.2.0
gcc -march=znver2 hello_ser.c

MPI multiple network interfaces support

Intel MPI

Intel MPI supports multiple networks. To use

module load intel impi

The choice of network is made by the FI_PROVIDER variable, with allowed variables being {verbs, tcp}. While the best available fabric should be selected by default, we have seen cases when the application would crash at the start when the fabric is not specified. If this happens, the network can be explicitly specified by

For Infiniband -- mpirun -genv FI_PROVIDER=verbs -np 2 ./latbw_impi

For ethernet -- mpirun -genv FI_PROVIDER=tcp -np 2 ./latbw_impi

Here's a table of latency/bandwidth. Similar performance can be expected on other clusters.

	Ash Latency [us]	Ash Bandwidth [MB/s]	KP Latency [us]	KP Bandwidth [MB/s]	NP Latency [us]	NP Bandwidth [MB/s]
verbs	2.05	6254	1.68	3376.022	1.20	6191.268
tcp	23.31	2719	14.60	1748.446	13.20	2421.430

Notice that the tcp ran over the InfiniBand using ipoib therefore we see fairly decent latencies and bandwidths.

A different way of launching can be chosen by -bootstrap option. There are many different choices for this option; of value at CHPC are slurm (the default if nothing is specified), and ssh. To use ssh over multiple nodes, one has to prepare a host file (as we were used in the PBS world), and feed it to the mpirun:
mpirun -genv FI_PROVIDER=verbs -bootstrap ssh -machinefile nodefile -np 2 ./latbw_impi

Now, the fun part is, that, depending on how we name the node names in the host file, the tcp will run over the given network. If we use hostnames such as kp001,..., it'll run over the Ethernet, if we use kp001.ipoib,..., it'll run over the InfiniBand using ipoib.

Due to the simplicity of build and deployment, and good CPU affinity support, we recommend using Intel MPI in user applications. However, be aware that OpenMPI and MVAPICH2 have better latencies. Applications that send many small messages will likely perform the best with OpenMPI.

Also note that thanks to the common Application Binary Interface (ABI) between Intel MPI (>= 5.0) , MPICH (>= 3.1) and MVAPICH2 (>= 2.0), one can build a dynamic executable (default on our systems) with one MPI, and run with the other. See details about this at https://software.intel.com/en-us/articles/using-intelr-mpi-library-50-with-mpich3-based-applications.

MPICH

MPICH's most commonly used nemesis channel supports several subchannels (called netmods), including tcp , ib , mxm and ofi. The last three use InfiniBand, however ib and mxm are deprecated. We therefore built MPICH from version 3.2.1 with the relatively new ofi OpenFabrics netmod. As such, it does not appear to be optimized since latencies and bandwidths seem to be lower than that of MVAPICH2 or Intel MPI. We therefore recommend to build with these if the program will be run mostly on the InfiniBand clusters.

Since the tcp netmod is default, to run over InfiniBand, set environment variable MPICH_NEMESIS_NETMOD=ofi, e.g.
mpirun -genv MPICH_NEMESIS_NETMOD ofi -np 2 ./a.out

For a single workstation (running MPI program with multiple processes on a single node), the network is not accessed and therefore the program works even if IB is not present on the system.

Also, note that unlike Intel MPI or MVAPCH2, MPICH does not set task affinity to the CPUs (i.e., binds tasks to CPUs for better performance) by default. To enable the task affinity, you need to use flag-bind-to , e.g. to bind each MPI task to a core, -bind-to=core.

It is possible, however, to run the MPICH built executable on the clusters with InfiniBand, taking advantage of the common ABI with Intel MPI. Simply add the Intel MPI to your environment, with
module load impi
and then run with Intel MPI as
mpirun -np 2 ./a.out

OpenMPI

OpenMPI has provided multi-network support for a while, and chooses the best available network automatically.

To force use Ethernet, add the --mca btl tcp,self flag to mpirun, the default is to use the InfiniBand.

mpirun --mca btl tcp,self -np $SLURM_NTASKS $EXE # uses Ethernet
mpirun -np $SLURM_NTASKS $EXE                     # uses InfiniBand

Here's a table of latency/bandwidth on ember and kingspeak. Similar performance can be expected on other clusters.

	KP Latency [us]	KP Bandwidth [MB/s]	NP Latency [us]	NP Bandwidth [MB/s]
ib	1.486	3348.124	2.962	6081.060
tcp	15.189	229.600	18.405	230.317

It looks like the TCP bandwidth is limited to 1 Mbit/sec, which indicates that OpenMPI chose the Ethernet network to run over.

Since OpenMPI does not yet support the common MPI ABI, its executable can not be exchanged with intel MPI, MVAPICH2, and MPICH.

Multi-architecture CPU optimization with multi-network MPI

Intel MPI

Intel MPI multi-network support along with the automatic CPU dispatch inside of the Intel compilers is quite straightforward and works well. During the compilation, all one needs is to include the -axCORE-AVX512,CORE-AVX2,AVX,SSE4.2 flag, -O3 -ip to also include good optimization, and, then, at runtime select the appropriate network fabrics via the I_MPI_FABRICS environment variable.

Intel MPI is thus a good choice for a single executable that runs optimized over all CHPC Linux machines. As such, most of the applications that we support are built that way.

Note that there is also mpitune utility that allows one to run multiple scenarios automatically and come up with the best Intel MPI runtime parameters (all MPIs feature a number of internal switches that adjust communication parameters based on message sizes, task counts, etc). Details on the mpitune utility are at Intel's mpitune page.

MVAPICH2

The default network channel of MVAPICH2 is mrail, which only supports InfiniBand, but, it is supposed to provides better performance than the nemesis channel which was inherited from MPICH. However, in reality based on the LAMMPS benchmarks below (outdated) MPICH with mxm performs faster. MVAPICH2's strength may lie in specific communication and InfiniBand optimizations which LAMMPS may not be utilizing.

MVAPICH2's good performance on InfiniBand makes it a good choice for a program that will run on InfiniBand clusters. If using Intel or PGI compilers, you can build executable that can run optimally on all three CPU architectures that our clusters have.

Note that you can still run MVAPICH2 executables on Ethernet only clusters or on single desktops using Intel MPI or MPICH libraries thanks to common ABI. In case the program complains about missing libpmi.so,
setenv LD_LIBRARY_PATH "/uufs/chpc.utah.edu/sys/pkg/slurm/std/lib:$LD_LIBRARY_PATH"for tcsh
or export LD_LIBRARY_PATH "/uufs/chpc.utah.edu/sys/pkg/slurm/std/lib:$LD_LIBRARY_PATH" for bash,
and then run the executable as if using Intel MPI or MPICH.

MPICH

MPICH with multi-network (over MXM), multi-architecture works well only with Intel compilers, GNU does not support multi-architecture, whereas PGI has problems with running unified binary-built MPICH on Sandybridge and higher CPUs. Therefore we only provide MPICH built with Intel compiler and the -axCORE-AVX512,CORE-AVX2,AVX,SSE4.2 flag, along with the PGI and GNU MPICH builds with the lowest common denominator optimization, which is the Nehalem architecture. Since there should be minimum vectorization potential inside the MPI, this should not affect the performance radically. You can then build PGI unified binary applications on top of the Nehalem optimized MPI.

Furthermore, from MPICH 3.2.1, the MXM interface is not included in our build, replaced with OpenFabrics (ofi), which does not appear to be as well performing as MXM used to be.

OpenMPI

OpenMPI with multi-network, multi-architecture works well with Intel compilers. GNU does not support multi-architecture. PGI has problems with running unified binary-built OpenMPI on Sandybridge and higher CPUs, which is why we have built OpenMPI optimized for Nehalem as in case of MPICH. Build PGI unified binary on top of the Nehalem optimized MPI.

LAMMPS benchmark results

Below are benchmarks of LAMMPS molecular dynamics code results for jobs run on four Kingspeak Haswell and SandyBridge nodes using Intel compilers and MPI (-axCORE-AVX2) to build the code and running using different Intel MPI runtime option, and also MPICH and MVAPICH2. Total runtime and communication time in seconds are reported in the table below, which means lower number is better.

The runs were performed as follows:

IMPI default

module load intel impi
set EXE = /uufs/chpc.utah.edu/sys/installdir/lammps/9Dec14/src/lmp_intel2015
mpirun -np $SLURM_NTASKS $EXE < in.spce

IMPI shm:dapl

module load intel impi
set EXE = /uufs/chpc.utah.edu/sys/installdir/lammps/9Dec14/src/lmp_intel2015
mpirun -genv I_MPI_FABRICS shm:dapl -np $SLURM_NTASKS $EXE < in.spce

IMPI shm:ofa

module load intel impi
set EXE = /uufs/chpc.utah.edu/sys/installdir/lammps/9Dec14/src/lmp_intel2015
mpirun -genv I_MPI_FABRICS shm:ofa -np $SLURM_NTASKS $EXE < in.spce

IMPI srun

module load intel impi
setenv I_MPI_PMI_LIBRARY /uufs/kingspeak.peaks/sys/pkg/slurm/std/lib/libpmi.so
set EXE = /uufs/chpc.utah.edu/sys/installdir/lammps/9Dec14/src/lmp_intel2015
srun -n $SLURM_NTASKS $EXE < in.spce

MPICH default

module load intel mpich2
setenv LD_LIBRARY_PATH "/uufs/chpc.utah.edu/sys/installdir/mpich/3.1.4i/lib:$LD_LIBRARY_PATH"
setenv MPICH_NEMESIS_NETMOD mxm
set EXE = /uufs/chpc.utah.edu/sys/installdir/lammps/9Dec14/src/lmp_intel2015
mpirun -np $SLURM_NTASKS $EXE < in.spce

MPICH affinity

module load intel mpich2
setenv LD_LIBRARY_PATH "/uufs/chpc.utah.edu/sys/installdir/mpich/3.1.4i/lib:$LD_LIBRARY_PATH"
setenv MPICH_NEMESIS_NETMOD mxm
set EXE = /uufs/chpc.utah.edu/sys/installdir/lammps/9Dec14/src/lmp_intel2015
mpirun -bind-to core -np $SLURM_NTASKS $EXE < in.spce

MVAPICH2 default

module load intel mvapich2
set EXE = /uufs/chpc.utah.edu/sys/installdir/lammps/9Dec14/src/lmp_intel2015
srun -n $SLURM_NTASKS $EXE < in.spce

	LAMMPS LJ
	96 procs, Haswell		64 procs, Sandybridge
	Total	Comm	Total	Comm
IMPI default	41.02	10.46	73.61	23.40	core affinity
IMPI shm:dapl	40.79	10.25	73.55	23.31	core affinity
IMPI shm:ofa	42.79	13.70	70.61	22.94	core affinity
IMPI srun	64.15	30.48	91.71	38.14	no affinity
MPICH mxm	64.93	31.07	89.07	36.36	no affinity
MPICH mxm bind core	45.66	14.63	74.96	24.59	core affinity
MVAPICH2 default	42.61	12.30	74.67	25.19	core affinity
	LAMMPS SPCE
	96 procs, Haswell		64 procs, Sandybridge
	Total	Comm	Total	Comm
IMPI default	60.39	2.61	76.91	3.71	core affinity
IMPI shm:dapl	60.16	2.60	76.78	3.72	core affinity
IMPI shm:ofa	60.68	2.77	74.60	3.41	core affinity
IMPI srun	84.37	4.14	100.87	4.76	no affinity
MPICH mxm	85.82	5.17	106.45	5.48	no affinity
MPICH mxm bind core	60.77	3.69	75.94	4.18	core affinity
MVAPICH2 default	60.58	3.03	78.09	4.44	core affinity

There are several observations that can be made:

Some MPI distributions set process affinity by default (Intel MPI, MVAPICH2, OpenMPI while others do not (MPICH). Slurm currently is not set up for task affinity and as such srun with Intel MPI does not perform well (this is to be retested once we roll Slurm affinity to the clusters). MVAPICH2 overrides the Slurm affinity settings and sets its own.
Using the process affinity is a good idea for performance reasons.
Intel MPI's ofa and dapl fabrics performance is comparable, but, in some cases one is slightly better than the other, and vice versa.
MPICH mxm netmod is close to be competitive to Intel MPI, it may improve with the MPICH 3.2 release (mxm contribution from Mellanox) and/or our Mellanox OFED stack update in the future.
MVAPICH2 is not as fast as we have hoped, Intel MPI is faster for LAMMPS.

Recommendations for different scenarios

To summarize the material presented above, here are our recommendations for optimal performance for several common scenarios.

For an optimized application running on all CHPC Linux machines

Use the Intel compilers with Intel MPI and the automatic CPU dispatch with -axCORE-AVX512,CORE-AVX2,AVX,SSE4.2 compiler flag. It is simple and it works. Test the dapl and ofa fabrics and pick up one that performs better using the I_MPI_FABRICS option to mpirun.

Alternatively, all other MPI distributions (MVAPICH2, MPICH, OpenMPI) work with the Intel compiler and the -axCORE-AVX2,AVX,SSE4.2 option as well. If you use MVAPICH2, you can use the same executable to run on the Ethernet clusters and desktops using MPICH or Intel MPI thanks to the common ABI (see the Intel MPI section).

For an optimized application using GNU compilers

Since GNU does not allow for multi-architecture optimization, you have to build separate executables for the Kingspeak SandyBridge nodes (16, 20 cores) and Haswell nodes (24 cores). Use the -march flag to specify the appropriate given CPU architecture (-march=westmere - Lonepeak, -march=sandybridge - Kingspeak 16 and 20 core nodes, -march=haswell - Kingspeak 24 and 28 core nodes, Notchpeak AMD nodes, -march=skylake - Notchpeak Intel nodes).

As for what MPI to use, on the clusters, use Open MPI or MVAPICH2 for single threaded programs and Intel MPI for multi-threaded program - the latter provides better CPU affinity control.

Be aware that OpenMPI lacks the ABI compatibility with the other MPI distributions which makes it less flexible.

For an optimized application using NVHPC compilers

The NVHPC unified binary works fairly well with applications, however, we did not have much success building MPI libraries with it, so, the MPIs in the chpc.utah.edu branch have been built to the lowest common denominator (Lonepeak). However, we don't expect any performance impact on using these MPI builds.

To then build application that will run optimized on all CHPC InfiniBand clusters, use MVAPICH2 with -tp=nehalem,sandybridge,haswell,skylake compiler flag. For flexible binary that works both on InfiniBand and on Ethernet, use Intel MPI or MPICH.

OpenMPI would be another good choice.