Single executable on all CHPC platforms

Advances in compiler technology and MPI libraries are starting to allow building single executables optimized for multiple CPU architectures and running over multiple networks. The document below summarizes how to achieve this on CHPC machines. Please, note that not all compilers and MPIs allow this, therefore we detail how to do this for those that work and what problems to expect with those that don't work.

Short summary

In this document we show how to build a single executable optimized for multiple CPU architectures that runs in parallel over multiple network types. This should be beneficial for both CHPC staff and for users who need to run their applications optimally on all CHPC clusters.

Moreover, we evaluate the common Application Binary Interface (ABI) for MPI as implemented in several MPI distributions and show how a single executable can be run using several different MPI distributions without the need to recompile.

A result of this is a single parallel executable that can run optimally on all CHPC clusters and on three out of the four MPI distributions that we support.

 

General remarks

CPU architectures change from generation to generation, affecting the data/instruction processing and adding/modifying CPU instructions. A common trend recently has been improving vectorization capabilities of the CPUs. As of mid 2015, CHPC runs three generations of Intel CPUs -- Nehalem, SandyBridge and Haswell -- each of which shows incremental improvement in vectorization processing power which is significant enough to be important be able to harnessed optimally.

The two commercial compilers that CHPC licenses, Intel and PGI, both support building multiple optimized code paths for different CPU architectures into a single executable, alleviating the need to build separate executables for each CPU type.  The open source GNU compiler does not currently support this option.

Parallel programs add another complexity factor in the form of network interface over which the parallel program (usually using an MPI library) runs. Most of CHPC clusters feature high performance InfiniBand network, but, some Lonepeak cluster nodes as well as user desktops, have only slower Ethernet network, which in the past required using separate MPIs built for one network or another.

Most of current MPIs allow to include multiple network channels into a single MPI build, which then allows to run executable built with such MPI on several different networks.

As such, we are getting to a point when it is possible to make a single high performance executable that runs optimally on many CPU architectures and over many networks.

That said, due to the simplicity of build and deployment, as well as good performance, we recommend using the intel compiler and Intel MPI as the first choice for building user applications and we are planning to build most of the applications that we support in this manner.

Multi-architecture CPU optimization

Intel compilers

Intel calls this approach automatic CPU dispatch, and it is invoked with the -ax compiler flag. The current highest CPU architecture at CHPC includes AVX2 vectorization instructions and is achieved with-axCORE-AVX2, which builds code that, according to the manual, "May  generate Intel(R) Advanced Vector Extensions 2 (Intel(R) AVX2), Intel(R)   AVX,  SSE4.2, SSE4.1, SSE3, SSE2, SSE, and SSSE3 instructions for Intel(R) processors". What that really means, and is not documented clearly, is that the compiler produces two binary paths, one for the AVX2 compatible CPU, and other for generic x86 CPU. So, the code will run optimally using AVX2 on AVX2 CPU, but, run suboptimally using only SSE vectorization on any other CPU (including those having AVX and higher SSE vectorization instructions).

In order to build executable that vectorizes optimally on all CHPC clusters, that is on Nehalem, SandyBridge and Haswell generations of Intel Xeon CPUs, we need to add specifications for those particular architectures, i.e. -axCORE-AVX2,AVX,SSE4.2. To verify this is indeed the case, one can enable detailed compiler reporting via flags -diag-enable=all -qopt-reportand then examine compile report *.optrpt files, one for each source file. There we should see the following sections reporting optimizations for the given architectures:

Begin optimization report for: main(int, char **) [core_4th_gen_avx]
Begin optimization report for: main(int, char **) [core_2nd_gen_avx]
Begin optimization report for: main(int, char **) [core_i7_sse4_2]
Begin optimization report for: main(int, char **) [generic

Note that using  the single optimization option -fast does not build multiple target executables; to produce highly optimized code add the flags -O3 -ipo, e.g.

icc -O3 -ipo -axCORE-AVX2,AVX,SSE4.2 hello_ser.c -vec-report
hello_ser.c(8): (col. 5) remark: LOOP WAS VECTORIZED
hello_ser.c(4): (col. 1) remark: main has been targeted for automatic cpu dispatch

Below are results of series of HPL benchmark runs compiled with Intel 15.0.1 compiler, MKL 11.2, Intel MPI 5.0.1 and the -ax option vs. the -x option which optimizes for a specified CPU target only. We ran this benchmark on 4 or 5 different nodes, reporting average with standard deviation - due to fact that there have been small variances from run to run due to system noise. The values are in GFlops per node.

  -axCORE-AVX2,AVX,SSE4.2 -xSSE4.2 -xAVX -xCORE-AVX2 Speedup vs. Ember Cores/node Core increase
Ember Westmere 119.85+/-1.34 120.23+/-0.59     1.00 12 1.00
Kingspeak Sandybridge 311.63+/-3.56   310.73+/-3.58   2.60 16 1.33
Kingspeak Haswell 765.90+/-6.48     765.48+/-7.49 6.39 24 2.00

 

From the table above, it is evident that the automatic CPU dispatch (-ax) flag produced CPU specific optimized code and single executable runs on all the three platforms optimally. Since most of this benchmark runtime is spent in the LAPACK routines, part of the MKL library, it does not necessarily show the power of compiler optimization for each of the CPU architecture, nevertheless, the code runs optimally on all the platforms without giving an illegal instruction error which would be the case when we would optimize only for the latest CPU architecture and run on an earlier one.

It is also worth noticing the increase of GFlops performance among the three generation, doubling it per core with the AVX instruction set (2 vs. 4 double precision wide vector unit) and making it ~3x faster with the AVX2 (adding the fused multiply-add instruction, i.e. doing vector multiplication and addition in a single instruction).

  PGI compilers

PGI compilers call this the unified binary. It is achieved by bundling the different CPU architecture names into the -tp compiler flag. For the three CHPC compiler architectures, this corresponds to -tp=nehalem,sandybridge,haswell .

Unfortunately, there are several issues with this in our world. The first issue stems from the lack of Haswell architecture support in the GNU binutils which are shipped with the RHEL 6.6 OS that we run on our clusters. To overcome this,  the path to updated binutils, /uufs/chpc.utah.edu/sys/installdir/binutils/2.25 , must be added to the PATH variable. For this reason, it is easier to just build for the Nehalem and Sandybridge, unless one wants to use the Haswell based owner-guest nodes on kingspeak (the nodes that have 24 cores).

Notice that using single optimization option -fastsse works fine, e.g.

pgcc -fastsse -tp=nehalem-64,sandybridge-64 -Minfo=unified,vect hello_ser.c
main:
      4, PGI Unified Binary version for -tp=sandybridge-64
      8, Generated vector sse code for the loop
main:
      4, PGI Unified Binary version for -tp=nehalem-64
      8, Generated vector sse code for the loop

The other problems stem from difficulty in building MPI distributions with the unified binary approach, which we'll detail below.

It is possible to work around all these issues, however, it makes PGI builds a little more cumbersome.

GNU compilers

GNU compilers (as far as we know) don't allow for multiple code paths in an executable so one has to build optimized code for each CPU architecture. Also note that the gcc 4.4 series which is shipped with RHEL6 is relatively old and therefore it only includes up to the -mavx flag for generation of AVX code and lacks the -march flag definitions for CPUs higher than core2, which dates back to ~2006.

We have installed newer versions of the GNU compilers in the application tree, with the gcc 4.9 series (accessible by module load gcc/4.9.2) being the most recent.  This 4.9.2 version includes -march up to the latest Haswell (-march=haswell); therefore to do a build to use on the nodes with the haswell processor one should:

module load gcc/4.9.2
setenv PATH "/uufs/chpc.utah.edu/sys/installdir/binutils/2.25/bin:$PATH"
gcc -march=haswell hello_ser.c

 

MPI multiple network interfaces support

Intel MPI

Intel MPI supports multiple networks. To use

module load intel impi

The choice of network is made by the I_MPI_FABRICS variable, with allowed variables being {shm,  dapl,  tcp, tmi, ofa}. While the best available fabric should be selected by default, we have seen cases when the application would segfault at the start when the fabric is not specified. Therefore we recommend that this is explicitly specified by

For Infiniband --  mpirun -genv I_MPI_FABRICS shm:ofa -np 2 ./latbw_impi

For ethernet -- mpirun -genv I_MPI_FABRICS shm:tcp -np 2  ./latbw_impi

The dapl fabrics is another way to run over InfiniBand which may be worth trying since our latency/bandwidth tests show competitive performance with ofa. The default is shm:dapl, which means use shared memory for intra-node communication and DAPL for inter-node. From our tests the ofa and dapl performance is almost the same, but, there are some differences in synthetic benchmarks which may have one performing better than the other in certain scenarios so it may be worth to test your application using both.

Here's a table of latency/bandwidth. Similar performance can be expected on other clusters.

  Ash Latency [us] Ash Bandwidth [MB/s] EM Latency [us] EM Bandwidth [MB/s] KP Latency [us] KP Bandwidth [MB/s]
ib 1.70 3367.049 1.68 3376.022 1.20 6191.268
dapl         1.12 6241.217

tcp

13.20 1540.396 14.60 1748.446 13.20 2421.430

Notice that the tcp ran over the InfiniBand using ipoib therefore we see fairly decent latencies and bandwidths.

Note that the lonepeak cluster has QLogic InfiniBand which for best performance uses different network protocol called TMI. Using -genv I_MPI_FABRICS ofa will work, but, run on a potentially lower speed (latency ~10 us), as compared to -genv I_MPI_FABRICS tmi (latency ~2 us), that is:

/uufs/chpc.utah.edu/sys/pkg/intel/ics/impi/std/bin64/mpirun -genv I_MPI_FABRICS shm:tmi -np 2 -machinefile nodefile ./latbw_impi

Another trick with Intel MPI is that it allows for easy selection of the communication route in the mpirun process launcher.

The simplest mpirun launch using the Slurm by default is
mpirun -genv I_MPI_FABRICS ofa -np 2 ./latbw_impi

This will use the default host names (e.g. kp001,...). The ofa and dapl fabrics will use InfiniBand, the tcp will run over the Ethernet.

A different way of launching can be chosen by -bootstrap option. There are many different choices for this option; of value at CHPC are  slurm (the default if nothing is specified), and ssh. To use ssh over multiple nodes, one has to prepare a host file (as we were used in the PBS world), and feed it to the mpirun:
mpirun -genv I_MPI_FABRICS tcp -bootstrap ssh -machinefile nodefile -np 2 ./latbw_impi

Now, the fun part is, that, depending on how we name the node names in the host file, the tcp will run over the given network. If we use hostnames such as kp001,..., it'll run over the Ethernet, if we use kp001.ipoib,..., it'll run over the InfiniBand using ipoib.

Due to the simplicity of build and deployment, as well as good performance, we recommend using Intel MPI in user applications and are planning to build most of the applications that we support with it.

Also note that thanks to the common Application Binary Interface (ABI) between Intel MPI (>= 5.0) , MPICH (>= 3.1) and MVAPICH2 (>= 2.0), one can build a dynamic executable (default on our systems) with one MPI, and run with the other. See details about this at https://software.intel.com/en-us/articles/using-intelr-mpi-library-50-with-mpich3-based-applications.

MPICH

MPICH's most commonly used nemesis channel supports several subchannels (called netmods), including tcp , ib and mxm. The latter two use InfiniBand, with ib for use over the OFED and mxmover  the Mellanox MXM protocol. Up to version 3.1.2, we found that theib interface worked with Intel and GNU compilers; however, as of version 3.1.4 it does not compile. PGI does not compile even with version 3.1.2.

We therefore build MPICH with mxm and tcp, with tcp being the first choice. Please, be careful with the PGI MPICH build as we have seen some issues with it, potentially stemming from the fact that we run relatively old OFED stack (to be updated during a future downtime). For this reason we don't recommend using MPICH on the InfiniBand clusters for production. Since the tcpnetmod is default, to run over InfiniBand, set environment variable MPICH_NEMESIS_NETMOD=mxm, e.g.
mpirun -genv MPICH_NEMESIS_NETMOD mxm -np 2 ./a.out

For a single workstation (running MPI program with multiple processes on a single node), the network is not accessed and therefore the program works even if IB is not present on the system. 

Also, note that unlike Intel MPI or MVAPCH2, MPICH does not set task affinity to the CPUs (i.e.,  binds tasks to CPUs for better performance) by default. To enable the task affinity, you need to use flag-bind-to , e.g. to bind each MPI task to a core, -bind-to=core.

It is possible, however, to run the MPICH built executable on the clusters with InfiniBand, taking advantage of the common ABI with Intel MPI. Simply add the Intel MPI to your environment, either with
module load impi
or
source /uufs/chpc.utah.edu/sys/pkg/intel/ics/impi/std/bin64/mpivars.[csh,sh]
and then run with Intel MPI as
mpirun -np 2 ./a.out

OpenMPI

OpenMPI has provided multi-network support for a while. As a part of the Slurm deployment, version 1.8.4 has been built with all the CHPC supported compilers (GNU, Intel, PGI). In Slurm, one just needs to module load compiler openmpi, followed by mpirun without need of host specification. 

To use Ethernet, add the --mca btl tcp,self flag to mpirun, the default is to use the InfiniBand.

mpirun --mca btl tcp,self -np $SLURM_NTASKS $EXE # uses Ethernet
mpirun -np $SLURM_NTASKS $EXE # uses InfiniBand

Here's a table of latency/bandwidth on ember and kingspeak. Similar performance can be expected on other clusters.

  EM Latency [us] EM Bandwidth [MB/s] KP Latency [us] KP Bandwidth [MB/s]
ib 1.486 3348.124 2.962 6081.060

tcp

15.189 229.600 18.405 230.317

It looks like the TCP bandwidth is limited to 1 Mbit/sec, which indicates that OpenMPI chose the Ethernet network to run over.

Since OpenMPI does not yet support the common MPI ABI, its executable can not be exchanged with intel MPI, MVAPICH2, and MPICH.

Multi-architecture CPU optimization with multi-network MPI

Intel MPI

Intel MPI multi-network support along with the automatic CPU dispatch inside of the Intel compilers is quite straightforward and works well. During the compilation, all one needs is to include the -axCORE-AVX2 flag, -O3 -ip to also include good optimization, and, then, at runtime select the appropriate network fabrics via the I_MPI_FABRICS environment variable.

Intel MPI is thus a good choice for a single executable that runs optimized over all CHPC Linux machines. As such, most of the applications that we support are built that way.

Note that there is also mpitune utility that allows one to run multiple scenarios automatically and come up with the best Intel MPI runtime parameters (all MPIs feature a number of internal switches that adjust communication parameters based on message sizes, task counts, etc). Details on the mpitune utility are at  Intel's mpitune page.

MVAPICH2

The default network channel of MVAPICH2  is mrail, which only supports InfiniBand, but, it is supposed to provides better performance than the nemesis channel which was inherited from MPICH (which includes tcp, ib and mxmnetmods). However,  in reality based on the LAMMPS benchmarks below  MPICH with mxm performs faster.

MVAPICH2's good performance on InfiniBand makes it a good choice for a program that will run on InfiniBand clusters. If using Intel or PGI compilers, you can build executable that can run optimally on all three CPU architectures that our clusters have.

Note that you can still run MVAPICH2 executables on Ethernet only clusters or on single desktops using Intel MPI or MPICH libraries thanks to common ABI. In case the program complains about missing libpmi.so,
setenv LD_LIBRARY_PATH "/uufs/chpc.utah.edu/sys/pkg/slurm/std/lib:$LD_LIBRARY_PATH"for tcsh
or export LD_LIBRARY_PATH "/uufs/chpc.utah.edu/sys/pkg/slurm/std/lib:$LD_LIBRARY_PATH" for bash,
and then run the executable as if using Intel MPI or MPICH.

MPICH

MPICH with multi-network (over MXM), multi-architecture works well only with Intel compilers, GNU does not support multi-architecture, whereas PGI has problems with running unified binary-built MPICH on Sandybridge and higher CPUs. Therefore we only provide  MPICH built with Intel compiler and the -axCORE-AVX2 flag, along with  the PGI and GNU MPICH builds with the lowest common denominator optimization, which is the Nehalem architecture. Since there should be minimum vectorization potential inside the MPI, this should not affect the performance radically. You can then build PGI unified binary applications on top of the Nehalem optimized MPI.

OpenMPI

OpenMPI with multi-network, multi-architecture works well with Intel compilers. GNU does not support multi-architecture. PGI has problems with running unified binary-built OpenMPI on Sandybridge and higher CPUs, which is why we have built OpenMPI optimized for Nehalem as in case of MPICH. Build PGI unified binary on top of the Nehalem optimized MPI.

LAMMPS benchmark results

Below are benchmarks of LAMMPS molecular dynamics code results for jobs run on four Kingspeak Haswell and SandyBridge nodes using Intel compilers and MPI (-axCORE-AVX2) to build the code and running using different Intel MPI runtime option, and also MPICH and MVAPICH2. Total runtime and communication time in seconds are reported in the table below, which means lower number is better.

The runs were performed as follows:

  • IMPI default
    module load intel impi
    set EXE = /uufs/chpc.utah.edu/sys/installdir/lammps/9Dec14/src/lmp_intel2015
    mpirun -np $SLURM_NTASKS $EXE < in.spce
  • IMPI shm:dapl
    module load intel impi
    set EXE = /uufs/chpc.utah.edu/sys/installdir/lammps/9Dec14/src/lmp_intel2015
    mpirun -genv I_MPI_FABRICS shm:dapl -np $SLURM_NTASKS $EXE < in.spce
  • IMPI shm:ofa
    module load intel impi
    set EXE = /uufs/chpc.utah.edu/sys/installdir/lammps/9Dec14/src/lmp_intel2015
    mpirun -genv I_MPI_FABRICS shm:ofa -np $SLURM_NTASKS $EXE < in.spce
  • IMPI srun
    module load intel impi
    setenv I_MPI_PMI_LIBRARY /uufs/kingspeak.peaks/sys/pkg/slurm/std/lib/libpmi.so
    set EXE = /uufs/chpc.utah.edu/sys/installdir/lammps/9Dec14/src/lmp_intel2015
    srun -n $SLURM_NTASKS $EXE < in.spce
  • MPICH default
    module load intel mpich2
    setenv LD_LIBRARY_PATH "/uufs/chpc.utah.edu/sys/installdir/mpich/3.1.4i/lib:$LD_LIBRARY_PATH"
    setenv MPICH_NEMESIS_NETMOD mxm
    set EXE = /uufs/chpc.utah.edu/sys/installdir/lammps/9Dec14/src/lmp_intel2015
    mpirun -np $SLURM_NTASKS $EXE < in.spce
  • MPICH affinity
    module load intel mpich2
    setenv LD_LIBRARY_PATH "/uufs/chpc.utah.edu/sys/installdir/mpich/3.1.4i/lib:$LD_LIBRARY_PATH"
    setenv MPICH_NEMESIS_NETMOD mxm
    set EXE = /uufs/chpc.utah.edu/sys/installdir/lammps/9Dec14/src/lmp_intel2015
    mpirun -bind-to core -np $SLURM_NTASKS $EXE < in.spce
  • MVAPICH2 default
    module load intel mvapich2
    set EXE = /uufs/chpc.utah.edu/sys/installdir/lammps/9Dec14/src/lmp_intel2015
    srun -n $SLURM_NTASKS $EXE < in.spce

 

  LAMMPS LJ        
  96 procs, Haswell   64 procs, Sandybridge    
  Total Comm Total Comm  
IMPI default 41.02 10.46 73.61 23.40 core affinity
IMPI shm:dapl 40.79 10.25 73.55 23.31 core affinity
IMPI shm:ofa 42.79 13.70 70.61 22.94 core affinity
IMPI srun 64.15 30.48 91.71 38.14 no affinity
MPICH mxm 64.93 31.07 89.07 36.36 no affinity
MPICH mxm bind core 45.66 14.63 74.96 24.59 core affinity
MVAPICH2 default 42.61 12.30 74.67 25.19 core affinity
  LAMMPS SPCE        
  96 procs, Haswell   64 procs, Sandybridge    
  Total Comm Total Comm  
IMPI default 60.39 2.61 76.91 3.71 core affinity
IMPI shm:dapl 60.16 2.60 76.78 3.72 core affinity
IMPI shm:ofa 60.68 2.77 74.60 3.41 core affinity
IMPI srun 84.37 4.14 100.87 4.76 no affinity
MPICH mxm 85.82 5.17 106.45 5.48 no affinity
MPICH mxm bind core 60.77 3.69 75.94 4.18 core affinity
MVAPICH2 default 60.58 3.03 78.09 4.44 core affinity

 

 There are several observations that can be made:

  • Some MPI distributions set process affinity by default (Intel MPI, MVAPICH2, OpenMPI while others do not (MPICH).  Slurm currently is not set up for task affinity and as such srun with Intel MPI does not perform well (this is to be retested once we roll Slurm affinity to the clusters). MVAPICH2 overrides the Slurm affinity settings and sets its own.
  • Using the process affinity is a good idea for performance reasons.
  • Intel MPI's ofa and dapl fabrics performance is comparable, but, in some cases one is slightly better than the other, and vice versa.
  • MPICH mxm netmod is close to be competitive to Intel MPI, it may improve with the MPICH 3.2 release (mxm contribution from Mellanox) and/or our Mellanox OFED stack update in the future.
  • MVAPICH2 is not as fast as we have hoped, Intel MPI is faster for LAMMPS.

 

Recommendations for different scenarios

To summarize the material presented above, here are our recommendations for optimal performance for several common scenarios.

For an optimized application running on all CHPC Linux machines

Use the Intel compilers with Intel MPI and the automatic CPU dispatch with -axCORE-AVX2,AVX,SSE4.2 compiler flag. It is simple and it works. Test the dapl and ofa fabrics and pick up one that performs better using the I_MPI_FABRICS option to mpirun.

Alternatively, all other MPI distributions (MVAPICH2, MPICH, OpenMPI) work with the Intel compiler and the -axCORE-AVX2,AVX,SSE4.2 option as well. If you use MVAPICH2, you can use the same executable to run on the Ethernet clusters and desktops using MPICH or Intel MPI thanks to the common ABI (see the Intel MPI section).

For an optimized application using GNU compilers

Since GNU does not allow for multi-architecture optimization, you have to build separate executables for Ember, Kingspeak SandyBridge nodes (16, 20 cores) and Haswell nodes (24 cores). Also, the OS shipped GNU version 4.4.4 includes only vectorization optimization flags up to AVX (-msse42 for Ember, -mavx for Kingspeak SandyBridge nodes), and no AVX2 (Kingspeak Haswell nodes). For better optimization opportunities consider using GNU 4.9.2 (available by loading module gcc/4.9.2). This version includes -march flag where one can specify the given CPU architecture (-march=westmere, -march=sandybridge, -march=haswell), though, for the haswell optimization, one also needs to include newer GNU binutils ( /uufs/chpc.utah.edu/sys/installdir/binutils/2.25 ) since the ones stocked with RHEL6 don't support AVX2 instructions.

As for what MPI to use, on the clusters, use MVAPICH2 that is built in the chpc.utah.edu branch - which was built for the lowest common denominator CPU (Ember), however, we don't expect MPI to require too many optimizations.

If you want to have executable running both over InfiniBand and Ethernet, try MPICH and OpenMPI, however, be aware that the MPICH InfiniBand channel (ib) is not as optimized as MVAPICH2. You can also build with MPICH, but run with Intel MPI or MVAPICH2 over the InfiniBand thanks to the common ABI (see Intel MPI discussion above).

OpenMPI seems to be doing its own optimizations so it may be a better pick. However, be aware that we have seen problems with OpenMPI when doing communication from multi-threaded MPI tasks. 

For an optimized application using PGI compilers

The PGI unified binary works fairly well with applications, however, we did not have much success building MPI libraries with it, so, the MPIs in the chpc.utah.edu branch have been built to the lowest common denominator (Ember). However, we don't expect any performance impact on using these MPI builds.

To then build application that will run optimized on all CHPC InfiniBand clusters, use MVAPICH2 with -tp=nehalem,sandybridge,haswell compiler flag. Note that PGI requires updated binutils ( /uufs/chpc.utah.edu/sys/installdir/binutils/2.25 ) to build for the haswell architecture.

OpenMPI would be another good choice.

Try to avoid MPICH (except for using the tcp channel) since the ib channel does not build with PGI and does not seem to be supported by the developers.