MPI

Message Passing Interface (MPI) is the principal method of performing parallel computations on all CHPC clusters. Its main component is a standardized library that enables communication between processors in distributed processor environments. There are numerous MPI distributions available and thus CHPC supports only some of them, those we believe are best suited for the particular system.

More information: MPI standard page.

The CHPC clusters utilize two types of network interconnects, Ethernet and InfiniBand. Except for some nodes on Lonepeak, all clusters have InfiniBand and users should use it since it is much faster than Ethernet.

We provide a number of MPI distributions with InfiniBand support: MVAPICH2, OpenMPI, MPICH2 and Intel MPI. All have fairly similar performance, however, MVAPICH2, MPICH2 and Intel MPI seem to be more stable with multi-threaded programs. For each of these MPI distribution, there is a general build that works on all CHPC clusters, and also cluster specific builds of OpenMPI and MVAPICH2 which target the cluster specific CPUs for optimal performance. The CPU optimizations in the MPI calls should have minor effect on performance, which is why we are moving towards the general builds. On the top of that, the Intel compiler allows for multiple CPU target optimizations in one library which is also how we build the MPIs. Builds with with the GNU and PGI compilers are optimized for the lowest common denominator, which is the Ember cluster.

The general builds of OpenMPI, MPICH and IntelMPI also support multiple network interfaces in a single build, usage is described in a this page.

More information from the developers of each MPI distribution can be found here:

Sourcing MPI and Compiling

Before performing any work with MPI, users need to source the MPI distributions appropriate for their needs. Each of the distributions has its pros and cons. Intel MPI has good performance and very flexible usage, but, it's a commercial product that we have to license. MVAPICH2 is optimized for InfiniBand, but, it does not provide flexible process/core affinity in multi-threaded environment. MPICH is more of a reference platform and its InfiniBand support is not well aligned with CHPC's drivers, though, its feature set is the same as that of Intel MPI and MVAPICH2 (both of which are based on MPICH) . Finally, OpenMPI is quite flexible, but,we have seen its peformance to be slightly below Intel MPI and MVAPICH2. If you have any doubts on what MPI to use, contact us.

To set up general MPI package, use the module command as:

module load <compiler> <MPI distro>

where

<MPI distro>= mvapich2, openmpi, mpich2, impi
<compiler> = gcc (GNU), intel (Intel), pgi (Portland Group)

 Example 1. If you were running a program on ember that was compiled with the Intel compilers and uses mvapich2  :

module load intel mvapich2

Example 2. If you were running a program on kingspeak that was compiled with the PGI compilers and uses OpenMPI :

module load pgi openmpi

 The CHPC keeps older versions of each MPI distribution, however, the backwards compatibility is sometimes compromised due to network driver and compiler upgrades. When in doubt, please, use the latest versions of compilers and MPI distributions as obtained with the module load command. These older versions can be found in the respective directory for each distribution, or module loaded by explicitly specifying the compiler and MPI version:

/uufs/chpc.utah.edu/sys/installdir/<MPI distro>/<version><g,i,p>

/uufs/<cluster>/sys/pkg/<MPI distro>/<version><g,i,p>

Different version are indicated by version numbers, and what compiler was used (GNU, Intel, PGI, respectively). If no compiler tag is given, assume that it is GNU.

Compiling

Compiling with MPI is quite straightforward. Below is a list of MPI compiler commands with their equivalent standard version:

Language MPI Command Standard Commands
C mpicc icc, pgcc
C++ mpicxx icpc, pgCC
Fortran 77/90 mpif90,mpif77

ifort, pgf90, pgf77

When you compile, make sure you record what version of MPI you used. The std builds are periodically updated, and programs will sometimes break if they depend on the std builds.

Note that Intel MPI supplies separate compiler commands (wrappers) for the Intel compilers, in a form of mpiicc, mpiicpc and mpiifort. Using mpicc, mpicxx and mpif90 will call the GNU compilers.

Running with MPI

General MPI running information

IMPORTANT: Before running a job, make sure that your environment is using the same MPI package that was used to build your executable or list it explicitly in your SLURM script.

mpirun command launches the parallel job. For help with mpirun, please consult the manpages (man mpirun) or run mpirun --help. The important parameter is the number of MPI processes specification (-np).

To run on Kingspeak, Ember, Lonepeak, or Tangent:

mpirun -np $SLURM_NTASKS ./program

The $SLURM_NTASKS variable corresponds to SLURM task count requested with the #SBATCH -n option.

Multi-threaded MPI

For optimal performance, especially in the case of multi-threaded parallel programs, there are additional arguments that must be passed to the program. Specifically, the variable OMP_NUM_THREADS (number of threads to parallelize over) needs to be set. When running multi-threaded jobs, make sure to also link multi-threaded libraries (e.g. MKL, FFTW), and vice versa, link single threaded libraries to single threaded MPI programs.

The OMP_NUM_THREADS count can be calculated automatically by utilizing SLURM provided variables, assuming that all nodes have the same CPU core count. This can prevent accidental over or under-subscription when node or task count in the SLURM script changes:

# find number of threads for OpenMP
# find number of MPI tasks per node
set TPN=`echo $SLURM_TASKS_PER_NODE | cut -f 1 -d \(`
# find number of CPU cores per node
set PPN=`echo $SLURM_JOB_CPUS_PER_NODE | cut -f 1 -d \(`
@ THREADS = ( $PPN / $TPN )
setenv OMP_NUM_THREADS $THREADS

mpirun -genv $OMP_NUM_THREADS -genv MV2_ENABLE_AFFINITY 0 -np $SLURM_NTASKS ./program
Task/thread affinity

In the NUMA (Non Uniform Memory Access) architecture, which is present on all CHPC clusters, it is often advantageous to pin MPI tasks and/or OpenMP threads to the CPU sockets and cores. We have seen up to 60% performance degradation in high memory bandwidth codes when process/thread affinity is not enforced. The pinning prevents the processes and threads to migrate to CPUs which have more distant path to the data in the memory. Most commonly we would set the MPI task to be pinned to a CPU socket, with OpenMP threads allowed to migrate over this socket's cores. All MPIs except for MPICH automatically bind MPI tasks to CPUs, but the behavior and adjustment options depend on the MPI distribution. We detail this in the relevant MPI section below. 

 Running MVAPICH2 programs

MVAPICH2 by default binds MPI tasks to cores, so, optimal binding of single threaded MPI program is one MPI task to one CPU core and is achieved with plainly running:

mpirun -np $SLURM_NTASKS ./program

For multi-threaded parallel programs, we need to disable the task to core affinity by settingMV2_ENABLE_AFFINITY=0.This also means that we need to pin the tasks manually, which is can be done using Intel compilers with KMP_AFFINITY=verbose,granularity=core,compact,1,0. To run multi-threaded MVAPICH2 code compiled with Intel compilers:

module load intel mvapich2
# find number of threads for OpenMP
# find number of MPI tasks per node
set TPN=`echo $SLURM_TASKS_PER_NODE | cut -f 1 -d \(`
# find number of CPU cores per node
set PPN=`echo $SLURM_JOB_CPUS_PER_NODE | cut -f 1 -d \(`
@ THREADS = ( $PPN / $TPN )
setenv OMP_NUM_THREADS $THREADS

mpirun -genv OMP_NUM_THREADS $OMP_NUM_THREADS -genv MV2_ENABLE_AFFINITY 0 -genv KMP_AFFINITY verbose,granularity=core,compact,1,0 -np $SLURM_NTASKS ./program

For other compilers, the suggestions listed in MVAPICH2 user's guide don't seem to be appropriate for multi-threaded programs. However, we have found that using MPICH's process affinity options will do the trick (as MVAPICH2 is derived from MPICH). That is, for example on 16 core, 2 socket cluster node, runing 2 tasks 8 threads each:

mpirun -genv MV2_ENABLE_AFFINITY 0 -bind-to numa -map-by numa -genv OMP_NUM_THREADS 8 -np 2 ./myprogram
taskset -cp 8700
pid 8700's current affinity list: 0-7,16-23
taskset -cp 8701
pid 8701's current affinity list: 8-15,24-31

Running OpenMPI programs

Generally, our tests show that for the InfiniBand, OpenMPI performance is slightly below that of MVAPICH2. Nevertheless, OpenMPI has a number of appealing features that have led us to provide it to CHPC users. Again, see the manpages for OpenMPI for details.

Running OpenMPI programs is straightforward, and the same on all clusters:

mpirun -np $SLURM_NTASKS $WORKDIR/program.exe

mpirun flags for multi-threaded process distribution and binding to the CPU sockets are-map-by socket -bind-to socket.

To run an OpenMPI program multithreaded:

mpirun -np $SLURM_NTASKS -map-by socket -bind-to socket $WORKDIR/program.exe

OpenMPI will automatically select the optimal network interface. To force it to use Ethernet, use --mca btl tcp,self mpirun flag. Network configuration runtime flags are detailed here.

 

Running  MPICH programs

MPICH (formerly referred to as MPICH2) is an open source implementation developed at Argonne National Laboratories. Its newer versions support both Ethernet and InfiniBand, although we do not provide cluster specific MPICH build, mainly since MVAPICH2 which is derived from MPICH provides additional performance tweaks. MPICH should only be used for debugging on interactive nodes, single node runs and and embarrassingly parallel problems, as its InfiniBand build does not ideally match our drivers.

mpirun -np $SLURM_NTASKS ./program.exe

Since by default MPICH does not bind tasks to CPUs, use -bind-to core option to bind tasks to cores (equivalent to MV2_ENABLE_AFFINITY=1) in case of single threaded program. For multi-threaded programs, one can use -bind-to numa map-by numa, with details on the -bind-to option obtained by running mpirun -bind-to -help, or consulting the Hydra process manager help page. The multi-threaded process/thread affinity seems to be working quite well with MPICH, for example, on a 16 core Kingspeak node with core-memory mapping:

numactl -H
available: 2 nodes (0-1)
node 0 cpus: 0 1 2 3 4 5 6 7 16 17 18 19 20 21 22 23
node 1 cpus: 8 9 10 11 12 13 14 15 24 25 26 27 28 29 30 31
mpirun -bind-to numa -map-by numa -genv OMP_NUM_THREADS 8 -np 2 ./myprogram
taskset -cp 8595
pid 8595's current affinity list: 8-15,24-31
mpirun -bind-to core:4 -map-by numa -genv OMP_NUM_THREADS 4 -np 4
taskset -cp 9549
pid 9549's current affinity list: 0-3,16-19

Notice that the binding is also correctly assigned to a subset of CPU socket cores when we use 4 tasks on 2 sockets. Intel MPI is also capable of this, MVAPICH2 (unless using MPICH's flags) and OpenMPI don't seem to have an easy way to do this.

Running Intel MPI programs

Intel MPI is a high performance MPI library which runs on many different network interfaces. Apart from its runtime flexibility, it also integrates with other Intel tools (compilers, performance tools). For a quick introduction to Intel MPI, see the Getting Started guide, https://software.intel.com/en-us/get-started-with-mpi-for-linux.

Intel MPI by default works with whatever interface it finds on the machine at runtime. To use it module load impi .

For best performance we recommend using Intel compilers along with the IMPI, so, to build, use the Intel compiler wrapper calls mpiicc, mpiicpc, mpiifort.

For example

mpiicc code.c -o executable

Network selection with Intel MPI

Since IMPI is designed to run on multiple network interfaces, one just needs to build a single executable which should be able to run on all CHPC clusters. Combining this with the Intel compiler's automatic CPU dispatch flag (-axCORE-AVX2,AVX,SSE4.2) allows to build a single executable for all the clusters. The network interface selection is controlled with the I_MPI_FABRICS environment variable. The default should be the fastest network, in our case InfiniBand. We can verify the network selection by running the Intel MPI benchmark and look at the time it takes to send a message from one node to another:

srun -n 2 -N 2 -A mygroup -p ember --pty /bin/tcsh -l

mpirun -np 2 /uufs/chpc.utah.edu/sys/installdir/intel/impi/std/intel64/bin/IMB-MPI1

#---------------------------------------------------
# Benchmarking PingPong
# #processes = 2
#---------------------------------------------------
#bytes #repetitions t[usec] Mbytes/sec
0 1000 1.74 0.00

It takes 1.75 microseconds to send a message there and back which is typical for InfiniBand network.

Intel MPI provides two different MPI fabrics for InfiniBand, one based on Open Fabrics Enterprise Distribution (OFED), and the other on Direct Access Programming Library (DAPL), denoted by ofa and dapl, respectively. Moreover, one can also specify intra-node communication, out of which the fastest should be shared memory(shm). According to our observations, the default fabrics is shm:dapl, which can be confirmed by using environment variable I_MPI_DEBUG larger than 2, e.g.
mpirun -genv I_MPI_DEBUG 2 -np 2 /uufs/chpc.utah.edu/sys/pkg/intel/ics/impi/std/intel64/bin/IMB-MPI1
...
[0] MPI startup(): shm and dapl data transfer modes
...

The performance of the OFED and DAPL are comparable, but, it may be worth-wile to test both to see if your particular application gets a boost from one fabrics or the other.

If we'd like to use the Ethernet network instead (except for Lonepeak, not recommended for production due to slower communication speed), we choose I_MPI_FABRICS tcp and get:

mpirun -genv I_MPI_FABRICS tcp -np 2 /uufs/chpc.utah.edu/sys/installdir/intel/impi/std/intel64/bin/IMB-MPI1

#---------------------------------------------------
# Benchmarking PingPong
# #processes = 2
#---------------------------------------------------
#bytes #repetitions t[usec] Mbytes/sec
0 1000 18.56 0.00

Notice that the latency on the Ethernet is about 10x larger than on the InfiniBand.

Single and multi-threaded process/thread affinity

Intel MPI pins processes and threads to sockets by default, so, no additional runtime options should be needed unless the process/thread mapping needs to be different. If that is the case, consult the OpenMP interoperability guide. For the common default pinning.:

mpirun -genv OMP_NUM_THREADS 8 -np 2 ./myprog
taskset -cp 10085
pid 10085's current affinity list: 0-7,16-23
mpirun -genv OMP_NUM_THREADS 4 -np 4 ./myprog
taskset -cp 9119
pid 9119's current affinity list: 0-3,16-19
Common MPI ABI

As of Intel MPI 5.0 and MPICH 3.1 (and MVAPICH2 1.9 and higher which is based on MPICH 3.1), the libraries are interchangeable at the binary level, using common Application Binary Interface (ABI). This in practice means that one can build the application with MPICH, but, run it using the Intel MPI libraries, and thus taking advantage of the Intel MPI functionality. See details about this at https://software.intel.com/en-us/articles/using-intelr-mpi-library-50-with-mpich3-based-applications.