Message Passing Interface

Message Passing Interface (MPI) is the principal method of performing parallel computations on all CHPC clusters. Its main component is a standardized library that enables communication between processors in distributed processor environments.

MPI Distributions
Setting MPI and Compiling
- Compiling
Running with MPI
Common MPI ABI

MPI Distributions

There are numerous MPI distributions available and thus CHPC supports only some of them, those we believe are best suited for the particular system.

More information: MPI standard page.

The CHPC clusters utilize two types of network interconnects, Ethernet and InfiniBand. Except for some nodes on Lonepeak, all clusters have InfiniBand and users should use it since it is much faster than Ethernet.

We provide a number of MPI distributions with InfiniBand support: OpenMPI, MVAPICH2, MPICH and Intel MPI. All these MPI implementations also support multiple network interfaces in a single build, usage is described in a this page.

More information from the developers of each MPI distribution can be found here:

Setting MPI and Compiling

Before performing any work with MPI, users need to initialize the environment for the MPI distribution appropriate for their needs. Each of the distributions has its pros and cons. Intel MPI has good performance and very flexible usage, but, we have seen problems with some of its versions on the AMD CPUs. MVAPICH2 is optimized for InfiniBand, but, it does not provide flexible process/core affinity in multi-threaded environment. MPICH is more of a reference platform, for which we build InfiniBand support through the UCX library. Its feature set is the same as that of Intel MPI and MVAPICH2 (both of which are based on MPICH). Intel MPI, MVAPICH2 and MPICH can also be freely interchanged thanks to their common Application Binary Interface (ABI), which main advantage is no need to build separate binaries for each distribution.

Finally, OpenMPI is quite flexible, and on InfiniBand we see better performance than with Intel MPI and MVAPICH2. However, it is not ABI compatible with the other MPIs that we provide. Still, as of late 2023, it is the most versatile MPI distribution across the heterogeneous CHPC resources which is why it is our choice for building most applications.

If you have any doubts on what MPI to use, contact us.

Note that in the past we have provided separate MPI builds for different clusters. Since these days most MPIs provide flexible interfaces for multiple networks, we provide single builds for all clusters and allow for changing the default network interface at runtime.

To set up general MPI package, use the module command as:

module load <compiler> <MPI distro>

where

<MPI distro>= openmpi, mvapich2, mpich, intel-oneapi-mpi
<compiler> = gcc (GNU), intel-oneapi-compilers (Intel), nvhpc (Nvidia)

Example 1. If you were running a program that was compiled with the GNU compilers and uses mvapich2 :

module load gcc mvapich2

Example 2. If you were running a program that was compiled with the NVHPC compilers and uses OpenMPI :

module load nvhpc openmpi

The CHPC keeps older versions of each MPI distribution, however, the backwards compatibility is sometimes compromised due to network driver and compiler upgrades. When in doubt, please, use the latest versions of compilers and MPI distributions as obtained with the module load command. These older versions can be found with the module spider <MPI distro> command.

Compiling

Compiling with MPI is quite straightforward. Below is a list of MPI compiler commands with their equivalent standard version:

Language	MPI Command	Standard Commands
C	`mpicc`	`gcc, icc, nvc`
C++	`mpicxx`	`g++, icpc, nvc++`
Fortran 77/90	`mpif90,mpif77, mpifort`	`gfortran, ifort, nvfortran`

When you compile, make sure you record what version of MPI you used. The std builds are periodically updated, and programs will sometimes break if they depend on the std builds.

Note that Intel MPI supplies separate compiler commands (wrappers) for the Intel compilers, in a form of mpiicc, mpiicpc and mpiifort. Using mpicc, mpicxx and mpif90 will call the GNU compilers.

Running with MPI

mpirun command launches the parallel job. For help with mpirun, please consult the manpages (man mpirun) or run mpirun --help. The important parameter is the number of MPI processes specification (-np).

To run on a cluster, or on a CHPC supported Linux desktop desktop:

mpirun -np $SLURM_NTASKS ./program

The $SLURM_NTASKS variable corresponds to SLURM task count requested with the #SBATCH -n option.

Multi-threaded MPI

For optimal performance, especially in the case of multi-threaded parallel programs, there are additional arguments that must be passed to the program. Specifically, the variable OMP_NUM_THREADS (number of threads to parallelize over) needs to be set. When running multi-threaded jobs, make sure to also link multi-threaded libraries (e.g. MKL, FFTW), and vice versa, link single threaded libraries to single threaded MPI programs.

The OMP_NUM_THREADS count can be calculated automatically by utilizing SLURM provided variables, assuming that all nodes have the same CPU core count. This can prevent accidental over or under-subscription when node or task count in the SLURM script changes:

# find number of threads for OpenMP
# find number of MPI tasks per node
set TPN=`echo $SLURM_TASKS_PER_NODE | cut -f 1 -d \(`
# find number of CPU cores per node
set PPN=`echo $SLURM_JOB_CPUS_PER_NODE | cut -f 1 -d \(`
@ THREADS = ( $PPN / $TPN )
setenv OMP_NUM_THREADS $THREADS

mpirun -genv $OMP_NUM_THREADS -np $SLURM_NTASKS ./program

Task/Thread affinity

In the NUMA (Non Uniform Memory Access) architecture, which is present on all CHPC clusters, it is often advantageous to pin MPI tasks and/or OpenMP threads to the CPU sockets and cores. We have seen up to 60% performance degradation in high memory bandwidth codes when process/thread affinity is not enforced. The pinning prevents the processes and threads to migrate to CPUs which have more distant path to the data in the memory. Most commonly we would set the MPI task to be pinned to a CPU socket, with OpenMP threads allowed to migrate over this socket's cores. All MPIs except for MPICH automatically bind MPI tasks to CPUs, but the behavior and adjustment options depend on the MPI distribution. We describe MPI task pinning in the relevant MPI section below, for more details on the problem see our blog post, which provides a general solution using a shell script that pins both tasks and threads to cores.

Running MVAPICH2 programs

MVAPICH2 by default binds MPI tasks to cores, so, optimal binding of single threaded MPI program is one MPI task to one CPU core and is achieved with plainly running:

mpirun -np $SLURM_NTASKS ./program

For multi-threaded parallel programs, we need to change the MPI process distribution and binding. In the most common distribution of one or few MPI tasks per CPU socket, we use MV2_CPU_BINDING_LEVEL=socket and MV2_CPU_BINDING_POLICY=scatter. We can also pin the OpenMP threads manually, which is can be done using Intel compilers with KMP_AFFINITY=verbose,granularity=core,compact,1,0. To run multi-threaded MVAPICH2 code compiled with Intel compilers:

module load intel mvapich2
# find number of threads for OpenMP
# find number of MPI tasks per node
set TPN=`echo $SLURM_TASKS_PER_NODE | cut -f 1 -d \(`
# find number of CPU cores per node
set PPN=`echo $SLURM_JOB_CPUS_PER_NODE | cut -f 1 -d \(`
@ THREADS = ( $PPN / $TPN )
setenv OMP_NUM_THREADS $THREADS

mpirun -genv OMP_NUM_THREADS $OMP_NUM_THREADS -genv MV2_CPU_BINDING_LEVEL socket -genv MV2_CPU_BINDING_POLICY scatter -genv KMP_AFFINITY verbose,granularity=core,compact,1,0 -np $SLURM_NTASKS ./program

See the MVAPICH2 user's guide for details on the MV2_CPU affinity options shown above. In general, without enforcing the OpenMP CPU affinity, we can use the mpirun command below with any compiler which will result in binding MPI tasks on all the CPU cores in one socket.

mpirun -genv OMP_NUM_THREADS 8 -genv MV2_CPU_BINDING_LEVEL socket -genv MV2_CPU_BINDING_POLICY scatter -np 2 ./myprogram
taskset -cp 8700
pid 8700's current affinity list: 0-7,16-23
taskset -cp 8701
pid 8701's current affinity list: 8-15,24-31

Running OpenMPI programs

On the Rocky Linux 8, our tests show that for the InfiniBand, OpenMPI performance is better to that of MVAPICH2. Although OpenMPI is not binary compatible with the other MPI distributions that we offer, it has some appealing features and as of July 2022 seems to be more stable than Intel MPI, especially on the AMD platforms. Again, see the manpages for OpenMPI for details.

Running OpenMPI programs is straightforward, and the same on all clusters:

mpirun -np $SLURM_NTASKS $WORKDIR/program.exe

mpirun flags for multi-threaded process distribution and binding to the CPU sockets are-map-by socket -bind-to socket.

To run an OpenMPI program multithreaded:

mpirun -np $SLURM_NTASKS -map-by socket -bind-to socket $WORKDIR/program.exe

OpenMPI will automatically select the optimal network interface. To force it to use Ethernet, use the "--mca btl tcp,self" mpirun flag. Network configuration runtime flags are detailed here.

Running MPICH programs

MPICH (formerly referred to as MPICH2) is an open source implementation developed at Argonne National Laboratories. Its newer versions support both Ethernet and InfiniBand, although we do not provide cluster specific MPICH build, mainly since MVAPICH2 which is derived from MPICH provides additional performance tweaks. MPICH should only be used for debugging on interactive nodes, single node runs and and embarrassingly parallel problems, as its InfiniBand build does not ideally match our drivers.

mpirun -np $SLURM_NTASKS ./program.exe

Since by default MPICH does not bind tasks to CPUs, use -bind-to core option to bind tasks to cores in case of single threaded program. For multi-threaded programs, one can use -bind-to numa map-by numa, with details on the -bind-to option obtained by running mpirun -bind-to -help, or consulting the Hydra process manager help page. The multi-threaded process/thread affinity seems to be working quite well with MPICH, for example, on a 16 core Kingspeak node with core-memory mapping:

numactl -H
available: 2 nodes (0-1)
node 0 cpus: 0 1 2 3 4 5 6 7 16 17 18 19 20 21 22 23
node 1 cpus: 8 9 10 11 12 13 14 15 24 25 26 27 28 29 30 31
mpirun -bind-to numa -map-by numa -genv OMP_NUM_THREADS 8 -np 2 ./myprogram
taskset -cp 8595
pid 8595's current affinity list: 8-15,24-31
mpirun -bind-to core:4 -map-by numa -genv OMP_NUM_THREADS 4 -np 4
taskset -cp 9549
pid 9549's current affinity list: 0-3,16-19

Notice that the binding is also correctly assigned to a subset of CPU socket cores when we use 4 tasks on 2 sockets. Intel MPI is also capable of this, MVAPICH2 (unless using MPICH's flags) and OpenMPI don't seem to have an easy way to do this.

By default, MPICH that we build with the UCX library picks the InfiniBand, to force it to use Ethernet, set environment variable UCX_NET_DEVICES=eth0.

Note that all the examples above only pin MPI tasks to cores, allowing the OpenMP threads to freely float across the task's cores. Sometime it is advantageous to also pin threads, which is decribed here.

Running Intel MPI programs

Intel MPI is a high performance MPI library which runs on many different network interfaces. Apart from its runtime flexibility, it also integrates with other Intel tools (compilers, performance tools). For a quick introduction to Intel MPI, see the Getting Started guide, https://software.intel.com/en-us/get-started-with-mpi-for-linux.

Intel MPI by default works with whatever interface it finds on the machine at runtime. To use it module load intel-oneapi-mpi .

For best performance we recommend using Intel compilers along with the IMPI, so, to build, use the Intel compiler wrapper calls mpiicc, mpiicpc, mpiifort.

For example

mpiicc code.c -o executable

Network selection with Intel MPI

Intel MPI network interface selection is controlled by the underlying LibFabrics' FI_PROVIDER environment variable. We set the appropriate FI_PROVIDER for the InfiniBand and Ethernet clusters in the intel-oneapi-mpi module file. We can verify the network selection by running the Intel MPI benchmark and look at the time it takes to send a message from one node to another:

salloc -n 2 -N 2 -A mygroup -p kingspeak--pty /bin/tcsh -l

mpirun -np 2 $INTEL_ONEAPI_MPI_ROOT/bin/IMB-MPI1

#---------------------------------------------------
# Benchmarking PingPong
# #processes = 2
#---------------------------------------------------
#bytes #repetitions t[usec] Mbytes/sec
0 1000 1.74 0.00

It takes 1.75 microseconds to send a message there and back which is typical for InfiniBand network.

If we'd like to use the Ethernet network instead (except for Lonepeak, not recommended for production due to slower communication speed), we choose FI_PROVIDER tcp and get:

mpirun -genv FI_PROVIDER tcp -np 2 $INTEL_ONEAPI_MPI_ROOT/bin/IMB-MPI1

Notice that the latency on the Ethernet is about 10x larger than on the InfiniBand.

Single and multi-threaded process/thread affinity

Intel MPI pins processes and threads to sockets by default, so, no additional runtime options should be needed unless the process/thread mapping needs to be different. If that is the case, consult the OpenMP interoperability guide. For the common default pinning.:

mpirun -genv OMP_NUM_THREADS 8 -np 2 ./myprog
taskset -cp 10085
pid 10085's current affinity list: 0-7,16-23
mpirun -genv OMP_NUM_THREADS 4 -np 4 ./myprog
taskset -cp 9119
pid 9119's current affinity list: 0-3,16-19

Based on our investigation detailed in here, Intel MPI does the best job in pinning MPI tasks and OpenMP threads, but, in case of more exotic MPI tasks/OpenMP threads combinations, use our task/thread pinning script.

Common MPI ABI

From Intel MPI 5.0 and MPICH 3.1 (and MVAPICH2 1.9 and higher which is based on MPICH 3.1), the libraries are interchangeable at the binary level, using common Application Binary Interface (ABI). This in practice means that one can build the application with MPICH, but, run it using the Intel MPI libraries, and thus taking advantage of the Intel MPI functionality. See details about this and suggestions for future directions at this article.