Message Passing Interface
Message Passing Interface (MPI) is the principal method of performing parallel computations on all CHPC clusters. Its main component is a standardized library that enables communication between processors in distributed processor environments.
There are numerous MPI distributions available and thus CHPC supports only some of them, those we believe are best suited for the particular system.
More information: MPI standard page.
The CHPC clusters utilize two types of network interconnects, Ethernet and InfiniBand. Except for some nodes on Lonepeak, all clusters have InfiniBand and users should use it since it is much faster than Ethernet.
We provide a number of MPI distributions with InfiniBand support: MVAPICH2, OpenMPI, and Intel MPI. All these MPI implementations also support multiple network interfaces in a single build, usage is described in a this page. Historically we have also supported MPICH, however, at present we are having problem with MPICH's support for InfiniBand that will require further analysis.
More information from the developers of each MPI distribution can be found here:
Before performing any work with MPI, users need to initialize the environment for the MPI distribution appropriate for their needs. Each of the distributions has its pros and cons. Intel MPI has good performance and very flexible usage, but, it's a commercial product that we have to license. MVAPICH2 is optimized for InfiniBand, but, it does not provide flexible process/core affinity in multi-threaded environment. MPICH is more of a reference platform which has InfiniBand support through a relative recent LibFabrics interface. Its feature set is the same as that of Intel MPI and MVAPICH2 (both of which are based on MPICH) . Intel MPI, MVAPICH2 and MPICH can also be freely interchanged thanks to their common Application Binary Interface (ABI), which main advantage is no need to build separate binaries for each distribution.
Finally, OpenMPI is quite flexible, and on InfiniBand we see better performance than with Intel MPI and MVAPICH2. However, it is not ABI compatible with the other MPIs that we provide. If you have any doubts on what MPI to use, contact us.
Note that in the past we have provided separate MPI builds for different clusters. Since these days most MPIs provide flexible interfaces for multiple networks, we provide single builds for all clusters and allow for changing the default network interface at runtime.
To set up general MPI package, use the module command as:
module load <compiler> <MPI distro>
where
<MPI distro>
= mvapich2, openmpi, mpich, intel-oneapi-mpi
<compiler> = gcc (GNU), intel (Intel), pgi (Portland Group)
Example 1. If you were running a program that was compiled with the Intel compilers and uses mvapich2 :
module load intel mvapich2
Example 2. If you were running a program that was compiled with the NVHPC compilers and uses OpenMPI :
module load nvhpc openmpi
The CHPC keeps older versions of each MPI distribution, however, the backwards compatibility
is sometimes compromised due to network driver and compiler upgrades. When in doubt,
please, use the latest versions of compilers and MPI distributions as obtained with
the module load command. These older versions can be found with the module spider <MPI distro>
command.
Compiling with MPI is quite straightforward. Below is a list of MPI compiler commands with their equivalent standard version:
Language |
MPI Command |
Standard Commands |
---|---|---|
C |
|
|
C++ |
|
|
Fortran 77/90 |
|
|
When you compile, make sure you record what version of MPI you used. The std builds are periodically updated, and programs will sometimes break if they depend on the std builds.
Note that Intel MPI supplies separate compiler commands (wrappers) for the Intel compilers, in a form of mpiicc, mpiicpc and mpiifort. Using mpicc, mpicxx and mpif90 will call the GNU compilers.
mpirun command launches the parallel job. For help with mpirun
, please consult the manpages (man mpirun
) or run mpirun --help
. The important parameter is the number of MPI processes specification (-np).
To run on a cluster, or on a CHPC supported Linux desktop desktop:
mpirun -np $SLURM_NTASKS ./program
The $SLURM_NTASKS variable corresponds to SLURM task count requested with the #SBATCH -n option.
For optimal performance, especially in the case of multi-threaded parallel programs, there are additional arguments that must be passed to the program. Specifically, the variable OMP_NUM_THREADS (number of threads to parallelize over) needs to be set. When running multi-threaded jobs, make sure to also link multi-threaded libraries (e.g. MKL, FFTW), and vice versa, link single threaded libraries to single threaded MPI programs.
The OMP_NUM_THREADS count can be calculated automatically by utilizing SLURM provided variables, assuming that all nodes have the same CPU core count. This can prevent accidental over or under-subscription when node or task count in the SLURM script changes:
# find number of threads for OpenMP
# find number of MPI tasks per node
set TPN=`echo $SLURM_TASKS_PER_NODE | cut -f 1 -d \(`
# find number of CPU cores per node
set PPN=`echo $SLURM_JOB_CPUS_PER_NODE | cut -f 1 -d \(`
@ THREADS = ( $PPN / $TPN )
setenv OMP_NUM_THREADS $THREADS
mpirun -genv $OMP_NUM_THREADS -genv MV2_ENABLE_AFFINITY 0 -np $SLURM_NTASKS ./program
In the NUMA (Non Uniform Memory Access) architecture, which is present on all CHPC clusters, it is often advantageous to pin MPI tasks and/or OpenMP threads to the CPU sockets and cores. We have seen up to 60% performance degradation in high memory bandwidth codes when process/thread affinity is not enforced. The pinning prevents the processes and threads to migrate to CPUs which have more distant path to the data in the memory. Most commonly we would set the MPI task to be pinned to a CPU socket, with OpenMP threads allowed to migrate over this socket's cores. All MPIs except for MPICH automatically bind MPI tasks to CPUs, but the behavior and adjustment options depend on the MPI distribution. We describe MPI task pinning in the relevant MPI section below, for more details on the problem see our blog post, which provides a general solution using a shell script that pins both tasks and threads to cores.
MVAPICH2 by default binds MPI tasks to cores, so, optimal binding of single threaded MPI program is one MPI task to one CPU core and is achieved with plainly running:
mpirun -np $SLURM_NTASKS ./program
For multi-threaded parallel programs, we need to disable the task to core affinity by setting MV2_ENABLE_AFFINITY=0. This also means that we need to pin the tasks manually, which is can be done using Intel compilers with KMP_AFFINITY=verbose,granularity=core,compact,1,0. To run multi-threaded MVAPICH2 code compiled with Intel compilers:
module load intel mvapich2
# find number of threads for OpenMP
# find number of MPI tasks per node
set TPN=`echo $SLURM_TASKS_PER_NODE | cut -f 1 -d \(`
# find number of CPU cores per node
set PPN=`echo $SLURM_JOB_CPUS_PER_NODE | cut -f 1 -d \(`
@ THREADS = ( $PPN / $TPN )
setenv OMP_NUM_THREADS $THREADS
mpirun -genv OMP_NUM_THREADS $OMP_NUM_THREADS -genv MV2_ENABLE_AFFINITY 0 -genv KMP_AFFINITY
verbose,granularity=core,compact,1,0 -np $SLURM_NTASKS ./program
For other compilers, the suggestions listed in MVAPICH2 user's guide don't seem to be appropriate for multi-threaded programs. However, we have found that using MPICH's process affinity options will do the trick (as MVAPICH2 is derived from MPICH). That is, for example on 16 core, 2 socket cluster node, runing 2 tasks 8 threads each:
mpirun -genv MV2_ENABLE_AFFINITY 0 -bind-to numa -map-by numa -genv OMP_NUM_THREADS
8 -np 2 ./myprogram
taskset -cp 8700
pid 8700's current affinity list: 0-7,16-23
taskset -cp 8701
pid 8701's current affinity list: 8-15,24-31
Generally, our tests show that for the InfiniBand, OpenMPI performance is similar to that of MVAPICH2. Although OpenMPI is not binary compatible with the other MPI distributions that we offer, it has some appealing features and as of July 2022 seems to be more stable than Intel MPI, especially on the AMD platforms. Again, see the manpages for OpenMPI for details.
Running OpenMPI programs is straightforward, and the same on all clusters:
mpirun -np $SLURM_NTASKS $WORKDIR/program.exe
mpirun flags for multi-threaded process distribution and binding to the CPU sockets
are-map-by socket -bind-to socket
.
To run an OpenMPI program multithreaded:
mpirun -np $SLURM_NTASKS -map-by socket -bind-to socket $WORKDIR/program.exe
OpenMPI will automatically select the optimal network interface. To force it to use Ethernet, use --mca btl tcp,self mpirun flag. Network configuration runtime flags are detailed here.
MPICH (formerly referred to as MPICH2) is an open source implementation developed at Argonne National Laboratories. Its newer versions support both Ethernet and InfiniBand, although we do not provide cluster specific MPICH build, mainly since MVAPICH2 which is derived from MPICH provides additional performance tweaks. MPICH should only be used for debugging on interactive nodes, single node runs and and embarrassingly parallel problems, as its InfiniBand build does not ideally match our drivers.
mpirun -np $SLURM_NTASKS ./program.exe
Since by default MPICH does not bind tasks to CPUs, use -bind-to core option to bind tasks to cores (equivalent to MV2_ENABLE_AFFINITY=1) in case of single
threaded program. For multi-threaded programs, one can use -bind-to numa map-by numa
, with details on the -bind-to option obtained by running mpirun -bind-to -help, or consulting the Hydra process manager help page. The multi-threaded process/thread affinity seems to be working quite well with MPICH,
for example, on a 16 core Kingspeak node with core-memory mapping:
numactl -H
available: 2 nodes (0-1)
node 0 cpus: 0 1 2 3 4 5 6 7 16 17 18 19 20 21 22 23
node 1 cpus: 8 9 10 11 12 13 14 15 24 25 26 27 28 29 30 31
mpirun -bind-to numa -map-by numa -genv OMP_NUM_THREADS 8 -np 2 ./myprogram
taskset -cp 8595
pid 8595's current affinity list: 8-15,24-31
mpirun -bind-to core:4 -map-by numa -genv OMP_NUM_THREADS 4 -np 4
taskset -cp 9549
pid 9549's current affinity list: 0-3,16-19
Notice that the binding is also correctly assigned to a subset of CPU socket cores when we use 4 tasks on 2 sockets. Intel MPI is also capable of this, MVAPICH2 (unless using MPICH's flags) and OpenMPI don't seem to have an easy way to do this.
MPICH by default uses the slower Ethernet network for communication, to take advantage
of InfiniBand, set environment variable MPICH_NEMESIS_NETMOD=ofi
. However, please note that based on our tests MPICH's implementation of the InfiniBand
does not seem to be highly optimized so you may get better performance with Intel
MPI or MVAPICH2.
Note that all the examples above only pin MPI tasks to cores, allowing the OpenMP threads to freely float across the task's cores. Sometime it is advantageous to also pin threads, which is decribed here.
Intel MPI is a high performance MPI library which runs on many different network interfaces. Apart from its runtime flexibility, it also integrates with other Intel tools (compilers, performance tools). For a quick introduction to Intel MPI, see the Getting Started guide, https://software.intel.com/en-us/get-started-with-mpi-for-linux.
Intel MPI by default works with whatever interface it finds on the machine at runtime.
To use it module load intel-oneapi-mpi
.
For best performance we recommend using Intel compilers along with the IMPI, so, to build, use the Intel compiler wrapper calls mpiicc, mpiicpc, mpiifort.
For example
mpiicc code.c -o executable
Since IMPI is designed to run on multiple network interfaces, one just needs to build
a single executable which should be able to run on all CHPC clusters. Combining this
with the Intel compiler's automatic CPU dispatch flag (-axCORE-AVX512,CORE-AVX2,AVX,SSE4.2
) allows to build a single executable for all the clusters. The network interface
selection is controlled with the I_MPI_FABRICS environment variable. The default should be the fastest network, in our case InfiniBand.
We can verify the network selection by running the Intel MPI benchmark and look at
the time it takes to send a message from one node to another:
srun -n 2 -N 2 -A mygroup -p ember --pty /bin/tcsh -l
mpirun -np 2 /uufs/chpc.utah.edu/sys/installdir/intel/impi/std/intel64/bin/IMB-MPI1
#---------------------------------------------------
# Benchmarking PingPong
# #processes = 2
#---------------------------------------------------
#bytes #repetitions t[usec] Mbytes/sec
0 1000 1.74 0.00
It takes 1.75 microseconds to send a message there and back which is typical for InfiniBand network.
Intel MPI provides two different MPI fabrics for InfiniBand, one based on Open Fabrics
Enterprise Distribution (OFED), and the other on Direct Access Programming Library
(DAPL), denoted by ofa
and dapl
, respectively. Moreover, one can also specify intra-node communication, out of which
the fastest should be shared memory(shm
). According to our observations, the default fabrics is shm:dapl
, which can be confirmed by using environment variable I_MPI_DEBUG
larger than 2, e.g.mpirun -genv I_MPI_DEBUG 2 -np 2 /uufs/chpc.utah.edu/sys/pkg/intel/ics/impi/std/intel64/bin/IMB-MPI1
...
[0] MPI startup(): shm and dapl data transfer modes
...
The performance of the OFED and DAPL are comparable, but, it may be worth-wile to test both to see if your particular application gets a boost from one fabrics or the other.
If we'd like to use the Ethernet network instead (except for Lonepeak, not recommended
for production due to slower communication speed), we choose I_MPI_FABRICS tcp
and get:
mpirun -genv I_MPI_FABRICS tcp -np 2 /uufs/chpc.utah.edu/sys/installdir/intel/impi/std/intel64/bin/IMB-MPI1
#---------------------------------------------------
# Benchmarking PingPong
# #processes = 2
#---------------------------------------------------
#bytes #repetitions t[usec] Mbytes/sec
0 1000 18.56 0.00
Notice that the latency on the Ethernet is about 10x larger than on the InfiniBand.
Intel MPI pins processes and threads to sockets by default, so, no additional runtime options should be needed unless the process/thread mapping needs to be different. If that is the case, consult the OpenMP interoperability guide. For the common default pinning.:
mpirun -genv OMP_NUM_THREADS 8 -np 2 ./myprog
taskset -cp 10085
pid 10085's current affinity list: 0-7,16-23
mpirun -genv OMP_NUM_THREADS 4 -np 4 ./myprog
taskset -cp 9119
pid 9119's current affinity list: 0-3,16-19
From Intel MPI 5.0 and MPICH 3.1 (and MVAPICH2 1.9 and higher which is based on MPICH 3.1), the libraries are interchangeable at the binary level, using common Application Binary Interface (ABI). This in practice means that one can build the application with MPICH, but, run it using the Intel MPI libraries, and thus taking advantage of the Intel MPI functionality. See details about this at https://software.intel.com/en-us/articles/using-intelr-mpi-library-50-with-mpich3-based-applications.