You are here:

Intel Parallel Studio XE

CHPC has a 2 concurrent user license for Intel Parallel Studio XE Cluster Edition.The IPS-CE consists of the following applications:

All Intel tools have excellent documentation and we recommend to follow the tutorials in learning how to use the tools and the documentation in finding out details. Most of the tools work well out of the box, however, few of the tools have peculiarities regarding our local installation which we try to make known in this document.

Intel Compilers

The Intel compilers are described on our compiler page. They provide a wealth of code optimization options and generally produce the fastest code on the Intel CPU platforms.

Intel MKL

MKL is a high performance math library containing full BLAS, LAPACK, ScaLapack, transforms and more. For details on MKL use see our math libraries page.

Intel Advisor XE

Intel Advisor XE is a thread and vectorization prototyping tool. More info is at The vectorization support was added in the 2016 version and can be very helpful in detecting loops that could take advantage of vectorization with simple code changes, potentially doubling to quadrupling the performance on the AVX and AVX2 capable CPUs (CHPC's Kingspeak cluster).

Start with loading the module - module load advisorxe. Then launch the tool GUI with advixe-gui.

Then either use your own serial code or get examples from /uufs/

The thread prototyping process generally involves four steps:

  1. Creating a project and surveying (timing) the target code
  2. Annotating the target code to tell Advisor what sections to parallelize
  3. Analyze the annotated code to predict parallel performance and predict parallel problems
  4. Adding parallel framework (OpenMP, TBB, Cilk+) to the code based on feedback from step 3

The vectorization profiling involvec the following:

  1. Target survey to explore where to add vectorization or threading
  2. Find trip counts to see how many iterations each loop executes
  3. Check data dependencies in the loop and use the Advisor hints to fix them
  4. Check memory accesses to identify and fix complex data access patterns

Some useful documentation and tutorials:

User's guide -

Tutorials -

Intel Inspector XE

Intel Inspector XE is a memory and thread error debugging tool. More info is at

Start with loading the module - module load inspectorxe.

Then either use your own serial code or get examples from /uufs/

Inspector has two major workflows, one geared at the memory and the other at thread inspection and debugging. The debugging process involves compilation of the code with the -g flag, creating and populating a project, selecting the type of debugging (memory or thread), running the code inside of the Inspector tool and when done go over the report that Inspector provides. Be aware that Inspector may be reporting false positives which can be turned off for future analysis using supressions. See the user's guide and tutorials for details how to use the tool.

Some useful documentation and tutorials:

User's guide -

Tutorials -

Support -

Intel VTune Amplifier XE

VTune is an advanced performance profiler. Its main appeal is integrated code performance measurement and evaluation and support for multithreading on both CPUs and accelerators (GPUs, Intel Xeon Phis).

For CPU based profiling on CHPC Linux systems, VTune is located at /uufs/

To profile an application in VTune GUI, do the following:

1. Source the VTune environment: module load vtune
2. Start the VTune GUI: amplxe-gui
3. Follow the GUI instructions to start a new sampling experiment, run it and then visualize the results.

To profile a distributed parallel application (e.g. MPI), one has to use the command line interface for VTune. This can be done either in the SLURM job script, or ran as an interactive job. Inside of the script, or interactive job, do the following:

1. Source the VTune environment: module load vtune
2. Source the appropriate compiler and MPI, e.g. module load intel impi
3. Run the VTune command line command, e.g. mpirun -np $SLURM_NTASKS amplxe-cl -collect hotspots -result-dir /path/to/directory/with/VTune/result myExecutable. Note that we are explicitly stating where to put the results, as the VTune default results directory name can cause problems during the job launch.
4. To analyze the results, we recommend to use the VTune GUI, i.e. on the cluster interactive node, start, and then use the "Open Result" option in the main GUI window to find the directory with the result obtained above.

We also have an experimental setup of VTune on Intel Xeon Phi demo machines which use is described on our Wiki page.

Some useful documentation and tutorials:

User's guide -

Tutorials -

Support -

Intel MPI

Intel MPI is a high performance MPI library which runs on many different network interfaces. The main reason for having IMPI, though is its seamless integration with ITAC and its features. It's generally slightly slower than the top choice MPIs that we use on the clusters, though, there may be applications in which IMPI outperforms our other MPIs so we recommend to include IMPI in performance testing before deciding what MPI to use for production runs. For a quick introduction to Intel MPI, see the Getting Started guide,

Intel MPI by default works with whatever interface it finds on the machine at runtime. To use it module load impi .

For best performance we recommend using Intel compilers along with the IMPI, so, to build, use the Intel compiler wrapper calls mpiicc, mpiicpc, mpiifort.

For example

mpiicc code.c -o executable

Since IMPI is designed to run on multiple network interfaces, one just needs to build a single executable which should be able to run on all CHPC clusters. Combining this with the Intel compiler's automatic CPU dispatch flag (-axCORE-AVX2,AVX,SSE4.2) allows to build a single executable for all the clusters. The network interface selection is controlled with the I_MPI_FABRICS environment variable. The default should be the fastest network, in our case InfiniBand. We can verify the network selection by running the Intel MPI benchmark and look at the time it takes to send a message from one node to another:

srun -n 2 -N 2 -A mygroup -p ember --pty /bin/tcsh -l

mpirun -np 2 /uufs/

# Benchmarking PingPong
# #processes = 2
#bytes #repetitions t[usec] Mbytes/sec
0 1000 1.74 0.00

It takes 1.75 microseconds to send a message there and back which is typical for InfiniBand network.

Intel MPI provides two different MPI fabrics for InfiniBand, one based on Open Fabrics Enterprise Distribution (OFED), and the other on Direct Access Programming Library (DAPL), denoted by ofa and dapl, respectively. Moreover, one can also specify intra-node communication, out of which the fastest should be shared memory(shm). According to our observations, the default fabrics is shm:dapl, which can be confirmed by using environment variable I_MPI_DEBUG larger than 2, e.g.
mpirun -genv I_MPI_DEBUG 2 -np 2 /uufs/
[0] MPI startup(): shm and dapl data transfer modes

The performance of the OFED and DAPL are comparable, but, it may be worth-wile to test both to see if your particular application gets a boost from one fabrics or the other.

If we'd like to use the Ethernet network instead (except for Lonepeak, not recommended for production due to slower communication speed), we choose I_MPI_FABRICS tcp and get:

mpirun -genv I_MPI_FABRICS tcp -np 2 /uufs/

# Benchmarking PingPong
# #processes = 2
#bytes #repetitions t[usec] Mbytes/sec
0 1000 18.56 0.00

Notice that the latency on the Ethernet is about 10x larger than on the InfiniBand.

As of Intel MPI 5.0 and MPICH 3.1 (and MVAPICH2 1.9 and higher which is based on MPICH 3.1), the libraries are interchangeable at the binary level, using common Application Binary Interface (ABI). This in practice means that one can build the application with MPICH, but, run it using the Intel MPI libraries, and thus taking advantage of the Intel MPI functionality. See details about this at

Intel Trace Analyzer and Collector

ITAC can be used for MPI code checking and for profiling. To use it, module load itac.

MPI/OpenMP profiling

It is the best to run ITAC with Intel compiler and MPI, since that way one can take advantage of their interoperability. That is, also

module load intel impi itac

On existing code that was built with IMPI, just run

mpirun -trace -n 4 ./a.out

This will produce a set of trace files a.out.stf*, which are then loaded to the Trace Analyzer as

traceanalyzer a.out.stf &

To take advantage of additional profiling features, compile with -trace command as

mpiicc -trace code.c

ITAC reference guide is a good resource for more detailed info about other ways to invoke the tracing, instrumentation, etc. The documentation can be accessed at A good tutorial on how to use ITAC to profile a MPI code is named Detecting and Removing Unnecessary Serialization

MPI correctness check

To run the correctness checker, the easiest is to compile with -check_mpi flag as

mpiicc -check_mpi code.c and then plainly run as

mpirun -check -n 4 ./a.out

If the executable was built with other MPI of the MPICH2 family, one can specifically invoke the checker library by

mpirun -genv LD_PRELOAD -genv VT_CHECK_TRACING on -n 4 ./a.out

At least this is what the manual says, but, it seems like it's just creating the trace file. So, the safest way is to use -check_mpi during compilation. The way to tell the MPI checking is enabled is that the program will start writing out a lot of output describing what it's doing during the runtime, such as:


Once the program is done, if there is no MPI error, it'll say:

[0] INFO: Error checking completed without finding any problems.

We recommend anyone who is developing MPI program to run their program through the MPI checker before starting some serious use of the program. It can help to uncover hidden problems that could be hard to locate during normal runtime.

Last Updated: 5/16/17