CHPC has a limited number of GPU nodes on ember, kingspeak and notchpeak, as well as a standalone OpenPower server.
|Nodes||GPU type||GPU count||CPU core count|| Host Memory
|kp297 and kp298||GeForce TitanX||8||12||64GB||5.0|
|kp299 and kp300||Nvidia K80||8||12||64GB||3.5|
|p8.chpc.utah.edu||Nvidia K80||2||16 Power8||256 GB||3.5|
*The crypsoparc nodes are stand alone servers owned by individual research groups. They are included here only for documentation purposes.
**Note that the P100 GPU nodes are owned by the School of Computing (SOC), and users not from the SOC can only access these in the guest mode, in a similar manner used to access the non-GPU owner nodes.
Notchpeak has three nodes with threeTesla V100 cards each.
The V100 is of the Volta architecture, released in late 2017. Each GPU having 16 GB of global memory. Peak double prevision performance of a V100 is 7 TFlops. These nodes have two 16-core Intel skylake generation CPUs (Gold 6130 @ 2.10GHz) and 192 GB of RAM. We recommend to use CUDA >= 9.1 (Support for the 7.x compute capability).
The K80 is of the Kepler architecture, released in late 2014. Each K80 card consists of two GPUs, each GPU having 12 GB of global memory. Thus the K80 nodes will show 8 total GPUs available. Peak double prevision performance of a single K80 card is 1864 GFlops. The K80 nodes have two 6-core Intel Haswell generation CPUs and 64 GB of RAM.
The GeForce TitanX is of the next generation Maxwell architecture, and also has 12 GB of global memory per card. An important difference from the K80 is that it does not have very good double precision performance (max ca. 200 GFlops), but it has great single precision speed (~ 7 TFlops). The TitanX nodes should be used for either single precision, or mixed single-double precision GPU codes. The TitanX nodes have two 6-core Intel Haswell generation CPUs and 64 GB of host RAM.
The Tesla P100 is of the Pascal architecture, and has 16 GB of global memory per card. Each GPU card contains 56 multiprocessors with each 64 cores (3584 cores in total). The ECE support is (currently) disabled.The system interface is a PCIe Gen3 Bus. The double-precision performance per card is 4.7 TFlops. The single-precision is 9.3 TFlops anf the half-precision performance is 18.7 TFlops. The P100 nodes have each 2 14-core Intel Broadwell processors (E5-2680 v4 running @ 2.4 GHz) and 256 GB RAM.
The Ember cluster has eleven nodes which have two Tesla M2090 cards each. The M2090 is of the Fermi architecture (compute capability 2.0) that was released in 2011. Each card has 6 GB of global memory. Although relatively old, each card has a peak double precision floating point performance of 666 GFlops, still making it a good performer. The GPU nodes have two 6-core Intel Westmere generation CPUs and 24 GB of host RAM.
The use of the GPU nodes does not affect your allocation (i.e. their usage does not
count against any allocation your group may have). However, we have restricted them
to users with a special GPU account, which needs to be requested. Please e-mail firstname.lastname@example.org to request the GPU account. The GPU partition and account on ember are named
ember-gpu. On kingspeak the partition and account names are
kingspeak-gpu for the nodes containing K80 and the GeForce TitanX cards. On notchpeak the partition
and account are
The nodes with NVIDIA P100 cards are owner nodes. Therefore, the partition name is
kingspeak-gpu-guest. Its account name is
owner-gpu-guest. The jobs on the NVIDIA P100 nodes may be subjected to preemption.
One has to request the GPUs via a list of generic consumable resources (a.k.a gres),
#SBATCH --gres=gpu:k80:8. The gres notation is a colon separated list of
resource_type:resource_name:resource_count. In our case, the
resource_type is always
resource_name is either
v100 and the
resource_count is the number of GPUs per node requested - 1-8. Note that if you do not have the
#SBATCH --gres=gpu:k80:8 line, your job will not be assigned any GPUs.
Some programs are serial, or able to run only on a single GPU; other jobs perform better on a single or small number of GPUs and therefore cannot efficiently make use of all of the GPUs on a single node. Therefore, in order to better utilize our GPU nodes, node sharing has been enabled for the GPU partitions. This allows multiple jobs to run on the same node, each job being assigned specific resources (number of cores, amout of memory, number of accelerators). The node resources are managed by the scheduler up to the maximum available on each node. It should be noted that while efforts are made to isolate jobs running on the same node, there are still many shared components in the system. Therefore a job's performace can be affected by the other job(s) running on the node at the same time and if you are doing benchmarking you will want to request the entire node even if your job will only make use of part of the node.
Node sharing can be accessed by requesting less than the full number of gpus, core
and/or memory. Note that node sharing can also be done on the basis of the number of cores and/or
memory, or all three. By default, each job gets 2 GB of memory per core requested (the lowest common denominator
among our cluster nodes), therefore to request a different amount than the default
amount of memory, you must use
--mem flag . To request exclusive use of the node, use
When node sharing is on (default unless asking full number of GPUs, cores or memory), the SLURM scheduler automatically sets task to core affinity, mapping one task per physical core. To find what cores are bound to the job's tasks, run:
Below is a list of useful job modifiers for use:
|#SBATCH --gres=gpu:k80:1||request one K80 GPU|
|#SBATCH --mem=4G||request 4 GB of RAM|
request all memory of the node; this option also
|#SBATCH --ntasks=1||requests 1 core|
|#SBATCH --mem=0||request all cores of the node
An example script that would request two Ember nodes with 2xM2090 GPUs, including all cores and all memory, running one GPU per MPI task, would look like this:
... prepare scratch directory, etc
mpirun -np $SLURM_NTASKS myprogram.exe
To request all 8 K80 GPUs on a kingspeak node, again using one GPU per MPI task, we would do:
... prepare scratch directory, etc
mpirun -np $SLURM_NTASKS myprogram.exe
As an example, using the script below will get four GPUs, four CPU cores, and 8GB of memory. The remaining GPUs, CPUs, and memory will then be accessible for other jobs.
The script below will ask for 14 CPU cores, 100 GB of memory and 1 GPU card on one of the P100 nodes.
To run a parallel interactive job with MPI, do not use the usual
srun command, as this does not work correctly with the gres. Instead, use the
salloc command, e.g.
salloc -n 1 -N 1 -t 1:00:00 -p kingspeak-gpu -A kingspeak-gpu --gres=gpu:titanx:1
This will allocate the resources to the job, but keeps the prompt on the interactive node. You can then use srun or mpirun commands to launch the calculation on the allocated compute node resources.
For serial jobs, utilizing one or more GPUs, srun is functional, e.g.
srun -n 1 -N 1 -A owner-gpu-guest -p kingspeak-gpu-guest --gres=gpu:p100:1 --pty /bin/bash -l
On all GPU nodes Nvidia CUDA, PGI CUDA Fortran and the OpenACC compilers are installed.
The default CUDA is 8.0, which at the time of writing is the most recent. The CUDA
default installation is to be found at
/usr/local/cuda, or by simply loading the CUDA module,
module load cuda. PGI compilers come with their own CUDA which is quite recent, and can be set up
by loading the PGI module,
module load pgi.
We recommend to get an interactive session on the compute node in order to compile a CUDA code, but, in a pinch, any interactive node should work as the CUDA is installed on the interactives as well. PGI compilers come with their own CUDA so compiling anywhere from where you can load the PGI module should work.
To compile CUDA code so that it runs on all the four types of GPUs that we have, use
the following compiler flags:
-gencode arch=compute_20,code=sm_20 -gencode arch=compute_37,code=sm_37 -gencode arch=compute_52,code=sm_52
-gencode arch=compute_60,code=sm_60 -gencode arch=compute_70,code=sm_70. For more info on the CUDA compilation and linking flags, please have a look at http://docs.nvidia.com/cuda/pdf/CUDA_C_Programming_Guide.pdf.
The PGI compilers specify the GPU architecture with the
-tp=tesla flag. If no further option is specified, the flag will generate code for all available
computing capabilities (at the time of writing cc20, cc30, cc35, cc50 and cc60). To
be specific for each GPU,
-tp=tesla:cc20 can be used for the M2090,
-tp=tesla:cc35 for the K80,
-tp=tesla:cc50 for the TitanX and
-tp=tesla:cc60 for the P100. To invoke the OpenACC, use
-acc flag. More information on OpenACC can be obtained at http://www.openacc.org.
Good tutorials on GPU programming are available a the CUDA Education and Training site from Nvidia.
When running the GPU code, it is worth checking the resources that the program is
using, to ensure that the GPU is well utilized. For that, one can run the
nvidia-smi command, and watch for the memory and CPU utilization.
nvidia-smi is also useful to query and set various features of the GPU, see
nvidia-smi --help for all the options that the command takes. For example,
nvidia-smi -L lists the GPU card properties. On the TitanX nodes:
Titan Card: GPU 0: GeForce GTX TITAN X (UUID: GPU-cd731d6a-ee18-f902-17ff-1477cc59fc15)
Nvidia's CUDA distribution includes a terminal debugger named
cuda-gdb. Its operation is similar to the GNU
gdbdebugger. For details, see the cuda-gdb documentation.
For out of bounds and misaligned memory access errors, there is the cuda-memcheck tool. For details, see the cuda-memcheck documentation.
The Totalview debugger that we license used to license and DDT debugger that we currently license also support CUDA and OpenACC debugging. Due to its user friendly graphical interface we recommend them for GPU debugging. For information on how to use DDT or Totalview, see our debugging page.
Profiling can be very useful in finding GPU code performance problems, for example
inefficient GPU utilization, use of shared memory, etc. Nvidia CUDA provides both
command line (
nprof) and visual profiler (
nvvp). More information is in the CUDA profilers documentation.
We have the following GPU codes installed:
|Code name||Module name||Prerequisite modules||Sample batch script(s) location||Other notes|
|/uufs/chpc.utah.edu/sys/installdir/vasp/examples||Per group license, let us know if you need access|
|adapt CPU script|
|LAMMPS||lammps/10Aug15||intel/2016.0.109 impi/22.214.171.124||adapt CPU script|
If there is any other GPU code that you would like us installed, please let us know.
Some commercial programs that we have installed, such as Matlab, also have GPU support. Either try them or contact us.