Skip to content

GPUs and Accelerators at CHPC

The CHPC has a limited number of cluster compute nodes with GPUs. The GPU devices are to be found on the Kingspeak, Notchpeak and Redwood (Protected Environment (PE)) clusters. This document describes the hardware, as well as access and usage of these resources.

 

CHPC has a limited number of GPU nodes on notchpeak, kingspeak, and redwood (PE). Below are descriptions of the GPU nodes as well as the different GPU devices available at CHPC.

 

*Note that while  notch{081-082,308-309} are general gpu nodes, they are not part of the notchpeak-gpu partition discussed below but are in the notchpeak-shared-short partition which has special queue constraints.

Note that for the nodes owned by research groups: CHPC users who do not belong to the research groups who own these device can use those device if and only if they use the SLURM partition notchpeak-gpu-guest.

Nodes GPU
Type
#GPU
Devices
per Node
GPU Global
Memory
(GB)
Compute
Capability
Gres flag/
Access Code
General
Node?
notch[001-003] Tesla V100 3 16 7.0 gpu:v100:[1-3] Y
notch004 RTX 2080 Ti 2 11 7.5 gpu:2080ti:[1-2] Y

Tesla P40 1 24 6.1 gpu:p40:1
notch055 Titan V 4 12 7.0 gpu:titanv:[1-4] N
notch060 GTX 1080 Ti 8 11 6.1 gpu:1080ti:[1-8] N
notch[081-082] GTX 1080 Ti 2 11 6.1 gpu:1080ti:[1-2] Y*
notch[083-084] RTX 2080 Ti 4 11 7.5 gpu:2080ti:[1-4] N
notch085 RTX 2080 Ti 4 11 7.5 gpu:2080ti:[1-4] N
notch{086-088} RTX 2080 Ti 4 11 7.5 gpu:2080ti:[1-4] Y
notch089 RTX 2080 Ti 4 11 7.5 gpu:2080ti:[1-4]  N
notch102 RTX 3090 2 24 8.6 gpu:3090:[1-2] N
notch103 RTX 2080 Ti 4 11 7.5 gpu:2080ti:[1-4] N
Tesla A40 2 48 8.6 gpu:a40:[1-2]
notch136 RTX 2080 Ti 8 11 7.5 gpu:2080ti:[1-8] N
notch[168-169] RTX 2080 Ti 8 11 7.5 gpu:2080ti:[1-8] N
notch204 Tesla V100 16 7.0 gpu:v100:[1-2] N
notch215 RTX 2080 Ti 8 11 7.5 gpu:2080ti:[1-8] N
notch271 RTX 2080 Ti 8 11 7.5 gpu:2080ti:{1-8] Y
notch293 Tesla A100 4 40 8.0 gpu:a100:{1-4] Y

RTX 3090 4 24 8.6 gpu:3090:{1-4]
notch294 RTX 2080 Ti 4 11 7.5 gpu:2080ti:{1-4] N
notch299 RTX 3090 8 24 8.6 gpu:3090:[1-8] N
notch300 Tesla T4 1 16 7.5 gpu:t4:1 N
notch[308-309] Tesla T4 2 16 7.5 gpu:t4:[1-2] Y*
notch328 RTX 3090 8 24 8.6 gpu:3090:[1-8] Y
notch329 Tesla A40 4 48 8.6 gpu:a40:[1-4] N
notch330 Tesla A100 8 40 8.0 gpu:a100:{1-8] N
notch343 RTX A6000 1 48 8.6 gpu:a6000:1 N
notch347 Tesla A100 2 80 8.0 gpu:a100:[1-2] N
notch348 Tesla A100 1 80 8.0 gpu:a100:1 N
notch[367-368] RTX A6000 8 48 8.6 gpu:a6000:[1-8] N
notch[369-370] Tesla A100 2 80 8.0 pu:a100:[1-2] N
kp[297-298] GTX Titan X 8 12 5.2 gpu:titanx:[1-8] N
kp[359-362] Tesla P100 2 16 6.0 gpu:p100:[1-2] N
rw[085-086] GTX 1080 Ti 4 11 6.1 gpu:1080ti:[1-4] Y
rw183 Tesla A100 8 40 8.0 gpu:a100:[1-8] Y
rw188 Tesla A40 4 48 8.6 gpu:a40:[1-4] N
rw[189-190] Tesla A30 2 24 8.0 gpu:a30:[1-2] N
lp[234-254] GTX 1080Ti 8 11 6.1 gpu:1080ti:[1-8] Y
 

  • P40 device of the Pascal generation. It contains 24 GB of global memory (GDDR5X) with a memory bandwidth of 346 GB/s. It has a single-precision performance of 12 TFlops; the double-precision performance is 0.35 TFlops. 
  •  Titan V devices (Volta generation). Each device has 5120 CUDA cores and a global memory of 12 GB (HBM2). Its memory bandwidth is 652.8GB/s. Each device has the following performance specifics: 6.9 TFlops (Double Precision), 13.8 TFlops (Single Precision), 27.6 (Half Precision) & 110 TFlops (Tensor Performance - Deep Learning).
  •  GTX 1080 Ti devices (Pascal generation). Each device contains 11 GB of global memory (GDDR5X) with a memory bandwidth of 484 GB/s. It also contains 3584 CUDA cores. Each GPU Card has a single-precision performance of 10.6 TFlops. The double-precision performance is only 0.33 TFlops.
  •  RTX 2080 Ti devices (Turing generation). Each GPU device has 11 GB (GDDR6) global memory with a memory bandwidth of 616 GB/s. It also contains 4352 CUDA cores. The single-precision performance of each GPU device is 11.75 TFlops. Its double-precision performance is 0.37 TFlops.
  • GeForce TitanX  devices (Maxwell generation). Each of these GPU devices has 12 GB of global memory. Each device has great single-precison performance (~7 TFlops) but does rather poor with double-precision (max ca. 200 GFlops).  Therefore, the TitanX nodes should be used for either single-precision, or mixed single-double precision GPU codes. 
  • Tesla P100 devices (Pascal generation). Each device has 16 GB global memory. Each GPU card contains 56 multiprocessors with each 64 cores (3584 CUDA cores in total). The ECE support is (currently) disabled.The system interface is a PCIe Gen3 Bus. The double-precision performance per card is 4.7 TFlops. The single-precision is 9.3 TFlops anf the half-precision performance is 18.7 TFlops. 
  • Tesla V100 (Volta generation). Each GPU has 16 GB of global memory.  Peak double-precision performance of a Tesla V100  is 7 TFlops; its peak single-precision performance is 14 TFlops. Its memory bandwidth is 900 GB/s. It also contains 5120 CUDA cores.
  • Tesla A100 (Ampere generation). Each GPU has either 40 or 80 GB of global memory and a peak double-precision performance of 9.7 TFlops. The total memory bandwidth is 1555 Gb/s. There are 6912 CUDA Cores. 
  • RTX 3090 (Ampere generation). Each GPU has 24 GB global memory and 10,496 CUDA cores and 328 Tensor cores. The peak single precision performace is 35.7 TFlops. The memory bandwidth is 936 Gb/s.
  • Tesla T4 (Turing generation). Each GPU has 16 GB global memory, with 2,560 CUDA cores and 320 Tensor cores. The peak single precision performace is 8.1 TFlops, with 65 TFlops peak mixed precision performance. The memory bandwidth is 320 Gb/s.
  • Tesla A40 (Ampere generation):  Each GPU has 48 GB global memory. The memory bandwith is 696 GB/s. There are 10,752 CUDA cores, 84 2nd generation Ray Tracing cores, and 336  3rd genereration Tensor cores.
  • RTX 3090  (Ampere generation): Each GPU has 24 GB memory, 10,496 CUDA cores.
  • Tesla A30 (Ampere generation):  Each GPU has 24 GB global memory and a peak double-precision performance of 5.2 TFlops. The memory bandwidth is 933 GB/s.
  • RTX A6000 (Ampere generation):  Each GPU has  48 GB memory, 10,752 CUDA cores, memory bandwidth 768 GB/s.

The node notch060 has 16 physical CPU cores (dual Intel Xeon Silver 4110 CPU @ 2.10 GHz) and 96 GB memory.

The nodes notch[001-004, 055, 083-084] each have 32 physical CPUS cores (dual Intel Xeon Gold 6130 CPU @ 2.10 GHz) and 192 GB memory.

The nodes notch{081-082] each have 64 CPU physical cores (dual AMD EPYC 7601@ 2.2 GHz) and 512 GB memory.

The node notch085 has 32 physical CPU cores (dual Intel Xeon Silver 4216 CPU @ 2.10GHz) and 384 GB memory.

The nodes notch[086-089, 136, 204, 215, 271, 243} each have 40 physical cores (dual Intel Xeon Gold 6230 CPU @ 2.10GHz) and 192 GB memory.

The node notch102 has 40 physical CPU cores (dual Intel Xeon Gold 6230 CPU @ 2.10GHz) and 192 GB memory.

The node notch103 has 40 physical CPU cores (dual Intel Xeon Gold 6230 CPU @ 2.10GHz) and 384 GB memory.

The nodes notch[168, 169] each have 40 physical CPU cores (dual Intel Xeon Gold 6230 CPU @ 2.10 GHz) and 384 GB memory.

The node notch293 has 64 physical cores (dual AMD 7502 CPU @ 2.5 GHz) and 512 GB memory.

The node notch299 has 32 physical cores (single AMD 7502P 32-Core Processor) and 256 GB memory.

The node notch300 has 52 physical cores (dual Intel Xeon Gold 6230R CPU @ 2.10 GHz) and 768 GB memory.

The nodes notch{[308-309] each have 52 physical cores (dual Intel Xeon Gold 6230R CPU @ 2.10 GHz) and 512 GB memory.

The node notch328 has 48 physical cores (dual AMD 7413 CPU @ 2.65 GHz) and 256 GB memory.

The node notch329 has 64 physical cores (dual AMD 7543 CPU @ 2.8 GHz) and 512 GB memory.

The node notch330 has 128 physical cores (dual AMD 7713 CPU @ 2.0 GHz ) and 1024 GB memory.

The node notch 343 has 56 physical cores (dual Intel Xeon Gold 6330 CPU @ 2.0 GHz) and 256 GB memory.

The nodes notch[347-348] each have 56 physical cores (dual Intel Xeon Gold 6330 CPU @ 2.0 GHz) and 1024 GB memory.

The nodes notch[367-370] each have 64 physical cores (dual AMD 7513 @ 2.6 GHz) and 256 GB memory.

Note that the owner GPU nodes are accessible to all CHPC users via the  if and only if they use the SLURM partition notchpeak-gpu-guest if the owner groups leaves the nodes idle; the guest jobs are preemptible by slurm jobs from the owner group.

Also note that the nodes notch{081-082,308-309} are part of the notchpeak-shared-short partition open to all users, but with constraints on wall time, cores, memory and gpus allowed per user. 

 

The GPU nodes kp297 and kp298 each have two 6-core Intel Haswell generation CPUs (2.4 GHz) and 64 GB memory.

The remaining GPU nodes i.e., kp359, kp360, kp361 & kp362 have each 2 14-core Intel Broadwell processors (E5-2680 v4 running @ 2.4 GHz) and 256 GB RAM. These nodes are general nodes (i.e. owned by the CHPC). The GPU nodes kp{359-362} are owned by the School of Computing (SOC). Therefore, CHPC users outside SOC need to the use the SLURM kingspeak-gpu-guest partition if they want to get access the GPUS of kp{359-362} nodes.

Redwood contains two general compute nodes with GPU devices. Both compute nodes have 32 physical CPUS cores (Intel Xeon Gold 6130 CPU @ 2.10 GHz) and 192 GB memory.

There are also owner GPU nodes:

rw183 has 128 physical cores (dual AMD 7713 CPU @ 2.0 GHz ) and 1024 GB memory.

rw188 has 56 physical cores (dual Intel Xeon Gold 6330 CPU @ 2.0 GHz) and 512 GB memory.

rw[189-190] each have 56 physical cores (dual Intel Xeon Gold 6330 CPU @ 2.0 GHz) and 1024 GB memory.

The use of the GPU nodes does not affect your allocation  (i.e. their usage does not count against any allocation your group may have).  However, we have restricted the access to the GPU nodes to users who have a special GPU account, which needs to be requested. Please e-mail helpdesk@chpc.utah.edu to do so.

There are 2 categories of GPU nodes:

  • General nodes ( labeled Y in the last column of Table 1) i.e. nodes owned by the CHPC. The corresponding SLURM partition and account settings are:
    • --account=$(clustername)-gpu
    • --partition=$(clustername)-gpu 
    where $(clustername) stands for eithernotchpeak, kingspeak, or redwood.
  • Owner nodes (non-general nodes) i.e. nodes owned by a research group (labeled N in the last column of Table 1).
    • For members of the groups who own GPU nodes a specific instance of a SLURM account & partition settings will be given to get access to these nodes. For example, the members of the group who own the nodes kp{359-362}, need to use the following settings:
      • --account=soc-gpu-kp
      • --partition=soc-gpu-kp
    • Users outside the group can also use these devices. The corresponding account & partition names are:
      • --account=owner-gpu-guest
      • --partition=$(clustername)-gpu-guest
      where $(clustername) stands for either notchpeak or kingspeak. Note that the jobs by users outside the group may be subjected to preemption. 

If one wants to access the GPU devices on a node one must specify the generic consumable resources flag (a.k.a. gres flag). The gres flag has the following syntax:

--gres=$(resource_type)[:$(resource_name):$(resource_count)]
where:

  • $(resource_type) is always equal to gpu string for the GPU devices.
  • $(resource_name) is a string which describes the type of the requested gpu(s) e.g. 1080ti, titanv, 2080ti, etc
  • $(resource_count) is the number of gpu devices that are requested of the type $(resource_name)$.
    Its value is an integer in the closed interval: {1,max. number of devices on one node}
  • the [ ] means optional parameters, that is, to request any single GPU (the default for the count is 1), regardless of a type, --gres=gpu will work. To request more than one GPU of any type, one can add the $(resource_count), e.g. --gres=gpu:2

The gres flag attached to each type of node can be found in the second-to-last column of Table 1.
For example, the flag --gres=gpu:titanx:5 must be used to request 5 GTX Titan X devices that can only be satisfied by the nodes kp297 and kp298. 

Note that if you do not specify the gres flag, your job will run on a GPU node (presuming you use the correct combination of the --partition and --account flag), but it will not have access the node's GPUs.

Some programs are serial, or able to run only on a single GPU; other jobs perform better on a single or small number of GPUs and therefore cannot efficiently make use of all of the GPUs on a single node. Therefore, in order to better utilize our GPU nodes,  node sharing has been enabled for the GPU partitions. This allows multiple jobs to run on the same node, each job being assigned specific resources (number of cores, amout of memory, number of accelerators). The node resources are managed by the scheduler up to the maximum available on each node. It should be noted that while efforts are made to isolate jobs running on the same node, there are still many shared components in the system.  Therefore a job's performace can be affected by the other job(s) running on the node at the same time and if you are doing benchmarking you will want to request the entire node even if your job will only make use of part of the node.

Node sharing can be accessed by requesting less than the full number of gpus, core and/or memory. Note that node sharing can also be done on the basis of the number of cores and/or memory, or all three. By default, each job gets  2 GB of memory per core requested (the lowest common denominator among our cluster nodes), therefore to request a different amount than the default amount of memory, you must use --mem flag . To request exclusive use of the node, use --mem=0.

When node sharing is on (default unless asking full number of GPUs, cores or memory), the SLURM scheduler automatically sets task to core affinity, mapping one task per physical core. To find what cores are bound to the job's tasks, run:

cat /cgroup/cpuset/slurm/uid_$SLURM_JOB_UID/job_$SLURM_JOB_ID/cpuset.cpus

Below is a list of useful job modifiers for use:

Option Explanation
#SBATCH --gres=gpu:1080ti:1 request one 1080ti GPU
#SBATCH --mem=4G request 4 GB of RAM 
#SBATCH --mem=0

request all memory of the node; this option also
ensures node is in exclusive use by the job

#SBATCH --ntasks=1 requests 1 task, mapping it to 1 CPU core
 

An example script that would request two Notchpeak nodes with 2xM2090 GPUs, including all cores and all memory, running one GPU per MPI task, would look like this:

#SBATCH --nodes=2
#SBATCH --ntasks=4
#SBATCH --mem=0

#SBATCH --partition=notchpeak-gpu
#SBATCH --account=notchpeak-gpu
#SBATCH --gres=gpu:m2090:2
#SBATCH --time=1:00:00
... prepare scratch directory, etc
mpirun -np $SLURM_NTASKS myprogram.exe

 To request all 8 3090 GPUs on notch328again using one GPU per MPI task, we would do:

#SBATCH --nodes=1
#SBATCH --ntasks=8
#SBATCH --mem=0
#SBATCH --partition=notchpeak-gpu
#SBATCH --account=notchpeak-gpu
#SBATCH --gres=gpu:3090:8
#SBATCH --time=1:00:00
... prepare scratch directory, etc
mpirun -np $SLURM_NTASKS myprogram.exe

As an example, using the script below  will get four GPUs,  four CPU cores, and 8GB of memory.  The remaining GPUs, CPUs, and memory will then be accessible for other jobs.

#SBATCH --time=00:30:00
#SBATCH --nodes=1
#SBATCH --ntasks=4
#SBATCH --gres=gpu:titanx:4
#SBATCH --account=kingspeak-gpu
#SBATCH --partition=kingspeak-gpu

The script below will ask for 14 CPU cores, 100 GB of memory and 1 GPU card on one of the P100 nodes.

#SBATCH --time=00:30:00
#SBATCH --nodes=1
#SBATCH --ntasks=14
#SBATCH --gres=gpu:p100:1
#SBATCH --mem=100GB
#SBATCH --account=owner-gpu-guest
#SBATCH --partition=kingspeak-gpu-guest
 

To run an interactive job, add the --gres=gpu option to the salloc command, e.g.

salloc -n 2 -N 1 -t 1:00:00 -p kingspeak-gpu -A kingspeak-gpu --gres=gpu:titanx:2

This will allocate the resources to the job, namely two tasks and two GPUs. To run parallel job, use srun or mpirun commands to launch the calculation on the allocated compute node resources. To specify more memory than the default 2GB per task, use the --mem option.

For serial, non-MPI jobs, utilizing one or more GPUs, ask for one node, e.g.

salloc -n 1 -N 1 -A owner-gpu-guest -p kingspeak-gpu-guest --gres=gpu:p100:1

This will allocate the resources to the job, namely one core (task) and one GPUs. To run the job, use the  srun command to launch the calculation on the allocated compute node resources. 

Note - April 2021: Due to a change made in slurm with regards to the allocation of gpus to a job: the nvidia-smi command now shows you only the gpus which have been assigned to your job instead of all of the gpus on the node.

Note - April 2022:  You can also check on the utilization of  the GPUs assigned to a batch job from an interactive node with the following command and the job number:

srun --pty --jobid XXXX nvidia-smi

This will return the nvidia-smi results for the GPU(s) assigned to your job.

As of early 2021, all CHPC GPUs are from Nvidia, and as such we offer Nvidia programming tools, including CUDA, PGI CUDA Fortran and the OpenACC compilers. The latest programming tools are all included in the Nvidia HPC SDK, available by module load nvhpc, and include both the former Nvidia CUDA compilers and PGI compilers, along with the Nvidia GPU libraries, debugger and profilers.

Alternatively, one can explicitly load a CUDA version with module load cuda/version. Different CUDA versions can be also used, run  module spider cuda to see what versions are available. Deprecated PGI compilers which come with their own CUDA can be set up by loading the PGI module, module load pgi.

To compile CUDA code so that it runs on all the four types of GPUs that we have, use the following compiler flags:  -gencode arch=compute_20,code=sm_20 -gencode arch=compute_37,code=sm_37 -gencode arch=compute_52,code=sm_52 -gencode arch=compute_60,code=sm_60 -gencode arch=compute_70,code=sm_70. For more info on the CUDA compilation and linking flags, please have a look at  http://docs.nvidia.com/cuda/pdf/CUDA_C_Programming_Guide.pdf.

The Nvidia HPC compilers (formerly PGI compilers) specify the GPU architecture with the-tp=tesla flag. If no further option is specified, the flag will generate code for all available computing capabilities (at the time of writing cc20, cc30, cc35, cc50 and cc60). To be specific for each GPU,-tp=tesla:cc20 can be used for the M2090,-tp=tesla:cc50 for the TitanX and-tp=tesla:cc60 for the P100. To invoke the OpenACC, use -acc flag. More information on OpenACC can be obtained at http://www.openacc.org.

Good tutorials on GPU programming are available at the CUDA Education and Training site from Nvidia.

When running the GPU code, it is worth checking the resources that the program is using, to ensure that the GPU is well utilized. For that, one can run the nvidia-smi command, and watch for the memory and CPU utilization. nvidia-smi is also useful to query and set various features of the GPU, see nvidia-smi --help for all the options that the command takes. For example, nvidia-smi -L  lists the GPU card properties. On the TitanX nodes:

 Titan Card: GPU 0: GeForce GTX TITAN X (UUID: GPU-cd731d6a-ee18-f902-17ff-1477cc59fc15)

Note that you can also check on the utilization of  the GPUs assigned to a batch job from an interactive node with the following command and the job number:

srun --pty --jobid XXXX nvidia-smi

This will return the nvidia-smi results for the GPU(s) assigned to your job.

Nvidia HPC SDK bundles Math libraries which can be used to offload computation to the GPUs and Communication libraries which can be used for fast communication between multiple GPUs. The Math libraries are located at $NVROOT/math_libs and the communication libraries are at $NVROOT/comm_libs.

To compile and link e.g. the cuBLAS, include the following parameters in the compilation line: -I$NVROOT/math_libs/include , and these parameters in the link line: -L$NVROOT/math_libs/lib64 -Wl,-rpath=$NVROOT/math_libs/lib64 -lcublas.

Nvidia HPC SDK and CUDA distributions include a terminal debugger named cuda-gdb. Its operation is similar to the GNU gdb debugger. For details, see the cuda-gdb documentation.

For out of bounds and misaligned memory access errors, there is the cuda-gdbmemcheck tool. For details, see the cuda-memcheck documentation.

The Totalview debugger that we license used to license and DDT debugger that we currently license also support CUDA and OpenACC debugging. Due to its user friendly graphical interface we recommend them for GPU debugging. For information on how to use DDT or Totalview, see our debugging page.

Profiling can be very useful in finding GPU code performance problems, for example inefficient GPU utilization, use of shared memory, etc. Nvidia CUDA provides a visual profiler called Nsight Systems(nsight-sys). There are also older, deprecated, command line profiler (nprof) and visual profiler (nvvp). More information is in the CUDA profilers documentation.

NOTE that the usage of GPU hardware counters is restricted due to a security problem in the Nvidia driver, as detailed in this Nvidia post. This error will be demonstrated by the following:

$ nvprof -m all ./my-gpu-program
==4983== Warning: ERR_NVGPUCTRPERM - The user does not have permission to profile on the target device. See the following link
for instructions to enable permissions and get more information: https://developer.nvidia.com/ERR_NVGPUCTRPERM

In this case please open a support ticket asking for a reservation on a GPU node. Our admins will enable the hardware counters profiling for all users inside of this reservation and instruct how to use it.

 

 

 We have the following GPU codes installed:

Code name Module name  Prerequisite  modules Sample batch script(s) location Other notes
HOOMD hoomd

gcc/4.8.5

mpich2/3.2.g

/uufs/chpc.utah.edu/sys/installdir/hoomd/2.0.0g-[sp,dp]/examples/  
VASP vasp

intel

impi

cuda/7.5

/uufs/chpc.utah.edu/sys/installdir/vasp/examples Per group license, let us know if you need access
AMBER amber-cuda

gcc/4.4.7

mvapich2/2.1.g

 adapt CPU script  
LAMMPS lammps/10Aug15        intel/2016.0.109  impi/5.1.1.109 adapt CPU script  

If there is any other GPU code that you would like us installed, please let us know.

Some commercial programs that we have installed, such as Matlab, also have GPU support. Either try them or contact us.

Last Updated: 8/26/22