Notchpeak User Guide - Center for High Performance Computing

Hardware Overview

The Notchpeak cluster began development in January of 2018 and is operated in a condominium-style fashion, containing both CHPC-owned nodes and nodes owned by different research groups. Users can access all nodes on notchpeak with their allocation and have guest access to the owner nodes in a preemptable fashion.

The General CHPC nodes include:

9 GPU nodes each with 32 (Intel XeonSP Skylake) or 40 (Intel XeonSP Cascadelake) cores, 192GB memory, and a mix of P40, V100, A100, and RTX2080Ti GPUs. These nodes are in the notchpeak-gpu partition and are described in more detail on our GPU & Accelerators page.
25 dual socket nodes (Intel XeonSP Skylake) with 32 cores each
- 4 nodes with 96 GB memory
- 19 nodes with 192 GB memory
- 2 nodes with 768 GB memory
1 dual socket node (Intel XeonSP Skylake) with 36 cores, 768 GB memory
7 dual socket nodes (Intel XeonSP Cascadelake) with 40 cores, 192 GB memory
32 single socket AMD Rome nodes with 64 cores, 256 GB memory -- see details on AMD nodes below
2 dual socket AMD Naples nodes, each with 64 cores, 512 GB memory. These are in a special short partition, notchpeak-shared-short. See below for details.

Other information on notchpeak's hardware and cluster configuration:

Mellanox EDR Infiniband interconnect
2 general interactive nodes

Important Differences from other CHPC Clusters

The Skylake and Cascadelake processors offer AVX-512 support. See our page on Single Executables for all CHPC Platforms for details on building applications to take advantage of this feature.

Notchpeak Usage

CHPC resources are available to faculty, students under faculty supervision, and researchers from any Utah institution of higher education. Users can request accounts for CHPC computer systems by filling out an account request form.

The notchpeak cluster requires an allocation for jobs running in a non-preemptive state. Users requiring an allocation may apply for an allocation of wall clock hours per quarter. Users will need to send a brief proposal, using the allocation form found here.

Notchpeak Access and Environment

The notchpeak cluster can be accessed via ssh (secure shell) at the following address:

notchpeak.chpc.utah.edu

The above address will randomly assign you to one of two of the CHPC login nodes - notchpeak1.chpc.utah.edu or notchpeak2.chpc.utah.edu. You can, alternatively, specify a login node. This method is preferred as it enables our User Services team to find solutions to login issues more quickly.

All CHPC machines mount the same user home directories. This means that the user files on notchpeak will be exactly the same as the ones on other CHPC general environment clusters (kingspeak, lonepeak, ash). The advantage is that users do not need to copy files between machines.

Notchpeak compute nodes mount the following scratch file systems:

/scratch/general/nfs1
/scratch/general/vast

These scratch file systems are automatically scrubbed of files that have not been accessed for 60 days.

Your environment is setup through the use of modules. Please see the User Environment section of the General Cluster Information page for details in setting up your environment for batch and other applications.

Using the Batch System on Notchpeak

The batch implementation on notchpeak is Slurm.

Creation of a Batch Script on the Notchpeak Cluster

A shell script is a bundle of shell commands which are fed one after another to a shell (bash, tcsh, etc). As soon as the first command has successfully finished, the second command is executed. This process continues until either an error occurs or the complete array of individual shell commands has been executed. A batch script is a shell script which defines the tasks a particular job has to execute on a cluster.

Below this paragraph a batch script example for running in Slurm on the notchpeak cluster is shown. The lines at top of the file all begin with #SBATCH - these are interpreted by the shell as comments, but pass as options to Slurm.

Below is an example slurm script for notchpeak:

#!/bin/csh

#SBATCH --time=1:00:00 # walltime, abbreviated by -t
#SBATCH --nodes=2 # number of cluster nodes, abbreviated by -N
#SBATCH -o slurm-%j.out-%N # name of the stdout, using the job number (%j) and the first node (%N)
#SBATCH --ntasks=64 # number of MPI tasks, abbreviated by -n # additional information for allocated clusters
#SBATCH --account=baggins # account - abbreviated by -A
#SBATCH --partition=notchpeak # partition, abbreviated by -p # # set data and working directories

#create working directory environmental variable that points to directory data is housed in
setenv WORKDIR $HOME/mydata

#set scratch directory for holding temporary input/output files created during the job

setenv SCRDIR /scratch/general/vast/$USER/$SLURM_JOB_ID
mkdir -p $SCRDIR

#copy input files over to scratch directory, then move into the scratch directory
cp -r $WORKDIR/* $SCRDIR
cd $SCRDIR

# load appropriate modules, in this case Intel compilers, MPICH2
module load intel mpich2

# for MPICH2 over Ethernet, set communication method to TCP - for general lonepeak nodes
# see above for network interface selection options for other MPI distributions
setenv MPICH_NEMESIS_NETMOD tcp

# run the program
# see above for other MPI distributions
mpirun -np $SLURM_NTASKS my_mpi_program > my_program.out

For more details and example scripts please see our Slurm documentation. Also, to help with specifying your job and instructions in your Slurm script, please review CHPC Policy 2.1.6 notchpeak Job Scheduling Policy.

Job Submission on Notchpeak

In order to submit a job on notchpeak one has to first login to a notchpeak interactive node.

To submit a script named slurmjob.script, just type:

sbatch slurmjob.script

If you are logged into another general environment cluster, but would like to submit your job to notchpeak, you may do so with a variation of the previous command:

sbatch -M notchpeak slurmjob.script

Checking the Status of your Job in Slurm

To check the status of your job, use the "squeue" command

squeue --me

The above will filter the squeue output for only your jobs.

Notchpeak-shared-short nodes

Two AMD processor (Epyc 7601, Naples) based nodes are available as compute nodes on the notchpeak cluster within the notchpeak-shared-short partition. Each node has 64 physical cores and 512GB of memory.

Instead of adding these nodes to the general notchpeak partition, we are using them as a “test or debug” queue with a shorter maximum wall time, allowing users to access computational resources for small, short jobs, testing code, or debugging code with little to no wait time in the Slurm queue.

In order to maximize throughput of short jobs and provide access to all users, they have been placed in a separate partition, with node sharing enabled. Use of these nodes is limited to:

Maximum wall time is 8 hours
Maximum running jobs per user is 2
Maximum cores per user is 32
Maximum memory per user is 128 GB
Maximum cores per job is 16
Maximum memory per job is 128 GB

These nodes are available for use by all users, regardless of access to a general allocation. The use of these nodes will not count against any allocation. To use these nodes, set both the partition and the account to notchpeak-shared-short. As node sharing is being used – users must specify the number of cores and the amount of memory – see https://www.chpc.utah.edu/documentation/software/node-sharing.php for additional details.

For information on compiling on the clusters at CHPC, please see our Programming Guide.

Notchpeak AMD Rome Nodes

32 AMD Rome (7702P processors) based nodes are available as compute nodes within part of the general notchpeak partition (Slurm partition must be set to notchpeak or notchpeak-shared for use of these nodes). Each node has 64 physical cores and 256 GB of memory. There is an evaluation of the new AMD Rome processors; this includes benchmarking of the performance relative to a number of generations of the Intel processors.

Some notes on usage:

As we have a mix of Intel and AMD processors on notchpeak, we have added additional terms -- rom for the AMD Rome, skl for Intel SkyLake, csl for Intel CascadeLake -- to the feature list for the notchpeak compute nodes. For completeness we have also added the feature npl for the AMD Naples process. This term, along with the SLURM constraint flag, can be used to either target the AMD processor nodes (#SBATCH -C rom) or the intel based (#SBATCH -C “skl|csl”) for a given job. If you do not have this constraint flag, the job will be eligible to run on any one of the notchpeak general partition nodes based on your other SBATCH options. Note that one easy way to get the complete list of features is to use the command scontrol show node notchXXX, substituting 'notchXXX' with the name of the node of interest.
The AMD Rome based nodes do not have AVX512 support. AVX512 support is available on all the intel based nodes on notchpeak.
We have found that for codes that make use of the MKL libraries, increased performance is often obtained if the MKL_DEBUG_CPU_TYPE environment variable is set to 5. You should test your code to see if this improves performance. This variable can be set in your batch script by:
- Tcsh: setenv MKL_DEBUG_CPU_TYPE 5
- Bash: export MKL_DEBUG_CPU_TYPE=5

As these nodes each have 64 cores, the use of node sharing for jobs that will not efficiently use all cores becomes even more important. These nodes are part of two partitions - notchpeak and notchpeak-shared.
- The notchpeak partition reserves the whole node for your job, regardless of if your job requires only part of the node.
- The notchpeak-shared partition will share nodes among jobs, splitting the node by CPU count asked for in the #SBATCH --ntasks directive