Notchpeak User Guide
Notchpeak, started in January 2018, is operated in a condominium fashion with some general CHPC nodes on which groups can get allocation, along with additional nodes owned by different research groups. All users have guest access to the owner nodes in a preemptable fashion.
The General CHPC nodes include:
- 9 GPU nodes each with 32 (Intel XeonSP Skylake) or 40 (Intel XeonSP Cascadelake) cores, 192GB memory, and a mix of P40, V100, A100, and RTX2080Ti GPUs. Note that these nodes are in the notchpeak-gpu partition and are described in more detail on our GPU & Accelerators page.
- 25 dual socket nodes (Intel XeonSP Skylake) with 32 cores each
- 4 nodes with 96 GB memory
- 19 nodes with 192 GB memory
- 2 nodes with 768 GB memory
- 1 dual socket node (Intel XeonSP Skylake) with 36 cores, 768 GB memory
- 7 dual socket nodes (Intel XeonSP Cascadelake) with 40 cores, 192 GB memory
- 32 single socket AMD Rome nodes with 64 cores, 256 GB memory -- see details on AMD nodes below
- 2 dual socket AMD Naples nodes, each with 64 cores, 512 GB memory. Note that these are in a special short partition, notchpeak-shared-short, see below for details.
Other information on notchpeak's hardware and cluster configuration:
- Mellanox EDR Infiniband interconnect
- 2 general interactive nodes
The Skylake and Cascadelake processors offer AVX-512 support. See our page on Single Executables for all CHPC Platforms for details on building applications to take advantage of this feature.
Added April 2019: Two new AMD processor (Epyc 7601, Naples) based nodes now available as compute nodes on notchpeak. Each node has 64 physical cores and 512GB of memory.
Instead of adding these nodes to the general notchpeak partition, we are using them to explore having a “test or debug” queue, with a shorter maximum wall time. We are doing this as we have had several requests for a test queue, and the arrival of these nodes has given us an opportunity to see if there will be sufficient usage of this queue.
These nodes are available for use by all users, regardless if they have access to a general allocation; the use of these nodes will not count against any allocation. To use set both the partition and the account to notchpeak-shared-short. As node sharing is being used – users MUST specify the number of cores and the amount of memory – see https://www.chpc.utah.edu/documentation/software/node-sharing.php for additional details.
In order to maximize throughput of short jobs, and provide access to all users, they have been placed in a separate partition, with node sharing enabled. Use of these nodes is limited:
- Maximum wall time is 8 hours
- Maximum jobs in the queue per user is 10
- Maximum running jobs per user is 2
- Maximum cores per user is 16
- Maximum memory per user is 128GB
Added January 2020: 32 new AMD Rome (7702P processors) based nodes are now available as compute nodes as part of the general notchpeak partition (partition set to notchpeak or notchpeak-shared). Each node has 64 physical cores and 256 GB of memory. There is an evaluation of the new AMD Rome processors; this includes benchmarking of the performance relative to a number of generations of the Intel processors.
Some notes on usage:
- As we now have a mix of Intel and AMD processors on notchpeak, we have added an additional
terms -- rom for the AMD Rome, skl for Intel SkyLake, csl for Intel CascadeLake --
to the feature list for the notchpeak compute nodes. For completeness we have also
added the feature npl for the AMD Naples process This term along with the SLURM constraint
flag can be used to either target the AMD processor nodes (
#SBATCH -C rom) or the intel based (
#SBATCH-C “skl|csl”) for a given job. If you do not have this constraint flag the job will be eligible to run on any one of the notchpeak general partition nodes, based on your other SBATCH options. Note that one easy way to get the complete lsit of features is to use the command
scontrol show node notchXXX, substituting in the name of the node of interest.
- From the testing we have completed our existing application builds should run on the AMD based nodes.
- The AMD Rome based nodes DO NOT have AVX512 support (which is available on all the intel based nodes on notchpeak).
- We have found that for codes that make use of the MKL libraries, increased performance
is often obtained if the MKL_DEBUG_CPU_TYPE environment variable is set to 5. You
should test your code to see if this improves performance. This variable can be set
in your batch script by:
- Tcsh: setenv MKL_DEBUG_CPU_TYPE 5
- Bash: export MKL_DEBUG_CPU_TYPE=5
- As these nodes each have 64 cores – the use of node sharing for jobs that will not efficiently use all cores becomes even more important. As will all of the other nodes on the clusters, each node is part of two partitions. For all general partition nodes of notchpeak the two partitions are notchpeak and notchpeak-shared.
CHPC resources are available to qualified faculty, students (under faculty supervision), and researchers from any Utah institution of higher education. Users can request accounts for CHPC computer systems by filling out an account request form. This can be found by following this link: account request form.
Users requiring priority on their jobs may apply for an allocation of wall clock hours per quarter. Users will need to send a brief proposal, using the the allocation form found here.
The notchpeak cluster can be accessed via ssh (secure shell) at the following address:
All CHPC machines mount the same user home directories. This means that the user files on notchpeak will be exactly the same as the ones on other CHPC clusters. The advantage is obvious: users do not need to copy files between machines.
Notchpeak compute nodes mount the following scratch file systems:
As a reminder, the non-restricted scratch file systems are automatically scrubbed of files that have not been accessed for 60 days.
Your environment is setup through the use of modules. Please see the User Environment section of the General Cluster Information page for details in setting up your environment for batch and other applications.
The batch implementation on notchpeak is Slurm.
The creation of a batch script on the Notchpeak cluster
A shell script is a bundle of shell commands which are fed one after another to a
tcsh,..). As soon as the first command has successfully finished, the second command is
executed. This process continues until either an error occurs or the complete array
of individual shell commands has been executed. A batch script is a shell script which
defines the tasks a particular job has to execute on a cluster.
Below this paragraph a batch script example for running in Slurm on the notchpeak cluster is shown. The lines at top of the file all begin with #SBATCH which are interpreted by the shell as comments, but give options to Slurm.
Example Slurm Script for notchpeak:
#SBATCH --time=1:00:00 # walltime, abbreviated by -t
#SBATCH --nodes=2 # number of cluster nodes, abbreviated by -N
#SBATCH -o slurm-%j.out-%N # name of the stdout, using the job number (%j) and the first node (%N)
#SBATCH --ntasks=64 # number of MPI tasks, abbreviated by -n # additional information for allocated clusters
#SBATCH --account=baggins # account - abbreviated by -A
#SBATCH --partition=notchpeak # partition, abbreviated by -p # # set data and working directories
setenv WORKDIR $HOME/mydata
setenv SCRDIR /scratch/notchpeak/serial/UNID/myscratch
mkdir -p $SCRDIR
cp -r $WORKDIR/* $SCRDIR
# load appropriate modules, in this case Intel compilers, MPICH2
module load intel mpich2
# for MPICH2 over Ethernet, set communication method to TCP - for general lonepeak nodes
# see above for network interface selection options for other MPI distributions
setenv MPICH_NEMESIS_NETMOD tcp
# run the program
# see above for other MPI distributions
mpirun -np $SLURM_NTASKS my_mpi_program > my_program.out
For more details and example scripts please see our Slurm documentation. Also, to help with specifying your job and instructions in your slurm script, please review CHPC Policy 2.1.6 notchpeak Job Scheduling Policy.
Job Submission on Notchpeak
In order to submit a job on notchpeak one has to login first into a notchpeak interactive node. Note that this is a change from the way job submission has worked in the past on our other clusters where you could submit from any interactive node to any cluster.
To submit a script named slurmjob.script, just type:
To check the status of your job, use the "squeue" command
For information on compiling on the clusters at CHPC, please see our Programming Guide.