We have added 32 AMD Rome based nodes to the general partition of the notchpeak cluster.
These nodes each have a single 64 core process (7702P, https://www.amd.com/en/products/cpu/amd-epyc-7702p, 2.0 GHz) and 256 GB of memory. We have posted an evaluation of the new AMD Rome processors, which includes benchmarking of the performance relative to a number of generations of the Intel processors. This takes the number of nodes in the general partition of notchpeak from 33 to 65, with the total core count going from 1,116 to 3,164. As was mentioned In the last CHPC newsletter, these nodes were purchased as an ember replacement, and will provide savings of both space and power consumption in the Downtown Data Center. These new nodes replace the core count of all of ember, that of both the general and owner nodes, while providing much more computational capacity per core (see the evaluation for details).
A few details:
- As we now have a mix of Intel and AMD processors on notchpeak, we have added an additional terms – rom for the AMD Rome generation processors, skl for Intel SkyLake, csl for Intel CascadeLake, npl for the AMD Naples generation processors (note that the AMD Naples processors aren only on the two nodes of the notchpeak-shared-short partition)-- to the feature list for the notchpeak compute nodes. This term along with the SLURM constraint flag can be used to either target the AMD processor nodes (#SBATCH –C rom) or the intel based nodes (#SBATCH-C “skl|csl”) for a given job. If you do not have this constraint flag the job will be eligible to run on any one of the notchpeak general partition nodes, based on your other SBATCH options.
- From the testing we have completed our existing application builds should run on the AMD based nodes.
- The AMD Rome nodes DO NOT have AVX512 support (which is available on all the intel processor based nodes on notchpeak).
- We have found that for codes that make use of the MKL libraries, increased performance
is often obtained if the MKL_DEBUG_CPU_TYPE environment variable is set to 5. You
should test your code to see if this improves performance. This variable can be set
in your batch script by:
- Tcsh: setenv MKL_DEBUG_CPU_TYPE 5
- Bash: export MKL_DEBUG_CPU_TYPE=5
- As these nodes each have 64 cores – the use of node sharing for jobs that will not efficiently use all cores becomes even more important to ensure the efficient use of the CHPC general resources. As will all of the other nodes on the clusters, each node is part of two partitions. For the general partition nodes of notchpeak the two partitions are notchpeak, to be used for when the job needs the entire node, and notchpeak-shared – for jobs not needing the entire node , in regards to both the cores and memory.
After users have access to the nodes for a couple of weeks, we will be retiring the ember cluster and also moving the kingspeak cluster off allocation, provided that there are no major issues reported. We will send another announcement when these changes are made.
Please take the time to test your workload on these new nodes. If you have any issues running on the new nodes, or have any other questions and/or concerns, please let us know via our ticketing system (firstname.lastname@example.org)