Apt Cluster Hardware (General) Overview
- 128 Dell PowerEdge r320 nodes, with a single Intel Xeon E5-2450 processor (8 cores, 2.1Ghz), 16GB Memory, and four 500GB Hard Drives
- 64 Dell PowerEdge c6220 nodes, with dual Intel Xeon E5-2650v2 processors (8 cores, 2.6Ghz), for a total of 16 cores, 64GB Memory, and two 1TB Hard Drives
NOTE: Currently, Tangent is only using the 64 c6220 nodes.
NFS Home Directory
Your home directory, which is an NFS mounted file system, is one choice for I/O. This space carries the worst statistical performance in terms of I/O speed. This space is visible to all nodes on the clusters through an auto-mounting system.
NFS Scratch (
Tangent has access to another NFS filesystem:
/scratch/kingspeak/serial. This file system has 175 TB disk capacity. It is attached to the Infiniband network
to obtain a larger potential network bandwidth. This space is seen (read and write
access) on all Tangent interactive and compute nodes. However it is still a shared
resource and may therefore perform slower when it is subjected to significant user
load. Users should test their applications performance to see if they experience any
unexpected performance issues within this space. This file system has a scrub policy
of files older than 60 days being deleted.
Parallel Scratch (
Tangent also has access to a seond scratch file system :
/scratch/general/lustre. This file system has a capacity of 700 TB. This space is seen (read and write access) on all Tangent interactive and compute nodes.
However it is still a shared resource and may therefore perform slower when it is
subjected to significant user load. Users should test their applications performance
to see if they experience any unexpected performance issues within this space. This
file system has a scrub policy of files older than 60 days being deleted.
Local Disk (
The local scratch space is a storage space unique to each individual node. As the hardware is deprovsioned after every job, no data left on this space can be retrieved after the experiment has been completed.
On Tangent the nodes that are available to run jobs at any instance is variable, depending on the other users of the Apt hardware. To see the nodes available you would do a sinfo. Below is a typical output of this command:
]$ sinfo -l
Mon Nov 17 11:17:01 2014
PARTITION AVAIL TIMELIMIT JOB_SIZE ROOT SHARE GROUPS NODES STATE NODELIST
tangent* up 3-00:00:00 1-64 no NO all 2 draining* tp[048-049]
tangent* up 3-00:00:00 1-64 no NO all 3 allocated# tp[010,024-025]
tangent* up 3-00:00:00 1-64 no NO all 21 drained tp[008,012,018-020,023,027-030,033-037,045-047,050,057,059]
tangent* up 3-00:00:00 1-64 no NO all 33 idle~ tp[001-007,011,013-017,026,038-044,051-056,058,060-064]
tangent* up 3-00:00:00 1-64 no NO all 5 allocated tp[009,021-022,031-032]
Possible states of importance for tangent allocated, completing, down, drained, draining
, fail, failing, idle, maint, mixed, perfctrs, power_down, power_up, reserved, and
unknown plus Their abbreviated forms: alloc, comp, down, drain, drng, fail, failg,
futr, idle, maint, mix, npc, pow_dn, pow_up, resv, and unk respectively. Note that
the suffix "*" identifies nodes that are presently not responding, the suffix "#"
indicates that the node is being powered up and provisioned, and the suffix "~" indicates
that the nodes are in the powered-down state. See the end of "man sinfo" page for
meaning of each of the states.
In this particular example, we see that 33 nodes are idle, i.e. powered down, thus available for a new job. The drained usually are the nodes that are used by other Apt experiments outside of Tangent. Allocated, and possibly also drained, generally describes nodes used by Tangent jobs
The squeue command lists active queue on the Tangent (= subset of Apt currently used for HPC jobs). For the sinfo example listed above, the corresponding squeue output is:
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
373 tangent 74836 u0101881 PD 0:00 52 (Resources)
396 tangent g09.slur u0028729 R 1:43 2 tp[024-025]
397 tangent tcsh u0101881 CF 0:02 2 tp[046-047]
395 tangent amber14. u0028729 CG 0:00 2 tp[048-049]
394 tangent nwchem.s u0028729 R 8:03 2 tp[009-010]
393 tangent g09.slur u0028729 R 8:13 2 tp[031-032]
392 tangent tcsh u0101881 R 14:55 2 tp[021-022]
The ST describes the state in which the job is. See man squeue for details on each states, in our case R stands for running job, CF stands for configuring (provisioning, startup) of the job, PD means that the job is pending due to insufficient available resources, CG means completing (finalizing) the job.
NOTE – we will add to this section as we get questions from users
- My job is taking a long time to start.
Due to the fact that the nodes have to be requested from the Apt framework, booted
and configured at each job start, the job startup takes longer than users may be used
to from other clusters. It is not unusual for larger jobs (> 12 nodes), to take 15-30
minutes to start. Also, occasionally there may be a problem with what nodes are made
available to the job by the Apt, in which case the whole job needs to restart which
further increases the time the job starts. We recommend to monitor the job startup
with the squeue command along with monitoring the I/O output of the calculation in
whatever directory it's being run. If the job is in the running state (R) but no output
has been written for a while, it is possible that one of the nodes is in a bad state
thus blocking the whole job. If this happens, you'll have to delete the job (scancel)
and start it again.
In order to set up a HPC job on the Apt hardware, a user will access via the tangent cluster interactve node
All CHPC machines mount the same user home directories. This means that the user files on Tangent will be exactly the same as the ones on other CHPC clusters. The advantage is obvious: users do not need to copy files between machines. However, users must be aware that they run the correct executables. CHPC maintained applications with executables suitable for use on all clusters are kept in /uufs/chpc.utah.edu/sys/pkg, whereas cluster specific executables (MPI-based applications built using the mpi optimized for the specific cluster infiniband) for Tangent will be found in /uufs/tangent.peaks/sys/pkg.
The batch implementation on this system is Slurm – Simple Linux Utility for Resource Management
Information about Slurm is available at: http://slurm.schedmd.com/documentation.html . To assist users in making the transition to Slurm, CHPC is developing a wiki page, Slurm wiki page (work in progress), which will include common slurm commands and variables, a sample slurm batch script, and a translation guide for common commands and environmental variables of the two batch systems.
There is a hard limit of maximum 72 hours for tangent jobs.
Runs in the batch system generally pass though the following steps:
The creation of a batch script on the Tangent cluster
In order to submit a job on Tangent one has to first login to the tangent interactive
node. Then the job submission is done with the
sbatch command in slurm.
For example, to submit a script named pbsjob, just type:
Checking the status of your job
Typical job options such as requested number of nodes/processors, walltime, etc are available by Slurm. Below is a sample job script:
#SBATCH -N 4
#SBATCH -n 8
#SBATCH -J my_job
#SBATCH -o my_job_output
cp -r $DATADIR/* $SCRDIR
# a way to get the list of hosts allocated to the job
srun hostname -s | sort -u > nodefile.$SLURM_JOBID
mpirun -genv OMP_NUM_THREADS 8 -np 8 $EXE
The #SBATCH option denotes the SLURM flags. Note that the node/process requesting
is different from Moab in such a sense that we request number of nodes with
-N flag and number of tasks with
-n flag. The system will allocate
-N nodes with
-n/-N tasks per node. In this example, we are requesting 4 nodes and 8 tasks, so, we'll
be running 2 MPI processes per node. To fully utilize the 16 cores per node, we'll
be running 8 OpenMP threads per process.
Also note that we are using the MPICH2 distribution from the generic CHPC program
branch - the version 3.1.2 has been built with InfiniBand network support that is
appropriate for the Tangent cluster. In case you need to run using the Ethernet network,
MPICH_NEMESIS_NETMOD=tcp to the mpirun flags.
Finally, MPICH2 supports SLURM process to task mapping, which allows us to run without
-machinefile flag that tells mpirun to run on hosts listed in the host file. If the host file
is needed, the
srun hostnameline in the script above produces it.
Interactive jobs are best started with the srun command. For example, to get 2 nodes with 4 tasks:
srun --pty -t 1:00:00 --partition=tangent --account=youraccount -n 4 -N 2 /bin/tcsh
The important flags are the
--pty denoting to get an interactive terminal, and
/bin/tcsh -l, which is the shell to run. If you prefer bash, replace
Slurm User Commands
Please see the slurm documentation wiki page for more information.
For information on application development and compiling on the clusters at CHPC, please see our Programming Guide.