You are here:

Tangent User Guide

Background on the Apt Project

In Fall 2013, NSF funded a collaboration led by Rob Ricci of the School of Computing’s Flux group, in conjunction with the Center for High Performance Computing, to develop an “adaptable profile-driven testbed”.  This testbed will allow computer research teams to use the same resources for different missions, e. g., network experiments, high performance computing experiments, security experiments.   The Flux group’s emphasis is to create a low barrier manner of creating very reproducible experiments.  CHPC’s emphasis is to create an environment to roll out HPC images on demand, to scale the images dynamically, and to support multiple images with different HPC and security contexts.

The Apt project is a three year project which consists of a hardware foundation, the Apt cluster, and a testbed control system, built upon systems developed previously by the Flux group for the Emulab and GENI projects.  The testbed control system will allow researchers to use either established images, or to create new ones for their experiments.  Researchers can use one or more of these images simultaneously, along with respective network characterizations, to create a “profile”, which can be saved and used to repeat experiments or to share with other researchers.

The Flux Research Group started building Emulab in 1999 as a testbed for their own research in operating systems and distributed systems, and subsequently made this tool available to others.  Emulab has grown to have about 5,000 users worldwide and there are also about 50 other sites that have built testbeds using the open-source software.  As part of NSF's GENI project, the Flux group focused on expanding the scope of the Emulab to federate with other types of testbeds.  Apt represents a different type of widening of the scope: expanding the environment to HPC as well as to other areas of computer science (CS) through Apt's "on demand" profiles. 

Software is being developed to allow this hardware to be dynamically provisioned to meet the needs of the researchers.   The user defines a profile which includes all the information needed to run an experiment, including the description of the resources, both hardware and software, that will be used in the experiment, providing a mechanism to enable repeatable research. 

The hardware specification of the profile includes information on the properties of the nodes, the storage, and the network.  The software environment of the profile consists of the operation system, and can include additional software packages, data files, etc., needed for the experiment. The Apt project includes a number of standard profiles; others will be defined and shared by users of the resource.

Researchers use either an application programming interface (API) or a web interface to configure the profile, then create an image of this profile, and define the experiment. The experiment belongs to the experimenter for the specified duration.  When the experiment is complete, the Apt software de-provisions the hardware, and makes it available for future requests.

For more information on the project see http://www.flux.utah.edu/project/apt.  

CHPC is establishing an traditional HPC profile as a cluster called Tangent, to launch jobs on the C6220 nodes, which are the same hardware as the 16 core nodes of CHPC’s Kingspeak cluster. This profile will have CHPC applications available, as well as mount current CHPC file systems. From the user perspective, access to this resource is obtained via a login to an interactive node for the Tangent cluster.  The Tangent interactive nodes are local to CHPC and allow users the users to submit batch jobs that will spin up dynamic HPC images on the Apt hardware.  

Contents

 Apt Cluster Hardware (General) Overview

  • 128 Dell PowerEdge r320 nodes, with a single Intel Xeon E5-2450 processor (8 cores, 2.1Ghz), 16GB Memory, and four 500GB Hard Drives
  • 64 Dell PowerEdge c6220 nodes, with dual Intel Xeon E5-2650v2 processors (8 cores, 2.6Ghz), for a total of 16 cores, 64GB Memory, and two 1TB Hard Drives

NOTE:  Currently, Tangent is only using the 64 c6220 nodes.

NFS Home Directory

Your home directory, which is an NFS mounted file system, is one choice for I/O. This space carries the worst statistical performance in terms of I/O speed. This space is visible to all nodes on the clusters through an auto-mounting system.

NFS Scratch (/scratch/kingspeak/serial)

Tangent has access to another NFS filesystem: /scratch/kingspeak/serial. This file system has 175 TB disk capacity.  It is attached to the Infiniband network to obtain a larger potential network bandwidth. This space is seen (read and write access) on all Tangent interactive and compute nodes. However it is still a shared resource and may therefore perform slower when it is subjected to significant user load. Users should test their applications performance to see if they experience any unexpected performance issues within this space. This file system has a scrub policy of files older than 60 days being deleted.

Parallel Scratch  (/scratch/general/lustre)

Tangent also has access to a seond scratch file system : /scratch/general/lustre. This file system has a capacity of 700  TB. This space is seen (read and write access) on all Tangent interactive and compute nodes. However it is still a shared resource and may therefore perform slower when it is subjected to significant user load. Users should test their applications performance to see if they experience any unexpected performance issues within this space. This file system has a scrub policy of files older than 60 days being deleted.

 

Local Disk (/scratch/local)

The local scratch space is a storage space unique to each individual node. As the hardware is deprovsioned after every job, no data left on this space can be retrieved after the experiment has been completed.

Important Differences from Other CHPC Clusters 

On Tangent the nodes that are available to run jobs at any instance is variable, depending on the other users of the Apt hardware. To see the nodes available you would do a sinfo. Below is a typical output of this command:

]$ sinfo -l
Mon Nov 17 11:17:01 2014
PARTITION AVAIL TIMELIMIT    JOB_SIZE   ROOT    SHARE   GROUPS   NODES     STATE        NODELIST
tangent*        up      3-00:00:00      1-64            no           NO           all              2           draining*       tp[048-049]
tangent*        up      3-00:00:00      1-64            no           NO           all              3           allocated#     tp[010,024-025]
tangent*        up      3-00:00:00      1-64            no           NO           all            21           drained         tp[008,012,018-020,023,027-030,033-037,045-047,050,057,059]
tangent*        up      3-00:00:00      1-64            no           NO           all            33           idle~              tp[001-007,011,013-017,026,038-044,051-056,058,060-064]
tangent*        up      3-00:00:00      1-64            no           NO           all              5           allocated       tp[009,021-022,031-032]

Possible states of importance for tangent allocated, completing, down, drained, draining , fail, failing, idle, maint, mixed, perfctrs, power_down, power_up, reserved, and unknown plus Their abbreviated forms: alloc, comp, down, drain, drng, fail, failg, futr, idle, maint, mix, npc, pow_dn, pow_up, resv, and unk respectively. Note that the suffix "*" identifies nodes that are presently not responding,  the suffix "#" indicates that the node is being powered up and provisioned, and the suffix "~" indicates that the nodes are in the powered-down state. See the end of "man sinfo" page for meaning of each of the states.

In this particular example, we see that 33 nodes are idle, i.e. powered down, thus available for a new job. The drained usually are the nodes that are used by other Apt experiments outside of Tangent. Allocated, and possibly also drained, generally describes nodes used by Tangent jobs

The squeue command lists active queue on the Tangent (= subset of Apt currently used for HPC jobs). For the sinfo example listed above, the corresponding squeue output is:

]$ squeue 
JOBID   PARTITION    NAME         USER            ST    TIME   NODES   NODELIST(REASON)
373        tangent          74836         u0101881     PD     0:00       52         (Resources)
396        tangent          g09.slur      u0028729      R       1:43        2          tp[024-025]
397        tangent          tcsh            u0101881      CF     0:02        2          tp[046-047]
395        tangent          amber14.    u0028729     CG     0:00        2          tp[048-049]
394        tangent          nwchem.s   u0028729      R       8:03        2          tp[009-010]
393        tangent          g09.slur      u0028729      R       8:13        2          tp[031-032]
392        tangent          tcsh            u0101881      R      14:55       2          tp[021-022]

The ST describes the state in which the job is. See man squeue for details on each states, in our case R stands for running job, CF stands for configuring (provisioning, startup) of the job, PD means that the job is pending due to insufficient available resources, CG means completing (finalizing) the job.

 

FAQ Section – NEW!

NOTE – we will add to this section as we get questions from users

  1. My job is taking a long time to start.

Due to the fact that the nodes have to be requested from the Apt framework, booted and configured at each job start, the job startup takes longer than users may be used to from other clusters. It is not unusual for larger jobs (> 12 nodes), to take 15-30 minutes to start. Also, occasionally there may be a problem with what nodes are made available to the job by the Apt, in which case the whole job needs to restart which further increases the time the job starts. We recommend to monitor the job startup with the squeue command along with monitoring the I/O output of the calculation in whatever directory it's being run. If the job is in the running state (R) but no output has been written for a while, it is possible that one of the nodes is in a bad state thus blocking the whole job. If this happens, you'll have to delete the job (scancel) and start it again.

Tangent Access

In order to set up a HPC job on the Apt hardware, a user will access via the tangent cluster interactve node

  • tangent.chpc.utah.edu

All CHPC machines mount the same user home directories. This means that the user files on Tangent will be exactly the same as the ones on other CHPC clusters. The advantage is obvious: users do not need to copy files between machines. However, users must be aware that they run the correct executables. CHPC maintained applications with executables suitable for use on all clusters are kept in  /uufs/chpc.utah.edu/sys/pkg, whereas cluster specific executables (MPI-based applications built using the mpi optimized for the specific cluster infiniband) for Tangent will be found in /uufs/tangent.peaks/sys/pkg.

Using the Batch System

The batch implementation on this system is Slurm  – Simple Linux Utility for Resource Management

Information about Slurm is available at: http://slurm.schedmd.com/documentation.html .  To assist users in making the transition to Slurm, CHPC is developing a wiki page, Slurm wiki page (work in progress), which will include common slurm commands and variables, a sample slurm batch script, and a translation guide for common commands and environmental variables of the two batch systems.

There is a hard limit of maximum 72 hours for tangent jobs. 

Runs in the batch system generally pass though the following steps:

  1. The creation of a batch script
  2. The job's submission to the batch system
  3. Checking the job's status

The creation of a batch script on the Tangent cluster

Job Submission on Tangent

In order to submit a job on Tangent one has to first login to the tangent interactive node. Then the job submission is done with the sbatch command in slurm.

For example, to submit a script named pbsjob, just type:

sbatch script.slurm


Checking the status of your job

To check the status of your job, use the squeue command..

  • squeue

Slurm Batch Script Options

Typical job options such as requested number of nodes/processors, walltime, etc are available by Slurm. Below is a sample job script:

#!/bin/csh
#SBATCH --time=36:00:00
#SBATCH --partition=tangent
#SBATCH --account=youraccount
#SBATCH -N 4
#SBATCH -n 8
#SBATCH -J my_job
#SBATCH -o my_job_output
#SBATCH --mail-type=ALL
#SBATCH --mail-user=user@utah.edu

set EXE="/uufs/chpc.utah.edu/common/home/u0123456/my_program"
set DATADIR="/uufs/chpc.utah.edu/common/home/u0123456/my_data"
set SCRDIR="/scratch/kingspeak/serial/u0123456/my_run_data"

cp -r $DATADIR/* $SCRDIR
cd $SCRDIR/

# a way to get the list of hosts allocated to the job
srun hostname -s | sort -u > nodefile.$SLURM_JOBID

source /uufs/chpc.utah.edu/sys/pkg/mpich2/3.1.2/etc/mpich2.csh
mpirun -genv OMP_NUM_THREADS 8 -np 8 $EXE

 

The #SBATCH option denotes the SLURM flags. Note that the node/process requesting is different from Moab in such a sense that we request number of nodes with -N flag and number of tasks with -n flag. The system will allocate -N nodes with -n/-N tasks per node. In this example, we are requesting 4 nodes and 8 tasks, so, we'll be running 2 MPI processes per node. To fully utilize the 16 cores per node, we'll be running 8 OpenMP threads per process.

Also note that we are using the MPICH2 distribution from the generic CHPC program branch - the version 3.1.2 has been built with InfiniBand network support that is appropriate for the Tangent cluster. In case you need to run using the Ethernet network, add -genv MPICH_NEMESIS_NETMOD=tcp to the mpirun flags.

Finally, MPICH2 supports SLURM process to task mapping, which allows us to run without the -machinefile flag that tells mpirun to run on hosts listed in the host file. If the host file is needed, the srun hostnameline in the script above produces it.

Interactive jobs

Interactive jobs are best started with the srun command. For example, to get 2 nodes with 4 tasks:

srun --pty -t 1:00:00 --partition=tangent --account=youraccount -n 4 -N 2 /bin/tcsh -l

The important flags are the --pty denoting to get an interactive terminal, and /bin/tcsh -l, which is the shell to run. If you prefer bash, replace /bin/tcsh with /bin/bash.

Slurm User Commands

Please see the slurm documentation wiki page for more information.

For information on application development and compiling on the clusters at CHPC, please see  our Programming Guide.  

Last Updated: 10/3/16