You are here:

SLURM Scheduler

SLURM is a scalable open-source scheduler used on a number of world class clusters. In an effort to align CHPC with XSEDE and other national computing resources, CHPC has switched clusters from the PBS scheduler to SLURM. There are several short training videos about Slurm and concepts like batch scripts and interactive jobs.

  • Lonepeak was migrated to SLURM on March 10, 2015
  • Kingspeak, Ember, Ash and Apexarch were migrated on April 2, 2015.

About Slurm

Slurm  – Simple Linux Utility for Resource Management is used for managing job scheduling on clusters. It was originally created by people at the Livermore Computing Center, and has grown into a full-fledge open-source software backed up by a large community, commercially supported by the original developers, and installed in many of the Top500 supercomputers. 

Using Slurm

There is a hard limit of maximum 72 hours for jobs on general cluster nodes and 14 days on owner cluster nodes. 

You may submit jobs to the batch system in two ways: 

  • Submitting a script 
  • Submitting an interactive job

Submitting a script to Slurm:

The creation of a batch script

To create a batch script, use your favorite text editor create a file which has both instructions to SLURM and instructions on how to run you job. All instructions to slurm are prefaced by the #SBATCH. It is necessary to specify both the partition and the account in your jobs on all clusters EXCEPT tangent.

#SBATCH --account=<youraccount>

#SBATCH --partition=<yourpartition>

Accounts: Your account is usually your unix group name, typically your PI's lastname. If your group has owner nodes, the account is usually <unix_group>-<cluster_abbreviation>  (where cluster abbreviation is  =kp, lp, em, ash). There is also the owner-guest account; all users have access to this account to run on the owner nodes when they are idle.  Jobs run as owner-guest are preemptable. Note that on the ash cluster, the owner-guest account is called smithp-guest.

Partitions:  Partitions are cluster, cluster-freecycle,pi-cl, cluster-guest, where cluster is the full name  of the cluster and cl is the abbreviated form (kingspeak and kp, ember and em, ash and ash, lonepeak and lp, apexarch and aa). 

Examples

In the examples below, we will suppose your PI is Frodo Baggins and has owner nodes on kingspeak (and not on ember):

  • General user  example on lonepeak (no allocation required)
    #SBATCH --account=baggins
    #SBATCH --partition=lonepeak
  • General user on ember with allocation (Frodo still has allocation available on ember): 
    #SBATCH --account=baggins
    #SBATCH --partition=ember
  • General userson ember without allocation: (Frodo has run out of allocation)
    #SBATCH --account=baggins
    #SBATCH --partition=ember-freecycle
  • To run on Frodo's owner nodes on kingspeak
    #SBATCH --account=baggins-kp
    #SBATCH --partition=baggins-kp
  • To run as owner-guest on ember:
    #SBATCH --account=owner-guest
    #SBATCH --partition=ember-guest
  • To run as owner-guest on ash:
    #SBATCH --account=smithp-guest
    #SBATCH --partition=ash-guest
  • To access ember GPU nodes (need to request addition to account)
    #SBATCH --account=ember-gpu
    #SBATCH --partition=ember-gpu
  • To access kingspeak GPU nodes (need to request addition to account)
    #SBATCH --account=kingspeak-gpu
    #SBATCH --partition=kingspeak-gpu

For more examples of SLURM jobs scripts see CHPC MyJobs templates.

IMPORTANT The biggest change in moving from torque/moab to slurm will come when you are out of allocation. At that point you will no longer have access to the cluster partition and will have to manually change your scripts to use the cluster-freecycle partition.

Features: Features are extensions that allow for finer grained specification of resources. We are using features for the core count per node, which allows a job to obtain unified core count nodes on clusters which have multiple core count nodes. The features are requested with the --constraint or -C flag (described in table below) and the core count is denoted as c#, i.e. c8, c12, c16, c20, c24. Features can be combined with logical operators, such as | for or, & for and. For example, to request 16 or 20 core nodes, do -C "c16|c20".

Features, along with the use of constraints #SBATCH -C or can also be used to target specific owner nodes when running as owner-guest. Using the 'si' alias for sinfo given below, there is a column NODES(A/I/O/T)  that provides the number of nodes that are allocated/idle/offline/total along with a column FEATURES, which gives the values that can be used as a constraint.  To target a specific group of owner nodes, use the name given in this feature column.  

Another feature that we list for each node is its owner, either chpc or the group/center name. This can be used to target specific group nodes which have low use as owner-guest in order to reduce chances of being preempted. For example, to target nodes used by group "ucgd", we can do -A owner-guest -p kingspeak-guest -C "ucgd".

Reservations: Upon request we can create reservations for users to guarantee node availability. Reservation are requested with the --reservation  flag (abbreviated as -R ) followed by the reservation name, which consists of a user name followed by a number, e.g.u0123456_1. Thus to use an existing reservation in a job script, include  #SBATCH --reservation=u0123456_1 .

For policies regarding reservations see the Batch Policies document.

Sample MPI job Slurm script

#!/bin/csh
#SBATCH --time=1:00:00 # walltime, abbreviated by -t
#SBATCH --nodes=2      # number of cluster nodes, abbreviated by -N
#SBATCH -o slurm-%j.out-%N # name of the stdout, using the job number (%j) and the first node (%N)
#SBATCH -e slurm-%j.err-%N # name of the stderr, using job and first node values
#SBATCH --ntasks=16 # number of MPI tasks, abbreviated by -n # additional information for allocated clusters #SBATCH --account=baggins # account - abbreviated by -A #SBATCH --partition=lonepeak # partition, abbreviated by -p
# # set data and working directories
setenv WORKDIR $HOME/mydata setenv SCRDIR /scratch/kingspeak/serial/UNID/myscratch mkdir -p $SCRDIR
cp -r $WORKDIR/* $SCRDIR
cd $SCRDIR
#
# load appropriate modules, in this case Intel compilers, MPICH2
module load intel mpich2
# for MPICH2 over Ethernet, set communication method to TCP
# see below for network interface selection options for different MPI distributions
setenv MPICH_NEMESIS_NETMOD tcp # run the program
# see below for ways to do this for different MPI distributions mpirun -np $SLURM_NTASKS my_mpi_program > my_program.out

The #SBATCH option denotes the SLURM flags. The rest of the script is instructions on how to run your job. Note that we are using the SLURM built in $SLURM_NTASKS variable to denote the number of MPI tasks to run. In case of a plain MPI job, this number should equal number of nodes ($SLURM_NNODES) times number of cores per node.

Also note that some packages have not been built with the MPI distributions that support Slurm, in which case you'll need to specify the hosts to run on via machinefile flag to the mpirun command and the appropriate MPI distribution. Please, see the package help page for details and the appropriate script. Additional information on creation of a machinefile is also given in a table below discussing SLURM environmental variables.

For mixed MPI/OpenMP runs, you can either hard code the OMP_NUM_THREADS in the script, or, use logic like that below to figure it out from the Slurm job information. When requesting resources, ask for number of MPI tasks and number of nodes to run on, not for total number cores the MPI+OpenMP tasks will use.

#SBATCH -N 2
#SBATCH -n 4
#SBATCH -C "c12" # we want to run on uniform core count nodes

# find number of threads for OpenMP
# find number of MPI tasks per node
set TPN=`echo $SLURM_TASKS_PER_NODE | cut -f 1 -d \(`
# find number of CPU cores per node
set PPN=`echo $SLURM_JOB_CPUS_PER_NODE | cut -f 1 -d \(`
@ THREADS = ( $PPN / $TPN )
setenv OMP_NUM_THREADS $THREADS
# set thread affinity to CPU socket
setenv KMP_AFFINITY verbose,granularity=core,compact,1,0

mpirun -np $SLURM_NTASKS my_mpi_program > my_program.out

 Alternatively, SLURM option -c , or  --cpus-per-task can be used, like:

#SBATCH -N 2
#SBATCH -n 4
#SBATCH -c 6

setenv OMP_NUM_THREADS $SLURM_PUS_PER_TASK
mpirun -np $SLURM_NTASKS my_mpi_program > my_program.out

Note that if you use this on a cluster with nodes that have varying core counts (kingspeak and ash), SLURM is free to pick any node so the job nodes may be undersubscribed (e.g. on ash, the above option would fully subscribe 12 core nodes, but, undersubscribe the 20 or 24 core nodes).  

Job Submission using SLURM

In order to submit a job, one has to first login to an interactive node. Then the job submission is done with the sbatch command in slurm.

For example, to submit a script named script.slurm just type:

  • sbatch script.slurm

IMPORTANT: sbatch by default passes all environment variables to the compute node, which differs from the behavior in PBS (which started with a clean shell). If you need to start with a clean environment, you will need to use the following directive in your batch script:

  • #SBATCH --export=NONE

This will still execute .bashrc/.tcshrc scripts, but any changes you make in your interactive environment will not be present in the compute session. As an additional precaution, if you are using modules, you should use  module purge to guarantee a fresh environment.

 Checking the status of your job

To check the status of your job, use the squeue command.

  • squeue

Most common arguments to are -u u0123456 for listing only user u0123456 jobs, and -j job# for listing job specified by the job number. Adding -l (for "long" output) gives more details.

Alternatively, from the account perspective, one can use the sacct command. This command accesses the accounting database and can give useful info about current and past job resources usage.

 Interactive batch jobs

In order to launch an interactive session on a compute node do:

srun --time=1:00:00 --ntasks 2 --nodes=1 --account=chpc --partition=ember --pty /bin/tcsh -l

The important flags are the --pty denoting an interactive terminal, and /bin/tcsh -l, which is the shell to run. If you prefer bash, replace /bin/tcsh with /bin/bash. As in submitting a batch script, the -n specifies tasks and the -N specifies nodes. The -l (or --label) will prepend task number to lines of stdout/err. The --label option will prepend lines of output with the remote task id. Note that the order of the command is important, the "--pty /bin/tcsh -l" has to be at the end.

The srun flags can be abbreviated as:

srun -t 1:00:00 -n 2 -N 1 -A chpc -p ember --pty /bin/tcsh -l

The srun command by default passes all environment variables of the parent shell therefore the X window connection gets preserved as well, allowing for running graphical applications such as GUI based programs inside the interactive job.

Running MPI jobs

One option is to produce the hostfile and feed it directly to the mpirun command of the appropriate MPI distribution. The disadvantage of this approach is that it does not integrate with SLURM and as such it does not provide advanced features such as task affinity, accounting, etc.

Another option is to use process manager built into SLURM and launch the MPI executable through srun command. How to do this for various MPI distributions is described at http://slurm.schedmd.com/mpi_guide.html. Some MPI distributions' mpirun commands integrate with Slurm and thus it is more convenient to use them instead of srun.

For MPI distributions at CHPC, the following works (assuming MPI program internally threaded with OpenMP).

Intel MPI
module load [intel,gcc] impi
# for a cluster with Ethernet only, set network fabrics to TCP
setenv I_MPI_FABRICS shm:tcp
# for a cluster with InfinBand, set network fabrics to OFA
setenv I_MPI_FABRICS shm:ofa
# on lonepeak owner nodes, use the TMI interface (InfiniPath)
setenv I_MPI_FABRICS shm:tmi

# IMPI option 1 - launch with PMI library - currently not using task affinity, use mpirun instead
setenv I_MPI_PMI_LIBRARY /uufs/CLUSTER.peaks/sys/pkg/slurm/std/lib/libpmi.so
#srun -n $SLURM_NTASKS $EXE >& run1.out
# IMPI option 2 - bootstrap
mpirun -bootstrap slurm -np $SLURM_NTASKS $EXE  >& run1.out
MPICH2

Launch the MPICH2 jobs with mpiexec as explained in http://slurm.schedmd.com/mpi_guide.html#mpich2. That is:

module load [intel,gcc,pgi] mpich2
setenv MPICH_NEMESIS_NETMOD mxm # default is Ethernet, choose mxm for InfiniBand
mpirun -np $SLURM_NTASKS $EXE
OpenMPI

Use the mpirun command from the OpenMPI distribution. There's no need to specify the hostfile as OpenMPI communicates with Slurm in that regard. To run:

module load [intel,gcc,pgi] openmpi
mpirun --mca btl tcp,self -np $SLURM_NTASKS $EXE # in case of Ethernet network cluster, such as general lonepeak nodes.
mpirun -np $SLURM_NTASKS $EXE # in case of InfiniBand network clusters

Note that OpenMPI supports multiple network interfaces and as such it allows for single MPI executable across all CHPC clusters, including the InfiniPath network on lonepeak.

MVAPICH2

MVAPICH2 executable can be launched with mpirun command (preferably) or with srun, in which case one needs to use --mpi=none flag. To run multi-threaded code, make sure to set OMP_NUM_THREADS and MV2_ENABLE_AFFINITY=0 (ensure that the MPI tasks don't get locked to single core) before calling the srun.

module load [intel,gcc,pgi] mvapich2
setenv OMP_NUM_THREADS 6 # optional number of OpenMP threads
setenv MV2_ENABLE_AFFINITY 0 # disable process affinity - only for multi-threaded programs
mpirun -np $SLURM_NTASKS $EXE # mpirun is recommended
srun -n $SLURM_NTASKS --mpi=none $EXE # srun is optional

Running multiple serial calculations within one job

Please, see a page dedicated to running multiple serial jobs for details.

Multiple jobs using job arrays

Job arrays enable quick submission of many jobs that differ from each other only slightly by a some sort of index. In this case Slurm provides environment variable SLURM_ARRAY_TASK_ID which serves as a differentiator between the job. For example, if our program takes input data input.dat, we can have it running using 30 different input data stored in files input[1-30].dat using the following script, named myrun.slr:

#!/bin/tcsh
#SBATCH -J myprog # A single job name for the array
#SBATCH -n 1 # Number of tasks
#SBATCH -N 1 # All tasks on one machine
#SBATCH -p CLUSTER # Partition on some cluster
#SBATCH -A chpc # General CHPC account
#SBATCH -t 0-2:00 # 2 hours (D-HH:MM)
#SBATCH -o myprog%A%a.out # Standard output
#SBATCH -e myprog%A%a.err # Standard error

./myprogram input$SLURM_ARRAY_TASK_ID.dat 

We then use the --array  parameter to run this script:

sbatch --array=1-30 myscript.sh

Apart from SLURM_ARRAY_TASK_ID which is an environment variable unique for each job array job, notice also %A and %a, which represent the job id and the job array index, respectively. These can be used in the sbatch parameters to generate unique names.

You can also limit the number of jobs that can be running simultaneously to "n" by adding a %n after the end of the array range:

sbatch --array=1-30%5 myscript.sh

Please, be aware that since we don't allow multiple jobs to share nodes, submitting serial jobs using this way will waste a lot of resources since each job will use only one task (core) on a cluster node. So, in our environment the job arrays should be only used for mass submission of programs that are parallelized either with shared memory (e.g. OpenMP) using all the CPU cores on one node, or distributed parallel (e.g. MPI) programs that can run on one or more nodes. For multiple serial job submissions, see the running multiple serial jobs page.

Automatic restarting of preemptable jobs

The owner-guest orfreecycle  queues tend to have quicker turnaround than general queues. However, the guest jobs may get preempted. If one's job is checkpointed (e.g. by saving particle positions and velocities in dynamics simulations, property values and gradients in minimizations, etc), one can automatically restart a preempted job following this strategy:

  1. Right at the beginning of the jobs script submit a new job with dependency on the current job. This will ensure that the new job will be eligible for running only after the current job is preemted (or finished). Save the new job submission information into a file - this file contains a job ID of the new job, which we save to an environment variable NEWJOB
    sbatch -d afterany:$SLURM_JOBID run_ash.slr >& newjob.txt
    set NEWJOB=`cat newjob.txt |cut -f 4 -d " "`
  2. In the simulation output, include a file that lists the last checkpointed iteration, time step, or other measure of the simulation progress. In our example below, we are having a file called inv.append which, among other things, contains lines on simulation iterations, one per line.
  3. In the job script, extract the iteration number from this file and put it into the simulation input file (here called inpt.m). This input file will be used when the simulation is restarted. Since the simulation file does not exist at the very start of the simulation, the first job will not append the input file - and thus begin from the start.  
    set ITER=`cat $SCRDIR/$RUNNAME/work_inv/inv.append | grep Iter |tail -n 1 | cut -f 2 -d " " | cut -f 1 -d /`
    if ($ITER != "") then
    echo "restart=$ITER;" >> inpt.m
    endif
  4. Run the simulation, if the job gets preempted, the current job will end here. If it runs through completion, then at the end of the job script, make sure to delete the new job identified by environment variable NEWJOB that was submitted when this job was started.
    scancel $NEWJOB

In summary, the whole SLURM script (called run_ash.slr) would look like this

#SBATCH all necessary job settings (partition, walltime, nodes, tasks)
#SBATCH -A owner-guest

# submit a new job dependent on the finish of the current job
sbatch -d afterany:$SLURM_JOBID run_ash.slr >& newjob.txt
# get this new job job number
set NEWJOB=`cat newjob.txt |cut -f 4 -d " "`
# figure out from where to restart
set ITER=`cat $SCRDIR/$RUNNAME/work_inv/inv.append | grep Iter |tail -n 1 | cut -f 2 -d " " | cut -f 1 -d /`
if ($ITER != "") then
echo "restart=$ITER;" >> inpt.m
endif

# copy input files to scratch
# run simulation
# copy results out of the scratch

# delete the job if the simulation finished
scancel $NEWJOB

Handy Slurm Information

Slurm User Commands

 Slurm Command  What it does
 sinfo  reports the state of partitions and nodes managed by Slurm. It has a wide variety of filtering, sorting, and formatting options.
 squeue  reports the state of jobs or job steps. It has a wide variety of filtering, sorting, and formatting options. By default, it reports the running jobs in priority order and then the pending jobs in priority order.
 sbatch  is used to submit a job script for later execution. The script will typically contain one or more srun commands to launch parallel tasks.
 scancel  is used to cancel a pending or running job or job step. It can also be used to send an arbitrary signal to all processes associated with a running job or job step.
 sacct  is used to report job or job step accounting information about active or completed jobs.
 srun  is used to submit a job for execution or initiate job steps in real time. srun has a wide variety of options to specify resource requirements, including: minimum and maximum node count, processor count, specific nodes to use or not use, and specific node characteristics (so much memory, disk space, certain required features, etc.). A job can contain multiple job steps executing sequentially or in parallel on independent or shared nodes within the job's node allocation.

Useful Slurm aliases

Bash to add to .aliases file:
#SLURM Aliases that provide information in a useful manner for our clusters
alias si="sinfo -o \"%20P %5D %14F %8z %10m %10d %11l %16f %N\""
alias si2="sinfo -o \"%20P %5D %6t %8z %10m %10d %11l %16f %N\""
alias sq="squeue -o \"%8i %12j %4t %10u %20q %20a %10g %20P %10Q %5D %11l %11L %R\""

Tcsh to add to .aliases file:
#SLURM Aliases that provide information in a useful manner for our clusters
alias si 'sinfo -o "%20P %5D %14F %8z %10m %11l %16f %N"'
alias si2 'sinfo -o "%20P %5D %6t %8z %10m %10d %11l %N"'
alias sq 'squeue -o "%8i %12j %4t %10u %20q %20a %10g %20P %10Q %5D %11l %11L %R"'

sview GUI tool

sview is a graphical user interface to view and modify Slurm state. Run it by typing sview. It is useful for viewing partitions and nodes characteristics and information on jobs. Right clicking on the job, node or partition allows to perform actions on them, though, use this carefully not to accidentally modify or remove your job.

 sview

Moab/PBS to Slurm translation

Moab/PBS to Slurm commands

Action  Moab/Torque  Slurm
Job Submission msub/qsub sbatch
Job deletion canceljob/qdel scancel
List all jobs in queue showq/qstat squeue
List all nodes   sinfo
Show information about nodes mdiag -n/pbsnodes scontrol show nodes 
Job start time showstart squeue --start
Job information checkjob scontrol show job <jobid>
Reservation information showres

scontrol show res (this option shows details)

sinfo -T

 Moab/PBS to Slurm environmental variables

Description  Moab/Torque  Slurm
 Job ID  $PBS_JOBID $SLURM_JOBID
 node list   $PBS_NODEFILE

 Generate a listing of 1 node per line:
srun hostname | sort -u > nodefile.$SLURM_JOBID

Generate alisting of 1 core per line: 

srun hostname | sort  > nodefile.$SLURM_JOBID

 

submit directory $PBS_O_WORKDIR $SLURM_SUBMIT_DIR
number of nodes   $SLURM_NNODES
number of processors (tasks)   $SLURM_NTASKS ($SLURM_NPROCS for backward compatibility)

Moab/PBS to Slurm job script modifiers

 

Description  Moab/Torque  Slurm
Walltime #PBS -l walltime=1:00:00 #SBATCH -t 1:00:00
Process count

#PBS -l nodes=2:ppn=12

#SBATCH -n 24 ( or --ntasks=24)
#SBATCH -N 2 (or --nodes=2)

For threaded MPI jobs, use number of MPI tasks for --ntasks,
not number of cores. See the example script above for how
to figure out number of threads per MPI task

Memory #PBS -l nodes=2:ppn=12:m24576

#SBATCH --mem=24576

it is also possible to specify memory per tash with --mem-per-cpu.

Mail options #PBS -m abe

#SBATCH --mail-type=FAIL,BEGIN,END 
there are other options such as REQUEUE, TIME_LIMIT_90. ...

Mail user #PBS -M user@mail.com  #SBATCH --mail-user=user@mail.com
Job name and
STDOUT/STDERR
#PBS -N myjob

#SBATCH -o myjob.out
#SBATCH -e myjob.err

Account #PBS -A owner-guest
optional in Torque/Moab

#SBATCH -A owner-guest (or --account=owner-guest)
required in Slurm

Dependency #PBS -W depend=afterok:12345
run after job 12345 finishes correctly

#SBATCH -d afterok:12345 or --dependency=afterok:12345
similarly to Moab, other modifiers include after, afterany, afternotok.
Please note that if job runs out of walltime, this does not constitute OK exit. To start a job after specified job finished use afterany.
For details on job exit codes see http://slurm.schedmd.com/job_exit_code.html

Reservation #PBS -l advres=u0123456_1

#SBATCH -R u0123456_1 or --reservation=u0123456_1

Partition No direct equivalent

#SBATCH -p lonepeak (or --partition=lonepeak)

Propagate all environment
variables from terminal
#PBS -V  All environment variables are propagated by default, except for modules
which are purged at a job start to prevent possible inconsistencies.
One can either load the needed modules in the job script,
or have them in their .custom.[sh,csh] file.
Propagate specific
environment variable
#PBS -v myvar #SBATCH --export=myvar
use with caution as this will export ONLY variable myvar

Target specific owner
nodes as guest

#PBS -l nodes=1:ppn=24:ucgd -A owner-guest #SBATCH -A owner-guest -p kingspeak-guest -C "ucgd"

Target specific nodes  

  #SBATCH -w em001,em002 or --nodelist=em001,em002

 Information about job priority

Note that this applies to the general resources and not to owner resources on the clusters. 

The first and most significant portion of a jobs priority are based on the account being used and if it has allocation or not.  Jobs run with allocation have a base priority of 1000000.  Jobs without have a base priority of 1.  

To this, there are additional values added for:

(1) Age (time a job spends in the queue) -- For the "Age" of a job we will see a somewhat linear growth of the priority until it his a cap.  The cap is a limit we put on how much time a job can accrue extra priority in the queue.

(2) Fairshare (how much of the system you have used recently) -- Fairshare is a factor based on the hostorical usage of a user.  All things being equal the user that has used the system the less recently should have a bonus to priority over the user that has used the system more recently.  This value though as somewhat more of an exponential behavior as compared to the other two.

(3) JobSize (how many nodes/cores our job is requesting) -- Job size is again a linear value according to the number of resources requested.  It is fixed  at submit time according to the requested resources.

At any point you can run 'sprio' and see the current priority as well as the source of the priority (in terms of the three components mentioned above) for all idle jobs in the queue on a cluster.

How to determine which Slurm accounts you are in

In order to see which accounts and partitions you can use do:

sacctmgr -p show assoc user=<UNID> 

 

Other good sources of information

Last Updated: 9/15/17