You are here:

Running independent serial calculations

Some data analyses require running a lot of similar independent serial calculations. Due to logistics in setting up these calculations, and to CHPC cluster scheduling policies, it is adviseable not to run each single calculation as a separate job. Depending on the characteristics of the calculations, strategies on how to implement them may differ. In this page we list and detail available strategies.

If the calculations take about the same time, there are many ways how to pack them into a single job. If the job runs on a single node, we can combine execution in the background with the wait statement. On multiple nodes, similar thing may be achieved with GNU Parallel, however, calculation distribution is easier with SLURM's srun --multi-prog option.

If the calculations have variable runtime, they need to be somehow scheduled inside of the job in order to efficiently use all available CPU cores on the allocated nodes. We have developed a mini-scheduler, called submit, for this purpose.

If the number of calculations is larger (> ~100), we recommend to either split the srun --multi-prog into multiple jobs, or chain the calculations one after another using submit. The reason for this is that it takes longer to allocate a larger job, and, in case of using owner-guest queue, the chances of preemption increase as well.

Finally, if there are a lot of serial calculations and unless they are very short, the Open Science Grid (OSG) may be the best choice due to the vast amount resources the OSG provides.

Below is a table that summarizes the best strategies for different types of serial calculations:

 

About the same runtime Variable runtime
Single calculation runtime < 15 min

srun --multi-prog for < ~100 calculations

shell script loop - only on a single node

submit for chaining > ~100 calculations

submit
Single calculation runtime > 15 min

srun --multi-prog, multiple jobs if > ~100 calculations

Open Science Grid

submit

Open Science Grid

 

If the calculation is thread-parallelized, it is also possible to use SLURM job arrays to submit multiple jobs at once, each occupying one full node.

Independent serial calculations with similar run time

Multiple calculations in the background

This is the simplest way to run multiple programs within a single job, however, it works only on a single node.

If we only run on a single node, the process is very simple, e.g. in bash:

#!/bin/bash
for (( i=0; i < $SLURM_NTASKS ; i++ )); do
/path_to/myprogram $i &
done
wait

We differentiate between the calculations inside of myprogram through the loop index $i. The & will put the processes in the background, thus allowing to launch all $SLURM_NTASKS of them. The wait statement will cause the script to wait till all of the background processes finish.

The advantage of this approach is the simplicity, the drawback is that it only works on a single node, and for calculations that roughly take the same amount of time.

srun multiple program configuration

The --multi-prog option allows to assign each task in the job a different option. This allows to differentiate serial runs from each other and run them inside a single parallel Slurm job. This is our preferred way to launch independent serial calculations that take about the same time. A basic Slurm job script can look like this:

#!/bin/sh
#SBATCH -n 16
#SBATCH -N 1
#SBATCH -t 1-03:00:00 # 1 day and 3 hours
#SBATCH -p CLUSTER # partition name
#SBATCH -A chpc # account name
#SBATCH -J my_job_name # job name

srun --multi-prog my.conf

Here we submit a job on one node with 16 tasks, and then run the srun with the --multi-prog option, which is followed by configuration file for the multiple programs. This file has the following three fields per line, separated by spaces:

  • task number
  • executable file
  • arguments to the executable file

The executable arguments may be augmented by expression "%t" which gets replaced by the task number, and "%o" which gets replaced with task's offset within this range.

Please, note that if the executable is not in the default PATH (as defined when new shell is opened), the full path to this executable has to be specified. The same is true if the executable is a script that is then calling a program. Due to our modules setup, running this script will reset the module environment and as such program modules need to be loaded again inside of this script.

For example, to run quantum chemistry program Mopac, we have mopac.conf as follows:

0-11 ./example.sh %t

Where example.sh script contains:

#!/bin/tcsh
module load mopac
mopac example-$1.mop

A complete example for running multiple serial R simulations using --multi-prog is described on our R documentation page.

We have also developed a simpler multiple serial program launch script which can be obtained here. This script runs as many serial tasks as specified in the #SLURM -n line, each of which uses one entry from the WORKDIR and PROGRAM arrays listed, copies data from WORKDIR to unique scratch directory for each serial task and runs PROGRAM which can be the same or unique for each serial task

Independent serial calculations with variable run time

submit mini-scheduler

The submit program allows to run many serial calculations inside of a parallel cluster job using master-worker model. The program is a simple MPI based scheduler which reads a list of calculations to do from a file, one per line, and runs them in parallel, filling as many calculations as there are parallel tasks. Once one calculation is finished, the worker asks for another calculation, which keeps repeating until all calculations are done.

This is our preferred way to run independent serial calculations that may take different amount of time to finish, as long as there are many more calculations than job tasks, as this allows to chain the calculations one after another and fill in the resources better. If one roughly knows the runtime of each calculation, listing them with respect to the calculation time in the descending order, the longest first, will provide the best packing of the calculations on the job tasks.

For the basic documentation, example and source code see the submitGitHub page.

The submit program reads in input file called job.list, which syntax is as follows:
first line - # of serial jobs to run
other lines - command line for these serial jobs (including program arguments). Make sure there is only single space between the program arguments - more that single space will break the command line.

For example (for testing purpose), you can make job.list as:

4
/bin/hostname -i
/bin/hostname -i
/bin/hostname -i
/bin/hostname -i

This will run 4 serial jobs, executing the  hostname command - which returns name of the node this command ran on.

NOTE - since submit launches the items in job.list directly, it does not use the environment. Therefore we need to specify full path to the command, or, run a shell script (with a full path to the shell script in job.list, where the shell script initializes a new shell session with user default environment).

The differentiation between different calculations can be built into the job.list through program arguments, as shown in the example below.

A complete example using SLURM and a set of serial R calculations, similar to the srun --multi-prog example shown above, can be found on the submit github page or at /uufs/chpc.utah.edu/sys/installdir/submit/std/examples/R.

Last Updated: 9/6/17