This is the simplest way to run multiple programs within a single job, however, it works only on a single node.
If we only run on a single node, the process is very simple, e.g. in bash:
for (( i=0; i < $SLURM_NTASKS ; i++ )); do
/path_to/myprogram $i &
We differentiate between the calculations inside of myprogram through the loop index $i. The & will put the processes in the background, thus allowing to launch all $SLURM_NTASKS of them. The wait statement will cause the script to wait till all of the background processes finish.
The advantage of this approach is the simplicity, the drawback is that it only works on a single node, and for calculations that roughly take the same amount of time.
--multi-prog option allows to assign each task in the job a different option. This allows to differentiate
serial runs from each other and run them inside a single parallel Slurm job. This
is our preferred way to launch independent serial calculations that take about the
same time. A basic Slurm job script can look like this:
#SBATCH -n 16
#SBATCH -N 1
#SBATCH -t 1-03:00:00 # 1 day and 3 hours
#SBATCH -p CLUSTER # partition name
#SBATCH -A chpc # account name
#SBATCH -J my_job_name # job name
srun --multi-prog my.conf
Here we submit a job on one node with 16 tasks, and then run the srun with the
--multi-prog option, which is followed by configuration file for the multiple programs. This file
has the following three fields per line, separated by spaces:
- task number
- executable file
- arguments to the executable file
The executable arguments may be augmented by expression "%t" which gets replaced by the task number, and "%o" which gets replaced with task's offset within this range.
Please, note that if the executable is not in the default PATH (as defined when new shell is opened), the full path to this executable has to be specified. The same is true if the executable is a script that is then calling a program. Due to our modules setup, running this script will reset the module environment and as such program modules need to be loaded again inside of this script.
For example, to run quantum chemistry program Mopac, we have mopac.conf as follows:
0-11 ./example.sh %t
Where example.sh script contains:
module load mopac
A complete example for running multiple serial R simulations using --multi-prog is described on our R documentation page.
We have also developed a simpler multiple serial program launch script which can be obtained here. This script runs as many serial tasks as specified in the #SLURM -n line, each of which uses one entry from the WORKDIR and PROGRAM arrays listed, copies data from WORKDIR to unique scratch directory for each serial task and runs PROGRAM which can be the same or unique for each serial task
Independent serial calculations with variable run time
The submit program allows to run many serial calculations inside of a parallel cluster job using master-worker model. The program is a simple MPI based scheduler which reads a list of calculations to do from a file, one per line, and runs them in parallel, filling as many calculations as there are parallel tasks. Once one calculation is finished, the worker asks for another calculation, which keeps repeating until all calculations are done.
This is our preferred way to run independent serial calculations that may take different amount of time to finish, as long as there are many more calculations than job tasks, as this allows to chain the calculations one after another and fill in the resources better. If one roughly knows the runtime of each calculation, listing them with respect to the calculation time in the descending order, the longest first, will provide the best packing of the calculations on the job tasks.
For the basic documentation, example and source code see the submitGitHub page.
The submit program reads in input file called job.list, which syntax is as follows:
first line - # of serial jobs to run
other lines - command line for these serial jobs (including program arguments). Make sure there is only single space between the program arguments - more that single space will break the command line.
For example (for testing purpose), you can make job.list as:
This will run 4 serial jobs, executing the hostname command - which returns name of the node this command ran on.
NOTE - since submit launches the items in job.list directly, it does not use the environment. Therefore we need to specify full path to the command, or, run a shell script (with a full path to the shell script in job.list, where the shell script initializes a new shell session with user default environment).
The differentiation between different calculations can be built into the job.list through program arguments, as shown in the example below.
A complete example using SLURM and a set of serial R calculations, similar to the srun --multi-prog example shown above, can be found on the submit github page or at /uufs/chpc.utah.edu/sys/installdir/submit/std/examples/R.