Back to main FAQ

A batch system is used to monitor and control the jobs running on a system. It enforces limits on runtime (walltime) as well as the number of jobs running at one time (both total and per user). To run a job, the batch system allocates the resources requested in the batch script, sets up an environment to run the job in (thus running the users .cshrc and .login files), and then runs the job in that environment. In this environment, standard out and standard error are redirected into files in the current working directory at the time the executable is actually run.

PBS (Torque distribution) is used on all systems at CHPC. On most CHPC systems the Maui scheduler is used. Please see (PBS Page).

A scheduler works with the batch system to increase throughput and enforce policies on the system.

CHPC uses Maui on all of its platforms.

To submit a job to a batch system, you must first write a batch script. The script contains batch system directives as well as normal unix commands, one of which is a unix command to run your job. Once you have a script, you can submit your job to the batch system by submitting your script to the batch system.

To submit a script to the batch system you must use the PBS command "qsub". The appropriate syntax is "qsub scriptname". There are also certain command line options that can be used. For a more detailed description as well as other commands for PBS check the PBS page.

You may use the PBS/Torque command "qstat" but the Maui/Moab command "showq" is preferable. For more information on "showq" see the Maui User Commands.

A list of PBS commands can be found at the bottom of the PBS commands

Sample PBS scripts as well as general PBS information can be found on the PBS page. From that page are links to the specific platform user guides which have sample scripts.

Your home directory space is visible to each node of the system. However if you are reading or writing large data files in your home directory, then the data will have to pass over the network to or from the file server. This slows the access down considerably and creates a lot of network congestion.

CHPC has several /scratch filesystems. The fastest is /tmp. The reason for this is simple - the /tmp filesystem is local to each node, and there is no delay for network access. The disadvantage is you can not check on your output during your run. Also, in the event your job doesn't complete normally, you may not be able to retrieve the files from this local space.

The other /scratch spaces available are /scratch/serial, /scratch/serial-old, /scratch/da, /scratch/mm. The first two are availble on all of the arches clusters. /scratch/da is only available on delicatearch, and /scratch/mm is only available on marchingmen.

The scratch directory is a filesystem space for writing temporary files. Scratch is generally not a safe place to keep anything long term. Any file with a modification date more than 30 days old is removed from scratch. Scratch is not backed up.

Scripts can only be submitted from the one of the interactive nodes. There are 2 interactive nodes for each cluster but you may submit jobs to any cluster from any interactive node by using the full path to the qsub command.

Also, the script should probably be submitted from and live somewhere in your home directory structure. It should not live in /scratch since usually one of the last commands in the script itself is to delete your temporary /scratch subdirectory. Scratch is not a safe place to keep anything long term. Any file with a modification date more than 30 days old can be removed from scratch. Scratch is not backed up.

All scripts are submitted to the system by typing the command "qsub" scriptname. This will submit to the cluster's queue associated with the interactive node you are on. If you want to submit to a different cluster than the one you're logged into, you may use the full path to the qsub command. For example, if you are logged into a marchingmen node and wish to submit to sanddunearch, instead of "qsub" scriptname, you would enter /uufs/sanddunearch.arches/sys/bin/qsub scriptname.

The policy on this is that there is no limit to the number of jobs a user can have submit to the queue, but a user cannot use more than half of the available resources on a platform at a time. Also, if you have more than 5 jobs waiting in the queue, the first 5 will be eligible for scheduling and the rest will be blocked. As jobs move through the queue, the blocked jobs will move to the eligible queue.

To run a job which is longer than is allowed in the standard queues, you must get special access. To request access, you must send email to the CHPC director at explaining the purpose of your request, and justifying your request by explaining, for example, why it is impossible to checkpoint your job.

You can use up to half of the available resources at a given time, either as one job, or several smaller jobs. If you need more you must send email to the CHPC director at explaining the purpose and scope of your request.

Queueing systems generally try to minimize the average wait time for everybody. If the system can not acquire the resources necessary to run your job, but does have the resources to run another (probably smaller or shorter) job that is later in the queue, then it may decide to run that job in the interest of overall throughput. The Maui Scheduler is setup so that reservations are made for jobs waiting in the queue. The Maui Scheduler takes information such as QOS, walltime, and nodes and places a reservation based upon what the Scheduler thinks is going to be the earliest released nodes. However, sometimes jobs finish when the Scheduler had not anticipated, so other jobs are able to be placed on those nodes. All jobs are going through strict policies before being entered into the system and fairness is always applied. For more information read the Maui Scheduler User's Guide.

You should NEVER try to run your job in the background (by having a "&" character after the line which runs your executable in your batch script) on the batch systems. If you do so, the batch sytem may lose track of some of your processes which may result in processes being killed, a loss of data, and/or your script doing its final cleanup before your job actually completes.

The only exception to this rule is to use the "wait" command after the process(es) were put in the background. The "wait" command makes the script wait till all the background processes finished before proceeding.

First consult the PBS page or the individual systems user guides. Then consult all the other batch questions on this page.

If you still can not get your script to work, send email to with either the script included in your email, or a path to it (make sure you have the permissions set so that the user services staff can get to it). Also any other information about jobs is extremely helpful for us.

Your jobs priority is greatly reduced. The QOS (Quality of Service) is changed to 0, which means you will only run if the resources are not in use by another user with an allocation. Sometimes this QOS is referred to as "freecycle".

Most likely you have a non-#PBS line in your script which precedes other #PBS lines. This is not allowed in a PBS script. You can not have anything, not even blank lines, preceeding the #PBS lines in your script. The only possible exception to this is comment lines beginning with #.

Back to main FAQ
Last Modified: November 08, 2013 @ 10:10:49