CHPC Software: Moab Scheduler

The Moab Scheduler is a software tool designed to allow flexible and near-optimal scheduling in a multi-resource high-performance computing environment. Jobs are screened through a large number of administrator-configurable fairness policies and then prioritized based on factors such as job queue time, job expansion factor, and user CPU usage history. The Moab Scheduler maximizes control of system resources by allowing certain nodes to be reserved for specific users, groups, accounts, or projects and minimizes the amount of wasted computer resources by permitting anticipated downtime to be scheduled.

The Moab Scheduler is also a analysis/research tool. The Moab Scheduler collects a large number of statistics and can analyze this information to determine the effectiveness of the scheduling approach being used by Moab and the utilization of the resources available.

Moab Scheduler was designed to provide the information and control needed to efficiently manage large systems. Moab Scheduler is based in an "information-rich" environment, which gives support required by administrators and needed by users. Moab Sceduler supplies system administrators with the parameters necessary to make well-founded system and job management decisions. Its classless, single queue environment allows any job to run on any set of nodes whose configuration meets its requirements. The Moab Scheduler provides an extensive set of dynamically reconfigurable parameters to control job queues and various aspects of the running workload.

Moab is part of the batch system at CHPC. It works in conjuction with PBS (Portable Batch System). For details on using batch on a particular CHPC platform, please refer to the User's Guides:

The Moab Scheduler has approximately 25 commands that allow system administrators to obtain information needed to solve specific problems, or make specific decisions and fine-tune parameters to increase utilization and throughput. The following set of commands are designed for the user in order to utilize the Moab Scheduler's functionality:

showq

Usage

showq [ -r | -i ]

Purpose

showq is a command directly related to "qstat" and is much more highly recommended. Running this command displays jobs that are running or active, idling, and non-queued jobs.

[u0108240@updraft1:~]$ /uufs/tunnelarch.arches/sys/bin/showq

active jobs------------------------
JOBID              USERNAME      STATE PROCS   REMAINING            STARTTIME

64661              u0320277    Running     4     2:10:53  Sat Feb 21 12:55:54
64662              u0320277    Running     4     8:01:01  Sat Feb 21 18:46:02
64666              u0320277    Running     2    10:13:51  Sat Feb 21 20:58:52
64668              u0320277    Running     4  1:10:42:21  Sun Feb 22 21:27:22
64670              u0320277    Running     2  1:11:08:40  Sun Feb 22 21:53:41
64671              u0320277    Running     2  1:11:08:52  Sun Feb 22 21:53:53
64717              u0180209    Running    22  1:15:05:14  Thu Feb 26 01:50:15
64718              u0180209    Running    22  1:15:05:17  Thu Feb 26 01:50:18
64719              u0180209    Running    22  1:19:46:53  Thu Feb 26 06:31:54
64720              u0180209    Running    22  1:21:19:29  Thu Feb 26 08:04:30
64672              u0320277    Running     2  1:23:08:22  Mon Feb 23 09:53:23
64673              u0320277    Running     2  1:23:08:22  Mon Feb 23 09:53:23
64689              u0320277    Running     2  3:00:19:14  Tue Feb 24 11:04:15
64697              u0399234    Running     2  3:05:12:30  Tue Feb 24 15:57:31
64698              u0399234    Running     2  3:05:21:06  Tue Feb 24 16:06:07
64699              u0399234    Running     2  3:05:25:05  Tue Feb 24 16:10:06
64710              u0320277    Running     2  3:09:16:23  Tue Feb 24 20:01:24
64711              u0320277    Running     2  3:09:18:25  Tue Feb 24 20:03:26
64701              u0399234    Running     2  4:15:05:17  Thu Feb 26 01:50:18

19 active jobs          124 of 124 processors in use by local jobs (100.00%)
                          62 of 62 nodes active      (100.00%)

eligible jobs----------------------
JOBID              USERNAME      STATE PROCS     WCLIMIT            QUEUETIME

46389              u0646299       Idle     1    20:00:00  Tue Dec  2 21:34:55
64727              u0320277       Idle     4  5:00:00:00  Wed Feb 25 09:18:05
64728              u0320277       Idle     4  5:00:00:00  Wed Feb 25 09:18:05
64729              u0320277       Idle     2  5:00:00:00  Wed Feb 25 09:18:16
64730              u0320277       Idle     2  5:00:00:00  Wed Feb 25 09:18:16
64754              u0565819       Idle     2    23:59:00  Wed Feb 25 13:49:28
64755              u0565819       Idle     2    23:59:00  Wed Feb 25 13:49:28
64756              u0565819       Idle     2    23:59:00  Wed Feb 25 13:49:29
64731              u0399234       Idle     2  5:00:00:00  Wed Feb 25 11:35:11
64732              u0399234       Idle     2  5:00:00:00  Wed Feb 25 11:37:18
64733              u0033399       Idle    50  5:00:00:00  Wed Feb 25 12:25:26
64757              u0565819       Idle     2    23:59:00  Wed Feb 25 13:49:29
64758              u0565819       Idle     2    23:59:00  Wed Feb 25 13:49:29
64694              u0530689       Idle     4  5:00:00:00  Mon Feb 23 14:48:43
64695              u0530689       Idle     4  5:00:00:00  Mon Feb 23 15:00:24
64704              u0530689       Idle     4  5:00:00:00  Mon Feb 23 16:08:41
64705              u0530689       Idle     4  5:00:00:00  Mon Feb 23 16:11:29

17 eligible jobs   

blocked jobs-----------------------
JOBID              USERNAME      STATE PROCS     WCLIMIT            QUEUETIME

64759              u0565819       Idle     2    23:59:00  Wed Feb 25 13:49:29
64760              u0565819       Idle     2    23:59:00  Wed Feb 25 13:49:29
64761              u0565819       Idle     2    23:59:00  Wed Feb 25 13:49:29
64762              u0565819       Idle     2    23:59:00  Wed Feb 25 13:49:29

4 blocked jobs   

Total jobs:  40

Statistics are not accurate.

Active jobs are those that are running or starting and consuming CPU resources. Displayed are the job name, the job's owner, and the job state. Also displayed are the number of processors allocated to the job, the amount of time remaining until the job completes (given in HH:MM:SS notation), and the time the job started. All active jobs are sorted in "Earliest Completion Time First" order.

Idle Jobs are those that are queued and eligible to be scheduled. They are all in the Idle job state and do not violate any fairness policies or have any job holds in place. The jobs in the Idle section display the same information as the Active Jobs section except that the wall clock CPULIMIT is specified rather than job time REMAINING, and job QUEUETIME is displayed rather than job STARTTIME. The jobs in this section are ordered by job priority. Jobs in this queue are considered eligible for both scheduling and backfilling.

Non-Queued jobs are those that are ineligible to be run or queued. Jobs listed here could be in a number of states for the following reasons:

Job State Reason
Idle Job violates a fairness policy.
UserHold PBS User Hold is in place.
SystemHold PBS System Hold is in place.
BatchHold A Moab Scheduler Batch Hold is in place (used when the job cannot be run because the requested resources are not available in the system or because PBS has repeatedly failed in attempts to start the job).
Deferred A Moab Scheduler Defer Hold is in place (a temporary hold used when a job has been unable to start after a specified number of attempts. This hold is automatically removed after a short period of time).

A summary of the job status is provided at the end of the output. The fields in the output are as follows:

Field Description
Jobname Name of the job having been submitted, or waiting to be submitted.
Username Name of the user whose job is running or in idle.
State State of Job. Either "Running" or "Idle".
Proc Number of processors being used or number of processors being requested.
Remaining Time the job has until it has reached its wall clock limit. Time specified in DD:HH:MM:SS notation. Remaining time displayed may not always equal actual job time remaining. The displayed time is based from the users wall clock limit.
WCLimit Wall clock limit specified for job. Time specified in DD:HH:MM:SS notation.
Starttime Date and time when job started.
Queuetime Date and time job entered in database.

An asterisk at the end of the job name indicates that the job has a reservation, thus the job cannot be preempted under most circumstances.

showbf

Usage

Click any argument in the showbf usage for a definition.

showbf [ -A ] | [ -a ACCOUNT ] | [ -g GROUP ] | [ -u USER ] | [ -m '[ MEMCMP ] MEMORY' ] | [ -n NODECOUNT ] | [ -d [ HH:MM:SS ] | [ -f FEATURE ] | [ -q QOS ] | [ NOT IN USE -c CLASS ]

Purpose

showbf is a tool used with Moab Scheduler to help simulate a specific job within the immediate job database. The showbf command simulates the job just as if a user was submitting an actual job. CHPC encourages users to practice using showbf before employing their jobs into qsub. The tool runs through and asks the job database what is running and how many nodes are open. Depending on the preset QOS and priorities, the job database sends back information telling the user the possiblity of his/her job being able to run.

This command can be used by any user to find out how many nodes are available for immediate use on the system. It is anticipated that users will use this information to submit jobs that meet these criteria and thus obtain quick job turnaround times. The key to this tool is that users are required to specify all the known varibles in using showbf to receive valuable information. For example, lets say user shaq in group laker wants to submit a job that requires 8 nodes, greater or equal to 128 megs RAM, a duration of 1 hour, and no partition specification. showbf would then go to the system and receive a printout:

$ showbf -u shaq -g laker -n 8 -m '>=128' -d 1:00:00
backfill window (user: 'shaq' group: 'laker' partition: ALL) Wed Jun  7 16:06:34

 42 procs available with no timelimit

$ 

If shaq wanted to run a job, but instead he would like exactly 256 megs RAM, 1 node, and for a duration of 1 hour, he would simply type and view:

$ showbf -n 3 -m '=256' -d 01:00:00
backfill window (user: 'shaq' group: 'laker' partition: ALL) Wed Jun  7 16:14:12

  3 procs available with no timelimit

$ 

Parameters and Arguments

Parameter/
Argument
Description
-A Show backfill information for all users, groups, and accounts.
-a Show backfill information only for specified accounts.
-g Show backfill information only for specified group.
-u Show backfill information only for specified user.
-m Allows user to specify the memory requirements for the backfill nodes of interest. It is important to note that if the optional MEMCMP and MEMORY parameters are used, they MUST be enclosed in backticks ('). For example, enter showbf -m '==256' to request nodes with exactly 256 MB memory. Valid signs used with MEMCMP (memory comparision) are >, >=, ==, <=, and <.
-n Show backfill information for a specified number of nodes.
-d Show backfill information for a specified duration. The number specified with the duration must have a preceeding plus (+) sign. Specified in DD:HH:MM:SS notation, indicating days:hours:minutes:seconds.
-f Show backfill information only for nodes which possess a specified feature, such as processor speed. Processor speed is indicated by 's950' for a processor that is running at 950 MHz.
-q Show backfill information for nodes which can be accessed by a job with a certain QOS. QOS's are specific to users, if you do not know what your specific QOS is, do not use this option.
-c (NOT IN USE) Show backfill information for nodes which support the class feature.

showstart

Usage

showstart [ JOBNAME ]

Purpose

The showstart command shows the earliest time a specified job can start, taking into account the requested resources, system downtime, reservations, and so on. The value given ignores jobs with higher priorities that do not have reservations. This command gives the best estimate for the job's start time if this job were next in priority behind the jobs which currently hold reservations. To show an instance of this command, let's say a user wanted to see when his/her job "4010.icebox.icebox" was intending to be run:

$ showstart 4010.icebox.icebox

job 4010.icebox.icebox requires 18 nodes for 2:00:00:00

Earliest start is in       1:01:03:45 on Wed Jun 14 11:48:09

Earliest Completion is in  3:01:03:45 on Fri Jun 16 11:48:09

Best Partition: DEFAULT

$

The output of this command informs the user how many nodes the job requires and what the WCLimit is. In addition, showstart informs the user the earliest start and completion time for the job. The user must take into account that showstart is only analyzing the job as if it were the next job in the queue. Ignore the partiton output, for partitions are irrelevant in the CHPC systems.

Fair Share policies used by the Moab Scheduler simply indicate the certain percentages available to particular groups and users. Fair Share is set up so that users and groups are given percentages of CPU cycles of which they are able to run. When a certain user's percentage is at a time lower than their target percentage, fair share would increase their priority. If the users percentage is over their percentage target, the user's priority in the queue would decrease.

For lack of a better example, the concept of priority in fair share can be displayed using a metaphor. Suppose you have a dart board. On the dart board are rings which represent allocations of CPU cycles. Your goal is to hit the middle ring; however, depending on how well you throw your dart, you may end up hitting the outer ring or the inner circle. If you hit the outer ring, it is similar to using more CPU cycles than your target; hitting the inner circle expresses using less than your allocated CPU cycles.

target

If your dart hits right on the middle ring you have hit the allocated amount of CPU cycles you were given. However, if you hit the center ring, it is like getting another chance to go again or in the schedulers case a higher priority on the jobs. If you hit the outside ring in Moab Scheduler's terms, you will get a lower priority on your jobs.

target

Users, in some cases, can be part of certain groups, for example user malone has a fair share target of 20% and user stockton has a fair share target of 7.5%. Both of these users are in the group jazz; therefore, the target percentages sum is directly related to the group.

Fair Share would see this:

  • User malone: 20%
  • User stockton: 7.5%
  • Group jazz: 30%

The sum of malone and stockton equal 27.5%, but the total of the users in the jazz group is 30%. This means that because the group collectively has not used there target percentage either the groups priority will rise or either user can use 2.5% more CPU cycles and remain at regular priority.

The use of reservations allows a job to maintain a guaranteed start time. Reservations enable other jobs to use these nodes as long as they do not delay this job's start time. The Moab backfill selects the best combinations of jobs in the schedule and places them, while not delaying any other jobs already created. The best schedule placement, which Moab chooses, varies from workload to workload and can only be defined through simulation.

Moab Scheduler has the capability to schedule jobs on different partitons; partitions being defined as a group of nodes with a common high performance switch, a common file space, and a common user space. Users do not specify partitions, but they are considered when Moab schedules the jobs.

Quality of Service or QOS is the resource CHPC uses to define users certain policies on machines. All users receive a predetermined number for QOS which is done behind the scenes. QOS is divided into groups:

  • QOS 0: Out of Allocations
  • QOS 1: Normal Run
  • High Priority Run
    • QOS 10: Voth Users
    • QOS 20: Schuster Users
    • QOS 30: Simmons Users
    • QOS 40: Steenburgh Users

Unless you are a user from a specific group, you should always run your job under a QOS of 1. Needed to be emphasized is that users carrying one of the high priority QOS's cannot use those QOS numbers when trying to run on another machine other than your own groups. If you are a user from Voth's nodes, the QOS of 10 will only get you a job under the Voth nodes. If you are a user with a QOS of 10 and are trying to get a job under any of the Simmons Nodes, you will remain in the queue. Never use the QOS of 0 unless you are totally out of allocation time. If a user still has some time left in their allocations, and specifies that they are QOS 0, the job length time is still deducted from the bank.

For more information relating to policies associated with each QOS look to IA-32 Cluster (icebox) User Guide.

Gold is a CPU allocations bank which uses its relational database to store and receive the prior transaction history and current state of the bank. Gold works with Moab to control and manage CPU resources allocated to projects or users. Moab, after scheduling a certain job, goes into Gold and informs it about the job scheduled. Gold puts that information into the allocation records and then runs the job specified. Gold is setup so it is able to provide feedback to users and administrators about usage time and account balances.

CHPC has now implemented Moab on Sierra and ICE Box. Each node, or set of nodes carry specific policies which affect the users when running jobs. The individual nodes also carry different hardware specifications which can vary a job's speed and sizability. The Local Updraft Cluster (updraft) Configuration, Local Arches Clusters Configuration, can be helpful when running commands such as "showbf" as to knowing what machines you are intending on running your job.

Last Modified: February 26, 2009 @ 10:47:44