CHPC Software: Moab Scheduler
Moab Scheduler was designed to provide the information and control needed to efficiently manage large systems. Moab Scheduler is based in an "information-rich" environment, which gives support required by administrators and needed by users. Moab Sceduler supplies system administrators with the parameters necessary to make well-founded system and job management decisions. Its classless, single queue environment allows any job to run on any set of nodes whose configuration meets its requirements. The Moab Scheduler provides an extensive set of dynamically reconfigurable parameters to control job queues and various aspects of the running workload.
Moab is part of the batch system at CHPC. It works in conjuction with PBS (Portable Batch System). For details on using batch on a particular CHPC platform, please refer to the User's Guides:
The Moab Scheduler has approximately 25 commands that allow system administrators to obtain information needed to solve specific problems, or make specific decisions and fine-tune parameters to increase utilization and throughput. The following set of commands are designed for the user in order to utilize the Moab Scheduler's functionality:
showq [ -r | -i ]
showq is a command directly related to
qstat" and is much more highly recommended.
Running this command displays jobs that are running or active,
idling, and non-queued jobs.
[u0108240@updraft1:~]$ /uufs/tunnelarch.arches/sys/bin/showq active jobs------------------------ JOBID USERNAME STATE PROCS REMAINING STARTTIME 64661 u0320277 Running 4 2:10:53 Sat Feb 21 12:55:54 64662 u0320277 Running 4 8:01:01 Sat Feb 21 18:46:02 64666 u0320277 Running 2 10:13:51 Sat Feb 21 20:58:52 64668 u0320277 Running 4 1:10:42:21 Sun Feb 22 21:27:22 64670 u0320277 Running 2 1:11:08:40 Sun Feb 22 21:53:41 64671 u0320277 Running 2 1:11:08:52 Sun Feb 22 21:53:53 64717 u0180209 Running 22 1:15:05:14 Thu Feb 26 01:50:15 64718 u0180209 Running 22 1:15:05:17 Thu Feb 26 01:50:18 64719 u0180209 Running 22 1:19:46:53 Thu Feb 26 06:31:54 64720 u0180209 Running 22 1:21:19:29 Thu Feb 26 08:04:30 64672 u0320277 Running 2 1:23:08:22 Mon Feb 23 09:53:23 64673 u0320277 Running 2 1:23:08:22 Mon Feb 23 09:53:23 64689 u0320277 Running 2 3:00:19:14 Tue Feb 24 11:04:15 64697 u0399234 Running 2 3:05:12:30 Tue Feb 24 15:57:31 64698 u0399234 Running 2 3:05:21:06 Tue Feb 24 16:06:07 64699 u0399234 Running 2 3:05:25:05 Tue Feb 24 16:10:06 64710 u0320277 Running 2 3:09:16:23 Tue Feb 24 20:01:24 64711 u0320277 Running 2 3:09:18:25 Tue Feb 24 20:03:26 64701 u0399234 Running 2 4:15:05:17 Thu Feb 26 01:50:18 19 active jobs 124 of 124 processors in use by local jobs (100.00%) 62 of 62 nodes active (100.00%) eligible jobs---------------------- JOBID USERNAME STATE PROCS WCLIMIT QUEUETIME 46389 u0646299 Idle 1 20:00:00 Tue Dec 2 21:34:55 64727 u0320277 Idle 4 5:00:00:00 Wed Feb 25 09:18:05 64728 u0320277 Idle 4 5:00:00:00 Wed Feb 25 09:18:05 64729 u0320277 Idle 2 5:00:00:00 Wed Feb 25 09:18:16 64730 u0320277 Idle 2 5:00:00:00 Wed Feb 25 09:18:16 64754 u0565819 Idle 2 23:59:00 Wed Feb 25 13:49:28 64755 u0565819 Idle 2 23:59:00 Wed Feb 25 13:49:28 64756 u0565819 Idle 2 23:59:00 Wed Feb 25 13:49:29 64731 u0399234 Idle 2 5:00:00:00 Wed Feb 25 11:35:11 64732 u0399234 Idle 2 5:00:00:00 Wed Feb 25 11:37:18 64733 u0033399 Idle 50 5:00:00:00 Wed Feb 25 12:25:26 64757 u0565819 Idle 2 23:59:00 Wed Feb 25 13:49:29 64758 u0565819 Idle 2 23:59:00 Wed Feb 25 13:49:29 64694 u0530689 Idle 4 5:00:00:00 Mon Feb 23 14:48:43 64695 u0530689 Idle 4 5:00:00:00 Mon Feb 23 15:00:24 64704 u0530689 Idle 4 5:00:00:00 Mon Feb 23 16:08:41 64705 u0530689 Idle 4 5:00:00:00 Mon Feb 23 16:11:29 17 eligible jobs blocked jobs----------------------- JOBID USERNAME STATE PROCS WCLIMIT QUEUETIME 64759 u0565819 Idle 2 23:59:00 Wed Feb 25 13:49:29 64760 u0565819 Idle 2 23:59:00 Wed Feb 25 13:49:29 64761 u0565819 Idle 2 23:59:00 Wed Feb 25 13:49:29 64762 u0565819 Idle 2 23:59:00 Wed Feb 25 13:49:29 4 blocked jobs Total jobs: 40
Statistics are not accurate.
Active jobs are those that are running or starting and consuming CPU resources. Displayed are the job name, the job's owner, and the job state. Also displayed are the number of processors allocated to the job, the amount of time remaining until the job completes (given in HH:MM:SS notation), and the time the job started. All active jobs are sorted in "Earliest Completion Time First" order.
Idle Jobs are those that are queued and eligible to be scheduled. They are all in the Idle job state and do not violate any fairness policies or have any job holds in place. The jobs in the Idle section display the same information as the Active Jobs section except that the wall clock CPULIMIT is specified rather than job time REMAINING, and job QUEUETIME is displayed rather than job STARTTIME. The jobs in this section are ordered by job priority. Jobs in this queue are considered eligible for both scheduling and backfilling.
Non-Queued jobs are those that are ineligible to be run or queued. Jobs listed here could be in a number of states for the following reasons:
|Idle||Job violates a fairness policy.|
|UserHold||PBS User Hold is in place.|
|SystemHold||PBS System Hold is in place.|
|BatchHold||A Moab Scheduler Batch Hold is in place (used when the job cannot be run because the requested resources are not available in the system or because PBS has repeatedly failed in attempts to start the job).|
|Deferred||A Moab Scheduler Defer Hold is in place (a temporary hold used when a job has been unable to start after a specified number of attempts. This hold is automatically removed after a short period of time).|
A summary of the job status is provided at the end of the output. The fields in the output are as follows:
|Jobname||Name of the job having been submitted, or waiting to be submitted.|
|Username||Name of the user whose job is running or in idle.|
|State||State of Job. Either "Running" or "Idle".|
|Proc||Number of processors being used or number of processors being requested.|
|Remaining||Time the job has until it has reached its wall clock limit. Time specified in DD:HH:MM:SS notation. Remaining time displayed may not always equal actual job time remaining. The displayed time is based from the users wall clock limit.|
|WCLimit||Wall clock limit specified for job. Time specified in DD:HH:MM:SS notation.|
|Starttime||Date and time when job started.|
|Queuetime||Date and time job entered in database.|
An asterisk at the end of the job name indicates that the job has a reservation, thus the job cannot be preempted under most circumstances.
Click any argument in the
showbf usage for a
showbf is a tool used with Moab Scheduler to
help simulate a specific job within the immediate job
showbf command simulates the job
just as if a user was submitting an actual job. CHPC
encourages users to practice using
employing their jobs into
qsub. The tool runs
through and asks the job database what is running and how many
nodes are open. Depending on the preset QOS and priorities,
the job database sends back information telling the user the
possiblity of his/her job being able to run.
This command can be used by any user to find out how many
nodes are available for immediate use on the system. It is
anticipated that users will use this information to submit
jobs that meet these criteria and thus obtain quick job
turnaround times. The key to this tool is that users are
required to specify all the known varibles in using
showbf to receive valuable information. For
example, lets say user shaq in group laker
wants to submit a job that requires 8 nodes, greater or equal
to 128 megs RAM, a duration of 1 hour, and no partition
showbf would then go to the
system and receive a printout:
$ showbf -u shaq -g laker -n 8 -m '>=128' -d 1:00:00 backfill window (user: 'shaq' group: 'laker' partition: ALL) Wed Jun 7 16:06:34 42 procs available with no timelimit $
If shaq wanted to run a job, but instead he would like exactly 256 megs RAM, 1 node, and for a duration of 1 hour, he would simply type and view:
$ showbf -n 3 -m '=256' -d 01:00:00 backfill window (user: 'shaq' group: 'laker' partition: ALL) Wed Jun 7 16:14:12 3 procs available with no timelimit $
Parameters and Arguments
|-A||Show backfill information for all users, groups, and accounts.|
|-a||Show backfill information only for specified accounts.|
|-g||Show backfill information only for specified group.|
|-u||Show backfill information only for specified user.|
|-m||Allows user to specify the memory requirements for the backfill nodes of interest. It is important to note that if the optional MEMCMP and MEMORY parameters are used, they MUST be enclosed in backticks ('). For example, enter showbf -m '==256' to request nodes with exactly 256 MB memory. Valid signs used with MEMCMP (memory comparision) are >, >=, ==, <=, and <.|
|-n||Show backfill information for a specified number of nodes.|
|-d||Show backfill information for a specified duration. The number specified with the duration must have a preceeding plus (+) sign. Specified in DD:HH:MM:SS notation, indicating days:hours:minutes:seconds.|
|-f||Show backfill information only for nodes which possess a specified feature, such as processor speed. Processor speed is indicated by 's950' for a processor that is running at 950 MHz.|
|-q||Show backfill information for nodes which can be accessed by a job with a certain QOS. QOS's are specific to users, if you do not know what your specific QOS is, do not use this option.|
|-c||(NOT IN USE) Show backfill information for nodes which support the class feature.|
showstart [ JOBNAME ]
showstart command shows the earliest time
a specified job can start, taking into account the requested
resources, system downtime, reservations, and so on. The value
given ignores jobs with higher priorities that do not have
reservations. This command gives the best estimate for the
job's start time if this job were next in priority behind the
jobs which currently hold reservations. To show an instance of
this command, let's say a user wanted to see when his/her job
"4010.icebox.icebox" was intending to be run:
$ showstart 4010.icebox.icebox job 4010.icebox.icebox requires 18 nodes for 2:00:00:00 Earliest start is in 1:01:03:45 on Wed Jun 14 11:48:09 Earliest Completion is in 3:01:03:45 on Fri Jun 16 11:48:09 Best Partition: DEFAULT $
The output of this command
informs the user how many nodes the job requires and what the
WCLimit is. In addition,
showstart informs the
user the earliest start and completion time for the job. The
user must take into account that
only analyzing the job as if it were the next job in the
queue. Ignore the partiton output, for partitions are
irrelevant in the CHPC systems.
Fair Share policies used by the Moab Scheduler simply indicate the certain percentages available to particular groups and users. Fair Share is set up so that users and groups are given percentages of CPU cycles of which they are able to run. When a certain user's percentage is at a time lower than their target percentage, fair share would increase their priority. If the users percentage is over their percentage target, the user's priority in the queue would decrease.
For lack of a better example, the concept of priority in fair share can be displayed using a metaphor. Suppose you have a dart board. On the dart board are rings which represent allocations of CPU cycles. Your goal is to hit the middle ring; however, depending on how well you throw your dart, you may end up hitting the outer ring or the inner circle. If you hit the outer ring, it is similar to using more CPU cycles than your target; hitting the inner circle expresses using less than your allocated CPU cycles.
If your dart hits right on the middle ring you have hit the allocated amount of CPU cycles you were given. However, if you hit the center ring, it is like getting another chance to go again or in the schedulers case a higher priority on the jobs. If you hit the outside ring in Moab Scheduler's terms, you will get a lower priority on your jobs.
Users, in some cases, can be part of certain groups, for example user malone has a fair share target of 20% and user stockton has a fair share target of 7.5%. Both of these users are in the group jazz; therefore, the target percentages sum is directly related to the group.
Fair Share would see this:
- User malone: 20%
- User stockton: 7.5%
- Group jazz: 30%
The sum of malone and stockton equal 27.5%, but the total of the users in the jazz group is 30%. This means that because the group collectively has not used there target percentage either the groups priority will rise or either user can use 2.5% more CPU cycles and remain at regular priority.
The use of reservations allows a job to maintain a guaranteed start time. Reservations enable other jobs to use these nodes as long as they do not delay this job's start time. The Moab backfill selects the best combinations of jobs in the schedule and places them, while not delaying any other jobs already created. The best schedule placement, which Moab chooses, varies from workload to workload and can only be defined through simulation.
Moab Scheduler has the capability to schedule jobs on different partitons; partitions being defined as a group of nodes with a common high performance switch, a common file space, and a common user space. Users do not specify partitions, but they are considered when Moab schedules the jobs.
Quality of Service or QOS is the resource CHPC uses to define users certain policies on machines. All users receive a predetermined number for QOS which is done behind the scenes. QOS is divided into groups:
- QOS 0: Out of Allocations
- QOS 1: Normal Run
- High Priority Run
- QOS 10: Voth Users
- QOS 20: Schuster Users
- QOS 30: Simmons Users
- QOS 40: Steenburgh Users
Unless you are a user from a specific group, you should always run your job under a QOS of 1. Needed to be emphasized is that users carrying one of the high priority QOS's cannot use those QOS numbers when trying to run on another machine other than your own groups. If you are a user from Voth's nodes, the QOS of 10 will only get you a job under the Voth nodes. If you are a user with a QOS of 10 and are trying to get a job under any of the Simmons Nodes, you will remain in the queue. Never use the QOS of 0 unless you are totally out of allocation time. If a user still has some time left in their allocations, and specifies that they are QOS 0, the job length time is still deducted from the bank.
For more information relating to policies associated with each QOS look to IA-32 Cluster (icebox) User Guide.
Gold is a CPU allocations bank which uses its relational database to store and receive the prior transaction history and current state of the bank. Gold works with Moab to control and manage CPU resources allocated to projects or users. Moab, after scheduling a certain job, goes into Gold and informs it about the job scheduled. Gold puts that information into the allocation records and then runs the job specified. Gold is setup so it is able to provide feedback to users and administrators about usage time and account balances.
CHPC has now implemented Moab on Sierra and ICE Box. Each node, or set of nodes carry specific policies which affect the users when running jobs. The individual nodes also carry different hardware specifications which can vary a job's speed and sizability. The Local Updraft Cluster (updraft) Configuration, Local Arches Clusters Configuration, can be helpful when running commands such as "showbf" as to knowing what machines you are intending on running your job.