CHPC Software: Maui Scheduler
Updated: July 15, 2004
The Maui Scheduler is a software tool designed to allow flexible and near-optimal scheduling in a multi-resource high-performance computing environment. Jobs are screened through a large number of administrator-configurable fairness policies and then prioritized based on factors such as job queue time, job expansion factor, and user CPU usage history. The Maui Scheduler maximizes control of system resources by allowing certain nodes to be reserved for specific users, groups, accounts, or projects and minimizes the amount of wasted computer resources by permitting anticipated downtime to be scheduled.
The Maui Scheduler is also a analysis/research tool. The Maui Scheduler collects a large number of statistics and can analyze this information to determine the effectiveness of the scheduling approach being used by Maui and the utilization of the resources available.
Maui Scheduler was designed to provide the information and control needed to efficiently manage large systems. Maui Scheduler is based in an "information-rich" environment, which gives support required by administrators and needed by users. Maui Sceduler supplies system administrators with the parameters necessary to make well-founded system and job management decisions. Its classless, single queue environment allows any job to run on any set of nodes whose configuration meets its requirements. The Maui Scheduler provides an extensive set of dynamically reconfigurable parameters to control job queues and various aspects of the running workload.
Maui is part of the batch system at CHPC. It works in conjuction with PBS (Portable Batch System). For details on using batch on a particular CHPC platform, please refer to the User's Guides:
The Maui Scheduler has approximately 25 commands that allow system administrators to obtain information needed to solve specific problems, or make specific decisions and fine-tune parameters to increase utilization and throughput. The following set of commands are designed for the user in order to utilize the Maui Scheduler's functionality:
Currently Maui Scheduler is only on the Sierra and ICE Box
clusters. These commands are located in
"/uufs/sierra/sys/bin" for Sierra and
"/uufs/icebox/sys/bin" for ICE Box.
showq
Usage
showq [ -r | -i ]
Purpose
showq is a command directly related to
"qstat" and is much more highly recommended.
Running this command displays jobs that are running or active,
idling, and non-queued jobs.
$ showq
ACTIVE JOBS--------------------
JOBNAME USERNAME STATE PROC REMAINING STARTTIME
3994.icebox.icebox detar Running 32 0:56:57 Mon Jun 12 15:56:10
3996.icebox.icebox chmmbr Running 30 1:02:48:34 Tue Jun 13 05:47:47
3944.icebox.icebox chmriu Running 8 2:01:30:09 Mon Jun 12 10:29:22
3945.icebox.icebox peutch Running 4 2:01:31:19 Mon Jun 12 10:30:32
3952.icebox.icebox chmale Running 1 2:01:56:02 Mon Jun 12 11:55:15
3953.icebox.icebox chmale Running 1 2:01:56:03 Mon Jun 12 11:55:16
3954.icebox.icebox chmale Running 1 2:01:56:07 Mon Jun 12 11:55:20
3957.icebox.icebox chmale Running 1 2:01:56:10 Mon Jun 12 11:55:23
3956.icebox.icebox chmale Running 1 2:01:56:11 Mon Jun 12 11:55:24
3955.icebox.icebox chmale Running 1 2:01:56:12 Mon Jun 12 11:55:25
3966.icebox.icebox chmale Running 1 2:01:56:12 Mon Jun 12 11:55:25
3960.icebox.icebox chmale Running 1 2:01:56:14 Mon Jun 12 11:55:27
3961.icebox.icebox chmale Running 1 2:01:56:17 Mon Jun 12 11:55:30
38 Active Jobs 128 of 178 Processors Active (71.91%)
132 of 158 Nodes Active (83.54%)
IDLE JOBS----------------------
JOBNAME USERNAME STATE PROC WCLIMIT QUEUETIME
4003.icebox.icebox peutch Idle 4 3:00:00:00 Tue Jun 13 08:45:02
1 Idle Job
NON-QUEUED JOBS----------------
JOBNAME USERNAME STATE PROC WCLIMIT QUEUETIME
3995.icebox.icebox chmmbr Idle 20 1:06:00:00 Mon Jun 12 16:13:47
Total Jobs: 40 Active Jobs: 38 Idle Jobs: 1 Non-Queued Jobs: 1
$
Statistics are not accurate.
Active jobs are those that are running or starting and consuming CPU resources. Displayed are the job name, the job's owner, and the job state. Also displayed are the number of processors allocated to the job, the amount of time remaining until the job completes (given in HH:MM:SS notation), and the time the job started. All active jobs are sorted in "Earliest Completion Time First" order.
Idle Jobs are those that are queued and eligible to be scheduled. They are all in the Idle job state and do not violate any fairness policies or have any job holds in place. The jobs in the Idle section display the same information as the Active Jobs section except that the wall clock CPULIMIT is specified rather than job time REMAINING, and job QUEUETIME is displayed rather than job STARTTIME. The jobs in this section are ordered by job priority. Jobs in this queue are considered eligible for both scheduling and backfilling.
Non-Queued jobs are those that are ineligible to be run or queued. Jobs listed here could be in a number of states for the following reasons:
| Job State | Reason |
|---|---|
| Idle | Job violates a fairness policy. |
| UserHold | PBS User Hold is in place. |
| SystemHold | PBS System Hold is in place. |
| BatchHold | A Maui Scheduler Batch Hold is in place (used when the job cannot be run because the requested resources are not available in the system or because PBS has repeatedly failed in attempts to start the job). |
| Deferred | A Maui Scheduler Defer Hold is in place (a temporary hold used when a job has been unable to start after a specified number of attempts. This hold is automatically removed after a short period of time). |
| NotQueued | Job is in the PBS state NQ (indicating the job's controlling scheduling daemon is unavailable). |
A summary of the job status is provided at the end of the output. The fields in the output are as follows:
| Field | Description |
|---|---|
| Jobname | Name of the job having been submitted, or waiting to be submitted. |
| Username | Name of the user whose job is running or in idle. |
| State | State of Job. Either "Running" or "Idle". |
| Proc | Number of processors being used or number of processors being requested. |
| Remaining | Time the job has until it has reached its wall clock limit. Time specified in DD:HH:MM:SS notation. Remaining time displayed may not always equal actual job time remaining. The displayed time is based from the users wall clock limit. |
| WCLimit | Wall clock limit specified for job. Time specified in DD:HH:MM:SS notation. |
| Starttime | Date and time when job started. |
| Queuetime | Date and time job entered in database. |
An asterisk at the end of the job name indicates that the job has a reservation, thus the job cannot be preempted under most circumstances.
showbf
Usage
Click any argument in the showbf usage for a
definition.
showbf [
-A ] | [
-a ACCOUNT ] | [
-g GROUP ] | [
-u USER ] | [
-m '[ MEMCMP ] MEMORY' ] | [
-n NODECOUNT ] | [
-d [ HH:MM:SS ] | [
-f FEATURE ] | [
-q QOS ] | [ NOT IN USE
-c CLASS ]
Purpose
showbf is a tool used with Maui Scheduler to
help simulate a specific job within the immediate job
database. The showbf command simulates the job
just as if a user was submitting an actual job. CHPC
encourages users to practice using showbf before
employing their jobs into qsub. The tool runs
through and asks the job database what is running and how many
nodes are open. Depending on the preset QOS and priorities,
the job database sends back information telling the user the
possiblity of his/her job being able to run.
This command can be used by any user to find out how many
nodes are available for immediate use on the system. It is
anticipated that users will use this information to submit
jobs that meet these criteria and thus obtain quick job
turnaround times. The key to this tool is that users are
required to specify all the known varibles in using
showbf to receive valuable information. For
example, lets say user shaq in group laker
wants to submit a job that requires 8 nodes, greater or equal
to 128 megs RAM, a duration of 1 hour, and no partition
specification. showbf would then go to the
system and receive a printout:
$ showbf -u shaq -g laker -n 8 -m '>=128' -d 1:00:00 backfill window (user: 'shaq' group: 'laker' partition: ALL) Wed Jun 7 16:06:34 42 procs available with no timelimit $
If shaq wanted to run a job, but instead he would like exactly 256 megs RAM, 1 node, and for a duration of 1 hour, he would simply type and view:
$ showbf -n 3 -m '=256' -d 01:00:00 backfill window (user: 'shaq' group: 'laker' partition: ALL) Wed Jun 7 16:14:12 3 procs available with no timelimit $
Parameters and Arguments
| Parameter/ Argument |
Description |
|---|---|
| -A | Show backfill information for all users, groups, and accounts. |
| -a | Show backfill information only for specified accounts. |
| -g | Show backfill information only for specified group. |
| -u | Show backfill information only for specified user. |
| -m | Allows user to specify the memory requirements for the backfill nodes of interest. It is important to note that if the optional MEMCMP and MEMORY parameters are used, they MUST be enclosed in backticks ('). For example, enter showbf -m '==256' to request nodes with exactly 256 MB memory. Valid signs used with MEMCMP (memory comparision) are >, >=, ==, <=, and <. |
| -n | Show backfill information for a specified number of nodes. |
| -d | Show backfill information for a specified duration. The number specified with the duration must have a preceeding plus (+) sign. Specified in DD:HH:MM:SS notation, indicating days:hours:minutes:seconds. |
| -f | Show backfill information only for nodes which possess a specified feature, such as processor speed. Processor speed is indicated by 's950' for a processor that is running at 950 MHz. |
| -q | Show backfill information for nodes which can be accessed by a job with a certain QOS. QOS's are specific to users, if you do not know what your specific QOS is, do not use this option. |
| -c | (NOT IN USE) Show backfill information for nodes which support the class feature. |
showstart
Usage
showstart [ JOBNAME ]
Purpose
The showstart command shows the earliest time
a specified job can start, taking into account the requested
resources, system downtime, reservations, and so on. The value
given ignores jobs with higher priorities that do not have
reservations. This command gives the best estimate for the
job's start time if this job were next in priority behind the
jobs which currently hold reservations. To show an instance of
this command, let's say a user wanted to see when his/her job
"4010.icebox.icebox" was intending to be run:
$ showstart 4010.icebox.icebox job 4010.icebox.icebox requires 18 nodes for 2:00:00:00 Earliest start is in 1:01:03:45 on Wed Jun 14 11:48:09 Earliest Completion is in 3:01:03:45 on Fri Jun 16 11:48:09 Best Partition: DEFAULT $
The output of this command
informs the user how many nodes the job requires and what the
WCLimit is. In addition, showstart informs the
user the earliest start and completion time for the job. The
user must take into account that showstart is
only analyzing the job as if it were the next job in the
queue. Ignore the partiton output, for partitions are
irrelevant in the CHPC systems.
Fair Share policies used by the Maui Scheduler simply indicate the certain percentages available to particular groups and users. Fair Share is set up so that users and groups are given percentages of CPU cycles of which they are able to run. When a certain user's percentage is at a time lower than their target percentage, fair share would increase their priority. If the users percentage is over their percentage target, the user's priority in the queue would decrease.
For lack of a better example, the concept of priority in fair share can be displayed using a metaphor. Suppose you have a dart board. On the dart board are rings which represent allocations of CPU cycles. Your goal is to hit the middle ring; however, depending on how well you throw your dart, you may end up hitting the outer ring or the inner circle. If you hit the outer ring, it is similar to using more CPU cycles than your target; hitting the inner circle expresses using less than your allocated CPU cycles.
If your dart hits right on the middle ring you have hit the allocated amount of CPU cycles you were given. However, if you hit the center ring, it is like getting another chance to go again or in the schedulers case a higher priority on the jobs. If you hit the outside ring in Maui Scheduler's terms, you will get a lower priority on your jobs.
Users, in some cases, can be part of certain groups, for example user malone has a fair share target of 20% and user stockton has a fair share target of 7.5%. Both of these users are in the group jazz; therefore, the target percentages sum is directly related to the group.
Fair Share would see this:
- User malone: 20%
- User stockton: 7.5%
- Group jazz: 30%
The sum of malone and stockton equal 27.5%, but the total of the users in the jazz group is 30%. This means that because the group collectively has not used there target percentage either the groups priority will rise or either user can use 2.5% more CPU cycles and remain at regular priority.
The use of reservations allows a job to maintain a guaranteed start time. Reservations enable other jobs to use these nodes as long as they do not delay this job's start time. The Maui backfill selects the best combinations of jobs in the schedule and places them, while not delaying any other jobs already created. The best schedule placement, which Maui chooses, varies from workload to workload and can only be defined through simulation.
Maui Scheduler has the capability to schedule jobs on different partitons; partitions being defined as a group of nodes with a common high performance switch, a common file space, and a common user space. Users do not specify partitions, but they are considered when Maui schedules the jobs.
Quality of Service or QOS is the resource CHPC uses to define users certain policies on machines. All users receive a predetermined number for QOS which is done behind the scenes. QOS is divided into groups:
- QOS 0: Out of Allocations
- QOS 1: Normal Run
- High Priority Run
- QOS 10: Voth Users
- QOS 20: Schuster Users
- QOS 30: Simmons Users
- QOS 40: Steenburgh Users
Unless you are a user from a specific group, you should always run your job under a QOS of 1. Needed to be emphasized is that users carrying one of the high priority QOS's cannot use those QOS numbers when trying to run on another machine other than your own groups. If you are a user from Voth's nodes, the QOS of 10 will only get you a job under the Voth nodes. If you are a user with a QOS of 10 and are trying to get a job under any of the Simmons Nodes, you will remain in the queue. Never use the QOS of 0 unless you are totally out of allocation time. If a user still has some time left in their allocations, and specifies that they are QOS 0, the job length time is still deducted from the bank.
For more information relating to policies associated with each QOS look to IA-32 Cluster (icebox) User Guide.
Qbank is a CPU allocations bank which uses its relational database to store and receive the prior transaction history and current state of the bank. Qbank works with Maui to control and manage CPU resources allocated to projects or users. Maui, after scheduling a certain job, goes into Qbank and informs it about the job scheduled. Qbank puts that information into the allocation records and then runs the job specified. Qbank is setup so it is able to provide feedback to users and administrators about usage time and account balances.
CHPC has now implemented Maui on Sierra and ICE Box. Each node, or set of nodes carry specific policies which affect the users when running jobs. The individual nodes also carry different hardware specifications which can vary a job's speed and sizability. The Local Opteron Cluster (sierra) Configuration, Local Compaq Sierra Configuration and Local IA-32 Cluster (icebox) Configuration can be helpful when running commands such as "showbf" as to knowing what machines you are intending on running your job.

