CHPC Software: Maui Scheduler

Updated: July 15, 2004

The Maui Scheduler is a software tool designed to allow flexible and near-optimal scheduling in a multi-resource high-performance computing environment. Jobs are screened through a large number of administrator-configurable fairness policies and then prioritized based on factors such as job queue time, job expansion factor, and user CPU usage history. The Maui Scheduler maximizes control of system resources by allowing certain nodes to be reserved for specific users, groups, accounts, or projects and minimizes the amount of wasted computer resources by permitting anticipated downtime to be scheduled.

The Maui Scheduler is also a analysis/research tool. The Maui Scheduler collects a large number of statistics and can analyze this information to determine the effectiveness of the scheduling approach being used by Maui and the utilization of the resources available.

Maui Scheduler was designed to provide the information and control needed to efficiently manage large systems. Maui Scheduler is based in an "information-rich" environment, which gives support required by administrators and needed by users. Maui Sceduler supplies system administrators with the parameters necessary to make well-founded system and job management decisions. Its classless, single queue environment allows any job to run on any set of nodes whose configuration meets its requirements. The Maui Scheduler provides an extensive set of dynamically reconfigurable parameters to control job queues and various aspects of the running workload.

Maui is part of the batch system at CHPC. It works in conjuction with PBS (Portable Batch System). For details on using batch on a particular CHPC platform, please refer to the User's Guides:

The Maui Scheduler has approximately 25 commands that allow system administrators to obtain information needed to solve specific problems, or make specific decisions and fine-tune parameters to increase utilization and throughput. The following set of commands are designed for the user in order to utilize the Maui Scheduler's functionality:

Currently Maui Scheduler is only on the Sierra and ICE Box clusters. These commands are located in "/uufs/sierra/sys/bin" for Sierra and "/uufs/icebox/sys/bin" for ICE Box.

showq

Usage

showq [ -r | -i ]

Purpose

showq is a command directly related to "qstat" and is much more highly recommended. Running this command displays jobs that are running or active, idling, and non-queued jobs.

$ showq
ACTIVE JOBS--------------------
           JOBNAME USERNAME      STATE  PROC   REMAINING            STARTTIME

3994.icebox.icebox    detar    Running    32     0:56:57  Mon Jun 12 15:56:10
3996.icebox.icebox   chmmbr    Running    30  1:02:48:34  Tue Jun 13 05:47:47
3944.icebox.icebox   chmriu    Running     8  2:01:30:09  Mon Jun 12 10:29:22
3945.icebox.icebox   peutch    Running     4  2:01:31:19  Mon Jun 12 10:30:32
3952.icebox.icebox   chmale    Running     1  2:01:56:02  Mon Jun 12 11:55:15
3953.icebox.icebox   chmale    Running     1  2:01:56:03  Mon Jun 12 11:55:16
3954.icebox.icebox   chmale    Running     1  2:01:56:07  Mon Jun 12 11:55:20
3957.icebox.icebox   chmale    Running     1  2:01:56:10  Mon Jun 12 11:55:23
3956.icebox.icebox   chmale    Running     1  2:01:56:11  Mon Jun 12 11:55:24
3955.icebox.icebox   chmale    Running     1  2:01:56:12  Mon Jun 12 11:55:25
3966.icebox.icebox   chmale    Running     1  2:01:56:12  Mon Jun 12 11:55:25
3960.icebox.icebox   chmale    Running     1  2:01:56:14  Mon Jun 12 11:55:27
3961.icebox.icebox   chmale    Running     1  2:01:56:17  Mon Jun 12 11:55:30


    38 Active Jobs     128 of  178 Processors Active (71.91%)
                       132 of  158 Nodes Active      (83.54%)

IDLE JOBS----------------------
           JOBNAME USERNAME      STATE  PROC     WCLIMIT            QUEUETIME

4003.icebox.icebox   peutch       Idle     4  3:00:00:00  Tue Jun 13 08:45:02

1 Idle Job 

NON-QUEUED JOBS----------------
           JOBNAME USERNAME      STATE  PROC     WCLIMIT            QUEUETIME

3995.icebox.icebox   chmmbr       Idle    20  1:06:00:00  Mon Jun 12 16:13:47

Total Jobs: 40   Active Jobs: 38   Idle Jobs: 1   Non-Queued Jobs: 1
$ 

Statistics are not accurate.

Active jobs are those that are running or starting and consuming CPU resources. Displayed are the job name, the job's owner, and the job state. Also displayed are the number of processors allocated to the job, the amount of time remaining until the job completes (given in HH:MM:SS notation), and the time the job started. All active jobs are sorted in "Earliest Completion Time First" order.

Idle Jobs are those that are queued and eligible to be scheduled. They are all in the Idle job state and do not violate any fairness policies or have any job holds in place. The jobs in the Idle section display the same information as the Active Jobs section except that the wall clock CPULIMIT is specified rather than job time REMAINING, and job QUEUETIME is displayed rather than job STARTTIME. The jobs in this section are ordered by job priority. Jobs in this queue are considered eligible for both scheduling and backfilling.

Non-Queued jobs are those that are ineligible to be run or queued. Jobs listed here could be in a number of states for the following reasons:

Job State Reason
Idle Job violates a fairness policy.
UserHold PBS User Hold is in place.
SystemHold PBS System Hold is in place.
BatchHold A Maui Scheduler Batch Hold is in place (used when the job cannot be run because the requested resources are not available in the system or because PBS has repeatedly failed in attempts to start the job).
Deferred A Maui Scheduler Defer Hold is in place (a temporary hold used when a job has been unable to start after a specified number of attempts. This hold is automatically removed after a short period of time).
NotQueued Job is in the PBS state NQ (indicating the job's controlling scheduling daemon is unavailable).

A summary of the job status is provided at the end of the output. The fields in the output are as follows:

Field Description
Jobname Name of the job having been submitted, or waiting to be submitted.
Username Name of the user whose job is running or in idle.
State State of Job. Either "Running" or "Idle".
Proc Number of processors being used or number of processors being requested.
Remaining Time the job has until it has reached its wall clock limit. Time specified in DD:HH:MM:SS notation. Remaining time displayed may not always equal actual job time remaining. The displayed time is based from the users wall clock limit.
WCLimit Wall clock limit specified for job. Time specified in DD:HH:MM:SS notation.
Starttime Date and time when job started.
Queuetime Date and time job entered in database.

An asterisk at the end of the job name indicates that the job has a reservation, thus the job cannot be preempted under most circumstances.

showbf

Usage

Click any argument in the showbf usage for a definition.

showbf [ -A ] | [ -a ACCOUNT ] | [ -g GROUP ] | [ -u USER ] | [ -m '[ MEMCMP ] MEMORY' ] | [ -n NODECOUNT ] | [ -d [ HH:MM:SS ] | [ -f FEATURE ] | [ -q QOS ] | [ NOT IN USE -c CLASS ]

Purpose

showbf is a tool used with Maui Scheduler to help simulate a specific job within the immediate job database. The showbf command simulates the job just as if a user was submitting an actual job. CHPC encourages users to practice using showbf before employing their jobs into qsub. The tool runs through and asks the job database what is running and how many nodes are open. Depending on the preset QOS and priorities, the job database sends back information telling the user the possiblity of his/her job being able to run.

This command can be used by any user to find out how many nodes are available for immediate use on the system. It is anticipated that users will use this information to submit jobs that meet these criteria and thus obtain quick job turnaround times. The key to this tool is that users are required to specify all the known varibles in using showbf to receive valuable information. For example, lets say user shaq in group laker wants to submit a job that requires 8 nodes, greater or equal to 128 megs RAM, a duration of 1 hour, and no partition specification. showbf would then go to the system and receive a printout:

$ showbf -u shaq -g laker -n 8 -m '>=128' -d 1:00:00
backfill window (user: 'shaq' group: 'laker' partition: ALL) Wed Jun  7 16:06:34

 42 procs available with no timelimit

$ 

If shaq wanted to run a job, but instead he would like exactly 256 megs RAM, 1 node, and for a duration of 1 hour, he would simply type and view:

$ showbf -n 3 -m '=256' -d 01:00:00
backfill window (user: 'shaq' group: 'laker' partition: ALL) Wed Jun  7 16:14:12

  3 procs available with no timelimit

$ 

Parameters and Arguments

Parameter/
Argument
Description
-A Show backfill information for all users, groups, and accounts.
-a Show backfill information only for specified accounts.
-g Show backfill information only for specified group.
-u Show backfill information only for specified user.
-m Allows user to specify the memory requirements for the backfill nodes of interest. It is important to note that if the optional MEMCMP and MEMORY parameters are used, they MUST be enclosed in backticks ('). For example, enter showbf -m '==256' to request nodes with exactly 256 MB memory. Valid signs used with MEMCMP (memory comparision) are >, >=, ==, <=, and <.
-n Show backfill information for a specified number of nodes.
-d Show backfill information for a specified duration. The number specified with the duration must have a preceeding plus (+) sign. Specified in DD:HH:MM:SS notation, indicating days:hours:minutes:seconds.
-f Show backfill information only for nodes which possess a specified feature, such as processor speed. Processor speed is indicated by 's950' for a processor that is running at 950 MHz.
-q Show backfill information for nodes which can be accessed by a job with a certain QOS. QOS's are specific to users, if you do not know what your specific QOS is, do not use this option.
-c (NOT IN USE) Show backfill information for nodes which support the class feature.

showstart

Usage

showstart [ JOBNAME ]

Purpose

The showstart command shows the earliest time a specified job can start, taking into account the requested resources, system downtime, reservations, and so on. The value given ignores jobs with higher priorities that do not have reservations. This command gives the best estimate for the job's start time if this job were next in priority behind the jobs which currently hold reservations. To show an instance of this command, let's say a user wanted to see when his/her job "4010.icebox.icebox" was intending to be run:

$ showstart 4010.icebox.icebox

job 4010.icebox.icebox requires 18 nodes for 2:00:00:00

Earliest start is in       1:01:03:45 on Wed Jun 14 11:48:09

Earliest Completion is in  3:01:03:45 on Fri Jun 16 11:48:09

Best Partition: DEFAULT

$

The output of this command informs the user how many nodes the job requires and what the WCLimit is. In addition, showstart informs the user the earliest start and completion time for the job. The user must take into account that showstart is only analyzing the job as if it were the next job in the queue. Ignore the partiton output, for partitions are irrelevant in the CHPC systems.

Fair Share policies used by the Maui Scheduler simply indicate the certain percentages available to particular groups and users. Fair Share is set up so that users and groups are given percentages of CPU cycles of which they are able to run. When a certain user's percentage is at a time lower than their target percentage, fair share would increase their priority. If the users percentage is over their percentage target, the user's priority in the queue would decrease.

For lack of a better example, the concept of priority in fair share can be displayed using a metaphor. Suppose you have a dart board. On the dart board are rings which represent allocations of CPU cycles. Your goal is to hit the middle ring; however, depending on how well you throw your dart, you may end up hitting the outer ring or the inner circle. If you hit the outer ring, it is similar to using more CPU cycles than your target; hitting the inner circle expresses using less than your allocated CPU cycles.

target

If your dart hits right on the middle ring you have hit the allocated amount of CPU cycles you were given. However, if you hit the center ring, it is like getting another chance to go again or in the schedulers case a higher priority on the jobs. If you hit the outside ring in Maui Scheduler's terms, you will get a lower priority on your jobs.

target

Users, in some cases, can be part of certain groups, for example user malone has a fair share target of 20% and user stockton has a fair share target of 7.5%. Both of these users are in the group jazz; therefore, the target percentages sum is directly related to the group.

Fair Share would see this:

  • User malone: 20%
  • User stockton: 7.5%
  • Group jazz: 30%

The sum of malone and stockton equal 27.5%, but the total of the users in the jazz group is 30%. This means that because the group collectively has not used there target percentage either the groups priority will rise or either user can use 2.5% more CPU cycles and remain at regular priority.

The use of reservations allows a job to maintain a guaranteed start time. Reservations enable other jobs to use these nodes as long as they do not delay this job's start time. The Maui backfill selects the best combinations of jobs in the schedule and places them, while not delaying any other jobs already created. The best schedule placement, which Maui chooses, varies from workload to workload and can only be defined through simulation.

Maui Scheduler has the capability to schedule jobs on different partitons; partitions being defined as a group of nodes with a common high performance switch, a common file space, and a common user space. Users do not specify partitions, but they are considered when Maui schedules the jobs.

Quality of Service or QOS is the resource CHPC uses to define users certain policies on machines. All users receive a predetermined number for QOS which is done behind the scenes. QOS is divided into groups:

  • QOS 0: Out of Allocations
  • QOS 1: Normal Run
  • High Priority Run
    • QOS 10: Voth Users
    • QOS 20: Schuster Users
    • QOS 30: Simmons Users
    • QOS 40: Steenburgh Users

Unless you are a user from a specific group, you should always run your job under a QOS of 1. Needed to be emphasized is that users carrying one of the high priority QOS's cannot use those QOS numbers when trying to run on another machine other than your own groups. If you are a user from Voth's nodes, the QOS of 10 will only get you a job under the Voth nodes. If you are a user with a QOS of 10 and are trying to get a job under any of the Simmons Nodes, you will remain in the queue. Never use the QOS of 0 unless you are totally out of allocation time. If a user still has some time left in their allocations, and specifies that they are QOS 0, the job length time is still deducted from the bank.

For more information relating to policies associated with each QOS look to IA-32 Cluster (icebox) User Guide.

Qbank is a CPU allocations bank which uses its relational database to store and receive the prior transaction history and current state of the bank. Qbank works with Maui to control and manage CPU resources allocated to projects or users. Maui, after scheduling a certain job, goes into Qbank and informs it about the job scheduled. Qbank puts that information into the allocation records and then runs the job specified. Qbank is setup so it is able to provide feedback to users and administrators about usage time and account balances.

CHPC has now implemented Maui on Sierra and ICE Box. Each node, or set of nodes carry specific policies which affect the users when running jobs. The individual nodes also carry different hardware specifications which can vary a job's speed and sizability. The Local Opteron Cluster (sierra) Configuration, Local Compaq Sierra Configuration and Local IA-32 Cluster (icebox) Configuration can be helpful when running commands such as "showbf" as to knowing what machines you are intending on running your job.

Last Modified: October 06, 2008 @ 21:07:47