CHPC Implements Maui Scheduler on Icebox and SP

by Julia Harrison and Brian Haymore, CHPC Staff

Over the past several months CHPC has been in the process of implementing the Maui Scheduler on the Beowulf cluster (icebox) and the IBM SP. We are also working toward implementing it on the SGI Origin2000 (raptor) and the SGI PowerChallenge (inca/maya.)

The Maui Scheduler was developed at UNM Maui High Performance Computing Center (MHPCC) to improve the throughput on their systems. CHPC is working directly with the main developer to implement our requirements.

The Maui scheduler works along with PBS to optimize the throughput of the system, enforce policies, and assure fairness in job priority and scheduling. In addition, Maui provides some additional commands to allow you to check your job.

There are problems with a strict first-in-first-out (FIFO) scheduler which result in a waste of valuable cycles. Our goal is to maximize the systems utilization within policy contraints.

BACKFILL

Sometimes Maui will start a job before yours even though that job was submitted after. This phenomenon is called backfill. When your job is submitted, Maui creates a reservation for it based on the currently queued jobs and the requirements of your job. If it determines that it can fit another job in before yours, without delaying the startime on your reservation, it will proceed and begin that other job.

MAUI COMMANDS

Maui provides some commands that will help you see the state of the system. One helpful command is showq. The showq command is similar to the PBS qstat command (which will still be available), and displays the status of the jobs. The showq command has a bit more information than the qstat output, so we believe it will be your preference. The output of the showq command is divided into three categories: Running jobs, listed in order of first to complete, Idle: listed in order of next of run and Non-queued jobs: Jobs that Maui does not believe can be run currently.

Another helpful command is showbf which has several flags you will want to use:

  • -a ACCOUNT
  • -d DD:HH:MM:SS
  • -f FEATURE
  • -n #nodes
  • -q qos#
  • -u user

Using this command will help you choose which features and which qos to request to get 'a job to run in the desired fashion. The output of this command lists the number of nodes available based on the options specified.

Icebox Node Layout

  • Voth:
    • 24: PII 350MHz 256MB
    • 24: PII 400MHz 256MB
    • 24: PIII 550MHz 256MB
  • Simons:
    • 4: K7 950 MHz 1024MB
  • Schuster:
    • 8: SMP PII 450 MHz 512MB
    • 11: SMP PIII 450 MHz 512MB
    • 1: SMP 500 MHz 512MB
  • CHPC:
    • 36: PII 350 MHz 256MB
    • 28: K7 950 MHz: 18 with 256MB, 10 with 1024MB

POLICY ENFORCEMENT AND QOS

The equipment at CHPC has been purchased by several different mechanisms. Many of the nodes of the SP and Icebox were purchased by certain user groups. CHPC maintains the equipment, and in return, our general community is allowed to run jobs given certain polices. We use Maui to enforce these policies.

The jobs of those members of these user groups are given priority on their equipment, and no time limits are enforced. Other users are limited to 24 hour time limit on that equipment, and will only be dispatched on those resources when they are not in use by members of the ownership group.

The way we control these policies within Maui is by use of Quality of Service or QOS. Most of the time you will not have to worry about QOS because it will be determined by default. The QOS definitions are:

QOS 0: Assigned to your job when you are out of allocations (equivalent to the free_cycle queue). and given a negative 100000 priority, therefore will only dispatch when there are free cycles.

QOS 1: Normal run, default for most jobs.

QOS 10 and higher: High priority jobs, default on jobs for members of ownership groups (currently voth 10, schuster 20 and simons 30). Restricted to nodes of the corresponding ownership group, given +500000 priority therefore will jump to the top of the queue. No resource limits.

NODE OWNERSHIP POLICIES

Voth Nodes: +500000 priority for voth users, no resource limits for voth users, non-voth users limited to jobs less than 24 hours walltime, QOS 10, 1, 0 allowed to use.

Schuster Nodes: +500000 priority for schuster users, no resource limits for schuster users, non-shuster users limited to jobs less than 24 hours walltime, QOS 20, 1, 0 allowed.

Simons Nodes: +500000 priority for simons users, no resource limits for simons users, non-simons users limited to jobs less than 24 hours walltime, QOS 30, 1, 0 allowed.

CHPC Nodes: Normal usage policies, 72 hour walltime limit, 32 processor limit per job and 32 processors x 72 hour load per user. QOS 1 and 0 allowed.

SP NODE LAYOUT

  • Voth:
    • 4: Power2 SC 120MHz 1024MB
    • 8: Power2 SC 160MHz 1024MB
    • 2: SMP Power3 332 MHz 1024MB
  • CHPC:
    • 55: Power2 SC 120MHz: 51 with 128MB, 2 with 256MB, 2 with 512MB

PBS SCRIPT AND QOS

Most users will never have to worry about changing the QOS on their jobs, and will usually have a default QOS of 1 assigned to their jobs. Users in the ownership groups have their high priority QOS as default, which limits them to their nodes. All users, even those in ownership groups, who have exhausted their service unit allocations have a default QOS=0.

If users in the ownership groups wish to access nodes outside their ownership, they will need to do this by adding a command to their PBS script to change the QOS of the job to 1 and will then have the regular limits. The command to do this is

#PBS -W qos=1

Please note that if you set a QOS to an unauthorized value, or a value inconsistent to the nodes requested, Maui will not be able to satisfy the requirements and will defer your job. This includes trying to set the QOS to 0 if you still have an allocation. Even if your job were to run under this QOS, the SUs will still count against your allocation.

Example PBS Script (icebox):

#PBS -l nodes=4:s350:m256,walltime=00:15:00
#PBS -W qos=1
#PBS -M email@host.utah.edu
#PBS -m abe
#PBS -q icebox@icebox.icebox

#Change to working directory
cd /uufs/icebox/common/home/LOGIN/Workingdir
#Execute job
mpirun -np 4 -nolocal -machinefile $PBS_NODEFILE myexecutable

EXAMPLES (ICEBOX)

Won't work: voth user requests 8, 450Mhz processors (default qos=10) for 00:15:00 walltime.
Fix: change to 550 Mhz or change to qos=1.

Won't work: schuster user requests 16, 350 Mhz processors (default qos=20) for 16:00:00.
Fix: change to request 450Mhz processors, or change qos=1

Won't work: General CHPC user requests 8, 550Mhz processors (default qos=1) for 72:00:00 hours.
Fix: Change walltime hours to less than 24:00:00.

To be Hacked or NOT to be Hacked

by Jonzy, CHPC Staff

With the growth of the Internet, an unwanted side effect is the growth of the number of "Hackers" [1], and the number of machines being brought on line for the first time. We, as administrators and or users of these computers, cannot control these "Hackers" directly, but we can make our systems difficult to be hacked. It is unfortunate that Operating System Vendors release their products with very little default security in place. Instead their concern is for a "User Friendly"[2] product.

Granted there is a trade off between "Security" and "Ease of Use", however, there is a happy medium. When installing your Operating System, take the extra time to ensure unwanted features are turned off.

In particular, turn off unknown daemons [3]. Don't think "Plug and Play", and assume your environment, as provided from the Vendor, is ready to go. Look at your System and determine what programs are running and why. If you see a running program and don't know what it is for, take the time to learn its functionality or turn it off. More times than not you can disable a program with no side effect. If you observe your actions cause a problem, re-enable the program. There are also tools available for many platforms that will monitor your System for any changes to needed programs, such as Tripwire for unix, or Regmon for NT.

Don't assume that your initial steps to lock down your computer are enough. You still need to keep on top of any newly released patches, and more importantly, you will need to monitor your System Logs on a daily basis. It is better to take a "Proactive" stance than one which is "Reactive". It takes longer to backup and restore than it does to do your initial configuration and daily monitoring.

For assistance in configuring your System, and various Security issues, please see the Institutional Security Office (ISO) security pages at: http://www.uusec.utah.edu/security/

Notes:

[1] Hackers - a term once used to identify computer programers hacking code to make computer use better, but now describes users who attempt to break, or "hack", into a computer system.

[2] User Friendly - Ease of use; a side effect of "Plug and Play".

[3] daemons - A program that runs in the background such as: a Mail Transport Agent, a Web Server, ftp, talk, finger, etc.

Scratch Space on CHPC Systems

by Julia Harrison, CHPC Staff

We are working toward a consistent paradigm for scratch space on all of our systems, but because the nature of each platform is slightly different this may not be a practical goal. Each system currently has a different method for handling scratch space.

The scratch space on CHPC systems is not billed for like your home directory space is. Our policy is to to remove any files older than 7 days on local scratch space to keep the space available for other user. We are still evaluating a policy for global, shared scratch, but reserve the right to clean up files older than 7 days if it becomes full. We never back up any scratch space.

This space is considered volatile and should just be used for temporary files. Any data you need permanently should be copied to your home directory after the completion of your job.

Please do not write files directly in the root of the scratch filesystems, but rather create a subdirectory (most people use their userid or their jobname) where you will write your files.

SGI POWER CHALLENGE (INCA/MAYA)

The path to scratch space on both inca and maya is /scratch, but is not shared. When testing your jobs on inca you may use /scratch. When running your job, also use /scratch on maya. We have mounted maya's /scratch read-only on inca for your convenience. The path on inca to maya's scratch space is /hosts/maya/scratch.

IBM SP

Each batch node of the IBM SP has a small local scratch of about 700 Mbytes. Its path is /scratch on each node.

In addition we have an NFS mounted filesystem, /global/sp which is approximately 115 Gigabytes. This filesystem is visible and writable from all of the nodes, including the interactive nodes. Writing to this space is slower than the /scratch, but has the advantage of being shared and visible everywhere.

SGI ORIGIN 2000 (RAPTOR)

Raptor has the simplest scratch scheme. There is one scratch filesystem (/scratch) of about about 72 Gigabytes.

BEOWULF CLUSTER (ICEBOX)

Icebox, like the SP, has two levels of scratch space. On each node there is a local scratch (fastest) unique to each node. Its path is /scratch/local and varies in size from 8Gigabytes to 32Gigabytes.

Recently we added PVFS (Parallel Virtual File System), a scratch space similar in function to the SP's global scratch, but it has better performance than regular NFS mounted disk space. This global scratch space is visible to all compute nodes and the interactive node. The mount point on all nodes is /scratch/global and is about 128 Gigbytes.

Initially when we installed PVFS we had some technical problems which have since been solved with an updated kernel. It is now open and ready for all to use.

Update on Chemistry Software

by Anita Orendt, CHPC Staff

Gaussian 98: Gaussian 98 is now available on the Icebox cluster. The Gaussian information on the CHPC web site (http://www.chpc.utah.edu/software/docs/g98.html) has been updated to include a sample PBS script for running Gaussian on icebox. If you have any questions or problems please feel free to contact me.

Information on timing for a sample job (a 163 basis function DFT optimization run on all CHPC platforms with 1, 2, 4, and 8 processors) has also been added to the above web page. I plan on adding additional timing/scaling information on different types and sizes of jobs to hopefully aid the user in making best use of the available computer resources.

In addition, the A.9 revision was installed on maya and on raptor on July 13th, 2000. Installation on icebox and the IBM SP should be complete by the time this newsletter is out.

NWChem: The NWChem computational chemistry package developed by the High Performance Computational Chemistry Group from the Environmental Molecular Science Laboratory (EMSL) at Pacific Northwest National Laboratory has been installed and the quantum portion of the package has been tested on both raptor and icebox. Testing of the molecular mechanics/dynamics as well as installation on the IBM SP is underway. This program was written to take advantage of massively parallel computer systems, with the target for highest efficiency being in the 64 - 128 processor range. I have been working with the 3.3.1 release; however, there is an update to version 4.0 scheduled for the end of July. In this update the performance of some modules has been improved along with the addition of new functionality. I will make a presentation on the capabilities and use of this package in September. Users can also obtain information at http://www.emsl.pnl.gov:2080/docs/nwchem/nwchem.html where there are manuals for users and developers as well as an useful tutorial.

Molecular Simulations Software: As this newsletter is being prepared, Molecular Simulations is anticipating shipping evaluation copies of their new PC based Materials Studio 1.0 to all of their current customers in July. The evaluation period will last 60 days and is currently scheduled to end 28 August 2000. Information on this new product can be obtained from MSI's web page at http://www.msi.com.

Please contact me if you need any information about accessing the evaluation copy and/or have any comments about the product.

The CHPC fileserver is a six processor IBM S80 dedicated to serving files to the HPC systems and the INSCC building occupants. Disk space is served primarily via NFS. All served storage space is kept on RAID 0+1 (striped and mirrored) disk arrays for faster access and higher availability.

Number of processors6PowerPC_RS64-III (64-bit)
Amount of cache8 MBFast access level 2 cache
Amount of memory8 GBRAM
File services network bandwidth310 Mbps(2) 155 Mbps ATM adapters
Maximum subnet bandwidth155 MbpsOC3 via ATM
Number of hard disks8856 Fibre Channel, 28 LVD SCSI, 4 SCSI
Total disk space1.5 TBServed space is striped and mirrored

Some Recent Statistics:

  • Max number NFS calls (per day) = 8870649
  • Max number NFS version 2 reads (per day) = 3322
  • Max number NFS version 2 writes (per day) = 42023
  • Max number NFS version 3 reads (per day) = 1578815
  • Max number NFS version 3 writes (per day) = 1842904
  • (please note these Max numbers are the maximum actually handled, not the maximum the system can handle.)
  • % of NFS version 2 calls being reads (over period) = 1
  • % of NFS version 2 calls being writes (over period) = 46
  • % of NFS version 3 calls being reads (over period) = 22
  • % of NFS version 3 calls being writes (over period) = 27

You can see that writes dominate the NFS calls, which favors RAID 0+1 over RAID 5 disk configurations like ours. Also, please note that over the 2-3 day period in which we collected this data, there was less NFS traffic than usual, so percentages are more relevant than max numbers.

Last Modified: October 06, 2008 @ 21:09:13