PBS batch Queuing system on Raptor

by Lloyd Caldwell, Sr. Systems Programmer

Introduction:

We are changing batch systems on raptor, the Onyx2/Origin2000 system. The old batch system was named DQS, Distributed Queuing System. The new batch system is named PBS, Portable Batch System. This batch system was written as a joint project between the Numerical Aerospace Simulation (NAS) Systems Division of NASA Ames Research Center and the National Energy Research Supercomputer Center (NERSC) of Lawrence Livermore National Laboratory. Watch the "message of the day" and check our web site: www.chpc.utah.edu for exact dates for this change.

While there are many reasons for this change the most basic reasons are DQS is buggy and is hard to modify for our needs. PBS was designed to be customizable. Specifically the scheduler, which picks the next job to run, is a custom written application. There is a much larger team dedicated to PBS development than DQS (recalcitrant part timers) so bug fixes are quickly available.

CHPC is also planning on utilizing PBS for the PowerCHALLENGE cluster and AIX cluster. It may be implemented on the SP systems at some future time.

Goals for Raptor's batch system:

Raptor is a unique challenge on which to configure and use a batch system. We have special hardware, InfiniteReality2 Graphics Pipes, that are typically used in an interactive manner and often require specific cpus for optimum performance. Raptor often is involved in running demonstration codes for industry, government and research visitors. It has a large active user population who require cpu/memory access with as much performance as possible.

Utilization on raptor is categorized as follows, prioritized from high to low: Demonstrations (D), Graphics Intensive Processing (GI), "Classic" batch, non-interactive cpu and/or memory intensive processing (CB), Development, Debugging, Testing (DDT), "Free" Cycle (FC) "a cpu cycle is a terrible thing to waste."

Example: Assuming the system is full with CB processes (i.e. all cpu's are in use) and an individual submits a GI process. CB processes should be "stopped" until enough resources are available to satisfy the GI process, which is then dispatched. When the GI process completes the "stopped" CB processes should be "unstopped". This behavior is not currently achievable but CHPC thinks it can be implemented eventually.

Finally any category should be able to get as much or as little of the machine as required in a fair, priority enforced, manner.

Note: There is NO batch system that does this. We are always looking. PBS gets the closest of all the current batch systems available whether they are commercial, shareware, public domain or other procurement method. We will be modifying the SCHED (see PBS architecture below) and MOM processes to get closer to the ideal we want for our batch environment.

PBS architecture:

The pieces of PBS are; SERVER, SCHED, MOM, user commands (qsub, qstat, etc...), operator commands and administrator commands.

See diagram

Note: PBS can be run on multiple hosts across a network. This diagram demonstrates our implementation on a single host.

PBS assigns roles to individuals; Users, Operators and Administrators are the roles implemented. All roles interact with PBS by sending commands to the SERVER which determines authorization and services or forwards the requests to the appropriate destination(s).

The flow of a batch job through the system is as follow. I will assume the details of a PBS script are understood (see batch scripts, below) and there are no errors or miss-configurations of the system, for simplicity.

  1. qsub, CLIENT -> SERVER: The user has developed, debugged and thoroughly tested a code that is to be submitted to the batch system. The user knows what resources the code needs: the number of cpus and the time required to complete the code run. A script (ascii text) file is created that runs the code and has imbedded in it the resource requirements for this code run OR the resource requirements are specified on the qsub command line and the code invocation command is given to stdin. The user issues the qsub command. The qsub command contacts the SERVER on behalf of the users and sends the script file contents to the SERVER which enters this script job into a queue.
  2. SERVER <-> SCHEDuler <-> MOM: Some time later the SERVER contacts the SCHEDuler and tells it to run a scheduling cycle. The SCHEDuler contacts MOM (machine oriented mini-server) and requests resource utilization information from MOM which queries the local kernel and it's running jobs and returns this information to the SCHEDuler. The SCHEDuler asks the SERVER for queued job status which the SERVER responds with. SCHEDuler then makes a policy decision, selecting a job to run. The SCHEDuler tells the SERVER which job to run. The user specified criteria like; number-of-cpus, job-duration, memory-requirements, etc. can be used by the scheduler for making it's policy decision.
  3. SERVER -> MOM: The SERVER sends the job to MOM to be run.
  4. MOM builds an execution environment, prepares the input and output requirements and starts the job. MOM monitors the job and responds to status requests (qstat, etc...).

Things you need to be aware of using a batch system:

Login script issues: PBS creates a login environment for your job as much like a normal interactive login session as possible. Since this isn't a real login session, i.e. no keyboard, mouse or display, interactive commands during the login phase will cause errors and abort the job run.

This means things like, mail, news, stty, and their friends should NOT be invoked from your login scripts when the login is performed by the PBS batch system.

To be able to distinguish between normal logins and batch logins. PBS will set the environment variable PBS_ENVIRONMENT to one of one of PBS_BATCH or PBS_INTERACTIVE.

Bourne and Korn shell users can accomplish this by modifying their .profile with the following:

if [-z "$PBS_ENVIRONMENT"]
then
   # do interactive commands
else         
   # do batch specific commands       
fi

While csh users can use the following in their .cshrc and/or .login:

if ( $?PBS_ENVIRONMENT ) then
   # do batch commands
else
   # do interactive commands
endif

Logout script issues:

If your selected shell is the csh and you have a ~/.logout script you should place the following at the first and last of the file.

# First of csh ~/.logout
 set EXITVAL = $status
# Last line of csh ~/.logout
 exit $EXITVAL

Batch jobs can be linked by dependencies. The job exit values are available for determining jobs to initiate subsequent to the end of the parent job. These C shell statements will preserve the exit value from a batch job.

Checkpoint/Restart or Crash and Burn:

Scenario - Your job has been running for 3 days and has only minutes to go and the machine crashes ("What, a machine crash? Never."). Your SU's have been consumed, you have a paper to publish tomorrow and now you are MAD. Sorry, but you only have yourself to blame.

If you had created your own checkpoint and restart routines your code could be "restarted" somewhere near the end of your job's run.

A checkpoint routine should be able to write out the intermediate values (matrices, arrays, lists, maps, etc...) of your calculation and any control variables (loop counters) to a file. A restart routine should be able to read in what the checkpoint routine wrote out.

As your code starts up it should test for the existence of checkpoint files and if they exist run the restart routine.

Some operating systems provide checkpoint/restart services by the kernel. Kernel checkpoint/restart services can only work when your code is written so that ALL required resources are local to the kernel your code is running on.

Things that are NOT operating system kernel checkpoint-safe (this list is not all inclusive nor exhaustive but only examples): network socket connections, X terminals and X11 client sessions, devices like tape drives and cdroms, file opened with setuid credentials, System V semaphores and messages, and open directories.

Note: Files open through NFS are considered "local" so can successfully be checkpointed and restarted, assuming the file server is still up and the files exist.

At this time IRIX 6.4 and higher operating systems are the only checkpoint/restart systems CHPC is running. See `man cpr' and insight `SGI Admin Checkpoint and Restart Operation Guide' for more information.

PBS by default assumes your batch job is rerunnable, this means that if the system goes down (like for regular maintenance or demos) before the running batch job completes it will be restarted, that is STARTED again automatically. This may or may not be the correct behavior. If your code is checkpoint/restart capable, this is exactly right. You can turn off this rerun behavior with the `-r n|y' option to the qsub and qalter commands.

How to get jobs into the Batch system:

Once you have some code or a program that you wish to run that will be using significant cpu time (more than 15 minutes) you will need to submit it to PBS. The qsub command is your gateway to batch heaven. To access the qsub command you will need to have /usr/local/chpc/bin in your PATH. You may also want to add /usr/local/chpc/man to your MANPATH environment variable. Qsub has lots of arguments. Some of the more useful ones are:

  • -N job_name (give the job a name, viewable with qstat)
  • -l ncpus=X, walltime=23:30:00 (resources required)
  • -e path (write standard error to this file)
  • -o path (write standard out to this file)
  • -j oe (join standard out and error into out file)
  • -r y|n (mark job rerunnable|not-rerunnable)
  • -v varA=x,varB=y (set these environment variables in batch environment)
  • -V (import ALL environment variables into batch environment)
  • -S shell_path (program that runs script, defaults to login shell)
  • -c n (do not checkpoint job)
  • -q dest (send job to dest queue)
  • -m abe (send mail on a[bort], b[eginning] and e[xit] of job)
  • -M user@host (where to send email notices)

Qsub expects the last argument to be the name of a script file. If that parameter isn't supplied it will read from standard in for commands to be run until it detects and end_of_file (Control-D usually). The commands are then sent to the server. To make the command line shorter all pbs qsub options can be placed inside the script file. Each line in the script file that contains a qsub options should be prefixed with the 4 character string #PBS.

# Stupid example start 
#PBS -l ncpus=1    
#PBS -l walltime=1:00    #PBS -q cpu_q    
#
/usr/bin/printenv
# Stupid example end

If you saved the above example in a file, then you would submit the script by running:

qsub script_file 

command line. For an IRIX 6.X system the following resources MUST be specified:

  • ncpus: how many cpus does your program require
  • walltime: how many hours of wall clock time does it require

Note: 72 hours is the max allowed walltime without special dispensation. You may specify as short a time as you like. There are other resources that can be requested but we are only "managing" ncpus and walltime, for now.

PBS Commands:

PBS attempts to be a POSIX 1003.2d Batch compliant system.

  • User Commands: qalter, qdel, qhol, qmove, qmsg, qorder, qrerun, qrls, qselect, qsig, qstat, qsub, xpbs
  • Operator Commands: pbs_server, pbs_mom, qdisable, qenable, qrun, qstart, qstop, qterm
  • Administrator Commands: pbs_sched, pbsnodes, qmgr

PBS raptor batch configuration:

Raptor is configured with 4 queues:

  • cpu_q: batch queue
  • fc_q: free cycle queue
  • r_q: restricted non-graphics pipe queue
  • gp_q: graphics pipe queue

Access to the r_q (restricted) and gp_q (graphics pipe) queues are based on access control lists (acl). Only authorized users may use these queues. The cpu_q is the normal batch queue. If your Service Unit allocation runs out you will only be able to submit jobs to the fc_q. The fc_q will only be enabled to run jobs when there is low usage on raptor.

PBS batch policies

The maximum wall clock runtime is 72 hours in the cpu_q and fc_q queues. No user will have more than one (1) job running at a time. No user should have more than two (2) jobs queued at any time.

References:

  1. http://pbs.mrj.com
  2. http://science.nas.nasa.gov/Software/PBS/pbshome.html

Last January I was asked to Chair an Exhibit for the SIGGRAPH 98 conference to recognize computer graphics laboratories over the past 25 years. SIGGRAPH is the international conference for computer graphics. The week long conference has extensive course offerings on Monday & Tuesday, papers & panels on Wednesday, Thursday and Friday with a very large hardware, software Exhibition on Tuesday, Wednesday and Thursday. The conference usually attracts about 25,000 attendees with this year's 25-anniversary conference attracting nearly 40,000 attendees.

In collaboration with designer Jeff Calendar of Q Ltd. of Ann Arbor, Michigan he and I decided to make large format images to represent each of the laboratories. I assembled a jury to choose the 12 laboratories and worked with people from each laboratory to select 12 images from each laboratory. The images came in a variety of forms: anonymous ftp addresses and web sites for electronic images; slides, prints and Polaroid's; and references to figures in publications. The plotted images ranged in size from 18" x 20" to 24" x 20".

When CHPC acquired an ENCAD 60" plotter we were able to produce the 144 images for the exhibit. Jimmy Miklavcic, my colleague and CHPC Multimedia Specialist focused on plotting the images. Jimmy had chosen the ENCAD plotter and a CANNON 800 color printer that are hosted by a robust HP Kayak XA PC with Intel Pentium II processor running EDOX Control software. We previewed images on the CANNON printer. This printer has produced papers for publication when using 20-weight paper. The ENCAD plotter is a 300-dpi ink plotter that plots in both its forward and backward pass over the paper. It has vats of ink that can be refilled during the plotting process. The ENCAD has already been used to produce a dozen posters that average 4' x 4' in size. For this multi-image plot the EDOX software has a nesting feature that allows images to be printed side by side across the 60" width of paper. The combination of the CANNON printer, the ENCAD plotter, the robust PC, and the EDOX control software provided a powerful environment for producing this large quantity of images.

The SIGGRAPH exhibit was entitled: "A Visual Tribute to Computer Graphics Laboratories: 1971-1998".

An exhibits chair statement: The first 25 SIGGRAPH conferences have both inspired and been supported by the evolution of computer graphics imagery. Computer graphics has progressed from its early years of visual existence proofs by engineers, scientists, and mathematicians to today's personal visual expressions by designers, architects, and artists. In 1998, the technology serves a very broad spectrum of users, from researchers who produce meaningful images to gain insight into their abstract worlds, to fine artists who produce art that reveals nothing of the computer's key role in supporting their creativity.

This exhibit provides a visual impression of contributions from computer graphics laboratories covering more than 25 years. Criteria for inclusion in this tribute are a balance of substance underlying the imagery and striking visual impact. In addition, there is a balance of long-term sustained contributions with short-term focused contributions. Robin Forrest, Jim Blinn, and Pat Hanrahan, acting as the jury, were asked to select laboratories based on those criteria.

Twelve laboratories are presented here:

  • The University of Utah
  • New York Institute of Technology
  • Cornell University
  • Brown University
  • The University of North Carolina at Chapel Hill
  • California Institute of Technology
  • Pixar Animation Studios
  • Silicon Graphics, Inc.
  • Stanford University
  • Carnegie Mellon University
  • University of Washington
  • Microsoft Research

Laboratory principals provided images and names to represent their organizations. The result is a richness of imagery that summarizes some of the key developments in computer graphics.

Welcome, and enjoy!

  • Historic Laboratories Exhibit Chair
  • Robert McDermott
  • Center for High Performance Computing
  • University of Utah
  • mcdermott@chpc.utah.edu

The exhibit had a very prominent location in the Orlando Convention Center, which hosted SIGGRAPH 98. There was a very positive response to the exhibit from many of the attendees. The University of Utah part of the exhibit has been returned and can be seen in Merrill Engineering Building on the 3rd floor in the north hall toward the east end.

Photographs taken at SIGGRAPH conference by Jimmy Miklavcic.

ASPIRE: Astrophysics Science Project Integrating Research and Education

by Todd VanderVeen, CHPC Applications Programmer

Web site: http://sunshine.chpc.utah.edu

ASPIRE is a web-based education project being developed by the University of Utah's cosmic ray physics group. The group operates the HiResolution Fly's Eye Detector, a unique florescence-based detector used to study cosmic rays at the highest end of their energy spectrum. On being approached by the National Science Foundation (NSF) to provide more information and public access to their research, Professor Gene Loh proposed something beyond the creation of just another website. Taking note of the heavy investment the Utah public school system has made in building a strong network infrastructure and recent advances in distributed computing, namely Sun Microsystem's Java language, he proposed using the NSF's extensive astrophysics research programs to enrich science education at the secondary level.

The aim is to create a series of on-line interactive physics labs available to anyone with access to the Internet. Working in accordance with the curriculum established by the Utah State Office of Education, which derives from the project 2061 National Science Standards, the ASPIRE team is developing lesson plans and labs that give students a solid grounding in general physics, as well as exposure to the results and practice of fundamental scientific research. As such, the project represents a low maintenance, highly distributable educational resource. Working under a grant renewed by the NSF, the project now has a dedicated web server donated by Sun and maintained by the Center for High Performance Computing (CHPC).

Development for the project began with a workshop in the summer of `97 for a small group of public school teachers and the training of undergraduate physics students in the Java programming language. A second workshop was held this June with participants spending a week in various sessions with university professors from the cosmic ray group exploring topics in astrophysics, visiting the detector at Dugway Proving Grounds, and becoming familiar with the project's plans and resources. In July, a selected group of teachers will return to join those already established with the project to spend a week writing lessons/labs for the coming year's production cycle. They will be joined by the project's manager, members of the cosmic ray faculty, the programming team, a graphic artist, and a specialist from the University's education department.

The long-term goal is to gain increasing support from the State's educational system. Ideally, secondary school teachers would take on the principal role of curriculum development, with the ASPIRE team and university research groups providing them with training, content information, and the expertise for computer-based lesson development. The project has received a Resolution of Endorsement from the Governor's Science Council and will be pursuing support from the Governor's Office to purchase computer and projection equipment to pilot the program in two districts this coming year. Members of the ASPIRE team will also be traveling to a number of regional teachers conventions in Seattle, Boston, and San Antonio to promote their work and explore avenues for expansion into other states. Plans are also being made to hold next year's workshop out of Utah to help facilitate the projects growth.

Symposium on Adaptive Methods for PDE's

by Michael A. Pernice, CHPC Staff Scientist, Numerical Methods

Numerical simulation is a useful tool for many areas in computational science and engineering, such as numerical weather prediction, propagation of seismic waves, biofluid dynamics, turbulent reacting flow, and electro-cardiodynamics. Complex physical phenomena often possess features that span a wide range of spatial and temporal scales. This can make it difficult to obtain accurate numerical solutions, and computations that are under-resolved can even exhibit spurious features. While it is possible to resolve these phenomena by increasing the number of grid points, global grid refinement can quickly lead to problems that are intractable, even on the largest available computing facilities, and especially for three-dimensional problems that involve complex physics. One way to achieve the needed resolution is to refine the computational mesh locally, in only those regions where enhanced resolution is required.

The University of Utah hosted a Symposium on Adaptive Methods for Partial Differential Equations June 22-24, 1998 at the Intermountain Networking and Scientific Computing Center (INSCC). The symposium was co-organized by Chris Johnson, Department of Computer Science; Philip J. Smith, Department of Chemical and Fuels Engineering; Aaron Fogelson, Department of Mathematics; and this author. Through the generous support of the Department of Energy, the Center for High Performance Computing, and the Center for the Simulation of Accidental Fires and Explosions, the symposium brought together an international collection of producers and consumers of adaptive solution technology in a focused environment. The overall themes of the symposium were algorithms, tools, and applications. The symposium featured 20 presentations on methods for both Cartesian and triangulated grids and on the demanding applications being undertaken by partners in DoE's Accelerated Strategic Computing Initiative (ASCI). The symposium also provided extensive opportunities for informal discussions on the current state of the field.

While the idea of concentrating computational effort where it is needed through local grid refinement sounds attractive, realizing this objective in automatic, efficient, and robust software presents significant analytic, conceptual, and logistical challenges. Automatically refining a grid to resolve fine scale features of a solution requires reliable estimation of local errors. Presentations and discussion at the symposium revealed that analytically-based error estimators are known only for a small class of problems, and that even within this class a poorly chosen discretization will fool the best estimator. Local grid refinement requires the capability to both add and delete mesh points, which means that data structures must be dynamic and flexible, and that memory utilization must be carefully managed. Introducing parallelism further complicates matters. Moreover, locally refined grids can be organized in a hierarchy, and efficient algorithms for solving problems on such grids exploit this hierarchical description and are multilevel in nature. Several different software environments that provide these capabilities were described. While still evolving, these frameworks reflect an interesting trend in general scientific and engineering software, away from procedure-based libraries and towards object-based toolkits.

Overall, the symposium revealed that, while significant progress in the development and use of adaptive methods has occurred, much work remains to be done. By all accounts, the symposium successfully provided a useful forum for presentation and discussion of the state of the art.

Last Modified: October 06, 2008 @ 21:09:11