You are here:

2.1 General HPC Cluster Policies

2.1.1 Cluster Interactive Node Policy

  1. The interactive nodes are the front end interface systems for access to the HPC clusters. Each cluster has a set of interactive nodes associated with them. For example if you ssh'd into "ember.chpc.utah.edu," you will actually be connected to one of the interactive ember nodes, such as ember1 or ember2. For ash, the interactive nodes accessed via  "ash.chpc.utah.edu" are restricted to the groups that have allocation to run on this cluster. The interactive nodes for guests are ash5 and ash6, and they can be accessed either by specifying the specific node or by using "ash-guest.chpc.utah.edu".
  2. Interactive nodes are your interface with the computational nodes and are where you interact with the batch system. Please see our User Guides for details. Processes run directly on these nodes should be limited to tasks such as editing, data transfer and management, data analysis, compiling codes and debugging, as long as it is not resource intensive (memory, cpu, network and/or i/o). Any resource intensive work must be run on the compute nodes through the batch system.
  3. Any process that is consuming extensive resources on the interactive node may be killed, especially when it begins to impact other users on that node.
    1. CHPC will usually allow up to 15 minutes CPU time before considering killing a process, unless the resource usage is impacting other users. If the process is creating significant problems on the system, the process will be killed immediately and the user will be contacted via email.
    2. Owners of the process are notified via a tty message (if possible) and an email message is sent when the process is killed.
    3. Repeated abuse of interactive nodes may result in notification of your PI and potentially locking your account.
  4. The scratch spaces that are visible on the compute nodes of the clusters are mounted on the interactive nodes.
    1. Migrate your data using the interactive nodes to one of the scratch spaces. Run your i/o intensive batch work from the scratch space, NOT from your home directory or group space. Note: an i/o intensive process could either be excessive MB/second or excessive i/o operations/second.
    2. For storage polices please see 3.1 File Storage Policies.

2.1.2 Batch Policies (all clusters)

General Queuing Polices

  • Users who have exhausted their allocation will not have their jobs dispatched unless there are available free cycles on the system. Preemption rules are determined on each cluster - see the cluster's scheduling polices (links below).
  • Special access is given to a long qos to exceed the MAX walltime limit on a case-by-case basis. Please send any request for access to this qos to issues@chpc.utah.edu with an explanation (job cannot be run under regular limits with checkpointing/restarts/more nodes). There is a limit of two nodes running with this qos at any given time.
  • Each of the computational clusters has its own set of scheduling policies pertaining to job limits, access and priorities. Please see the appropriate policy for the details of any particular cluster.

Reservations

Users may request to reserve nodes for special circumstances. The request must come from the PI/Faculty advisor of the user's research group and the group's allocation must be sufficient to cover the duration of the reservation. A reservation may be shared by multiple users The maximum number of nodes allowed for a reservation is half the number of general nodes for the cluster the user is asking for. The maximum duration for reservations is two weeks. The PI/Faculty advisor should send a request to issues@chpc.utah.edu with the following information:

  • Which cluster
  • Number of Nodes/Cores
  • Starting date and time (please ask ahead by at least the MAX walltime for that cluster)
  • Duration
  • User or Users on the reservation
  • Any special requirements (longer MAX walltime for example)

Owner-Guest

Owner-guest access is enabled on owner nodes on ember, kingspeak, and lonepeak. Jobs run in this manner are preemptable. To do so, you will need to specify a special account. Jobs using these preemption accounts will not count against your group's allocation. Jobs run in this manner should not use /scratch/local as this scratch space will not be cleaned when job is preempted nor is it accessible by the user running the guest job to retrieve any needed files.

To specify the owner-guest, add #PBS -A owner-guest to your scripts, or add the -A owner-guest as a flag to the srun command. Your job will be preempted if a job comes in from a user from the group of the owner whose node(s) your job received.

You can also target a specific owner group by using the PI name in your resource specification: #PBS -l nodes=4:ppn=16:PIname

There is a script available at/uufs/chpc.utah.edu/sys/pkg/chpcscripts/std/bin/owner_guest_guide.sh  that can be used to see the list of owner nodes that rank the different owner groups by their relative usage, normalized on a per node basis.  Do not focus on the actual number, just that the lower numbers indicate the groups who have been using their nodes the least over the last two weeks.  This script also shows the current number of down or idle nodes. Please run this script to get more details. 

2.1.3 Ember Job Scheduling Policy

Job Control

Jobs will be controlled through the batch system using SLURM.
  1. Node sharing. No node sharing.
  2. Allocations. Allocations will be handled through the regular CHPC allocation committee. Allocations on owner nodes will be at the direction of node owners.
  3. Best effort to allocate nodes of same CPU speed.
  4. Max time limit for jobs is 72 hours.
  5. Scheduling is set based on a current highest priority set for every job. We do have backfill enabled.
  6. Fairshare boost in priority at user level. Minimal boost to help users who haven't been running recently. Our Fairshare window is two weeks.
  7. Small relative to time small short jobs are given a boost for length of time in the queue as it relates to the wall time they have requested.
  8. Reward for parallelism. Set at the global level.
  9. Partition settings
     Partition Name   Access  Accounts  Node/core count   Memory  Features  Node specification 
     ember  all  <pi>  73/876  24576  chpc, general, c12  em[019-022,075-142,144]
     ember-gpu  by request (GPUs)   ember-gpu  11/132 + 2 GPU per node  49152  chpc, c12, gpu, nv2090  em[001-008, 010-012]
     ember-freecycle  all  all    24576  chpc, c12  em[019-022,075-142,144] 
     ember-guest  all  all  sum of owners  see owner nodes      em[019-022,075-142,144]
     hci-em  restricted  kaplan-em, hci-em,
     owner-guest
     14/168  24576  hci, c12  em[013-014,059,070] 
     bolton-em  restricted  bolton-em, owner-guest  12/144  24576  bolton, c12  em[015-018,051-058]
     yandell-em  restricted  yandell-em, owner-guest   10/120  24576  yandell, c12  em[023-032]
     zpu-em  restricted  zpu-em, owner-guest  4/48  49152  zpu, c12  em[033-036]
     avey-em  restricted  avey-em, owner-guest  2/24  24,576  avey, c12  em[071-072]
     gregg-em  restricted  gregg-em, owner-guest  2/24  24576  gregg, c12  em073-074]
     facelli-em  restricted  facelli-em, owner-guest  1/12  196608  facelli, c12  em395
     arup-em  restricted  arup-em, voelk-em,
     owner-guest 
     14/168  24576  arup, c12  em[037-070]
     usu-em  restricted

     usu-em, usumae-em, usupych-em, owner-guest

     18/576  262144 usu-em, c32  em[145-162]
     Total      161/2292      
  10. QOS Settings The majority of a job's priority is based on a quality of service definition or QOS. The following QOS's are defined:
     QOS  Priority  Preempts  Preempt Mode  Flags  GrpNodes  MaxWall
     ember  1000  ember-freecycle   cancel    73  3-00:00:00 (3 days)
     ember-freecycle  1    cancel  NoReserve   73  3-00:00:00
     ember-guest  1    cancel  NoReserve  73  3-00:00:00
     ember-long  1000  ember-guest  cancel    73  14-00:00:00 (14 days)
     bolton-em  1000  ember-guest  cancel    12  14-00:00:00
     hci-em  1000  ember-guest, hci-em-low   cancel    14  14-00:00:00
     hci-em-low   1000  ember-guest  cancel    ??  14-00:00:00
     zpu-em  1000  ember-guest  cancel    4  14-00:00:00
     gregg-em  1000  ember-guest  cancel    2  14-00:00:00
     avey-em  1000  ember-guest  cancel    2  14-00:00:00
     yandell-em  1000  ember-guest  cancel    10  14-00:00:00
     arup-em  1000  ember-guest  cancel    14  14-00:00:00
     facelli-em  1000  ember-guest  cancel    1  14-00:00:00
                 
     ember-gpu  1000  ember-guest  cancel    6  1-00:00:00 (1 day)
  11. Interactive nodes. For general use there is ember1.chpc.utah.edu and ember2.chpc.utah.edu.  Access either via ember.chpc.utah.edu. There are also owner interactive nodes that are restricted to the owner group.

2.1.4 Kingspeak Job Scheduling Policy

Job Control

Jobs will be controlled through the batch system using SLURM.
  1. Node sharing. No node sharing.
  2. Allocations. Allocations will be handled through the regular CHPC allocation committee. Allocations on owner nodes will be at the direction of node owners.
  3. Best effort to allocate nodes of same CPU speed
  4. Max time limit for jobs is 72 hours (on general nodes)
  5. Scheduling is set based on a current highest priority set for every job. We do have backfill enabled.
  6. Fairshare boost in priority at user level. Minimal boost to help users who haven't been running recently. Our Fairshare window is two weeks.
  7. Small relative to time small short jobs are given a boost for length of time in the queue as it relates to the wall time they have requested.
  8. Reward for parallelism. Set at the global level.
    1. Partitions
       Partition Name  Access  Accounts  Node/core count  Memory  Features  Node specification
       kingspeak  all  <pi>  32/512;
       12/240;
       4/80

       65536, 393216

       chpc, general, c16, c20, 
       hadoop 
       kp[001-032,110-111,158-167,196-199]
       kingspeak-freecycle  all  all  32/512
       12/240
       4/80
       65536, 393216  chpc, general, c16, c20  kp[001-032,110-111,158-167,196-199] 
       kingspeak-guest  all  all  sum of owners  see owner nodes      all owner nodes
       lin-kp  restricted  lin-kp, owner-guest  15/240  65536  lin, c16  kp[033-047] 
       zpu-kp  restricted  zpu-kp, owner-guest  7/112  65536  zpu, c16  kp[048-054,272-279]
       facelli-kp  restricted  facelli-kp, owner-guest   6/96  65536  facelli, c16  kp[055-060]
       frost-kp  restricted  frost-kp, owner-guest  3/48  49152  frost, c16  kp[061-063]
       steele-kp  restricted  steele-kp, owner-guest  20/320  32768  steele, c16  kp[064-083]
       hci-kp  restricted  hci-kp, kaplan-kp, owner-guest  4/64, 4/80, 6/144  65536  hci, c16, c20, c24  kp[084-087,148-151,280-285]
       molinero-kp  restricted  molinero-kp, owner-guest  4/64, 8/160, 8/224  65536  molinero, c16, c20, c28  kp[088-091,140-147,318-325]
       bedrov-kp  restricted  bedrov-kp, owner-guest   4/64  65536  bedrov, c16  kp[092-095]
       calaf-kp  restricted  calaf-kp, owner-guest   4/80  131072  calaf, c20  kp[096-099]
       strong-kp  restricted  strong-kp, kochanski-kp, strong-kochan-kp owner-guest   4/80, 2/56  65536  strong, c20  kp[101-104,336-337]
       daq-kp  restricted  avey-kp, owner-guest   1/20  65536  avey, c20  kp105
       wjohnson-kp  restricted  wjohnson-kp, owner-guest   3/60  65536  wjohnson, c20  kp[106-108]
       sdss-kp  restricted  sdss-kp, owner-guest   28/448  65536  sdss, c16  kp[112-139]
       varley-kp  restricted  varley-kp, owner-guest   2/40, 2/56  131072  varley, c20  kp[152-153,302-303]
       gertz-kp  restricted  gertz-kp, owner-guest   2/40, 2/56  131072  gertz, c20  kp[154-155,304-305]
       sigman-kp  restricted  sigman-kp, owner-guest   1/20, 1/28  65536  sigman, c20, c28  kp156, kp338
       lebohec-kp  restricted  lebohec-kp, owner-guest   1/20  65536  lebohec, c20  kp157
       ucgd-kp  restricted  ucgd-kp, owner-guest   28/560, 28/672  131072  ucgd, c20, c24  kp[168-195,200-227]
       tavtigian-kp  restricted  tavtigian-kp,owner-guest  1/24  131072  tavtigian,c24  kp228
       mason-kp  restricted  mason-kp, owner-guest  5/120  131072  mason, c24  kp[229-233]
       arup-kp  restricted  arup-kp, owner-guest  24/192  131072  arup, c24  kp[246-253]
       gruenwald-kp  restricted  gruenwald-kp, owner-guest  14/336, 10/280  65536  gruenwald, c24, c28  kp[258-271,308-317]
       mansfield-kp  restricted  mansfield-kp, owner-guest  4/96  131072  mansfield, c24  kp[234-237]
       quinlan-kp  restricted quinlan-kp, owner-guest  8/128  131072  quinlan, c16  kp[238-245]
       usumae-kp  restricted  usumae-kp, owner-guest  10/240   131072  usumae, c24  kp[254-257,286-291]
       kapheim-kp restricted  kapheim-kp, owner-guest  1/32  1000000  kapheim, c32  kp292

       gompert-kp

       restricted gompert-kp, owner-guest  1/32, 1/28  1000000,  512000  gompert, c32, c28  kp[293,301]
       mah-kp  restricted   mah-kp, owner-guest  1/24  65536  mah, c24  kp294
       emcore-kp  restricted  emcore-kp, owner-guest  2/48  131072  belnap,c24  kp[295-296]
       soc-kp  restricted  soc-kp, sundar-kp, stustman-kp, ganesh-kp, venkatasu-kp, balasubramonian-kp, owner-guest 12/336   131072  soc, c28  kp[306-307,326-335]
       kingspeak-gpu  restricted  kingspeak-gpu 4/48, plus 8GPUs each  65536  chpc, c12, tesla, geforce,  titan  kp[297-300]
       Total      333/6764      
  9. Job priorities

    The majority of a job's priority is based on a quality of service definition or QOS. The following  QOS's are defined:

     QOS Priority  Preempts  Preempt Mode  Flags   GrpNodes   MaxWall 
     kingspeak  1000  kingspeak-freecycle   cancel    48  3-00:00:00 (3 days) 
     kingspeak-freecycle   1    cancel  NoReserve   48  3-00:00:00 
     kingspeak-guest  1     cancel  NoReserve  161  3-00:00:00
     kingspeak-long  1000  kingspeak-freecycle  cancel    48  14-00:00:00 (14 days) 
     lin-kp  1000  kingspeak-guest  cancel    15  14-00:00:00
     zpu-kp  1000  kingspeak-guest  cancel    15  14-00:00:00
     facelli-kp  1000  kingspeak-guest  cancel     6  14-00:00:00
     frost-kp  1000  kingspeak-guest  cancel    3  14-00:00:00
     steele-kp  1000  kingspeak-guest  cancel    20  14-00:00:00
     hci-kp  1000  kingspeak-guest  cancel    14  14-00:00:00
     molinero-kp  1000  kingspeak-guest  cancel    12   14-00:00:00
     bedrov-kp  1000  kingspeak-guest  cancel    4  14-00:00:00
     calaf-kp  1000  kingspeak-guest  cancel    4  14-00:00:00
     sdss-kp-fast  1000  kingspeak-guest  cancel    4  14-00:00:00
     sdss-kp  1000  kingspeak-guest  cancel    28  14-00:00:00
     strong-kp  1000  kingspeak-guest  cancel    4  14-00:00:00
     daq-kp  1000  kingspeak-guest  cancel    1  14-00:00:00
     wjohnson-kp  1000  kingspeak-guest  cancel    3  14-00:00:00
     varley-kp  1000  kingspeak-guest  cancel    2  14-00:00:00
     gertz-kp  1000  kingspeak-guest  cancel    2  14-00:00:00
     sigman-kp  1000  kingspeak-guest  cancel    1  14-00:00:00
     lebohec-kp  1000  kingspeak-guest  cancel    1  14-00:00:00
     ucgd-kp  1000  kingspeak-guest  cancel    56  14-00:00:00
     tavtigian-kp  1000  kingspeak-guest  cancel    1  14-00:00:00
     mason-kp  1000  kingspeak-guest  cancel    5  14-:00:00:00
     mansfield-kp  1000  kingspeak-guest  cancel    4  14-:00:00:00
     quinlan-kp  1000  kingspeak-guest  cancel    8  14-:00:00:00
     arup-kp  1000  kingspeak-guest  cancel    8  14-:00:00:00
     gruenwald-kp  1000  kingspeak-guest  cancel    14  14-:00:00:00
     usumae-kp  1000  kingspeak-guest  cancel    10  14-:00:00:00
     kapheim-kp  1000  kingspeak-guest  cancel    1  28-:00:00:00
     gompert-kp  1000  kingspeak-guest  cancel     1  28-:00:00:00
     mah-kp  1000  kingspeak-guest  cancel    1  14-:00:00:00
  10. Interactive nodes. For general use there are two interactive nodes, kingspeak1.chpc.utah.edu and kingspeak2.chpc.utah.edu.  Access either via kingspeak.chpc.utah.edu. There are also owner interactive nodes that are restricted to the owner group.

2.1.5 Lonepeak Job Scheduling Policy

Job Control

    Jobs will be controlled through the batch system using Slurm.
    1. Node sharing. No node sharing.
    2. Allocations. This cluster is completely unallocated.  All jobs on the general resources will run without allocation and without preemption.
    3. Max time limit for jobs will is 72 hours.
    4. Fairshare will be turned on.
    5. Scheduling is set based on the current highest priority set for every job.
    6. Reservations are allowed for users who show a need for the large memory available on these nodes.  Reservations will be for a maximum window of two weeks and a maximum of 50% of the general nodes on the cluster will be allowed to be reserved at any given time. Reservations will be made on a first come first serve basis. Up to 96 hours may be needed for the reservation to start.
    7. Partitions
      Partition Name  Access   Accounts   Node/core count   Memory   Node specification 
       lonepeak  all  <pi>   8/96
       8/160
       96GB/node
       256GB/node
       lp[001- 008]
       lp[009-016]
       lonepeak-freecycle  not in use   not in use  8/96
       8/160
       96GB/node
       256GB/node
       lp[001- 008]
       lp[009-016]
       lonepeak-guest  all  owner-guest  84/672  16GB/node  lp[017-100 ]
       cheatham-lp   cheatham   cheatham-lp  84/672  16GB/node  lp[017-100 ]
       marth-lp  marth  marth-lp  20/160, 7/112  64GB/node  lp[101-102, 104-111, 114, 117, 120-123, 127,131-134, 137, 139-141, 143]
        Total   141/1376    
    8. QOS Settings:

      The majority of a job's priority will be set based on a quality of service definition or QOS.

      QOS  Priority   Preempts   Preempt Mode   Flags   GrpNodes  MaxWall
       lonepeak  1000    cancel    16  3-00:00:00 (3 days)
       lonepeak-freecycle   not in use
       
       cancel  NoReserve 
       16  3-00:00:00
       cheatham-lp   1000   lonepeak-guest   cancel     84  3-00:00:00
       lonepeak-guest  1    cancel  NoReserve  84  3-00:00:00
       lonepeak-long  1000     cancel    ?  14-00:00:00  (14 days) 
       marth-lp  1000  lonepeak-guest  cancel   27  14-00:00:00
    9. Interactive nodes. For general use there are two interactive nodes, lonepeak1.chpc.utah.edu and lonepeak2.chpc.utah.edu. Access either via lonepeak.chpc.utah.edu.  There are also owner interactive nodes that are restricted to the owner group.

Last Updated: 10/3/16