Skip to content

 /scratch/general/lustre down

Date Posted: November 25th, 2019

6:02 PM November 25th:

The /scratch/general/lustre file system is back in service again.  Both last night and again today the issue was related to a very high load. 

We are continuing to analyze the usage to see if we can identify individual  workloads, users who contributed to the high load, or if the load can only be attributed to high cumulative use by many users.  In the meanwhile, users should distribute their workload among the different scratch file systems to minimize the impact on the lustre file system.

Should you see further issues, please send a report to helpdesk@chpc.utah.edu


3:20 PM November 25th:

Unfortunately, the /scratch/general/lustre file system went down again.  

Until the issue with this file system is resolved, please make use of either /scratch/general/nfs1 or /scratch/kingspeak/serial.

If you have any PENDING jobs that will try to use the lustre scratch file system please cancel the jobs and resubmit such that they are using one of the other two remaining scratch file systems.


2:06 pm November 25th:

The lustre scratch file system is back in service.  At this time we do not have any details to share on the root cause of the issue.

The majority of nodes impacted by the scratch file system outage have been cleared and returned to service.  We will watch for any additional nodes that need attention as jobs impacted by this outage end.  If you have a job using the lustre scratch file system that is still running, you may want to check the job to see if it recovered on its own; if not you should cancel and resubmit the job.

Additional notes about the scratch file systems

  • Two of the three scratch file systems are running full. Here is a recent listing of capacity, free space, used space, percentage used of the three scratch options
         /scratch/general/nfs1                  595T   41T  555T   7%
         /scratch/kingspeak/serial            175T  164T   12T  94%
         /scratch/general/lustre               696T  598T   92T  87&
  • Comparing the two NFS scratch file systems /scratch/kingspeak/serial and  /scratch/general/nfs1 – the /scratch/general/nfs1 is our newest scratch file system, is on a newer generation of the storage hardware than the older /scratch/kingspeak/serial, and has free capacity.
  • Remember that the scratch file systems are  shared resources and are not meant for permanent storage.

8:08 am November 25th:

Sometime yesterday evening /scratch/general/lustre started to have issues.  As a result uses will not be able to access files on this space.  In addition, jobs that were making use of this space are leaving nodes in a  'killl task failed' bad state, stuck in file IO to the lustre scratch file space.  These nodes need to be cleared manually.

Until the issues with this scratch space are resolved please make use of one of the other scratch file systems, /scratch/general/nfs1 and /scratch/serial/kingspeak, for any jobs you submit. Please also note that /scratch/kingspeak/serial is at 94% full - so users should check your usage of this space and clean up as much as possible.

We will send out an update when the lustre scratch file system is back in service.

Last Updated: 6/11/21