CHPC OUTAGE: No Remote Access to CHPC Resources - UPDATED
Date Published: August 23, 2023
8/24/2023 at 7:30pm
The /scratch/ucgd/lustre is back online, mounted on all redwood compute and interactive nodes.
In addition, rw[079,106,114,119,171,020,022,132,202] have been returned to service, leaving rw[010,187], with memory issues, and rw[072,134, 183] as the redwood nodes that remain down.
8/23/2023 at 5:45pm
Redwood is back in service:
- All of the interactive nodes are up.
- /scratch/ucgd/lustre is still down. DDN Support has been engaged to work on the issues; for now the mount of this space has been removed from all nodes
- There are a number of compute nodes that are still being worked on:
- rw[010,106,114,119,171,187,020,022,132,202] - memory issues are being reported, so doing additional testing overnight
- rw[079,072,134,183] - these nodes are either not responding or have other issues that will require further work to diagnose.
8/23/2023 at 10:00am
Most, but not all, of the CHPC resources are once again accessible. The notable resources that are not ready for use are the redwood cluster, including the interactive nodes, and the PE /scratch/ucgd/lustre file system.
At this time CHPC staff is working on identifying additional resources that are not accessible and working to bring them back online. Once we complete this process, we will send out a notification of the resources that need additional work.
If you notice any other CHPC resource that is not accessible, please send a report to email@example.com
8/22/2023 at 6:45pm
At about 3pm there was a widespread disruption of campus IT services that is being attributed to humidity issues in the datacenter. You can monitor the current status at https://uofu.status.io/ At this time the is no estimated time for resolution.
These issues resulted in an outage for remote access to CHPC resources. The outage will continue until the campus level event has been addressed. CHPC staff has been actively working to identify the impact on CHPC hardware, and so far we have determined issues with some network equipment in the PE. We are working with support to get those addressed.
Our current view is that once the campus issues are fully sorted that the general environment should be in good shape, but the PE depends on us addressing the issues we have found.