This downtime will be limited in scope. The impacts will be to the windows compute servers and to select portions of the general environment HPC compute clusters. No virtual machines will be affected.
Specifically the planned work includes:
- Applying updates of the systems on all windows compute servers in the general and the protected environment (kachina and narwhal).
- Applying OS updates to the cluster interactive nodes (including the meteo, atmos, wx and frisco nodes) and resource managers along with the compute nodes of notchpeak. Among other items, the lustre clients and the cuda drivers will be updated.
To prepare for this downtime, there is a reservation on notchpeak to drain the cluster of running jobs before the start of the downtime. On the remaining clusters, the scheduler will be paused such that no new jobs will start while the work is being completed on the resource managers.
On the HPC side, we will focus on the resource managers first, so that the remaining clusters can start to schedule jobs as soon as possible, followed by the interactive nodes and then the notchpeak clusters.
Provided that no issues arise after the updates, the remaining cluster computes nodes in the general environment will be scheduled and completed in a rolling fashion in a subsequent weeks.
During the downtime, we will make announcements as the systems are brought back into service.