Cluster connection issues - Center for High Performance Computing

January 23rd, 2:27 p.m.

The connectivity problems observed yesterday were determined to be due to issues with the networking switch that supported core cluster services. The temporary fix, relocating these services to a different switch, made last night is still in place and we are no longer seeing the issues. If you notice any problems with connections to the clusters, please report them via issues@chpc.utah.edu.

The permanent fix is to update then test the networking switch and if necessary replace it. We plan on doing this over the next few days. Once we are comfortable that the switch is functioning properly, we will move the services back. At this point we do not expect that making this change will result in any additional outages. If this changes we will let you know.

January 22nd 10:37 p.m.

CHPC sysadmins determined the cause of the connection failures and put a "band aid" fix in place. A permanent fix will need to be more thoroughly evaluated in the next few days.

As of now, clusters should be functional with the above caveat. If you notice something out of order, please, let us know.

We will notify once the permanent fix is decided on.

January 22nd 5:45 p.m.

The afternoon of January 22nd, we observed an issue with the Kingspeak resource manager and since then we have observed transient connectivity issues on all general environment cluster nodes. These are occurring often enough that users will see issues with the use of the clusters, both with sessions on the interactive nodes and with jobs on the compute nodes.

We are continuing to diagnose the problem, and will post additional updates when we know more or are able to resolve the issue.

Cluster connection issues January 22nd