July 10, 2014 - Issue with some lonepeak, kingspeak, and ember compute nodes

Posted: July 10, 2014

- UPDATE -- 1pm: All nodes affected are back in service. In addition, issues with getting to file systems from the protected environment (swasey) have also been resolved. There were two issues:

  • Power was lost to the racks housing the lonepeak nodes as well as the protected environment file system due to issues with the PDU.
  • There are problem with the cooling fan wall that lead to the observed temperature fluctuations; CHPC and the Datacenter staff are continuing to work on this issue.

ORIGINAL MESSAGE

Late last night a number of compute nodes went down. This included all of lonepeak and about 30 kingspeak nodes. In addition the gpu nodes on ember are also down. You can see the nodes affected by doing a ‘pbsnodes -nl’ on each of the affected clusters. In addition there have been issues reported about file system access in the protected environment.

CHPC staff is at the datacenter working on determining the cause and finding a solution so that the nodes can be put back into service. It is believed to be heat related as there were temperature fluctuations in the datacenter that started late yesterday. We will provide more information when it is available.