2013 CHPC Downtimes and History

Komas Datacenter Downtime: Tuesday March 12, 2013 for critical service to the cooling tower

Posted: March 8, 2013

Event date: March 12, 2013

Duration: Clusters in Komas Datacenter will be down beginning at 7:00 a.m. until about 5:00 p.m.

Systems Affected/Downtime Timelines:

All clusters in the Komas Datacenter including Ember, Updraft and Sanddunearch will be down from 7 a.m. until about 5 p.m.. Scratch space will remain up unless the temperature gets too high during this maintenance, at which time we will also need to down these servers as well.

The folks that service our cooling system (CMMS) in the Komas datacenter have notified us that the cooling system is in critical need for service, and they are very concerned about the current situation. It is believed if we see temperatures above 60 to 65 degrees we may have an outage on the cooling system.

We have set the reservations to drain the queues on the clusters, and we expect to make it to Tuesday morning and a graceful shutdown. However there is a chance that if the cooling fails in the meantime, we'll have an emergency downtime before the scheduled time.

CMMS will take the coolers offline at 8 a.m. Tuesday morning, so we need to shut the clusters down at 7 a.m. They expect to be finished by 2:00 p.m. so we can begin bringing the clusters backup as soon as that is complete. It usually takes 2-3 hours for the clusters to be up and made available to users.

Please let us know if you have any questions by sending email to issues@chpc.utah.edu.

Instructions to User:

All clusters in the Komas Datacenter including Ember, Updraft and Sanddunearch will be down from 7 a.m. until about 5 p.m.. Scratch space will remain up unless the temperature gets too high during this maintenance, at which time we will also need to down these servers as well.

The folks that service our cooling system (CMMS) in the Komas datacenter have notified us that the cooling system is in critical need for service, and they are very concerned about the current situation. It is believed if we see temperatures above 60 to 65 degrees we may have an outage on the cooling system.

We have set the reservations to drain the queues on the clusters, and we expect to make it to Tuesday morning and a graceful shutdown. However there is a chance that if the cooling fails in the meantime, we'll have an emergency downtime before the scheduled time.

CMMS will take the coolers offline at 8 a.m. Tuesday morning, so we need to shut the clusters down at 7 a.m. They expect to be finished by 2:00 p.m. so we can begin bringing the clusters backup as soon as that is complete. It usually takes 2-3 hours for the clusters to be up and made available to users.

Please let us know if you have any questions by sending email to issues@chpc.utah.edu.


**CHPC unexpected outage**: Home directories for a subset of users, 2/25/2013

Posted: February 25, 2013

Duration: Estimate - sometime 2/28/2013, individual file systems may be available more quickly

Arches Downtime Duration: Dear HPC Users,

Update: As of 4 p.m. 2/27 we are finding the estimates for the restore very difficult to guess. We have come up with a way to make filesystems available as they complete and we will notify individual groups when their space is ready. At this point we guess it will continue to run at least through the night, and part way into the day tomorrow 2/28.

Update: The restore continues to run. The best guess for having file services restored for those affected is sometime tomorrow morning - 2/27/2013. Thank you again for your patience.

The groups home directories listed below are currently down, and will remain down for an extended period of time. We have had a disk failure and the built in redundancy measures we had in place also failed to work as expected. We are currently restoring from backup, but will not be able to bring anything back online until the restore has completed, which we expect to take a full day or more. We sincerely apologize for the inconvenience and will keep you posted as we progress and as time estimates improve.

[baron-home]
[cheatham-home]
[cliu-home]
[garrett-home]
[gregg-home]
[horel-home]
[jenkins-home]
[jiang-home]
[krueger-home]
[lin-home]
[mace-home]
[paegle-home]
[perry-home]
[reichler-home]
[smithp-home]
[steele-home]
[steenburgh-home]
[strong-home]
[whiteman-home]
[yandell-home]
[zipser-home]
[zhdanov-home]
[zpu-home]

Dear HPC Users,

Update: As of 4 p.m. 2/27 we are finding the estimates for the restore very difficult to guess. We have come up with a way to make filesystems available as they complete and we will notify individual groups when their space is ready. At this point we guess it will continue to run at least through the night, and part way into the day tomorrow 2/28.

Update: The restore continues to run. The best guess for having file services restored for those affected is sometime tomorrow morning - 2/27/2013. Thank you again for your patience.

The groups home directories listed below are currently down, and will remain down for an extended period of time. We have had a disk failure and the built in redundancy measures we had in place also failed to work as expected. We are currently restoring from backup, but will not be able to bring anything back online until the restore has completed, which we expect to take a full day or more. We sincerely apologize for the inconvenience and will keep you posted as we progress and as time estimates improve.

[baron-home]
[cheatham-home]
[cliu-home]
[garrett-home]
[gregg-home]
[horel-home]
[jenkins-home]
[jiang-home]
[krueger-home]
[lin-home]
[mace-home]
[paegle-home]
[perry-home]
[reichler-home]
[smithp-home]
[steele-home]
[steenburgh-home]
[strong-home]
[whiteman-home]
[yandell-home]
[zipser-home]
[zhdanov-home]
[zpu-home]


CHPC Major Downtime: Tuesday January 15th, 2013 beginning at 7:00 AM - Unknown

Posted: January 8, 2013

Event date: January 15, 2013

Duration: From 7 a.m. January 15th: Clusters down most of the day. Other services, see below.

Systems Affected/Downtime Timelines: During this downtime, maintenance will be performed in the datacenters, requiring many systems to be down most of the day. Tentative timeline:

  • HPC Clusters: beginning at 7:00 a.m. lasting most of the day
  • File Servers: CHPCFS will remain up. While redbutte will stay up, there are a number of groups who will have outages to migrate home directories from oquirrh to redbutte. These particular spaces are:
    • baron-home
    • cheatham-home
    • cliu-home
    • garrett-home
    • gregg-home
    • horel-home
    • jenkins-home
    • jiang-home
    • krueger-home
    • lin-home
    • mace-home
    • paegle-home
    • perry-home
    • reichler-home
    • smithp-home
    • steele-home
    • steenburgh-home
    • strong-home
    • whiteman-home
    • yandell-home
    • zhdanov-home
    • zipser-home
    • zpu-home
  • Network outages: No outage
  • Virtual Machines: No outage
  • Software License Server: 15 minute outage sometime between 8 and 10 a.m.

Instructions to User: During this downtime, maintenance will be performed in the datacenters, requiring many systems to be down most of the day. Tentative timeline:

  • HPC Clusters: beginning at 7:00 a.m. lasting most of the day
  • File Servers: CHPCFS will remain up. While redbutte will stay up, there are a number of groups who will have outages to migrate home directories from oquirrh to redbutte. These particular spaces are:
    • baron-home
    • cheatham-home
    • cliu-home
    • garrett-home
    • gregg-home
    • horel-home
    • jenkins-home
    • jiang-home
    • krueger-home
    • lin-home
    • mace-home
    • paegle-home
    • perry-home
    • reichler-home
    • smithp-home
    • steele-home
    • steenburgh-home
    • strong-home
    • whiteman-home
    • yandell-home
    • zhdanov-home
    • zipser-home
    • zpu-home
  • Network outages: No outage
  • Virtual Machines: No outage
  • Software License Server: 15 minute outage sometime between 8 and 10 a.m.