2008 CHPC Downtimes and History

CHPC DOWNTIME: December 16 starting at 8am - COMPLETED by 5 p.m.

Posted: December 10, 2008

Event date: December 16, 2008

Duration: Most of the day

Systems Affected/Downtime Timelines:

  • 8:00 a.m. - 10:30 a.m. - intermittent networking outages in INSCC

  • 8:00 a.m. - back up about 3:20 p.m. - Desktops mounting CHPCFS filesystems

  • 8:00 a.m. - back up by 5 p.m. - All Clusters (arches, telluride, updraft, meteo nodes)

Arches Downtime Duration:

This downtime was for maintenance of the cooling system in the Komas datacenter, and therefore will required all clusters housed in the data center to be down from 8am until about 5pm. CHPC will take advantage of this down time to do a number of additional tasks, including work on the network in the morning and file servers and clusters for most of the day.

All file systems served from CHPCFS will be unavailable for a good part of the day. This includes HPC home directory space as well as departmental file systems supported by CHPC. We will work to get things online as soon as possible.

Instructions to User:

  • 8:00 a.m. - 10:30 a.m. - intermittent networking outages in INSCC

  • 8:00 a.m. - back up about 3:20 p.m. - Desktops mounting CHPCFS filesystems

  • 8:00 a.m. - back up by 5 p.m. - All Clusters (arches, telluride, updraft, meteo nodes)

This downtime was for maintenance of the cooling system in the Komas datacenter, and therefore will required all clusters housed in the data center to be down from 8am until about 5pm. CHPC will take advantage of this down time to do a number of additional tasks, including work on the network in the morning and file servers and clusters for most of the day.

All file systems served from CHPCFS will be unavailable for a good part of the day. This includes HPC home directory space as well as departmental file systems supported by CHPC. We will work to get things online as soon as possible.


CHPC Unplanned Outage: CHPCFS fileserver, October 9th, 2008 from 11:30 a.m. until approximately 6:00 p.m.

Posted: October 9, 2008

Duration: October 9th, 2008 from 11:30 a.m. until approximately 6:00 p.m.

Systems Affected/Downtime Timelines:

All systems mounting file systems from CHPCFS. Outage began about 11:30 a.m. and we plan to have most file systems mounted and stable by 6:00 p.m.

Arches Downtime Duration:

We are continuing to see problems with the file server as we work to correct the hardware failure of yesterday (10/8) morning. In order to bring all file systems back to a stable state, we have decided to take down the file server temporarily. We sincerely apologize for the inconvenience.

Instructions to User:

All systems mounting file systems from CHPCFS. Outage began about 11:30 a.m. and we plan to have most file systems mounted and stable by 6:00 p.m.

We are continuing to see problems with the file server as we work to correct the hardware failure of yesterday (10/8) morning. In order to bring all file systems back to a stable state, we have decided to take down the file server temporarily. We sincerely apologize for the inconvenience.


CHPC Unplanned outage: CHPCFS file server, October 8th 7:20 a.m. until about Noon

Posted: October 9, 2008

Duration: October 8th 7:20 a.m. until about Noon

Systems Affected/Downtime Timelines: All systems mounting file systems from CHPCFS. Outage began about 7:30 a.m. and most file systems were available by Noon.

Arches Downtime Duration:

The CHPCFS file server is offline which affects all systems mounting file systems from this server.

During a planned service outage from 7:30-8 a.m. today we had a storage controller failure resulting in a delay in getting the fileserver back up. We are taking action to get the bulk of CHPCFS up as soon as possible but a few groups may take longer. We will send out more information as it becomes available. We will contact the individual groups of the affected file systems with more details and the recovery time frame as more is known about the failure.

We apologize for the inconvenience and are working hard to correct the problem.

Instructions to User: All systems mounting file systems from CHPCFS. Outage began about 7:30 a.m. and most file systems were available by Noon.

The CHPCFS file server is offline which affects all systems mounting file systems from this server.

During a planned service outage from 7:30-8 a.m. today we had a storage controller failure resulting in a delay in getting the fileserver back up. We are taking action to get the bulk of CHPCFS up as soon as possible but a few groups may take longer. We will send out more information as it becomes available. We will contact the individual groups of the affected file systems with more details and the recovery time frame as more is known about the failure.

We apologize for the inconvenience and are working hard to correct the problem.


CHPC Major downtime: from 9 a.m. 9/27 (Sat) until sometime 9/28, (Sun)

Posted: September 19, 2008

Duration: From 9 a.m. Saturday 9/27 until sometime Sunday 9/28

Systems Affected/Downtime Timelines: All equipment in the SSB machine room including all file servers supported by CHPC starting at 9 a.m. Saturday. The Arches clusters including telluride and the meteorology compute servers beginning at 9 a.m. Saturday. The network equipment serving INSCC and the clusters will be intermittently unavailable in the late evening on Saturday.

Arches Downtime Duration:

There will be some major work on the electrical equipment serving the SSB machine room from Saturday 9/27 until Sunday 9/28. CHPC will take advantage of this downtime to do some system and network administration (originally planned for Oct. 14). At 9:00 a.m. on Saturday, September 27th access to desktops mounting CHPC supported file systems will be unavailable and the clusters will be taken down. Jobs running on the clusters will have been drained from the queues. The networks will unavailable for periods in the late evening on Saturday. We expect everything to be back online sometime on Sunday 9/28.

Instructions to User: All equipment in the SSB machine room including all file servers supported by CHPC starting at 9 a.m. Saturday. The Arches clusters including telluride and the meteorology compute servers beginning at 9 a.m. Saturday. The network equipment serving INSCC and the clusters will be intermittently unavailable in the late evening on Saturday.

There will be some major work on the electrical equipment serving the SSB machine room from Saturday 9/27 until Sunday 9/28. CHPC will take advantage of this downtime to do some system and network administration (originally planned for Oct. 14). At 9:00 a.m. on Saturday, September 27th access to desktops mounting CHPC supported file systems will be unavailable and the clusters will be taken down. Jobs running on the clusters will have been drained from the queues. The networks will unavailable for periods in the late evening on Saturday. We expect everything to be back online sometime on Sunday 9/28.


**MAJOR CHPC DOWNTIME** July 15th 2008 from 8:00 a.m. until approximately 9:00 p.m.

Posted: July 3, 2008

Duration: 8:00 a.m. until about 9:00 p.m.

Systems Affected/Downtime Timelines: INSCC Networks: 8:00 until 8:30 a.m. INSCC Desktops: 8:00 a.m. until approximately 2:00 p.m. HPC Clusters (Arches, Telluride): 8:00 a.m. until approximately 9:00 p.m.

Arches Downtime Duration:

Maintenance Will be performed on the coolers in the Komas data center requiring the clusters to be powered off. CHPC will also be performing system maintenance on some networking equipment. While the INSCC networks technically are not going down, because of core server changes (NIS, DNS) it may appear to users that the network in INSCC is not functional between 8-8:30 a.m. Desktops maintained by CHPC (mounting CHPC file servers) will be affected from about 8:00 a.m. until approximately 2:00 p.m. - maintenance will be performed on several service machines as well as home directory filesystems. Once the cooling maintenance is complete, CHPC will perform system maintenance on the Arches and Telluride Clusters. We expect to have the HPC systems up and scheduling jobs by approximately 9:00 p.m.

Downtime Summary:
  • INSCC Networks: 8:00 until 8:30 a.m.
  • INSCC Desktops: 8:00 a.m. until approximately 2:00 p.m.
  • HPC Clusters (Arches, Telluride): 8:00 a.m. until approximately 9:00 p.m.

Instructions to User: INSCC Networks: 8:00 until 8:30 a.m. INSCC Desktops: 8:00 a.m. until approximately 2:00 p.m. HPC Clusters (Arches, Telluride): 8:00 a.m. until approximately 9:00 p.m.

Maintenance Will be performed on the coolers in the Komas data center requiring the clusters to be powered off. CHPC will also be performing system maintenance on some networking equipment. While the INSCC networks technically are not going down, because of core server changes (NIS, DNS) it may appear to users that the network in INSCC is not functional between 8-8:30 a.m. Desktops maintained by CHPC (mounting CHPC file servers) will be affected from about 8:00 a.m. until approximately 2:00 p.m. - maintenance will be performed on several service machines as well as home directory filesystems. Once the cooling maintenance is complete, CHPC will perform system maintenance on the Arches and Telluride Clusters. We expect to have the HPC systems up and scheduling jobs by approximately 9:00 p.m.

Downtime Summary:
  • INSCC Networks: 8:00 until 8:30 a.m.
  • INSCC Desktops: 8:00 a.m. until approximately 2:00 p.m.
  • HPC Clusters (Arches, Telluride): 8:00 a.m. until approximately 9:00 p.m.

CHPC Fileserver downtime: June 10th, 2008 5-7 p.m.

Posted: June 5, 2008

Arches Downtime Duration:

All file systems served off the CHPCFS file server will be down for maintenance for 2 hours on Tuesday, June 10th, 2008 from 5 until 7 p.m.

These filesystems include:
  • Default home directories for HPC users
  • Nearline - (voth)
  • BMI
  • Meteorology (new non-iGrid spaces)
  • BIO Cheatham
  • INSCC home directories
  • CHPC staff

The arches clusters will continue to run jobs, but the schedulers and resource managers will be shut down during the outage. This means you will not be able to run commands such as showq, qstat, qsub etc.

If your home directory on your desktop is one of those affected, you may want to shutdown your desktop prior to the downtime and/or reboot it after the downtime.

All file systems served off the CHPCFS file server will be down for maintenance for 2 hours on Tuesday, June 10th, 2008 from 5 until 7 p.m.

These filesystems include:
  • Default home directories for HPC users
  • Nearline - (voth)
  • BMI
  • Meteorology (new non-iGrid spaces)
  • BIO Cheatham
  • INSCC home directories
  • CHPC staff

The arches clusters will continue to run jobs, but the schedulers and resource managers will be shut down during the outage. This means you will not be able to run commands such as showq, qstat, qsub etc.

If your home directory on your desktop is one of those affected, you may want to shutdown your desktop prior to the downtime and/or reboot it after the downtime.


Unscheduled Network Outage of arches clusters and telluride - 4/7/2008

Posted: April 7, 2008

Duration: Unknown

Systems Affected/Downtime Timelines: Connectivity of all of the arches clusters and telluride.

Arches Downtime Duration: We've had a switch go down which affects connectivity to all of the arches clusters and telluride. Our staff are working on the problem. We'll send an update when we know more.

Instructions to User: Connectivity of all of the arches clusters and telluride.

We've had a switch go down which affects connectivity to all of the arches clusters and telluride. Our staff are working on the problem. We'll send an update when we know more.


CHPC DOWNTIME: Starts Tuesday March 18, 2008 at 5PM

Posted: March 12, 2008

Arches Downtime Duration:

SCOPE: HPC(Arches), Network, desktop access to filesystems

DURATION:

Network access to INSCC should be restored at about 10pm on March 18th

Desktop access to fileservers (home directory access) will be restored during the morning of March 19th - a message will be sent to users when the systems are up and ready to be used

Arches will be back up sometime later in the day on March 19th - again a message will be sent when it is ready for use. Reservations have been set so that no jobs that will not finish before 5pm March 18th will be started. Jobs waiting in the queue will be started once the downtime has finished.

SCOPE: HPC(Arches), Network, desktop access to filesystems

DURATION:

Network access to INSCC should be restored at about 10pm on March 18th

Desktop access to fileservers (home directory access) will be restored during the morning of March 19th - a message will be sent to users when the systems are up and ready to be used

Arches will be back up sometime later in the day on March 19th - again a message will be sent when it is ready for use. Reservations have been set so that no jobs that will not finish before 5pm March 18th will be started. Jobs waiting in the queue will be started once the downtime has finished.


CHPC Major Downtime: 3/18/2008

Posted: February 27, 2008

Duration: From 5 p.m. Tuesday 3/18 until 5 p.m. 3/19

Systems Affected/Downtime Timelines: All CHPC networks, arches, fileservers, desktops

Arches Downtime Duration:

Major CHPC Downtime

Core Infrastructure down 5 p.m on 3/18, back up by Midnight including: fileservers etc., SSB and INSCC dependencies.

Arches Cluster down 5 p.m. 3/18 until 5 p.m. 3/19

Instructions to User: All CHPC networks, arches, fileservers, desktops

Major CHPC Downtime

Core Infrastructure down 5 p.m on 3/18, back up by Midnight including: fileservers etc., SSB and INSCC dependencies.

Arches Cluster down 5 p.m. 3/18 until 5 p.m. 3/19


CHPC Batch Systems Paused: Tuesday February 5, 2008

Posted: February 4, 2008

Duration: Downtime starts at 4pm and will last about an hour.

Arches Downtime Duration: Systems affected:

All of Arches and any computation cluster under batch control

The clusters impacted by this are: sanddunearch; delicatearch; marchingmen; tunnelarch; landscapearch and telluride.

We will be pausing the moab schedulers on all CHPC computational clusters under batch control, tomorrow, Tuesday February 5th, 2008 at 4:00 p.m., for about an hour. This is to perform system maintenance on one of our administrative systems.

Scope: This means that no new jobs will be started during this period of time. You may still queue jobs up, look at the queues and running jobs will continue to run. The clusters impacted by this are: sanddunearch; delicatearch; marchingmen; tunnelarch; landscapearch and telluride.

Please let us know if you have any questions.

Systems affected:

All of Arches and any computation cluster under batch control

The clusters impacted by this are: sanddunearch; delicatearch; marchingmen; tunnelarch; landscapearch and telluride.

We will be pausing the moab schedulers on all CHPC computational clusters under batch control, tomorrow, Tuesday February 5th, 2008 at 4:00 p.m., for about an hour. This is to perform system maintenance on one of our administrative systems.

Scope: This means that no new jobs will be started during this period of time. You may still queue jobs up, look at the queues and running jobs will continue to run. The clusters impacted by this are: sanddunearch; delicatearch; marchingmen; tunnelarch; landscapearch and telluride.

Please let us know if you have any questions.


CHPC DOWNTIME: Thursday January 3, 2008

Posted: December 19, 2007

Event date: January 3, 2008

Duration: Downtime starts at 3pm and will last until sometime early morning on January 4, 2008

Arches Downtime Duration:

Systems affected:

All of Arches and CHPC/INSCC Network

After this downtime all users will be using the campus uNID and password for authentication on all HPC systems (and other Linux systems admined by CHPC). Windows users will use the uNID and current password for authentication.

Arches:

All clusters will be down from 3pm to allow for updates to the OS and for the other changes outlined below. The Batch Queues will be drained of all running jobs. Reservations are in place so that jobs will not be started if they will not finish before the start of the downtime. Jobs that are queued but not running will be started after the downtime ends. The one exception to this is if you are being moved to using your unid for authentication during this downtime (see below); in this case any queued jobs you have will need to be deleted. The clusters will down until sometime the following morning.

**MIGRATION TO NEW FILESERVER: Some CHPC Users, those on the CHPC owned home directory filesystems i.e., those with home directories /uufs/inscc.utah.edu/common/home/USERID - will be migrated to a new, larger fileserver during this downtime. If you are one of these users your new home directory path will be /uufs/chpc.utah.edu/common/home/UNID

**CHANGE TO UNID: All CHPC users that are not already using their UNID as the CHPC login will be changed to doing so. If you do not have a UNID you will need to get one BEFORE this downtime. All University of Utah students and employees automatically have a UNID. But if you are a not a part of the University of Utah, you need to fill out a Person of Interest (PoI) form to get assigned a UNID. This form can be found at http://www.hr.utah.edu/forms/lib/u-affiliate-poi-form.pdf.

Network Outage:

All networking in CHPC/INSCC will be down from about 5-7pm

Systems affected:

All of Arches and CHPC/INSCC Network

After this downtime all users will be using the campus uNID and password for authentication on all HPC systems (and other Linux systems admined by CHPC). Windows users will use the uNID and current password for authentication.

Arches:

All clusters will be down from 3pm to allow for updates to the OS and for the other changes outlined below. The Batch Queues will be drained of all running jobs. Reservations are in place so that jobs will not be started if they will not finish before the start of the downtime. Jobs that are queued but not running will be started after the downtime ends. The one exception to this is if you are being moved to using your unid for authentication during this downtime (see below); in this case any queued jobs you have will need to be deleted. The clusters will down until sometime the following morning.

**MIGRATION TO NEW FILESERVER: Some CHPC Users, those on the CHPC owned home directory filesystems i.e., those with home directories /uufs/inscc.utah.edu/common/home/USERID - will be migrated to a new, larger fileserver during this downtime. If you are one of these users your new home directory path will be /uufs/chpc.utah.edu/common/home/UNID

**CHANGE TO UNID: All CHPC users that are not already using their UNID as the CHPC login will be changed to doing so. If you do not have a UNID you will need to get one BEFORE this downtime. All University of Utah students and employees automatically have a UNID. But if you are a not a part of the University of Utah, you need to fill out a Person of Interest (PoI) form to get assigned a UNID. This form can be found at http://www.hr.utah.edu/forms/lib/u-affiliate-poi-form.pdf.

Network Outage:

All networking in CHPC/INSCC will be down from about 5-7pm