2005 CHPC Downtimes and History

Arches Clusters Down: Thursday December 29th about 2 p.m. (unexpected downtime)

Posted: December 29, 2005

Arches Downtime Duration:


updated 4 p.m. 12/29/05

Arches Clusters Down: Thursday December 29th about 2 p.m. (unexpected downtime)

Systems affected: Arches Clusters

Date: Thursday December 29th, 2005

Duration: Began in the afternoon

Scope: One of the central administration servers for the arches clusters (north window) has gone down. CHPC staff is currently working on this problem and until it is resolved the arches clusters are not available. It is not clear at this point if we will need to re-boot all of the clusters but as we learn more, we will keep you posted.

About 4 p.m. CHPC staff reported that we had a controller fail in our NFS root fileserver NorthWindow and we replaced it with a spare we keep on hand. We were able to get the machine back on, and in the network.

In an attempt to save running jobs, CHPC staff are working through each node checking and restarting services on each node.

PVFS2 /scratch/global servers had to be restarted, and we are currently working through the clusters.


updated 4 p.m. 12/29/05

Arches Clusters Down: Thursday December 29th about 2 p.m. (unexpected downtime)

Systems affected: Arches Clusters

Date: Thursday December 29th, 2005

Duration: Began in the afternoon

Scope: One of the central administration servers for the arches clusters (north window) has gone down. CHPC staff is currently working on this problem and until it is resolved the arches clusters are not available. It is not clear at this point if we will need to re-boot all of the clusters but as we learn more, we will keep you posted.

About 4 p.m. CHPC staff reported that we had a controller fail in our NFS root fileserver NorthWindow and we replaced it with a spare we keep on hand. We were able to get the machine back on, and in the network.

In an attempt to save running jobs, CHPC staff are working through each node checking and restarting services on each node.

PVFS2 /scratch/global servers had to be restarted, and we are currently working through the clusters.


CHPC Downtime: Tuesday December 27th 8 a.m. until about Noon

Posted: December 16, 2005

Arches Downtime Duration:

CHPC Downtime Arches Downtime: Tuesday December 27th 8 a.m. until about Noon

Systems affected: Arches Clusters

Date: Tuesday December 27th, 2005

Duration: 8 a.m. until approximately Noon

Scope: Systems maintenance of the arches clusters. We will be draining the queues in anticipation of this downtime. The changes will include:

  1. Upgrading PBS/Torque to address some issues which surfaced after our 12/9/05 downtime.
  2. Update of the myrinet drivers on delicatearch and landscapearch. Please note that myrinet users will need to recompile. More details to follow.

CHPC Downtime Arches Downtime: Tuesday December 27th 8 a.m. until about Noon

Systems affected: Arches Clusters

Date: Tuesday December 27th, 2005

Duration: 8 a.m. until approximately Noon

Scope: Systems maintenance of the arches clusters. We will be draining the queues in anticipation of this downtime. The changes will include:

  1. Upgrading PBS/Torque to address some issues which surfaced after our 12/9/05 downtime.
  2. Update of the myrinet drivers on delicatearch and landscapearch. Please note that myrinet users will need to recompile. More details to follow.


Arches Cluster back up and scheduling jobs

Posted: December 9, 2005

Arches Downtime Duration:

Arches Cluster back up and scheduling jobs

Arches clusters are back up and scheduling jobs as of about 8:30 p.m. (December 9th, 2005).

We urge users to recompile their programs if they use Myrinet as we have upgraded the Myrinet GM drivers. Also, PVFS2 is still finishing up its check so the I/O to it may be slightly slower till it is done somtime later tonight.

As always, e-mail problems@chpc.utah.edu if you experience difficulties.

Arches Cluster back up and scheduling jobs

Arches clusters are back up and scheduling jobs as of about 8:30 p.m. (December 9th, 2005).

We urge users to recompile their programs if they use Myrinet as we have upgraded the Myrinet GM drivers. Also, PVFS2 is still finishing up its check so the I/O to it may be slightly slower till it is done somtime later tonight.

As always, e-mail problems@chpc.utah.edu if you experience difficulties.


CHPC System/Network downtime: Friday December 9th, 2005

Posted: December 9, 2005

Arches Downtime Duration:

CHPC Network/System downtime: Friday December 9th, 2005

Systems affected: All Arches clusters and switches. Software updates.

Date: Friday December 9th, 2005 at 8:00 a.m.

Duration: Until sometime Saturday December 10th

Scope: Systems maintenance of the arches clusters including upgrading the firmware on nests, upgrading PVFS. Maintenance on arches routers.

Important Notes:

  1. All /scratch space will be scrubbed. Files older than two weeks will be purged. Please move important data before this downtime.
  2. PVFS2 will be upgraded to 1.3.2 anyone using mpi-io will want to recompile their programs.
  3. GM will be upgraded to 2.0.23 - users of myrinet will need to recompile their programs.

CHPC Network/System downtime: Friday December 9th, 2005

Systems affected: All Arches clusters and switches. Software updates.

Date: Friday December 9th, 2005 at 8:00 a.m.

Duration: Until sometime Saturday December 10th

Scope: Systems maintenance of the arches clusters including upgrading the firmware on nests, upgrading PVFS. Maintenance on arches routers.

Important Notes:

  1. All /scratch space will be scrubbed. Files older than two weeks will be purged. Please move important data before this downtime.
  2. PVFS2 will be upgraded to 1.3.2 anyone using mpi-io will want to recompile their programs.
  3. GM will be upgraded to 2.0.23 - users of myrinet will need to recompile their programs.

CHPC System/Network downtime: Friday December 9th, 2005

Posted: November 30, 2005

Arches Downtime Duration:

CHPC Network/System downtime: Friday December 9th, 2005

Systems affected: All Arches clusters and switches. Software updates.

Date: Friday December 9th, 2005 at 8:00 a.m.

Duration: Until sometime Saturday December 10th

Scope: Systems maintenance of the arches clusters including upgrading the firmware on nests, upgrading PVFS. Maintenance on arches routers.

Important Notes:

  1. All /scratch space will be scrubbed. Files older than two weeks will be purged. Please move important data before this downtime.
  2. PVFS2 will be upgraded to 1.3.2 anyone using mpi-io will want to recompile their programs.
  3. GM will be upgraded to 2.0.23 - users of myrinet will need to recompile their programs.

CHPC Network/System downtime: Friday December 9th, 2005

Systems affected: All Arches clusters and switches. Software updates.

Date: Friday December 9th, 2005 at 8:00 a.m.

Duration: Until sometime Saturday December 10th

Scope: Systems maintenance of the arches clusters including upgrading the firmware on nests, upgrading PVFS. Maintenance on arches routers.

Important Notes:

  1. All /scratch space will be scrubbed. Files older than two weeks will be purged. Please move important data before this downtime.
  2. PVFS2 will be upgraded to 1.3.2 anyone using mpi-io will want to recompile their programs.
  3. GM will be upgraded to 2.0.23 - users of myrinet will need to recompile their programs.

Tunnelarch downtime Sunday 11/13-11/20/2005

Posted: November 8, 2005

Arches Downtime Duration:

Tunnelarch downtime Sunday 11/13-11/20/2005

Systems affected: Tunnelarch

Date: Beginning Noon, Sunday, November 13th, 2005.

Duration: Until Noon, Sunday November 20th, 2005

Scope: You should also know that the tunnelarch downtime (from Sunday, November 13th at Noon until the following Sunday, November 20th at Noon) is not just a demonstration. This is to solve a very significant bioinformatics problem. Details below.

mpiBLAST on the GreenGene Distributed Supercomputer: Sequencing the NT Database Against the NT Database (An NT-Complete Problem)

Abstract: The Basic Local Alignment Search Tool (BLAST) allows bioinformaticists to characterize an unknown sequence by comparing it against a database of known sequences. The similarity between sequences enables biologists to detect evolutionary relationships and infer biological properties of the unknown sequence.

Our open-source parallel BLAST --- mpiBLAST --- decreases the search time of a 300-kB query from 24 hours to 4 minutes on a 128-processor cluster. It also allows larger query files to be compared, something which is infeasible with the current BLAST. Consequently, we propose to compare the largest query available, the entire NT database, against the largest database available, the entire NT database. The result of this comparison will provide critical information to the biology community, including insightful evolutionary, structural, and functional relationships between every sequence and family in the NT database. We estimate that the experiment will generate 100 TB of output to StorCloud.

Chair/Speaker Details:

Martin Swany (Chair)
University of Delaware

Wu Feng
Los Alamos National Laboratory

Mark Gardner
Los Alamos National Laboratory

Srinidhi Varadarajan
Virginia Tech

Jeff Crowder
Virginia Tech

Julio Facelli
University of Utah

Jeremy Archuleta
Los Alamos National Laboratory / University of Utah

Xiaosong Ma
North Carolina State University

Heshan Lin
Los Alamos National Laboratory / North Carolina State University

Venkatram Vishwanath
Los Alamos National Laboratory / University of Illinois at Chicago

Pavan Balaji
The Ohio State University

Tunnelarch downtime Sunday 11/13-11/20/2005

Systems affected: Tunnelarch

Date: Beginning Noon, Sunday, November 13th, 2005.

Duration: Until Noon, Sunday November 20th, 2005

Scope: You should also know that the tunnelarch downtime (from Sunday, November 13th at Noon until the following Sunday, November 20th at Noon) is not just a demonstration. This is to solve a very significant bioinformatics problem. Details below.

mpiBLAST on the GreenGene Distributed Supercomputer: Sequencing the NT Database Against the NT Database (An NT-Complete Problem)

Abstract: The Basic Local Alignment Search Tool (BLAST) allows bioinformaticists to characterize an unknown sequence by comparing it against a database of known sequences. The similarity between sequences enables biologists to detect evolutionary relationships and infer biological properties of the unknown sequence.

Our open-source parallel BLAST --- mpiBLAST --- decreases the search time of a 300-kB query from 24 hours to 4 minutes on a 128-processor cluster. It also allows larger query files to be compared, something which is infeasible with the current BLAST. Consequently, we propose to compare the largest query available, the entire NT database, against the largest database available, the entire NT database. The result of this comparison will provide critical information to the biology community, including insightful evolutionary, structural, and functional relationships between every sequence and family in the NT database. We estimate that the experiment will generate 100 TB of output to StorCloud.

Chair/Speaker Details:

Martin Swany (Chair)
University of Delaware

Wu Feng
Los Alamos National Laboratory

Mark Gardner
Los Alamos National Laboratory

Srinidhi Varadarajan
Virginia Tech

Jeff Crowder
Virginia Tech

Julio Facelli
University of Utah

Jeremy Archuleta
Los Alamos National Laboratory / University of Utah

Xiaosong Ma
North Carolina State University

Heshan Lin
Los Alamos National Laboratory / North Carolina State University

Venkatram Vishwanath
Los Alamos National Laboratory / University of Illinois at Chicago

Pavan Balaji
The Ohio State University


/scratch/parallel back: November 3rd, 2005

Posted: November 3, 2005

Arches Downtime Duration:

/scratch/parallel back: November 3rd, 2005

Systems affected: /scratch/parallel on arches clusters

Date: Thursday November 3rd, 2005

Duration: Back by about 1:45 p.m. on 12/03/2005

Scope: The update of /scratch/parallel was successful. Most of the problems that we have encountered since he last upgrade are gone, including the crippling slowness of cp, vi,...

One last problem that we noticed is that dates and permissions on files created by MPI-IO are not right, but, they don't seem to pose a problem in production runs.

Please, start using /scratch/parallel again if you've been using it before and report to us any problems you may see.

/scratch/parallel back: November 3rd, 2005

Systems affected: /scratch/parallel on arches clusters

Date: Thursday November 3rd, 2005

Duration: Back by about 1:45 p.m. on 12/03/2005

Scope: The update of /scratch/parallel was successful. Most of the problems that we have encountered since he last upgrade are gone, including the crippling slowness of cp, vi,...

One last problem that we noticed is that dates and permissions on files created by MPI-IO are not right, but, they don't seem to pose a problem in production runs.

Please, start using /scratch/parallel again if you've been using it before and report to us any problems you may see.


/scratch/parallel downtime: November 2nd, 2005

Posted: November 2, 2005

Arches Downtime Duration:

/scratch/parallel downtime: November 2nd, 2005

Systems affected: /scratch/parallel on arches clusters

Date: Thursday November 3rd, 2005

Duration: Unknown. Estimate a few hours.

Scope: We will take /scratch/parallel down tomorrow to update apply some critical patches to fix problems that made its useage difficult lately. We will NOT erase the data, only the filesystem will be inaccessible during the downtime. We don't have an estimate for the duration of the downtime, but, if all goes well it should not take more than several hours.

/scratch/parallel downtime: November 2nd, 2005

Systems affected: /scratch/parallel on arches clusters

Date: Thursday November 3rd, 2005

Duration: Unknown. Estimate a few hours.

Scope: We will take /scratch/parallel down tomorrow to update apply some critical patches to fix problems that made its useage difficult lately. We will NOT erase the data, only the filesystem will be inaccessible during the downtime. We don't have an estimate for the duration of the downtime, but, if all goes well it should not take more than several hours.


CHPC /scratch/parallel downtime: Thursday October 20th, 2005

Posted: October 12, 2005

Arches Downtime Duration:

CHPC /scratch/parallel downtime: Thursday October 20th, 2005, we will upgrade PVFS2 file system that runs /scratch/parallel. Time and duration of the downtime is to be determined.

Systems affected: Arches clusters that mount /scratch/parallel. This file system will not be available during the downtime and all user files that have been on the file system will be erased.

Details: PVFS2 that is running /scratch/parallel will be upgraded to fix several bugs that were preventing some users from using the file system efficiently.
We appeal to all users that are using /scratch/parallel now to:
1. Copy all their important files off the file system before the downtime. The complete file system will be wiped out during the upgrade.
2. Do not submit any jobs on Arches that could use /scratch/parallel during the downtime window. It would be the best to stop submitting such jobs right now, and check all the queued up jobs few days before the downtime to make sure that they are not set to use /scratch/parallel. All the jobs that will use this file system during the downtime will crash and cause prolongation of the downtime due to need to clean the affected nodes. We will not give any time refund for the time lost due to this.

CHPC /scratch/parallel downtime: Thursday October 20th, 2005, we will upgrade PVFS2 file system that runs /scratch/parallel. Time and duration of the downtime is to be determined.

Systems affected: Arches clusters that mount /scratch/parallel. This file system will not be available during the downtime and all user files that have been on the file system will be erased.

Details: PVFS2 that is running /scratch/parallel will be upgraded to fix several bugs that were preventing some users from using the file system efficiently.
We appeal to all users that are using /scratch/parallel now to:
1. Copy all their important files off the file system before the downtime. The complete file system will be wiped out during the upgrade.
2. Do not submit any jobs on Arches that could use /scratch/parallel during the downtime window. It would be the best to stop submitting such jobs right now, and check all the queued up jobs few days before the downtime to make sure that they are not set to use /scratch/parallel. All the jobs that will use this file system during the downtime will crash and cause prolongation of the downtime due to need to clean the affected nodes. We will not give any time refund for the time lost due to this.


CHPC Network/System downtime: Thursday September 29th, 2005

Posted: September 14, 2005

Arches Downtime Duration:

updated: September 27th, 2005

updated: September 28th, 2005

CHPC Network/System downtime: Thursday September 29th, 2005, Arches cluster down at 5:00 p.m. (Komas). Some home directories will be unavailable. Expect systems to be available sometime the morning of September 30th.

Systems affected: All systems in the Komas machine room, including all Arches clusters and fileserve2 beginning at 5:00 p.m. Home directories for many users will be unavailable.

Date: Thursday September 29th, 2005

Duration: Undetermined. We will begin servicing systems in the Komas machine room at 5:00 p.m. Home directories should be available in a few hours. The Arches clusters will be available sometime the next morning.

Scope: The Arches clusters and fileserv2 will go down at 5:00 p.m. for maintenance and the cleaning and maintenance of coolers in the Komas Machine room. Power maintenance of the SSB Machine room will not require systems to go down.

updated: September 27th, 2005

updated: September 28th, 2005

CHPC Network/System downtime: Thursday September 29th, 2005, Arches cluster down at 5:00 p.m. (Komas). Some home directories will be unavailable. Expect systems to be available sometime the morning of September 30th.

Systems affected: All systems in the Komas machine room, including all Arches clusters and fileserve2 beginning at 5:00 p.m. Home directories for many users will be unavailable.

Date: Thursday September 29th, 2005

Duration: Undetermined. We will begin servicing systems in the Komas machine room at 5:00 p.m. Home directories should be available in a few hours. The Arches clusters will be available sometime the next morning.

Scope: The Arches clusters and fileserv2 will go down at 5:00 p.m. for maintenance and the cleaning and maintenance of coolers in the Komas Machine room. Power maintenance of the SSB Machine room will not require systems to go down.


SSB Maching Room Power Outage this Saturday (July 30th 7am-10am)

Posted: July 27, 2005

Arches Downtime Duration:

SSB Maching Room Power Outage this Saturday (July 30th 7am-10am)

There will be a power outage affecting CHPC's SSB machine room this Saturday, July 30th from 7am to 10am.

Clusters housed in the room, that is Icebox, Sierra and Slickrock will be shut off, as well as most of the file servers.

We have created reservations on the Arches clusters to drain them as well since the jobs would fail when home directories housed on the downed fileservers are not available.

We will also take this opportunity to do some maintentance on fileserv2 and on new landscapearch nodes which we expect to be done by the time the power is back up.

We plan to have the Arches resuming scheduling soon after the power is back up and the fileservers are booted, while the three clusters housed in SSB may take a few more hours to come up.

As always, we are sorry for the inconvenience and don't hesitate to contact us in case of any questions.

SSB Maching Room Power Outage this Saturday (July 30th 7am-10am)

There will be a power outage affecting CHPC's SSB machine room this Saturday, July 30th from 7am to 10am.

Clusters housed in the room, that is Icebox, Sierra and Slickrock will be shut off, as well as most of the file servers.

We have created reservations on the Arches clusters to drain them as well since the jobs would fail when home directories housed on the downed fileservers are not available.

We will also take this opportunity to do some maintentance on fileserv2 and on new landscapearch nodes which we expect to be done by the time the power is back up.

We plan to have the Arches resuming scheduling soon after the power is back up and the fileservers are booted, while the three clusters housed in SSB may take a few more hours to come up.

As always, we are sorry for the inconvenience and don't hesitate to contact us in case of any questions.


Icebox down, 9 pm Wednesday, July 13th, 2005

Posted: July 14, 2005

Arches Downtime Duration:

Icebox down, 9 pm Wednesday, July 13th, 2005

On Wednesday night, the cooler located in the center of our SSB machine room failed, and shutdown. This caused a large imbalance of cooling in the data center with the East half of the room reaching temperatures high enough to have ICEBOX hardware fail due to overheating.
CHPC staff attempted to restart the compressors on the failed unit, which was unsuccessful.
Our Liebert service provider - "Mountain Valley" was consulted that night and was scheduled to arrive between 8-9am on Thursday to work on the failed Liebert cooler.
Due to the failed Liebert cooler ICEBOX will remain down, and will be brought online and diagnosed when the cooling is restored. The availability will depend on the timeframe of the cooler repair and amount of damage suffered by the overheating.
We apologize for any inconvenience due to this failure.

Icebox down, 9 pm Wednesday, July 13th, 2005

On Wednesday night, the cooler located in the center of our SSB machine room failed, and shutdown. This caused a large imbalance of cooling in the data center with the East half of the room reaching temperatures high enough to have ICEBOX hardware fail due to overheating.
CHPC staff attempted to restart the compressors on the failed unit, which was unsuccessful.
Our Liebert service provider - "Mountain Valley" was consulted that night and was scheduled to arrive between 8-9am on Thursday to work on the failed Liebert cooler.
Due to the failed Liebert cooler ICEBOX will remain down, and will be brought online and diagnosed when the cooling is restored. The availability will depend on the timeframe of the cooler repair and amount of damage suffered by the overheating.
We apologize for any inconvenience due to this failure.


Major CHPC Network/System Downtime, 5 pm Thursday, July 7th, 2005

Posted: June 23, 2005

Arches Downtime Duration:

Major CHPC Network/System Downtime, 5 pm Thursday, July 7th, 2005

Systems affected:All systems in the Komas machine room, and some fileservers. Home directories for many users will be unavailable.

Date:Thursday July 7th, 2005

Duration: Undetermined. Will begin 5 pm.

Scope: Repair of coolers in the Komas Machine room. All Arches clusters will be down. System maintenance on fileserve2 (home directories in /uufs/inscc.utah.edu/common/home), and several of the /scratch filesystems on Arches.

Major CHPC Network/System Downtime, 5 pm Thursday, July 7th, 2005

Systems affected:All systems in the Komas machine room, and some fileservers. Home directories for many users will be unavailable.

Date:Thursday July 7th, 2005

Duration: Undetermined. Will begin 5 pm.

Scope: Repair of coolers in the Komas Machine room. All Arches clusters will be down. System maintenance on fileserve2 (home directories in /uufs/inscc.utah.edu/common/home), and several of the /scratch filesystems on Arches.


Major CHPC Network/System Downtime, 8 am - 5 pm Saturday, May 21st, 2005

Posted: May 10, 2005

Arches Downtime Duration:

Major CHPC Network/System Downtime: 8 am - 5 pm Saturday, May 21st, 2005

Systems affected: All routing in INSCC. Systems in the INSCC Machine room shutdown. All HPC systems including Arches Clusters, Sierra and Icebox.

NOTE: *** If you have equipment in the INSCC Machine room, you will need to make sure it is down before 8 a.m. the morning of May 21st.***

Date: Saturday May 21st, 2005

Duration: 8 a.m. until 5 p.m.

Scope: The coolers in the Machine room in INSCC were repaired. CHPC will take advantage of this downtime to upgrade a fileserver and some of the icebox nodes.

Major CHPC Network/System Downtime: 8 am - 5 pm Saturday, May 21st, 2005

Systems affected: All routing in INSCC. Systems in the INSCC Machine room shutdown. All HPC systems including Arches Clusters, Sierra and Icebox.

NOTE: *** If you have equipment in the INSCC Machine room, you will need to make sure it is down before 8 a.m. the morning of May 21st.***

Date: Saturday May 21st, 2005

Duration: 8 a.m. until 5 p.m.

Scope: The coolers in the Machine room in INSCC were repaired. CHPC will take advantage of this downtime to upgrade a fileserver and some of the icebox nodes.


CHPC Downtime - Arches Clusters (emergency)

Posted: May 7, 2005

Arches Downtime Duration:

updated: May 7th, 2005 (about 4pm)
updated: May 7th, 2005 (about 1pm)

CHPC Downtime:
All Arches Clusters: Saturday, May 7th, 2005 at 10:00 a.m. until about 4:00 p.m.

Systems affected: All Arches clusters - KOMAS machine room cooling due to power problem.

About an 10:00 a.m. there was a power problem in the Komas machine room causing the coolers to fail. We shutdown the Arches clusters as the temperature was getting dangerously high. CHPC staff are onsight and working to get everything back online.

The coolers came back online and were cooling the machine room by about 1:00 p.m. As soon as the room was sufficiently cooled, CHPC staff will began to bring the Arches clusters back online. All went smoothly and the clusters were back and scheduling jobs by 4:00 p.m.

updated: May 7th, 2005 (about 4pm)
updated: May 7th, 2005 (about 1pm)

CHPC Downtime:
All Arches Clusters: Saturday, May 7th, 2005 at 10:00 a.m. until about 4:00 p.m.

Systems affected: All Arches clusters - KOMAS machine room cooling due to power problem.

About an 10:00 a.m. there was a power problem in the Komas machine room causing the coolers to fail. We shutdown the Arches clusters as the temperature was getting dangerously high. CHPC staff are onsight and working to get everything back online.

The coolers came back online and were cooling the machine room by about 1:00 p.m. As soon as the room was sufficiently cooled, CHPC staff will began to bring the Arches clusters back online. All went smoothly and the clusters were back and scheduling jobs by 4:00 p.m.


Sierra cluster not fully functional

Posted: May 2, 2005

Arches Downtime Duration:

updated May 3rd, 2005

Sierra cluster not fully functional (back by about 4pm)

The sierra cluster had some troubles the afternoon of May 2nd. CHPC staff looked into the problem and it was returned to normal operation about 4:00 p.m. We apologize for the inconvenience.

updated May 3rd, 2005

Sierra cluster not fully functional (back by about 4pm)

The sierra cluster had some troubles the afternoon of May 2nd. CHPC staff looked into the problem and it was returned to normal operation about 4:00 p.m. We apologize for the inconvenience.


Power down in Komas building (4/20/05 from 4:30-6:00pm): Arches cluster down (until approx. 8:30pm), all routing for INSCC down.

Posted: April 20, 2005

Arches Downtime Duration:

Power down in Komas building (4/20/05 from 4:30-6:00pm): Arches cluster down (until approx. 8:30pm), all routing for INSCC down.

The power dropped in the Komas building in Research park about 4:30pm on April 19th, 2005. This power outage dropped all of the arches clusters. This power outage also dropped all the routing for the INSCC building. A few weeks ago, CHPC Network staff had to move all of the INSCC routing to Komas as a workaround for a code bug. The fix to the code bug is still outstanding. All routing between networks for the Komas cluster machine room, SSB machine room, INSCC machine room and INSCC building was non-functional until the power returned about 6:00 pm. CHPC Network and Systems staff were on site and working with electricians and UP&L to return power as soon as possible. The arches clusters were back scheduling jobs around 8:30 pm.

Power down in Komas building (4/20/05 from 4:30-6:00pm): Arches cluster down (until approx. 8:30pm), all routing for INSCC down.

The power dropped in the Komas building in Research park about 4:30pm on April 19th, 2005. This power outage dropped all of the arches clusters. This power outage also dropped all the routing for the INSCC building. A few weeks ago, CHPC Network staff had to move all of the INSCC routing to Komas as a workaround for a code bug. The fix to the code bug is still outstanding. All routing between networks for the Komas cluster machine room, SSB machine room, INSCC machine room and INSCC building was non-functional until the power returned about 6:00 pm. CHPC Network and Systems staff were on site and working with electricians and UP&L to return power as soon as possible. The arches clusters were back scheduling jobs around 8:30 pm.


HPC home directory file server crashes

Posted: April 9, 2005

Arches Downtime Duration:

updated: April 11th, 2005

HPC home directory file server crashes three times over the, April 9-10th, 2005 weekend

fileserv2, that hosts user's home directories had three crashes over the April 9-10th, 2005 weekend. As a result of the downtime, jobs of those people whose homes are hosted by this server (i.e. those who don't have their own departmental fileservers) and write there have either crashed or spent a lot of time waiting for the I/O and may have run out of walltime as a result. If the job was writing data to the scratch servers and was supposed to copy them back to home at the end, the data may not have been copied. Some jobs that were trying to start during the fileserver downtime could not find the data and could send lots of spam to the user's e-mail box. If there was a write at the time of the crash, the data may be corrupt. Please, check your results that you obtained during the weekend with extra care.

updated: April 11th, 2005

HPC home directory file server crashes three times over the, April 9-10th, 2005 weekend

fileserv2, that hosts user's home directories had three crashes over the April 9-10th, 2005 weekend. As a result of the downtime, jobs of those people whose homes are hosted by this server (i.e. those who don't have their own departmental fileservers) and write there have either crashed or spent a lot of time waiting for the I/O and may have run out of walltime as a result. If the job was writing data to the scratch servers and was supposed to copy them back to home at the end, the data may not have been copied. Some jobs that were trying to start during the fileserver downtime could not find the data and could send lots of spam to the user's e-mail box. If there was a write at the time of the crash, the data may be corrupt. Please, check your results that you obtained during the weekend with extra care.


Arches Cluster unscheduled downtime

Posted: March 31, 2005

Arches Downtime Duration:

Arches Cluster unscheduled downtime Thursday, March 31st, 2005 about 5:00 p.m.

At approx 10:30pm Thursday we started up the schedulers again after bringing Arches back up. The TS Electric people & Randy Green reported that UP&L had suffered a power bump on the hogle zoo substation. This resulted in our flywheel and ups triggering and the loss of power to the bulk of the Arches compute nodes. TS Electric gave us a green light to bring the system back up at 7:30pm. The system came back up without any real problems. Clearly all jobs that were running were lost. Jobs that were in the idle state in the queues have started to run and the system is open for regular use at this time. Thanks for your patience.

Arches Cluster unscheduled downtime Thursday, March 31st, 2005 about 5:00 p.m.

At approx 10:30pm Thursday we started up the schedulers again after bringing Arches back up. The TS Electric people & Randy Green reported that UP&L had suffered a power bump on the hogle zoo substation. This resulted in our flywheel and ups triggering and the loss of power to the bulk of the Arches compute nodes. TS Electric gave us a green light to bring the system back up at 7:30pm. The system came back up without any real problems. Clearly all jobs that were running were lost. Jobs that were in the idle state in the queues have started to run and the system is open for regular use at this time. Thanks for your patience.


Arches Cluster Back Online

Posted: March 23, 2005

Arches Downtime Duration:

Arches Cluster Back Online Wednesday, March 23rd, 2005 about 6:00 p.m.

The arches clusters are now open to users and scheduling jobs after the extended downtime. Two changes of note which were made during this downtime are:

  • All nodes/servers are running the newer kernel, 2.6.11.
  • GM was upgraded to 2.0.19 on delicatearch and landscapearch.

Thanks for your patience on this extended downtime. We had some unexpected delays and apologize for the inconvenience. Please let us know if you run into problems by sending a report to problems@chpc.utah.edu.

Arches Cluster Back Online Wednesday, March 23rd, 2005 about 6:00 p.m.

The arches clusters are now open to users and scheduling jobs after the extended downtime. Two changes of note which were made during this downtime are:

  • All nodes/servers are running the newer kernel, 2.6.11.
  • GM was upgraded to 2.0.19 on delicatearch and landscapearch.

Thanks for your patience on this extended downtime. We had some unexpected delays and apologize for the inconvenience. Please let us know if you run into problems by sending a report to problems@chpc.utah.edu.


CHPC Downtime - KOMAS Machine room (extended)

Posted: March 21, 2005

Arches Downtime Duration:

updated: March 23rd, 2005

CHPC Downtime: Monday, March 21st, 2005 at 10:00 p.m.

Systems affected: All Arches clusters - KOMAS machine room critical repair of cooling system.

Date: Monday March 21st - Wednesday March 23rd (extended from 3/22)

Duration: Availability expected sometime Wednesday March 23rd after replacement and testing of cooling system. (Our systems staff ran into unexpected problems during the nfsroot image upgrade and as a result the downtime was extended beyond the original estimate.

Details: A critical repair of the cooling system will take place beginning at 10:00 pm tonight, Monday March 21st. CHPC will take advantage of this opportunity and move up most of the maintenance planned for the March 31st scheduled downtime which has been cancelled. We apologize for any inconvenience.

updated: March 23rd, 2005

CHPC Downtime: Monday, March 21st, 2005 at 10:00 p.m.

Systems affected: All Arches clusters - KOMAS machine room critical repair of cooling system.

Date: Monday March 21st - Wednesday March 23rd (extended from 3/22)

Duration: Availability expected sometime Wednesday March 23rd after replacement and testing of cooling system. (Our systems staff ran into unexpected problems during the nfsroot image upgrade and as a result the downtime was extended beyond the original estimate.

Details: A critical repair of the cooling system will take place beginning at 10:00 pm tonight, Monday March 21st. CHPC will take advantage of this opportunity and move up most of the maintenance planned for the March 31st scheduled downtime which has been cancelled. We apologize for any inconvenience.


Sierra Cluster rebooted, approx 3:45 pm March 16th, 2005

Posted: March 16, 2005

Arches Downtime Duration:

Sierra Cluster rebooted, approx 3:45 pm March 16th, 2005

On the afternoon of March 16, 2005, the sierra cluster had some system problems causing some jobs to die. We rebooted the system about 3:45 pm.

Sierra Cluster rebooted, approx 3:45 pm March 16th, 2005

On the afternoon of March 16, 2005, the sierra cluster had some system problems causing some jobs to die. We rebooted the system about 3:45 pm.


Downtime: marchingmen cluster, Tuesday March 8th, 2005 from 8am

Posted: March 2, 2005

Arches Downtime Duration:

updated: March 9, 2005
updated: March 8, 2005

Downtime: marchingmen cluster, Tuesday March 8th, 2005 from 8am
Resumed user access and scheduling jobs about 9am the morning of March 9th, 2005.

Details: The kernel was upgraded on the compute nodes and another scratch filesystem added which will be visible only to the marchingmen compute and interactive nodes. The new scratch space is available at /scratch/mm. Additional scratch space was added to delicatearch at the same time which did not require a downtime. This space is available at /scratch/da

updated: March 9, 2005
updated: March 8, 2005

Downtime: marchingmen cluster, Tuesday March 8th, 2005 from 8am
Resumed user access and scheduling jobs about 9am the morning of March 9th, 2005.

Details: The kernel was upgraded on the compute nodes and another scratch filesystem added which will be visible only to the marchingmen compute and interactive nodes. The new scratch space is available at /scratch/mm. Additional scratch space was added to delicatearch at the same time which did not require a downtime. This space is available at /scratch/da


Icebox outage, morning of February 25, 2005

Posted: February 25, 2005

Arches Downtime Duration:

updated: March 1, 2005

Icebox outage, morning of February 25, 2005

On the morning of February 25, 2005, icebox, the IA-32 cluster, had some major system problems which required us to take it offline. Our systems staff worked to solve this recurring problem and the system was opened for users again the morning of March 1st, 2005. We apologize for any inconvenience.

updated: March 1, 2005

Icebox outage, morning of February 25, 2005

On the morning of February 25, 2005, icebox, the IA-32 cluster, had some major system problems which required us to take it offline. Our systems staff worked to solve this recurring problem and the system was opened for users again the morning of March 1st, 2005. We apologize for any inconvenience.


Icebox outage, morning of February 24, 2005

Posted: February 24, 2005

Arches Downtime Duration:

Icebox outage, morning of February 24, 2005

On the morning of February 24, 2005, icebox, the IA-32 cluster, had some major system problems which required us to take it offline. It was back scheduling jobs by about 11:40 am. We apologize for any inconvenience.

Icebox outage, morning of February 24, 2005

On the morning of February 24, 2005, icebox, the IA-32 cluster, had some major system problems which required us to take it offline. It was back scheduling jobs by about 11:40 am. We apologize for any inconvenience.


Icebox outage, morning of February 23, 2005

Posted: February 23, 2005

Arches Downtime Duration:

Icebox outage, morning of February 23, 2005

On the morning of February 23, 2005, icebox, the IA-32 cluster, had some major system problems and required us to reboot most of the system. Icebox has returned to scheduling by about 11am that morning. We apologize for any inconvenience.

Icebox outage, morning of February 23, 2005

On the morning of February 23, 2005, icebox, the IA-32 cluster, had some major system problems and required us to reboot most of the system. Icebox has returned to scheduling by about 11am that morning. We apologize for any inconvenience.


Major CHPC Downtime: From 5 pm on 2/3/05 until aout 4 am on 2/4/05.

Posted: January 27, 2005

Arches Downtime Duration:

updated February 4th, 2005
re-
posted: January 12, 2005

Major CHPC Downtime - All HPC systems and INSCC Networking From 5:00 pm 2/3/05 - about 4 am Friday 2/4/05.

Systems affected: All networking in the INSCC building, SSB machine room and Komas machine room. All of the HPC systems including the Arches Clusters: marchingmen, delicatearch, tunnelarch and landscapearch; icebox and sierra. There will be a router upgrade at this time as well.

Date: Beginning 5:00 pm on Thursday February 3rd, 2005

Duration: Until about 4 am on Friday Februarh 4th, 2005

Details: There will be a power outage at Komas. All networking will be down for the INSCC building, SSB machine room and Komas machine room. Home directories on HPC systems with a path of /uufs/inscc.utah.edu/common/home plan to be moved to a new server with policy changes. see Migration of /uufs/inscc.utah.edu/common/home to new fileserver.

updated February 4th, 2005
re-
posted: January 12, 2005

Major CHPC Downtime - All HPC systems and INSCC Networking From 5:00 pm 2/3/05 - about 4 am Friday 2/4/05.

Systems affected: All networking in the INSCC building, SSB machine room and Komas machine room. All of the HPC systems including the Arches Clusters: marchingmen, delicatearch, tunnelarch and landscapearch; icebox and sierra. There will be a router upgrade at this time as well.

Date: Beginning 5:00 pm on Thursday February 3rd, 2005

Duration: Until about 4 am on Friday Februarh 4th, 2005

Details: There will be a power outage at Komas. All networking will be down for the INSCC building, SSB machine room and Komas machine room. Home directories on HPC systems with a path of /uufs/inscc.utah.edu/common/home plan to be moved to a new server with policy changes. see Migration of /uufs/inscc.utah.edu/common/home to new fileserver.


Major CHPC Downtime: February 3rd, 2005 from 5 pm - about 4am February 4th, 2005. All systems and networks.

Posted: January 12, 2005

Arches Downtime Duration:

Major CHPC Downtime - All HPC systems and INSCC Networking From 5:00 pm 2/4/05 - 4:00 am 2/5/05.

Systems affected: All networking in the INSCC building, SSB machine room and Komas machine room. All of the HPC systems including the Arches Clusters: marchingmen, delicatearch, tunnelarch and landscapearch; icebox and sierra.

Date: Beginning 5:00 pm on Thursday February 3rd, 2005

Duration: Lasted until about 4:00 am on Friday February 4th, 2005

Details: There will be a power outage at Komas. All networking will be down for the INSCC building, SSB machine room and Komas machine room.

Major CHPC Downtime - All HPC systems and INSCC Networking From 5:00 pm 2/4/05 - 4:00 am 2/5/05.

Systems affected: All networking in the INSCC building, SSB machine room and Komas machine room. All of the HPC systems including the Arches Clusters: marchingmen, delicatearch, tunnelarch and landscapearch; icebox and sierra.

Date: Beginning 5:00 pm on Thursday February 3rd, 2005

Duration: Lasted until about 4:00 am on Friday February 4th, 2005

Details: There will be a power outage at Komas. All networking will be down for the INSCC building, SSB machine room and Komas machine room.


Unscheduled Downtime: Arches Clusters

Posted: January 8, 2005

Arches Downtime Duration:

Unscheduled Downtime - Arches Clusters From approximately 10:00 am Saturday 1/8/05 until approximately 6:00 pm Sunday 1/9/05

Due to a Cooler Failure in the Komas Machine Room

Arches Clusters DOWNED about 10:00 am Saturday January 8th, 2005

Arches Clusters UP about 6:00 pm Sunday January 9th, 2005

Systems affected: All of the Arches Clusters including: marchingmen, delicatearch, tunnelarch and landscapearch

Date: Beginning 10:00 am on Saturday 1/8/05

Duration: Until 6:00 pm on Sunday 1/9/05.

Details: The machine room at Komas was dangerously overheating. CHPC staff shut down all system in the machine room to prevent equipment damage. The cooler was repaired and the room was returned to normal operating temperatures.

Unscheduled Downtime - Arches Clusters From approximately 10:00 am Saturday 1/8/05 until approximately 6:00 pm Sunday 1/9/05

Due to a Cooler Failure in the Komas Machine Room

Arches Clusters DOWNED about 10:00 am Saturday January 8th, 2005

Arches Clusters UP about 6:00 pm Sunday January 9th, 2005

Systems affected: All of the Arches Clusters including: marchingmen, delicatearch, tunnelarch and landscapearch

Date: Beginning 10:00 am on Saturday 1/8/05

Duration: Until 6:00 pm on Sunday 1/9/05.

Details: The machine room at Komas was dangerously overheating. CHPC staff shut down all system in the machine room to prevent equipment damage. The cooler was repaired and the room was returned to normal operating temperatures.