Current information on Hardware Failure of Feb 25th and the Current Status of the Restoration

Posted: March 8, 2013

On Monday, February 25, CHPC experienced a major file system failure, which impacted the home directories for about 275 of our users. The initial report listed the groups involved. Below we provide additional information about this event and the current status of the file restoration.

CHPC still is working with the hardware vendor (HP) to determine the cause of the failure. So far we know that it was not a single failure, but a combination of failures in the controller and the disk. The analysis of the failure is ongoing.

Please note that the damaged equipment is not in service – the restorations are being performed to replacement hardware. Also all restored files are coming from the backup tapes, not from the damaged hardware; the integrity of the restored files has not been in question.

Here is an overview of the restoration process to date:

The majority of the users had their home directories back online by Saturday, March 2nd . These file systems were restored from the last full backup tape, which was started on Friday, February 22 and was still running when the disk failure occurred. In reviewing the logs of the restoration we noticed POTENTIAL missing files from some of these home directories- these users were notified of this fact on March 2nd (more below).

There was a subset of 40 users, whose home directories did not get restored at all on the initial attempt. Most of these have been traced to not being backed up on the last full restore (the weekend the failure occurred). All of these users were contacted over the weekend via e-mail. These users had a “new” home directory created with the standard CHPC dot files, and the restores were done to a different location and rsynced over when done. These restores were done up through the last incremental backup started on February 21. This process was completed Mar 7th and each user was notified when their home directory was completed.

As mentioned above, during this restoration process, some files were flagged with a “did not restore” message in the logs, and we now have a list of these files. Most of the users impacted by the failure are not on this list. Some that are on this list were notified over the weekend that the file restore may not be complete; others noticed that files were missing and have let us know. We are now focusing on restoring these files on an individual case basis. Anyone who discovers that they have missing file(s) should let CHPC staff know by opening an issue report.

Finally, for all involved, we are pulling and storing all tapes with the data lost by the hardware failure so that we will have them in case any user needs us to look for a missing file in the future. Backups will be continued starting this weekend on alternate tapes.