2005 CHPC News Announcements
Arches Clusters Down: Thursday December 29th about 2 p.m. (unexpected downtime)
Posted: December 29, 2005
updated 4 p.m. 12/29/05
Arches Clusters Down: Thursday December 29th about 2 p.m. (unexpected downtime)
Systems affected: Arches Clusters
Date: Thursday December 29th, 2005
Duration: Began in the afternoon
Scope: One of the central administration servers for the arches clusters (north window) has gone down. CHPC staff is currently working on this problem and until it is resolved the arches clusters are not available. It is not clear at this point if we will need to re-boot all of the clusters but as we learn more, we will keep you posted.
About 4 p.m. CHPC staff reported that we had a controller fail in our NFS root fileserver NorthWindow and we replaced it with a spare we keep on hand. We were able to get the machine back on, and in the network.
In an attempt to save running jobs, CHPC staff are working through each node checking and restarting services on each node.
PVFS2 /scratch/global servers had to be restarted, and we are currently working through the clusters.
Myrinet drivers upgraded on Arches (from GM to MX)
Posted: December 29, 2005
Myrinet drivers upgraded on Arches (from GM to MX)
As mentioned in separate e-mail, we will upgrade Myrinet drivers on the Arches (Delicatearch and Landscapearch) from GM to MX library on Tuesday.
This is somewhat more significant upgrade than the usual GM version upgrade we occasionally perform. MX is a completely new driver and also requires new MPI distribution. This means that users will not only have to recompile their MPI codes that use Myrinet, but, also change their Makefiles,... to reflect the location if this new MPI distro.
The MPI is called MPICH-MX. For use with GNU and Pathscale compilers, use
/uufs/$(CLUSTER)/sys/pkg/mpich-mx/std where $(CLUSTER) is either delicatearch.arches, or landscapearch.arches.
For PGI compilers, use
/uufs/$(CLUSTER)/sys/pkg/mpich-mx/std_pgi
So, for example, to use GNU gcc with MPICH-MX, call
/uufs/$(CLUSTER)/sys/pkg/mpich-mx/std/bin/mpicc
For those users that add their own library paths to their Makefiles, adding this as a single line to LDFLAGS or similar will link MPICH-MX and the MX library:
-L/uufs/$(CLUSTER)/sys/pkg/mpich-mx/std/lib -lmpich -L/uufs/$(CLUSTER)/sys/pkg/mx-2g/std/lib64 -lmyriexpress -Wl,-rpath=/uufs/$(CLUSTER)/sys/pkg/mx-2g/std/lib64 -lpthread
Similarly, add header file search path to CFLAGS or FFLAGS as:
-I/uufs/$(CLUSTER)/sys/pkg/mpich-mx/std/include
To run the code, use command mpirun.ch_mx. The syntax is the same as GM's mpirun.ch_gm, although not all flags of mpirun.ch_gm are implemented yet. If you use some more exotic flags, you may want to consult help by running "/uufs/$(CLUSTER)/sys/pkg/mpich-mx/std/bin/mpirun.ch_mx --help" if the feature you used with GM is available or not. In general, I don't expect anybody to get stuck here. Bottom line, run as you used to before except that changing the "gm" to "mx" in mpirun.
We have built the MX library and MPICH-MX already and encourage users to start rebuilding their codes now. Unfortunately, there's no way to test the functionality of the codes you build, but, the MPICH-MX has been tested.
We've built several commonly used programs with MPICH-MX and tested them and found most of them to improve performance by 5-10%. This is mainly due to about 1/2 reduction in message latency. There are some improvements that can be done with higher bandwidth, so, if you experience slower performance, contact us for suggestions on improvement.
The apps we tested include Amber, NAMD, DLPOLY, DLEVB and VASP. Users of those are encouraged to contact me for correct build flags.
Arches Clusters Down: Thursday December 29th about 2 p.m. (unexpected downtime)
Posted: December 29, 2005
updated 4 p.m. 12/29/05
Arches Clusters Down: Thursday December 29th about 2 p.m. (unexpected downtime)
Systems affected: Arches Clusters
Date: Thursday December 29th, 2005
Duration: Began in the afternoon
Scope: One of the central administration servers for the arches clusters (north window) has gone down. CHPC staff is currently working on this problem and until it is resolved the arches clusters are not available. It is not clear at this point if we will need to re-boot all of the clusters but as we learn more, we will keep you posted.
About 4 p.m. CHPC staff reported that we had a controller fail in our NFS root fileserver NorthWindow and we replaced it with a spare we keep on hand. We were able to get the machine back on, and in the network.
In an attempt to save running jobs, CHPC staff are working through each node checking and restarting services on each node.
PVFS2 /scratch/global servers had to be restarted, and we are currently working through the clusters.
CHPC Downtime Arches Downtime: Tuesday December 27th 8 a.m. until about Noon
Posted: December 16, 2005
CHPC Downtime Arches Downtime: Tuesday December 27th 8 a.m. until about Noon
Systems affected: Arches Clusters
Date: Tuesday December 27th, 2005
Duration: 8 a.m. until approximately Noon
Scope: Systems maintenance of the arches clusters. We will be draining the queues in anticipation of this downtime. The changes will include:
- Upgrading PBS/Torque to address some issues which surfaced after our 12/9/05 downtime.
- Update of the myrinet drivers on delicatearch and landscapearch. Please note that myrinet users will need to recompile. More details to follow.
CHPC Downtime: Tuesday December 27th 8 a.m. until about Noon
Posted: December 16, 2005
CHPC Downtime Arches Downtime: Tuesday December 27th 8 a.m. until about Noon
Systems affected: Arches Clusters
Date: Tuesday December 27th, 2005
Duration: 8 a.m. until approximately Noon
Scope: Systems maintenance of the arches clusters. We will be draining the queues in anticipation of this downtime. The changes will include:
- Upgrading PBS/Torque to address some issues which surfaced after our 12/9/05 downtime.
- Update of the myrinet drivers on delicatearch and landscapearch. Please note that myrinet users will need to recompile. More details to follow.
Arches Cluster back up and scheduling jobs
Posted: December 9, 2005
Arches Cluster back up and scheduling jobs
Arches clusters are back up and scheduling jobs as of about 8:30 p.m. (December 9th, 2005).
We urge users to recompile their programs if they use Myrinet as we have upgraded the Myrinet GM drivers. Also, PVFS2 is still finishing up its check so the I/O to it may be slightly slower till it is done somtime later tonight.
As always, e-mail problems@chpc.utah.edu if you experience difficulties.
CHPC System/Network downtime: Friday December 9th, 2005
Posted: December 9, 2005
CHPC Network/System downtime: Friday December 9th, 2005
Systems affected: All Arches clusters and switches. Software updates.
Date: Friday December 9th, 2005 at 8:00 a.m.
Duration: Until sometime Saturday December 10th
Scope: Systems maintenance of the arches clusters including upgrading the firmware on nests, upgrading PVFS. Maintenance on arches routers.
Important Notes:
- All /scratch space will be scrubbed. Files older than two weeks will be purged. Please move important data before this downtime.
- PVFS2 will be upgraded to 1.3.2 anyone using mpi-io will want to recompile their programs.
- GM will be upgraded to 2.0.23 - users of myrinet will need to recompile their programs.
Arches Cluster back up and scheduling jobs
Posted: December 9, 2005
Arches Cluster back up and scheduling jobs
Arches clusters are back up and scheduling jobs as of about 8:30 p.m. (December 9th, 2005).
We urge users to recompile their programs if they use Myrinet as we have upgraded the Myrinet GM drivers. Also, PVFS2 is still finishing up its check so the I/O to it may be slightly slower till it is done somtime later tonight.
As always, e-mail problems@chpc.utah.edu if you experience difficulties.
CHPC System/Network downtime: Friday December 9th, 2005
Posted: December 9, 2005
CHPC Network/System downtime: Friday December 9th, 2005
Systems affected: All Arches clusters and switches. Software updates.
Date: Friday December 9th, 2005 at 8:00 a.m.
Duration: Until sometime Saturday December 10th
Scope: Systems maintenance of the arches clusters including upgrading the firmware on nests, upgrading PVFS. Maintenance on arches routers.
Important Notes:
- All /scratch space will be scrubbed. Files older than two weeks will be purged. Please move important data before this downtime.
- PVFS2 will be upgraded to 1.3.2 anyone using mpi-io will want to recompile their programs.
- GM will be upgraded to 2.0.23 - users of myrinet will need to recompile their programs.
Upgrades on Arches
Posted: November 30, 2005
Upgrades on Arches
November and December are historically rich in upgrades and this year is no different.
First is the upgrade to Pathscale compilers to ver. 2.3. This includes new autoparallelization feature, better optimization selection with pathopt2 feature and few other useful things.
PGI will make their upgrade in the middle of December, which should be also interesting since they claim an improved support for dual code CPUs.
Then we have upgraded MPICH2 to version 1.0.3., so far just with GNU/Pathscale compilers. Will do the PGI build when PGI gets the upgrade. This version has numerous improvements in performance and functionality.
Finally, we upgraded Totalview to ver. 7.1. The improvements here are in memory and MPI debugging.
All were relatively minor upgrades so most users should not even notice it, but, it may be a good idea to recompile your code if you use Pathscale or MPICH2.
As always, let us know if you have any problems.
CHPC System/Network downtime: Friday December 9th, 2005
Posted: November 30, 2005
CHPC Network/System downtime: Friday December 9th, 2005
Systems affected: All Arches clusters and switches. Software updates.
Date: Friday December 9th, 2005 at 8:00 a.m.
Duration: Until sometime Saturday December 10th
Scope: Systems maintenance of the arches clusters including upgrading the firmware on nests, upgrading PVFS. Maintenance on arches routers.
Important Notes:
- All /scratch space will be scrubbed. Files older than two weeks will be purged. Please move important data before this downtime.
- PVFS2 will be upgraded to 1.3.2 anyone using mpi-io will want to recompile their programs.
- GM will be upgraded to 2.0.23 - users of myrinet will need to recompile their programs.
CHPC System/Network downtime: Friday December 9th, 2005
Posted: November 30, 2005
CHPC Network/System downtime: Friday December 9th, 2005
Systems affected: All Arches clusters and switches. Software updates.
Date: Friday December 9th, 2005 at 8:00 a.m.
Duration: Until sometime Saturday December 10th
Scope: Systems maintenance of the arches clusters including upgrading the firmware on nests, upgrading PVFS. Maintenance on arches routers.
Important Notes:
- All /scratch space will be scrubbed. Files older than two weeks will be purged. Please move important data before this downtime.
- PVFS2 will be upgraded to 1.3.2 anyone using mpi-io will want to recompile their programs.
- GM will be upgraded to 2.0.23 - users of myrinet will need to recompile their programs.
/scratch/serial is nearly full
Posted: November 24, 2005
/scratch/serial is nearly full
/scratch/serial is full. Jobs are dying when they try to write to this space. Please check your usage of space on this filesystem and clean up as much space as you can. You should also consider using the /scratch/parallel space instead - there is aver 13TB of space and all tests indicate that the this scratch system is stable and has good performance.
CHPC Presentation
Posted: November 15, 2005
Chemistry Packages at CHPC
Thursday, November 17th, 2005 at 1:30 p.m. in the INSCC Auditorium
Presenter: Anita Orendt
This talk will focus on the various computational software packages and tools that are available on CHPC computer systems – Gaussian, Gaussview, NWChem, ECCE, Amber, Molpro, Babel, Dock and Autodock. The talk is an overview and will present information on the capabilities of these packages, along with details on how users can access the various programs at CHPC and where they can get more information on these packages. This talk will serve as a precursor to the next talk in the series that focuses on Using Gaussian and GaussView.
For more information about this and other CHPC presentations, please see:
/scratch/serial is nearly full (91%)
Posted: November 12, 2005
/scratch/serial is nearly full (91%)
The /scratch/serial filesystem is over 91% full. Please delete or copy off any files you can to help free up space.
/scratch/serial is nearly full
Posted: November 9, 2005
/scratch/serial is nearly full
The /scratch/serial filesystem is at 99% of its capacity. Please take a minute to look at your files on this system and clean up this space as soon as possible. Thank you,
Allocation Requests Due December 1st, 2006
Posted: November 8, 2005
Allocation Requests Due December 1st, 2006
Proposals and allocation requests for computer time on the Arches Opteron Cluster are due by December 1st, 2006. We must have this information if you wish to be considered for an allocation of time for the Winter 2006 calendar quarter and/or subsequent three quarters. If you already have an award for Winter 2006, you do not need to re-apply unless you wish to request a different amount from what you were awarded.
- Information on the allocation process and relevant forms are located on
Allocation
Policies and Allocation Form
****** Please use this form when sending in your request, as only those requests following this format will be considered.
- You may request computer time for up to four quarters.
- Winter Quarter (Jan-Mar) allocations go into effect on January 1, 2007.
- Only faculty members can request additional computer time for themselves and those working with them. Please consolidate all projects onto one proposal to be listed under the requesting faculty member.
- Send your proposal and relevant information to the attention of:
Janet Ellingson, 405 INSCC
Fax: 585-5366, e-mail: chpc-admin@utah.edu, tel: 585-3791
CHPC Presentation
Posted: November 8, 2005
Fast Parallel I/O at the CHPC
Thursday, November 10th, 2005 at 1:30 p.m. in the INSCC Auditorium
Presenter: Martin Cuma
In this talk we explain how to perform fast parallel I/O operations on the CHPC computers. It should be beneficial for all users who are interested in speeding up their parallel applications via faster file operations. First, we describe in detail PVFS2 (Parallel Virtual File System 2), installed on the Arches. Then we go over several examples on how to perform parallel I/O on this file system, in particular, MPI-I/O extension to the MPI standard and native PVFS function calls. Subsequently we detail ways how to compile and run MPI-I/O applications on both PVFS (arches) and on the Compaq Sierra's AdvFS. We conclude the talk with an insight into some more advanced aspects of MPI-I/O.
For more information about this and other CHPC presentations, please see:
Tunnelarch downtime Sunday 11/13-11/20/2005
Posted: November 8, 2005
Tunnelarch downtime Sunday 11/13-11/20/2005
Systems affected: Tunnelarch
Date: Beginning Noon, Sunday, November 13th, 2005.
Duration: Until Noon, Sunday November 20th, 2005
Scope: You should also know that the tunnelarch downtime (from Sunday, November 13th at Noon until the following Sunday, November 20th at Noon) is not just a demonstration. This is to solve a very significant bioinformatics problem. Details below.
mpiBLAST on the GreenGene Distributed Supercomputer: Sequencing the NT Database Against the NT Database (An NT-Complete Problem)
Abstract: The Basic Local Alignment Search Tool (BLAST) allows bioinformaticists to characterize an unknown sequence by comparing it against a database of known sequences. The similarity between sequences enables biologists to detect evolutionary relationships and infer biological properties of the unknown sequence.
Our open-source parallel BLAST --- mpiBLAST --- decreases the search time of a 300-kB query from 24 hours to 4 minutes on a 128-processor cluster. It also allows larger query files to be compared, something which is infeasible with the current BLAST. Consequently, we propose to compare the largest query available, the entire NT database, against the largest database available, the entire NT database. The result of this comparison will provide critical information to the biology community, including insightful evolutionary, structural, and functional relationships between every sequence and family in the NT database. We estimate that the experiment will generate 100 TB of output to StorCloud.
Chair/Speaker Details:
Martin Swany (Chair)University of Delaware
Wu Feng
Los Alamos National Laboratory
Mark Gardner
Los Alamos National Laboratory
Srinidhi Varadarajan
Virginia Tech
Jeff Crowder
Virginia Tech
Julio Facelli
University of Utah
Jeremy Archuleta
Los Alamos National Laboratory / University of Utah
Xiaosong Ma
North Carolina State University
Heshan Lin
Los Alamos National Laboratory / North Carolina State University
Venkatram Vishwanath
Los Alamos National Laboratory / University of Illinois at Chicago
Pavan Balaji
The Ohio State University
Tunnelarch downtime Sunday 11/13-11/20/2005
Posted: November 8, 2005
Tunnelarch downtime Sunday 11/13-11/20/2005
Systems affected: Tunnelarch
Date: Beginning Noon, Sunday, November 13th, 2005.
Duration: Until Noon, Sunday November 20th, 2005
Scope: You should also know that the tunnelarch downtime (from Sunday, November 13th at Noon until the following Sunday, November 20th at Noon) is not just a demonstration. This is to solve a very significant bioinformatics problem. Details below.
mpiBLAST on the GreenGene Distributed Supercomputer: Sequencing the NT Database Against the NT Database (An NT-Complete Problem)
Abstract: The Basic Local Alignment Search Tool (BLAST) allows bioinformaticists to characterize an unknown sequence by comparing it against a database of known sequences. The similarity between sequences enables biologists to detect evolutionary relationships and infer biological properties of the unknown sequence.
Our open-source parallel BLAST --- mpiBLAST --- decreases the search time of a 300-kB query from 24 hours to 4 minutes on a 128-processor cluster. It also allows larger query files to be compared, something which is infeasible with the current BLAST. Consequently, we propose to compare the largest query available, the entire NT database, against the largest database available, the entire NT database. The result of this comparison will provide critical information to the biology community, including insightful evolutionary, structural, and functional relationships between every sequence and family in the NT database. We estimate that the experiment will generate 100 TB of output to StorCloud.
Chair/Speaker Details:
Martin Swany (Chair)University of Delaware
Wu Feng
Los Alamos National Laboratory
Mark Gardner
Los Alamos National Laboratory
Srinidhi Varadarajan
Virginia Tech
Jeff Crowder
Virginia Tech
Julio Facelli
University of Utah
Jeremy Archuleta
Los Alamos National Laboratory / University of Utah
Xiaosong Ma
North Carolina State University
Heshan Lin
Los Alamos National Laboratory / North Carolina State University
Venkatram Vishwanath
Los Alamos National Laboratory / University of Illinois at Chicago
Pavan Balaji
The Ohio State University
/scratch/parallel back: November 3rd, 2005
Posted: November 3, 2005
/scratch/parallel back: November 3rd, 2005
Systems affected: /scratch/parallel on arches clusters
Date: Thursday November 3rd, 2005
Duration: Back by about 1:45 p.m. on 12/03/2005
Scope: The update of /scratch/parallel was successful. Most of the problems that we have encountered since he last upgrade are gone, including the crippling slowness of cp, vi,...
One last problem that we noticed is that dates and permissions on files created by MPI-IO are not right, but, they don't seem to pose a problem in production runs.
Please, start using /scratch/parallel again if you've been using it before and report to us any problems you may see.
/scratch/parallel back: November 3rd, 2005
Posted: November 3, 2005
/scratch/parallel back: November 3rd, 2005
Systems affected: /scratch/parallel on arches clusters
Date: Thursday November 3rd, 2005
Duration: Back by about 1:45 p.m. on 12/03/2005
Scope: The update of /scratch/parallel was successful. Most of the problems that we have encountered since he last upgrade are gone, including the crippling slowness of cp, vi,...
One last problem that we noticed is that dates and permissions on files created by MPI-IO are not right, but, they don't seem to pose a problem in production runs.
Please, start using /scratch/parallel again if you've been using it before and report to us any problems you may see.
/scratch/parallel downtime: November 2nd, 2005
Posted: November 2, 2005
/scratch/parallel downtime: November 2nd, 2005
Systems affected: /scratch/parallel on arches clusters
Date: Thursday November 3rd, 2005
Duration: Unknown. Estimate a few hours.
Scope: We will take /scratch/parallel down tomorrow to update apply some critical patches to fix problems that made its useage difficult lately. We will NOT erase the data, only the filesystem will be inaccessible during the downtime. We don't have an estimate for the duration of the downtime, but, if all goes well it should not take more than several hours.
/scratch/parallel downtime: November 2nd, 2005
Posted: November 2, 2005
/scratch/parallel downtime: November 2nd, 2005
Systems affected: /scratch/parallel on arches clusters
Date: Thursday November 3rd, 2005
Duration: Unknown. Estimate a few hours.
Scope: We will take /scratch/parallel down tomorrow to update apply some critical patches to fix problems that made its useage difficult lately. We will NOT erase the data, only the filesystem will be inaccessible during the downtime. We don't have an estimate for the duration of the downtime, but, if all goes well it should not take more than several hours.
CHPC Presentation
Posted: October 26, 2005
Mathematical Libraries at CHPC
Thursday, October 27th, 2005 at 1:30 p.m. in the INSCC Auditorium
Presenter: Martin Cuma
In this talk we introduce the users to the mathematical libraries that are installed on the CHPC systems, which are designed to ease the programming and speed-up scientific applications. First, we will talk about BLAS, which is a standardized library of Basic Linear Algebra Subroutines, and present few examples. Then we briefly focus on other libraries that are in use, including freeware ACML, LAPACK, ScaLAPACK, PETSc and FFTW, and commercial NAG and custom libraries from Compaq.
For more information about this and other CHPC presentations, please see:
CHPC /scratch/parallel problems after update
Posted: October 25, 2005
The /scratch/parallel file system upgrade was successful in the terms of fixing known problems, however, several new ones have showed up.
First of all, we have found that copying or moving files out of /scratch/parallel is very slow. Also, trying to edit files in /scratch/parallel with vi or less editors is very slow. We are in touch with PVFS2 developers trying to fix this problem. In the meanwhile, a good workaround for the copy a move problem is to use scp or tar, e.g., to move file test.file from /scratch/parallel to $HOME/data, do:
scp /scratch/parallel/$USER/test.file $HOME/data/
or
tar -cpf - test.file |tar -xvpf - -C $HOME/data/
Note that the former needs a password so it can't be used in the PBS script. Overall, we would recommend to comment out that part of script that copies files from /scratch/parallel at the end of the run and do it manually. Copying files TO /scratch/parallel works well so initial staging of input files before the run should pose no problems.
As for the editor slowness (vi, less), we recommend to edit these files on your home directory space and then copy them to /scratch/parallel prior to the run.
There is one other problem that we have noticed - occasional mess-up of the time stamps of the files. This, however, seems to happen only shortly after the file was created and so far we haven't noticed any detrimental effect on the performance and useability.
CHPC /scratch/parallel downtime: Thursday October 20th, 2005
Posted: October 12, 2005
CHPC /scratch/parallel downtime: Thursday October 20th, 2005, we will upgrade PVFS2 file system that runs /scratch/parallel. Time and duration of the downtime is to be determined.
Systems affected: Arches clusters that mount /scratch/parallel. This file system will not be available during the downtime and all user files that have been on the file system will be erased.
Details:
PVFS2 that is running /scratch/parallel will be upgraded to fix several bugs that were preventing some users from using the file system efficiently.
We appeal to all users that are using /scratch/parallel now to:
1. Copy all their important files off the file system before the downtime. The complete file system will be wiped out during the upgrade.
2. Do not submit any jobs on Arches that could use /scratch/parallel during the downtime window. It would be the best to stop submitting such jobs right now, and check all the queued up jobs few days before the downtime to make sure that they are not set to use /scratch/parallel. All the jobs that will use this file system during the downtime will crash and cause prolongation of the downtime due to need to clean the affected nodes. We will not give any time refund for the time lost due to this.
CHPC /scratch/parallel downtime: Thursday October 20th, 2005
Posted: October 12, 2005
CHPC /scratch/parallel downtime: Thursday October 20th, 2005, we will upgrade PVFS2 file system that runs /scratch/parallel. Time and duration of the downtime is to be determined.
Systems affected: Arches clusters that mount /scratch/parallel. This file system will not be available during the downtime and all user files that have been on the file system will be erased.
Details:
PVFS2 that is running /scratch/parallel will be upgraded to fix several bugs that were preventing some users from using the file system efficiently.
We appeal to all users that are using /scratch/parallel now to:
1. Copy all their important files off the file system before the downtime. The complete file system will be wiped out during the upgrade.
2. Do not submit any jobs on Arches that could use /scratch/parallel during the downtime window. It would be the best to stop submitting such jobs right now, and check all the queued up jobs few days before the downtime to make sure that they are not set to use /scratch/parallel. All the jobs that will use this file system during the downtime will crash and cause prolongation of the downtime due to need to clean the affected nodes. We will not give any time refund for the time lost due to this.
CHPC Network/System downtime: Thursday September 29th, 2005
Posted: September 14, 2005
updated: September 27th, 2005
updated: September 28th, 2005
CHPC Network/System downtime: Thursday September 29th, 2005, Arches cluster down at 5:00 p.m. (Komas). Some home directories will be unavailable. Expect systems to be available sometime the morning of September 30th.
Systems affected: All systems in the Komas machine room, including all Arches clusters and fileserve2 beginning at 5:00 p.m. Home directories for many users will be unavailable.
Date: Thursday September 29th, 2005
Duration: Undetermined. We will begin servicing systems in the Komas machine room at 5:00 p.m. Home directories should be available in a few hours. The Arches clusters will be available sometime the next morning.
Scope: The Arches clusters and fileserv2 will go down at 5:00 p.m. for maintenance and the cleaning and maintenance of coolers in the Komas Machine room. Power maintenance of the SSB Machine room will not require systems to go down.
CHPC Network/System downtime: Thursday September 29th, 2005
Posted: September 14, 2005
updated: September 27th, 2005
updated: September 28th, 2005
CHPC Network/System downtime: Thursday September 29th, 2005, Arches cluster down at 5:00 p.m. (Komas). Some home directories will be unavailable. Expect systems to be available sometime the morning of September 30th.
Systems affected: All systems in the Komas machine room, including all Arches clusters and fileserve2 beginning at 5:00 p.m. Home directories for many users will be unavailable.
Date: Thursday September 29th, 2005
Duration: Undetermined. We will begin servicing systems in the Komas machine room at 5:00 p.m. Home directories should be available in a few hours. The Arches clusters will be available sometime the next morning.
Scope: The Arches clusters and fileserv2 will go down at 5:00 p.m. for maintenance and the cleaning and maintenance of coolers in the Komas Machine room. Power maintenance of the SSB Machine room will not require systems to go down.
PVFS (/scratch/parallel) file system Open for General Use
Posted: August 8, 2005
PVFS (/scratch/parallel) file system Open for General Use
We would like to invite users to start using our new large parallel file system running PVFS2 (http://www.pvfs.org/pvfs2/). It consists of 12 I/O nodes with total available space about 13.5 TB.
Apart from large size, it also has much better performance than existing /scratch/serial file servers, especially when doing concurrent I/O from multiple compute nodes. In our tests, we have achieved aggregate bandwidth of 1350 MB/s reading/writing from 16 compute nodes, the theoretical peak is 1500 MB/s; /scratch/serial type servers peak is 125 MB/s.
This file system is mounted on three Arches clusters, delicatearch, marchingmen and tunnelarch as /scratch/parallel. Most users will use it similarly to /scratch/serial to do standard UNIX I/O, those interested in parallel I/O (either PVFS native or MPI-IO), please contact me directly.
Note that we have tested this file system extensively but did not have many users running on it at the same time yet so consider the file system to be in "release candidate" status and report us any problems you may encounter with it. We will try to promptly respond and alleviate the problem.
As always, let us know if you experience any problems.
Software upgrades on Arches
Posted: July 29, 2005
Software upgrades on Arches
We have upgraded Pathscale compilers to ver. 2.2, Totalview to ver. 7.0 and ACML library to ver. 2.6. All should be a smooth upgrades, except for probable need for code relink with the ACML library.
As always, let us know if you experience any problems.
SSB machine room power outage Saturday, July 30t h 7am-10am
Posted: July 27, 2005
SSB machine room power outage Saturday, July 30th 7am-10am
There will be a power outage affecting CHPC's SSB machine room this
Saturday, July 30th from 7am to 10am.
Clusters housed in the room, that is Icebox, Sierra and Slickrock will be
shut off, as well as most of the file servers.
We have created reservations on the Arches clusters to drain them as well
since the jobs would fail when home directories housed on the downed
fileservers are not available.
We will also take this opportunity to do some maintentance on fileserv2 and
on new landscapearch nodes which we expect to be done by the time the power
is back up.
We plan to have the Arches resuming scheduling soon after the power is back
up and the fileservers are booted, while the three clusters housed in SSB
may take a few more hours to come up.
SSB Maching Room Power Outage this Saturday (July 30th 7am-10am)
Posted: July 27, 2005
SSB Maching Room Power Outage this Saturday (July 30th 7am-10am)
There will be a power outage affecting CHPC's SSB machine room this Saturday, July 30th from 7am to 10am.
Clusters housed in the room, that is Icebox, Sierra and Slickrock will be shut off, as well as most of the file servers.
We have created reservations on the Arches clusters to drain them as well since the jobs would fail when home directories housed on the downed fileservers are not available.We will also take this opportunity to do some maintentance on fileserv2 and on new landscapearch nodes which we expect to be done by the time the power is back up.
We plan to have the Arches resuming scheduling soon after the power is back up and the fileservers are booted, while the three clusters housed in SSB may take a few more hours to come up.
As always, we are sorry for the inconvenience and don't hesitate to contact us in case of any questions.
Icebox back online.
Posted: July 18, 2005
Icebox back online.
Icebox has been brought back up and is accepting jobs.
Icebox down, 9 pm Wednesday, July 13th, 2005
Posted: July 14, 2005
Icebox down, 9 pm Wednesday, July 13th, 2005
On Wednesday night, the cooler located in the center of our SSB machine room failed, and
shutdown. This caused a large imbalance of cooling in the data center with
the East half of the room reaching temperatures high enough to have ICEBOX
hardware fail due to overheating.
CHPC staff attempted to restart the compressors on the failed unit, which
was unsuccessful.
Our Liebert service provider - "Mountain Valley" was consulted that night
and was scheduled to arrive between 8-9am on Thursday to work on the failed Liebert
cooler.
Due to the failed Liebert cooler ICEBOX will remain down, and will be
brought online and diagnosed when the cooling is restored. The availability
will depend on the timeframe of the cooler repair and amount of damage suffered
by the overheating.
We apologize for any inconvenience due to this failure.
Icebox down, 9 pm Wednesday, July 13th, 2005
Posted: July 14, 2005
Icebox down, 9 pm Wednesday, July 13th, 2005
On Wednesday night, the cooler located in the center of our SSB machine room failed, and
shutdown. This caused a large imbalance of cooling in the data center with
the East half of the room reaching temperatures high enough to have ICEBOX
hardware fail due to overheating.
CHPC staff attempted to restart the compressors on the failed unit, which
was unsuccessful.
Our Liebert service provider - "Mountain Valley" was consulted that night
and was scheduled to arrive between 8-9am on Thursday to work on the failed Liebert
cooler.
Due to the failed Liebert cooler ICEBOX will remain down, and will be
brought online and diagnosed when the cooling is restored. The availability
will depend on the timeframe of the cooler repair and amount of damage suffered
by the overheating.
We apologize for any inconvenience due to this failure.
Use of Multiple nodes for gaussian jobs
Posted: July 5, 2005
Use of Multiple nodes for gaussian jobs
All Gaussian Users,
Some of you have been getting messages from CHPC userservices that your Gaussian jobs were not making use of all nodes in a multiple node run. To help clarify the parallel nature of Gaussian, the following information has been added to CHPC’s Gaussian 03 web page:
Parallel Gaussian:
The general rule is that SCF/DFT/MP2 energies and optimizations as well as SCF/DFT frequencies scale well when run on multiple nodes. In the case of DFT jobs, once there are more than about 100 atoms for pure DFTs or 150 atoms for hybrid methods, the default algorithm that is used (FMM – Fast Multiple Method) does not run parallel. In this case you can either run on one node or use Int=FMMNAtoms=n to turn off the use of the FMM, where n is the number of atoms.
For all other types of jobs you will want to only use one node to make efficient usage of your allocation. To run non-restartable jobs that will take longer than the queue time limits on tunnelarch, you can request access to the long queue on a case by case basis in order to be able to complete these jobs.
As always, when running a new type of job, it is best that you first test the parallel nature of the runs by running a test job on both 1 and 2 nodes and looking for the timing differences before submitting a job that uses 4, 8 or even more nodes. The timing differences are easiest to see if you use the p printing option (replace the # at the beginning of the keyword line in the input file with a #p)
If you have any questions about any of the above, or any questions regarding Gaussian 03 calculations, please feel free to contact me.
Anita
We have upgraded all the MPI distros on Arches
Posted: June 27, 2005
We have upgraded all the MPI distros on Arches, that is:
MPICH-GM to ver. 1.2.6..14b
MPICH to ver. 1.2.7
MPICH2 to ver. 1.0.2
Upgrade on Icebox will follow soon.
All are minor upgrades, but, in case at least MPICH2 I recommend source recompile.
For details how to use all these, consult our instructions page: http://www.chpc.utah.edu/docs/manuals/software/mpi.html
I would also like to take this opportunity to stress to users to use only MPICH-GM on Delicatearch to justify the purchase of the expensive Myrinet network hardware, and to shift their programs from MPICH to MPICH2. Version 1.2.7 of MPICH is likely to be the last one as the developers are abandoning it in favor of MPICH2.
As always, please, report any problems or questions to problems@chpc.utah.edu
Major CHPC Network/System Downtime, 5 pm Thursday, July 7th, 2005
Posted: June 23, 2005
Major CHPC Network/System Downtime, 5 pm Thursday, July 7th, 2005
Systems affected:All systems in the Komas machine room, and some fileservers. Home directories for many users will be unavailable.
Date:Thursday July 7th, 2005
Duration: Undetermined. Will begin 5 pm.
Scope: Repair of coolers in the Komas Machine room. All Arches clusters will be down. System maintenance on fileserve2 (home directories in /uufs/inscc.utah.edu/common/home), and several of the /scratch filesystems on Arches.
Major CHPC Network/System Downtime, 5 pm Thursday, July 7th, 2005
Posted: June 23, 2005
Major CHPC Network/System Downtime, 5 pm Thursday, July 7th, 2005
Systems affected:All systems in the Komas machine room, and some fileservers. Home directories for many users will be unavailable.
Date:Thursday July 7th, 2005
Duration: Undetermined. Will begin 5 pm.
Scope: Repair of coolers in the Komas Machine room. All Arches clusters will be down. System maintenance on fileserve2 (home directories in /uufs/inscc.utah.edu/common/home), and several of the /scratch filesystems on Arches.
CHPC email server upgrade: Things to Know
Posted: May 31, 2005
CHPC email server upgrade: Things to Know
To access previously created Email folders one needs to 'subscribe' to them. This is not required for newly created folders, only previously created folders.
With most imap clients this is done either by right clicking on any mail folder then clicking 'subcribe' or click on 'file' from the menu bar then 'subscribe'. Then select the box for the folder/subfolders you wish to view and hit 'subscribe'.
For https://webmail.chpc.utah.edu/ simply click on 'folders' then click on the ones you want to 'subscribe' to (hold down the 'Ctrl' key to select multiple folders) then click on subscribe.
The new mail server also supports secure incoming IMAPS and soon to be outgoing SMTP SSL authentication.
Pine Users
- Your inbox messages are probably not in order. To fix this, just type "$d" to sort it again.
- Go into your settings (from main menu type "sc" [for setup, config]) and verify:
- smtp-server is "smtp.inscc.utah.edu"
- inbox-path is "{mail.inscc.utah.edu/ssl/novalidate-cert}INBOX"
- If you use ssh-tunneling for remote access, leave your smtp-server set to "localhost" . Change your inbox-path as above (the encrypted access is more secure than an ssh tunnel).
Myrinet installed on Icebox, May 17th, 2005
Posted: May 17, 2005
Myrinet installed on Icebox, May 17th, 2005
We have put Myrinet (faster network interconnect) on the first 42 nodes of Icebox (ib001-ib042). We did not have much chance to test it as all the nodes are loaded almost constantly, but, short tests work and we would like to encourage users to give it a try. It should give a good performance boost to programs with heavy communication that use more than one node.
In order to request the Myrinet nodes, include keyword myrinet to your PBS node specification line, e.g.:
#PBS -l nodes=4:ppn=2:myrinet,walltime=24:00:00
We have built MPICH-GM just with the PGI compilers (C, C++, F77/90/95). The build is located in
/uufs/icebox/sys/pkg/mpich-gm/std.
That is, to compile, use:
/uufs/icebox/sys/pkg/mpich-gm/std/bin/mpiXXX, where XXX stands for cc, cxx, f77, f90.
To run, do:
/uufs/icebox/sys/pkg/mpich-gm/std/bin/mpirun.ch_gm -np $NODES -machinefile $PBS_NODEFILE ./executable
Major CHPC Network/System Downtime, 8 am - 5 pm Saturday, May 21st, 2005
Posted: May 10, 2005
Major CHPC Network/System Downtime: 8 am - 5 pm Saturday, May 21st, 2005
Systems affected: All routing in INSCC. Systems in the INSCC Machine room shutdown. All HPC systems including Arches Clusters, Sierra and Icebox.
NOTE: *** If you have equipment in the INSCC Machine room, you will need to make sure it is down before 8 a.m. the morning of May 21st.***
Date: Saturday May 21st, 2005
Duration: 8 a.m. until 5 p.m.
Scope: The coolers in the Machine room in INSCC were repaired. CHPC will take advantage of this downtime to upgrade a fileserver and some of the icebox nodes.
Major CHPC Network/System Downtime, 8 am - 5 pm Saturday, May 21st, 2005
Posted: May 10, 2005
Major CHPC Network/System Downtime: 8 am - 5 pm Saturday, May 21st, 2005
Systems affected: All routing in INSCC. Systems in the INSCC Machine room shutdown. All HPC systems including Arches Clusters, Sierra and Icebox.
NOTE: *** If you have equipment in the INSCC Machine room, you will need to make sure it is down before 8 a.m. the morning of May 21st.***
Date: Saturday May 21st, 2005
Duration: 8 a.m. until 5 p.m.
Scope: The coolers in the Machine room in INSCC were repaired. CHPC will take advantage of this downtime to upgrade a fileserver and some of the icebox nodes.
CHPC Downtime - Arches Clusters (emergency)
Posted: May 7, 2005
updated: May 7th, 2005 (about 4pm)
updated: May 7th, 2005 (about 1pm)
CHPC Downtime:
All Arches Clusters: Saturday, May 7th, 2005 at 10:00 a.m. until about 4:00 p.m.
Systems affected: All Arches clusters - KOMAS machine room cooling due to power problem.
About an 10:00 a.m. there was a power problem in the Komas machine room causing the coolers to fail. We shutdown the Arches clusters as the temperature was getting dangerously high. CHPC staff are onsight and working to get everything back online.
The coolers came back online and were cooling the machine room by about 1:00 p.m. As soon as the room was sufficiently cooled, CHPC staff will began to bring the Arches clusters back online. All went smoothly and the clusters were back and scheduling jobs by 4:00 p.m.
CHPC Downtime - Arches Clusters (emergency)
Posted: May 7, 2005
updated: May 7th, 2005 (about 4pm)
updated: May 7th, 2005 (about 1pm)
CHPC Downtime:
All Arches Clusters: Saturday, May 7th, 2005 at 10:00 a.m. until about 4:00 p.m.
Systems affected: All Arches clusters - KOMAS machine room cooling due to power problem.
About an 10:00 a.m. there was a power problem in the Komas machine room causing the coolers to fail. We shutdown the Arches clusters as the temperature was getting dangerously high. CHPC staff are onsight and working to get everything back online.
The coolers came back online and were cooling the machine room by about 1:00 p.m. As soon as the room was sufficiently cooled, CHPC staff will began to bring the Arches clusters back online. All went smoothly and the clusters were back and scheduling jobs by 4:00 p.m.
Allocation Requests Due June 1st, 2005
Posted: May 5, 2005
Allocation Requests Due June 1st, 2005
Proposals and allocation requests for computer time on the Arches Opteron Cluster are due by June 1st, 2005. We must have this information if you wish to be considered for an allocation of time for the Summer 2005 calendar quarter and/or subsequent three quarters. This is for additional computer time above the default amount given for the quarter. If you already have an award for Summer 2005, you do not need to re-apply unless you wish to request a different amount from what you were awarded.
- Information on the allocation process and relevant forms are located on
Allocation
Policies and Allocation Form
****** Please use this form when sending in your request, as only those requests following this format will be considered.
- You may request computer time for up to four quarters.
- Summer Quarter (Jul-Sep) allocations go into effect on July 1, 2005.
- Only faculty members can request additional computer time for themselves and those working with them. Please consolidate all projects onto one proposal to be listed under the requesting faculty member.
- Send your proposal and relevant information to the attention of:
Victoria Volcik, 405 INSCC
Fax: 585-5366, e-mail: admin@chpc.utah.edu, tel: 585-3791
Sierra cluster not fully functional
Posted: May 2, 2005
updated May 3rd, 2005
Sierra cluster not fully functional (back by about 4pm)
The sierra cluster had some troubles the afternoon of May 2nd. CHPC staff looked into the problem and it was returned to normal operation about 4:00 p.m. We apologize for the inconvenience.
Sierra cluster not fully functional
Posted: May 2, 2005
updated May 3rd, 2005
Sierra cluster not fully functional (back by about 4pm)
The sierra cluster had some troubles the afternoon of May 2nd. CHPC staff looked into the problem and it was returned to normal operation about 4:00 p.m. We apologize for the inconvenience.
Changes to NBO version in Gaussian installation
Posted: April 27, 2005
Changes to NBO version in Gaussian installation
Note to all users of Gaussian on arches cluster: The standard installation of NBO within Gaussian03 has been changed from the 3.1 that is shipped from Gaussian to the newest 5.0 version of NBO available. For more information on the added features of this newest version of NBO see http://www.chem.wisc.edu/~nbo5/
Power down in Komas building (4/20/05 from 4:30-6:00pm): Arches cluster down (until approx. 8:30pm), all routing for INSCC down.
Posted: April 20, 2005
Power down in Komas building (4/20/05 from 4:30-6:00pm): Arches cluster down (until approx. 8:30pm), all routing for INSCC down.
The power dropped in the Komas building in Research park about 4:30pm on April 19th, 2005. This power outage dropped all of the arches clusters. This power outage also dropped all the routing for the INSCC building. A few weeks ago, CHPC Network staff had to move all of the INSCC routing to Komas as a workaround for a code bug. The fix to the code bug is still outstanding. All routing between networks for the Komas cluster machine room, SSB machine room, INSCC machine room and INSCC building was non-functional until the power returned about 6:00 pm. CHPC Network and Systems staff were on site and working with electricians and UP&L to return power as soon as possible. The arches clusters were back scheduling jobs around 8:30 pm.
Power down in Komas building (4/20/05 from 4:30-6:00pm): Arches cluster down (until approx. 8:30pm), all routing for INSCC down.
Posted: April 20, 2005
Power down in Komas building (4/20/05 from 4:30-6:00pm): Arches cluster down (until approx. 8:30pm), all routing for INSCC down.
The power dropped in the Komas building in Research park about 4:30pm on April 19th, 2005. This power outage dropped all of the arches clusters. This power outage also dropped all the routing for the INSCC building. A few weeks ago, CHPC Network staff had to move all of the INSCC routing to Komas as a workaround for a code bug. The fix to the code bug is still outstanding. All routing between networks for the Komas cluster machine room, SSB machine room, INSCC machine room and INSCC building was non-functional until the power returned about 6:00 pm. CHPC Network and Systems staff were on site and working with electricians and UP&L to return power as soon as possible. The arches clusters were back scheduling jobs around 8:30 pm.
We have upgraded Pathscale compilers on Arches and Icebox to version 2.1.
Posted: April 19, 2005
We have upgraded Pathscale compilers on Arches and Icebox to version 2.1.
Among the most noticeable improvements:
- OpenMP 2.0 support in Fortran and C and limited support in C++
- Improved performance with changes in scalar and vector math library routines
- New options that help debugging
- trapuv sets uninitialized variables to NAN which helps to crash the code when the value is used
- Wuninitialized gives a warning when an uninitialized variable is used (though it also gives warning in conditional statements)
Totalview has been upgraded on Arches to version 6.8.
This version mainly increases number of supported platforms and compilers and also has some improvements in MPI debugging interface.
As always, let us know if you experience any problems.
CHPC Presentation Series
Posted: April 11, 2005
Parallel performance Analysis with TAU
Thursday, April 13th, 2006 at 1:30 p.m. in the INSCC Auditorium
Presenter: Martin Cuma
TAU (Tuning and Analysis Utilities, http://www.cs.uoregon.edu/research/tau/home.php) is a profiling and tracing toolkit for performance analysis of serial and parallel programs. In this talk, we will introduce TAU as a new and flexible tool for tracing of parallel programs on CHPC Arches clusters. We detail small changes necessary to turn on the tracing and then explain how to visualize the trace files in Vampir trace viewer. We will conclude with some specific examples and glimpse on other features that TAU provides.
For more information about this and other CHPC presentations, please see:
HPC home directory file server crashes
Posted: April 9, 2005
updated: April 11th, 2005
HPC home directory file server crashes three times over the, April 9-10th, 2005 weekend
fileserv2, that hosts user's home directories had three crashes over the April 9-10th, 2005 weekend. As a result of the downtime, jobs of those people whose homes are hosted by this server (i.e. those who don't have their own departmental fileservers) and write there have either crashed or spent a lot of time waiting for the I/O and may have run out of walltime as a result. If the job was writing data to the scratch servers and was supposed to copy them back to home at the end, the data may not have been copied. Some jobs that were trying to start during the fileserver downtime could not find the data and could send lots of spam to the user's e-mail box. If there was a write at the time of the crash, the data may be corrupt. Please, check your results that you obtained during the weekend with extra care.
HPC home directory file server crashes
Posted: April 9, 2005
updated: April 11th, 2005
HPC home directory file server crashes three times over the, April 9-10th, 2005 weekend
fileserv2, that hosts user's home directories had three crashes over the April 9-10th, 2005 weekend. As a result of the downtime, jobs of those people whose homes are hosted by this server (i.e. those who don't have their own departmental fileservers) and write there have either crashed or spent a lot of time waiting for the I/O and may have run out of walltime as a result. If the job was writing data to the scratch servers and was supposed to copy them back to home at the end, the data may not have been copied. Some jobs that were trying to start during the fileserver downtime could not find the data and could send lots of spam to the user's e-mail box. If there was a write at the time of the crash, the data may be corrupt. Please, check your results that you obtained during the weekend with extra care.
Moab Configuration Changes on Arches
Posted: April 6, 2005
Moab Configuration Changes on Arches
CHPC staff have made several changes to the scheduler configuration to moab on delicatearch, marchingmen and tunnelarch. These changes are an experiment to see if it improves throughput. To summarize, the changes include:
- Turning off the "fairshare". The effect of this is that it will be much more "fifo-ish" (first-in-first-out) than before.
- The priority reward for parallelism has been turned off.
- We've changed the backfill method from "FIRSTFIT" to "BESTFIT". This means that very small, very short jobs should get through more quickly and that backfill will be less "fifo-ish".
- We will no longer scrub /tmp on nodes during the job epilogue. This means it is very important to cleanup after your job ends, removing any scratch files. This is easy to do by adding a line to the end of your PBS script like:
cp * $HOME/working_directory && cd .. && rm -rf /tmp/$PBS_JOBIDThis should also be done on any scratch filesystem you may use (/scratch/serial, /scratch/da, /scratch/mm). If you are writing large scratch files to /tmp and your job crashes before it gets to this cleanup stage, let us know so we can cleanup the nodes.
CHPC Presentation
Posted: April 5, 2005
Chemistry Packages at CHPC
Thursday, April 7th, 2005 at 1:30 p.m. in the INSCC Auditorium
Presenter: Anita Orendt
This talk will focus on the computational chemistry software packages - Gaussian, Amber, NWChem, Molpro, Amica, Babel, GaussView, ECCE - that are available on CHPC computer systems. The talk will be an overview of the packages and their capabilities, and will focus on details of how users can access the installations at CHPC. This talk is the precursor for a second talk scheduled for April 21st that will focus on the use of Gaussian 03 and GaussView.
For more information about this and other CHPC presentations, please see:
All machines at Komas down at 5pm, up at 10.30pm, 31st March
Posted: March 31, 2005
All machines at Komas down at 5pm, up at 10.30pm, 31st March
Shortly before 5pm, March 31st, all connectivity to machines housed in Komas machine room (Arches clusters) was dropped. The TS Electric people & Randy Green reported that UP&L had suffered a power bump on the hogle zoo substation. This resulted in our flywheel and ups triggering and the loss of power to the bulk of the Arches compute nodes. TS Electric gave us a green light to bring the system back up at 7:30pm. The system came back up without any real problems and resumed scheduling at about 10.30pm. All jobs that were running were lost. Jobs that were in the idle state in the queues have started to run and the system is open for regular.
Arches Cluster unscheduled downtime
Posted: March 31, 2005
Arches Cluster unscheduled downtime Thursday, March 31st, 2005 about 5:00 p.m.
At approx 10:30pm Thursday we started up the schedulers again after bringing Arches back up. The TS Electric people & Randy Green reported that UP&L had suffered a power bump on the hogle zoo substation. This resulted in our flywheel and ups triggering and the loss of power to the bulk of the Arches compute nodes. TS Electric gave us a green light to bring the system back up at 7:30pm. The system came back up without any real problems. Clearly all jobs that were running were lost. Jobs that were in the idle state in the queues have started to run and the system is open for regular use at this time. Thanks for your patience.
PGI compilers upgraded to version 6.0
Posted: March 30, 2005
PGI compilers upgraded to version 6.0
Among new features that should be beneficial to our users are:
- Fortran 95 support - new compiler pgf95, which superceeds pgf90 (pgf90 retained for backward compatibility).
- improved performance
- large array support (> 2 GB)
For more details see release notes: http://www.pgroup.com/doc/pgicdkrn.pdf
We have done some quick tests that reveal that this version's libraries are compatible with previous, 5.2, which means we did not recompile any of the packages that are used on the top of PGI compilers (MPICH,...). This means also that recompiling codes is not necessary, but I'd still recommend it given the claimed performance improvements.
Please, let us know if you experience a problem with this, or anything else.
Arches Cluster Back Online
Posted: March 23, 2005
Arches Cluster Back Online Wednesday, March 23rd, 2005 about 6:00 p.m.
The arches clusters are now open to users and scheduling jobs after the extended downtime. Two changes of note which were made during this downtime are:
- All nodes/servers are running the newer kernel, 2.6.11.
- GM was upgraded to 2.0.19 on delicatearch and landscapearch.
Thanks for your patience on this extended downtime. We had some unexpected delays and apologize for the inconvenience. Please let us know if you run into problems by sending a report to problems@chpc.utah.edu.
Arches Cluster Back Online
Posted: March 23, 2005
Arches Cluster Back Online Wednesday, March 23rd, 2005 about 6:00 p.m.
The arches clusters are now open to users and scheduling jobs after the extended downtime. Two changes of note which were made during this downtime are:
- All nodes/servers are running the newer kernel, 2.6.11.
- GM was upgraded to 2.0.19 on delicatearch and landscapearch.
Thanks for your patience on this extended downtime. We had some unexpected delays and apologize for the inconvenience. Please let us know if you run into problems by sending a report to problems@chpc.utah.edu.
CHPC Downtime - KOMAS Machine room (extended)
Posted: March 21, 2005
updated: March 23rd, 2005
CHPC Downtime: Monday, March 21st, 2005 at 10:00 p.m.
Systems affected: All Arches clusters - KOMAS machine room critical repair of cooling system.
Date: Monday March 21st - Wednesday March 23rd (extended from 3/22)
Duration: Availability expected sometime Wednesday March 23rd after replacement and testing of cooling system. (Our systems staff ran into unexpected problems during the nfsroot image upgrade and as a result the downtime was extended beyond the original estimate.
Details: A critical repair of the cooling system will take place beginning at 10:00 pm tonight, Monday March 21st. CHPC will take advantage of this opportunity and move up most of the maintenance planned for the March 31st scheduled downtime which has been cancelled. We apologize for any inconvenience.
CHPC Downtime - KOMAS Machine room (extended)
Posted: March 21, 2005
updated: March 23rd, 2005
CHPC Downtime: Monday, March 21st, 2005 at 10:00 p.m.
Systems affected: All Arches clusters - KOMAS machine room critical repair of cooling system.
Date: Monday March 21st - Wednesday March 23rd (extended from 3/22)
Duration: Availability expected sometime Wednesday March 23rd after replacement and testing of cooling system. (Our systems staff ran into unexpected problems during the nfsroot image upgrade and as a result the downtime was extended beyond the original estimate.
Details: A critical repair of the cooling system will take place beginning at 10:00 pm tonight, Monday March 21st. CHPC will take advantage of this opportunity and move up most of the maintenance planned for the March 31st scheduled downtime which has been cancelled. We apologize for any inconvenience.
Seminar: "Some Key Ideas in High Performance Computing"
Posted: March 20, 2005
Seminar: "Some Key Ideas in High Performance Computing"
Wednesday, March 30th, 2005 at 4:15 p.m. LCR MEB 3147
School of Computing
Presenter: Rob Leland
Deputy Director
Computers, Computation, Informatics and Mathematics
Sandia National Laboratories
High performance computing promises to transform the scientific and engineering disciplines. The speaker will reflect on several key ideas that have emerged over the past two decades in the evolution of distributed memory high performance computing. These will be illustrated with examples in geometry and meshing, load balancing, linear solvers and the design and construction of a new supercomputing system, Red Storm. Discussion on the history and future of high performance computing will be invited.
CHPC Presentation
Posted: March 20, 2005
Introduction to Parallel Computing"
Thursday, March 31st, 2005 at 1:30 p.m. in the INSCC Auditorium
Presenter: Martin Cuma
In this talk, we will first discuss various parallel architectures and note which ones are represented at the CHPC, in particular, shared and distributed memory parallel computers. A very short introduction into two programming solutions for these machines, MPI and OpenMP, will then be given followed by instructions on how to compile, run, debug and profile parallel applications on the CHPC parallel computers. Although this talk is more directed towards those starting to explore parallel programming, more experienced users can gain from the second half of the talk, that will provide details on software development tools available at the CHPC. This presentation gives users new to CHPC, or interested in High Performance Computing an overview of the resources available at CHPC, and the policies and procedures to access these resources.
For more information about this and other CHPC presentations, please see:
CHPC Presentation
Posted: March 20, 2005
Overview of CHPC
Thursday, March 24th, 2005 at 1:30 p.m. in the INSCC Auditorium
This presentation gives users new to CHPC, or interested in High Performance Computing an overview of the resources available at CHPC, and the policies and procedures to access these resources.
Topic covered will include:
- The platforms available
- Filesystems
- Access and security
- An overview of the batch system and policies
All welcome!
For more information about this and other CHPC presentations, please see:
Sierra Cluster rebooted, approx 3:45 pm March 16th, 2005
Posted: March 16, 2005
Sierra Cluster rebooted, approx 3:45 pm March 16th, 2005
On the afternoon of March 16, 2005, the sierra cluster had some system problems causing some jobs to die. We rebooted the system about 3:45 pm.
Sierra Cluster rebooted, approx 3:45 pm March 16th, 2005
Posted: March 16, 2005
Sierra Cluster rebooted, approx 3:45 pm March 16th, 2005
On the afternoon of March 16, 2005, the sierra cluster had some system problems causing some jobs to die. We rebooted the system about 3:45 pm.
MPICH2 updated to version 1.0.1
Posted: March 15, 2005
MPICH2 updated to version 1.0.1
MPICH2 was updated to version 1.0.1. we did not have much chance to test it because all the machines are quite loaded, but, we don't expect any problems since it's a minor update. Please let us know of any problems.
CHPC Presentation Series starts March 24th, 2005
Posted: March 10, 2005
CHPC Presentation Series starts March 24th, 2005
CHPC is happy to annouce that a selection of our regular Fall series will be presented this Spring.
All presentations will be held on Thursdays in the INSCC Auditorium at 1:30 p.m.
- 3/24 Overview of CHPC
- 3/31 Introduction to Parallel Computing
- 4/7 Chemistry packages at CHPC
- 4/14 *No presentation*
- 4/21 Using Gaussian 03 and Gaussview
- 4/28 Introduction to programming with MPI
- 5/5 Debugging with Totalview
New Scratch space available on marchingmen and delicatearch
Posted: March 9, 2005
New Scratch space available on marchingmen and delicatearch
We have installed two new NFS scratch file servers, /scratch/mm and /scratch/da. The former is mounted on all MM (marchingmen) compute nodes, the latter on all DA (delicatearch) compute nodes. We encourage users to start using the new file servers for their jobs on MM and DA. Both have updated kernel that has fixed many NFS related bugs that were causing failures in /scratch/serial.
MM has also has an updated kernel on the compute nodes, while DA does not. We have not thoroughly tested the new /scratch/da server interaction with the older kernel DA compute nodes, so, we would appreciate users reporting anything they may observe odd with this setup. Still, since /scratch/da is running a newer kernel, from a primitive statistical standpoint, one could speculate 50% reduction of problems that were plaguing /scratch/serial.
/scratch/serial was not updated. It is still the only scratch file server available to TA and LA. Those who never experienced any problem with it may keep using it, but we recommend those who did should use MM or DA over the next couple of weeks.
Downtime: marchingmen cluster, Tuesday March 8th, 2005 from 8am
Posted: March 2, 2005
updated: March 9, 2005
updated: March 8, 2005
Downtime: marchingmen cluster, Tuesday March 8th, 2005 from 8am
Resumed user access and scheduling jobs about 9am the morning of March 9th, 2005.
Details: The kernel was upgraded on the compute nodes and another scratch filesystem added which will be visible only to the marchingmen compute and interactive nodes. The new scratch space is available at /scratch/mm. Additional scratch space was added to delicatearch at the same time which did not require a downtime. This space is available at /scratch/da
Downtime: marchingmen cluster, Tuesday March 8th, 2005 from 8am
Posted: March 2, 2005
updated: March 9, 2005
updated: March 8, 2005
Downtime: marchingmen cluster, Tuesday March 8th, 2005 from 8am
Resumed user access and scheduling jobs about 9am the morning of March 9th, 2005.
Details: The kernel was upgraded on the compute nodes and another scratch filesystem added which will be visible only to the marchingmen compute and interactive nodes. The new scratch space is available at /scratch/mm. Additional scratch space was added to delicatearch at the same time which did not require a downtime. This space is available at /scratch/da
Icebox outage, morning of February 25, 2005
Posted: February 25, 2005
updated: March 1, 2005
Icebox outage, morning of February 25, 2005
On the morning of February 25, 2005, icebox, the IA-32 cluster, had some major system problems which required us to take it offline. Our systems staff worked to solve this recurring problem and the system was opened for users again the morning of March 1st, 2005. We apologize for any inconvenience.
Icebox outage, morning of February 25, 2005
Posted: February 25, 2005
updated: March 1, 2005
Icebox outage, morning of February 25, 2005
On the morning of February 25, 2005, icebox, the IA-32 cluster, had some major system problems which required us to take it offline. Our systems staff worked to solve this recurring problem and the system was opened for users again the morning of March 1st, 2005. We apologize for any inconvenience.
Icebox outage, morning of February 24, 2005
Posted: February 24, 2005
Icebox outage, morning of February 24, 2005
On the morning of February 24, 2005, icebox, the IA-32 cluster, had some major system problems which required us to take it offline. It was back scheduling jobs by about 11:40 am. We apologize for any inconvenience.
Icebox outage, morning of February 24, 2005
Posted: February 24, 2005
Icebox outage, morning of February 24, 2005
On the morning of February 24, 2005, icebox, the IA-32 cluster, had some major system problems which required us to take it offline. It was back scheduling jobs by about 11:40 am. We apologize for any inconvenience.
Icebox outage, morning of February 23, 2005
Posted: February 23, 2005
Icebox outage, morning of February 23, 2005
On the morning of February 23, 2005, icebox, the IA-32 cluster, had some major system problems and required us to reboot most of the system. Icebox has returned to scheduling by about 11am that morning. We apologize for any inconvenience.
Totalview debugger upgrade on Arches clusters
Posted: February 23, 2005
Totalview debugger upgrade on Arches clusters
We have upgraded Totalview debugger to version 6.7 on Arches clusters. Main improvement is extended memory debugging. For details, see http://www.etnus.com/TotalView/Latest_Release.html
Note that we have discontinued license renewal for Totalview on Icebox so the latest release there remains 6.6. Since we have 4 user license there (for ia32) which is hardly used now, if someone affiliated with CHPC with Linux ia32 desktop is interested, we may consider sharing the license. In that case, please, contact me.
Icebox outage, morning of February 23, 2005
Posted: February 23, 2005
Icebox outage, morning of February 23, 2005
On the morning of February 23, 2005, icebox, the IA-32 cluster, had some major system problems and required us to reboot most of the system. Icebox has returned to scheduling by about 11am that morning. We apologize for any inconvenience.
Matlab installed on Arches and Icebox
Posted: February 8, 2005
Matlab installed on Arches and Icebox
We've obtained full license for Matlab and it is now installed on both Arches and Icebox. For details how to use it, please, see: http://www.chpc.utah.edu/docs/manuals/software/matlab.html
In there, please, note that we don't recommend to run Matlab on interactive nodes, but rather submit an interactive PBS job and run Matlab on the compute nodes. For details, see the aforementioned webpage.
As always, please, let us know if you experience any problems.
Icebox Available February 2nd, 2005
Posted: February 2, 2005
Icebox Available February 2nd, 2005
The structure and administration of the system is very similar to the arches configuration. Example scripts from arches should work with minor modifications. To get proper paths you will want to get newer versions of the chpc startup files. These can be found on our web page:
- chpc.tcshrc
(for csh and tcsh)
- chpc.bashrc
(for bash)
Major CHPC Downtime: From 5 pm on 2/3/05 until aout 4 am on 2/4/05.
Posted: January 27, 2005
updated February 4th, 2005
re-
posted: January 12, 2005
Major CHPC Downtime - All HPC systems and INSCC Networking From 5:00 pm 2/3/05 - about 4 am Friday 2/4/05.
Systems affected: All networking in the INSCC building, SSB machine room and Komas machine room. All of the HPC systems including the Arches Clusters: marchingmen, delicatearch, tunnelarch and landscapearch; icebox and sierra. There will be a router upgrade at this time as well.
Date: Beginning 5:00 pm on Thursday February 3rd, 2005
Duration: Until about 4 am on Friday Februarh 4th, 2005
Details: There will be a power outage at Komas. All networking will be down for the INSCC building, SSB machine room and Komas machine room. Home directories on HPC systems with a path of /uufs/inscc.utah.edu/common/home plan to be moved to a new server with policy changes. see Migration of /uufs/inscc.utah.edu/common/home to new fileserver.
Major CHPC Downtime: From 5 pm on 2/3/05 until aout 4 am on 2/4/05.
Posted: January 27, 2005
updated February 4th, 2005
re-
posted: January 12, 2005
Major CHPC Downtime - All HPC systems and INSCC Networking From 5:00 pm 2/3/05 - about 4 am Friday 2/4/05.
Systems affected: All networking in the INSCC building, SSB machine room and Komas machine room. All of the HPC systems including the Arches Clusters: marchingmen, delicatearch, tunnelarch and landscapearch; icebox and sierra. There will be a router upgrade at this time as well.
Date: Beginning 5:00 pm on Thursday February 3rd, 2005
Duration: Until about 4 am on Friday Februarh 4th, 2005
Details: There will be a power outage at Komas. All networking will be down for the INSCC building, SSB machine room and Komas machine room. Home directories on HPC systems with a path of /uufs/inscc.utah.edu/common/home plan to be moved to a new server with policy changes. see Migration of /uufs/inscc.utah.edu/common/home to new fileserver.
Migration of /uufs/inscc.utah.edu/common/home to new fileserver February 3rd, 2005
Posted: January 26, 2005
Migration of /uufs/inscc.utah.edu/common/home to new fileserver February 3rd, 2005
We are planning to migrate to a new fileserver for home directories on our HPC systems during the downtime February 3rd, 2005. This will affect you only if your home directory is currently being served on the HPC systems (arches clusters, icebox and sierra) out of the /uufs/inscc.utah.edu/common/home filesystem.
We will be making the following changes to our policies for this space:
- We will no longer backup this data. Any critical data in this space should be moved to your home department for permanent storage.
- We will be putting quotas on the usage of the new filesystem. The default quota will be 1 GB. If you need a larger quota, please send a request to problems@chpc.utah.edu.
- We will discontinue charging for the disk usage in this space.
We plan on leaving the data on the current fileserver for a few weeks in the event the new fileserver has any problems. If you have any questions or concerns about these changes, please let us know.
Skyline Arch Available
Posted: January 21, 2005
Skyline Arch Available
The visualization portion of the Arches cluster "Skyline Arch", located in 294 INSCC, is now available to our user community. This resource is comprised of 10 - Dual Opteron processors with Nvidia Quadro FX 3000 graphics cards running 18 - Sanyo LCD projectors to create a stereo 3 x 3 tiled display. The tiling application we are using is Chromium. It intercepts OpenGL calls for some application and distributes them to the appropriate display. This wall can potentially be used with any OpenGL visualization application.
Currently the OpenGL application must understand "quad buffered stereo" in order to utilize the stereo abilities of the wall. We are working on some Chromium options which would allow us to force the OpenGL application to create stereo images. VMD (Visual Molecular Dynamics) is the only OpenGL application we have thoroughly tested with the wall so far. In the works are Paraview (a volume visualization application), Mercury and Arima. Also available is NPB (NCSA Pixel Blaster) which allows the viewing of pre-rendered movies: such as .avi or a sequence of images.
If you like to take advantage of this resource please contact Sam Liston (stliston@chpc.utah.edu).
IA-32 Cluster (icebox) to be available soon
Posted: January 19, 2005
updated: January 28, 2005
IA-32 Cluster (icebox) to be available soon
Icebox, the IA-32 cluster will be available for use again sometime the week of January 31st, 2005 but there will be some significant changes in the configuration:
- Only approximately 100 nodes will be brought online
- All of the 100 or so nodes will be dual procs
- The notion of "ownership" of nodes will no longer apply
- There will be no allocations
The motivation for these changes is that all of the nodes are getting older and are out of warranty. We want to bring up the system to get any useful life out of the better nodes, but without a lot of complexity. Please let us know if you have questions about this.
Major CHPC Downtime: February 3rd, 2005 from 5 pm - midnight. All systems and networks.
Posted: January 12, 2005
Major CHPC Downtime - All HPC systems and INSCC Networking From 5:00 pm - midnight Thursday 2/3/05.
Systems affected: All networking in the INSCC building, SSB machine room and Komas machine room. All of the HPC systems including the Arches Clusters: marchingmen, delicatearch, tunnelarch and landscapearch; icebox and sierra. There will be a router upgrade at this time as well.
Date: Beginning 5:00 pm on Thursday February 3rd, 2005
Duration: Approximately 7 hours
Details: There will be a power outage at Komas. All networking will be down for the INSCC building, SSB machine room and Komas machine room.
Major CHPC Downtime: February 3rd, 2005 from 5 pm - about 4am February 4th, 2005. All systems and networks.
Posted: January 12, 2005
Major CHPC Downtime - All HPC systems and INSCC Networking From 5:00 pm 2/4/05 - 4:00 am 2/5/05.
Systems affected: All networking in the INSCC building, SSB machine room and Komas machine room. All of the HPC systems including the Arches Clusters: marchingmen, delicatearch, tunnelarch and landscapearch; icebox and sierra.
Date: Beginning 5:00 pm on Thursday February 3rd, 2005
Duration: Lasted until about 4:00 am on Friday February 4th, 2005
Details: There will be a power outage at Komas. All networking will be down for the INSCC building, SSB machine room and Komas machine room.
Pathscale compilers upgraded to version 2.0 on Arches clusters
Posted: January 11, 2005
Pathscale compilers upgraded to version 2.0 on Arches clusters
We have upgraded Pathscale compilers to version 2.0 on all Arches clusters. The upgrade includes OpenMP Fortran support and Pathdb debugger. For details, see http://www.pathscale.com/pr_011105.html
We also upgraded Intel Trace Collector library on DelicateArch (MPI Profiling tool) to version 5.0. For details how to use this tool, see http://www.chpc.utah.edu/docs/manuals/software/vampir.html
As always, please, let us know if you encounter any problems.
Unscheduled Downtime: Arches Clusters
Posted: January 8, 2005
Unscheduled Downtime - Arches Clusters From approximately 10:00 am Saturday 1/8/05 until approximately 6:00 pm Sunday 1/9/05
Due to a Cooler Failure in the Komas Machine Room
Arches Clusters DOWNED about 10:00 am Saturday January 8th, 2005
Arches Clusters UP about 6:00 pm Sunday January 9th, 2005
Systems affected: All of the Arches Clusters including: marchingmen, delicatearch, tunnelarch and landscapearch
Date: Beginning 10:00 am on Saturday 1/8/05
Duration: Until 6:00 pm on Sunday 1/9/05.
Details: The machine room at Komas was dangerously overheating. CHPC staff shut down all system in the machine room to prevent equipment damage. The cooler was repaired and the room was returned to normal operating temperatures.
Unscheduled Downtime: Arches Clusters
Posted: January 8, 2005
Unscheduled Downtime - Arches Clusters From approximately 10:00 am Saturday 1/8/05 until approximately 6:00 pm Sunday 1/9/05
Due to a Cooler Failure in the Komas Machine Room
Arches Clusters DOWNED about 10:00 am Saturday January 8th, 2005
Arches Clusters UP about 6:00 pm Sunday January 9th, 2005
Systems affected: All of the Arches Clusters including: marchingmen, delicatearch, tunnelarch and landscapearch
Date: Beginning 10:00 am on Saturday 1/8/05
Duration: Until 6:00 pm on Sunday 1/9/05.
Details: The machine room at Komas was dangerously overheating. CHPC staff shut down all system in the machine room to prevent equipment damage. The cooler was repaired and the room was returned to normal operating temperatures.
MPICH2 1.0 installed on Arches
Posted: January 5, 2005
MPICH2 1.0 installed on Arches
We have installed the new full release of MPICH 2 on all Arches clusters. This is a full release that has full MPI2 support. Apart from this, it includes some optimizations for global communication operations and separate module for shared memory communication with vastly improved performance in SMP nodes - with respect to MPICH 1.x.

