Arches Metacluster User Guide

Arches

Contents

Arches Metacluster Overview

  • Landscape Arch (restricted)
    • Limited availability
  • Tunnel Arch (48 nodes, 96 procs)
    • "data mining cluster" for jobs requiring large memory.
    • 1.4 and 1.8 GHz Opteron processors
    • 4 Gbytes memory per node
    • Gigabit Ethernet interconnect
  • Marching Men (164 nodes, 328 procs)
    • "cycle farm" for serial and smaller parallel
    • jobs which do not require a high speed interconnect.
    • 1.4 and 1.8 GHz Opteron processors
    • 2 Gbytes memory per node
    • Gigabit Ethernet interconnect
  • Delicate Arch (256 nodes, 512 procs)
    • "parallel cluster" for highly parallel parallel jobs requiring high speed interconnect.
    • 1.4 GHz Opteron processors
    • 2 Gbytes memory per node
    • Both Myrinet and Gigabit Ethernet interconnects
  • Sanddune Arch (156 nodes, 312 procs, 624 cores)
    • "parallel cluster" for highly parallel parallel jobs requiring high speed interconnect.
    • 2.4 GHz dual-core Opteron processors
    • 8 Gbytes memory per node (2 Gbytes per processor core)
    • Both Infiniband and Gigabit Ethernet interconnects

NFS Home Directory

NFS mounted file system, your home directory is one choice for i/o. Speed wise this space carries the worst statistical performance. This space is visible to all of the nodes of the clusters through an auto-mounting system.

NFS Scratch (/scratch/serial)

NFS mounted file system This space is visible to all arches nodes and can be access with the path /scratch/serial. Each user will be responsible for creating directories and cleaning up after their jobs. This filesystem is not backed up.

Local Disk (/tmp)

Local Scratch space is the storage space unique to each individual node. Local Scratch is cleaned aggressively and is not supported by CHPC. It can be accessed on each node through "/tmp". This space will be the fastest, but not necessarily the largest. Users should use this space at their own risk.

PVFS

PVFS or Global Scratch is also available the Arches clusters. This space is be reachable from any arches nodes through "/scratch/parallel". PVFS, otherwise known as Parallel Virtual File System provides high performance parallel file access from the compute nodes and is thus a file system of choice for applications that do heavy I/O from multiple nodes at the same time.

When running jobs, it is important to know that making a flow from one storage system to another is the best idea. For example, taking a job that isn't too large and doesn't need much time on the node should be placed in the "/scratch/serial" and then outputted to the user's home directory using a batch job.

It is also important to keep in mind that ALL users must remove excess files on their own. Preferably this can be done with the user's batch job when he/she has finished the computation. Leaving files in any "/scratch/" space creates an impediment to other users who are trying to run their own jobs. Simply delete all extra files from any space other than your home directory when it is not being used immediately.

CHPC resources are available to qualified faculty, students (under faculty supervision), and researchers from any Utah institution of higher education. Users can request accounts for CHPC computer systems by filling out an account request form. This can be found by following the link below or by coming into Room 405, INSCC Building. (Phone 581-5253)

Users requiring more than the default service units (SU) per quarter need to send a brief proposal, using the the allocation form available either:

  • Web version: Allocation form.
  • Hardcopy: from our main office, 405 INSCC, 581-5253.

The Arches clusters can be accessed via ssh (secure shell) at the following address:

  • delicatearch.chpc.utah.edu - parallel cluster (Myrinet interconnect)
  • sanddunearch.chpc.utah.edu - parallel cluster (InfiniBand interconnect)
  • marchingmen.chpc.utah.edu - cycle farm (Gigabit Ethernet interconnect)
  • tunnelarch.chpc.utah.edu - data mining cluster
  • landscapearch.chpc.utah.edu - PI owned cluster

Telnet facilities have been disabled and ssh (secure shell) will be required for remote login. SSH is available on all CHPC systems and is a highly recommended replacement for telnet.

All CHPC machines mount the same user home directories. That means that user files visible at one cluster will be the same as on other clusters. While this has an obvious benefit of not having to copy files between machines, users must be aware of this fact and make sure they e.g. run correct executables for that particular cluster platform (e.g. running Myrinet MPI executable on Marchingmen, which does not have Myrinet).

Another complication associated with single home directory across all systems is shell initialization scripts (that run before each login and setup environment, paths, ...). Environment and especially paths to applications vary on different clusters.
CHPC created a login script that can determine what machine is being logged into, and perform machine-specific initializations. Second goal of this script is to enable users to turn on/off initialization for specific packages installed on the cluster, e.g. switch between different MPI distributions, initialize variables for usage of Totalview, Gaussian,...


Default .tcshrc login script for CHPC systems
Default .bashrc login script for CHPC systems

Note that those using tcsh shell need the .tcshrc file while those using bash will need .bashrc. In case of bash, users should also create file .bash_profile. The easiest way is to get one that is in /etc/skel/.bash_profile - that is, copy this file in the users home directory.

The first part of each script determines what machine is being logged in based on the machine's operation system, IP address or variable UUFSCELL defined on the system level. The CHPC Linux machines address list is being retreived from the CHPC webserver upon each login and is stored in file linuxips.csh or linuxips.sh. In case webserver is down, there's about a minute long timeout, after which the script either uses the IP address file saved from previous sessions, or, if not available, issues a warning.
The script then finds IP of the machine and does host specific initialization.

Below is an example of tcsh initialization on tunnelarch. It works similarly for all other Arches clusters. bash syntax is similar but slightly different. Lines starting with # are comments. One can turn on/off specific package initializations by placing the comment at the start of line with source command. Do not comment out lines that don't start with source.

else if ($UUFSCELL == "tunnelarch.arches") then
# Commenting/uncommenting source lines below will disable/enable specified packages
# stacksize by default is very small, which causes programs with large static data to segfault
limit stacksize unlimited

# default path addon
setenv PATH "/uufs/arches/sys/bin:/uufs/$UUFSCELL/sys/bin:$PATH"
setenv MANPATH "/uufs/$UUFSCELL/sys/man:$MANPATH"
...

After the numerous host-specific initialization sections, the last section of the script does a global initialization, that is the same for each machine. Here one can for example set various command aliases, prompt format,...

In case user is mounting the CHPC home directory also on his/her own desktop (most people do), then we recommend to set variable MYMACHINE to that machine's IP address. This address can be found by issuing command hostname -i. For example:


hostname -i
123.456.78.90

we change MYMACHINE line .tcshrc to:

set MYMACHINE="123.456.78.90"

Then look for the $MYIP == $MYMACHINE line in the script, and add selected customizations there.

The batch implementation on all CHPC systems includes PBS, a resource manager, and a scheduler. The scheduler on Arches is the Maui scheduler.

Any process which runs for more that 15 minutes will need to be run through the batch system.

There are three steps to running in batch:

  1. Create a batch script file.
  2. Submit the script to the batch system.
  3. Check on your job.

Example Bash Script for Arches

Note that shell programming is exactly like running commands in the shell. You simply write into the file the commands you would like to run the same way you would write them interactively.

The following is an example script for running in PBS on Arches. The lines at top of the file all begin with #PBS which are seen as comments to the shell, but give options to PBS and Maui. Please see the options below for the available flags.

Note that in the example below we don't specify queue, the default used is that which the user is logged on. The queue names are the same as those of the clusters, e.g. delicatearch, sanddunearch, marchingmen, tunnelarch, landscapearch.

In this example the job would be limited to 4 CPUs and 1 hour of walltime. The node specification can be tricky as the "nodes=4" means 4 "tasks" and will (given the current arches configuration) assign your job to two dual processor nodes. For 4 processor nodes (sanddunearch), you job will be assigned to just one node. An equivalent way of asking for 4 tasks would be nodes=2:ppn=2 in which you explicitly ask for 2, dual processor nodes (ppn means processors per node). Or, for sanddunearch, nodes=1:ppn=4, which asks for four cores on one sanddunearch node. See Arches cluster configuration for information on available speeds and memory.

The PBS "-l" option (3rd line) tells PBS and Maui what requirements you need to run your job. You will need to ask for resources (consistent with CHPC policies) and based on what is available.

Regarding wall time, we suggest to use maximum, 72 hours, for the first run to get an idea how long the run will take, and then specify wall time 10-15% larger than the actual run time. Specifying shorter time one runs into a risk of having the job killed before finishing due to running over the wall time. Specifying too large time, on the other hand, may result in longer wait in the queue due not fitting to smaller free time windows.

You will need to change "userid" to your own userid, and also the #PBS -M line to your own email address. In the cp commands also change the working_directory to your path.

In the case of any unscheduled downtime (such as power outages, instances where the cooling fails and the systems are taken down quickly) any jobs actively running in the batch queue will be requeued and restarted from the beginning of the job. Note that this will not happen for scheduled downtimes as the queues are drained before the system is taken down.

If this requeueing/restarting from the beginning behavior is NOT acceptable (e.g. if you want to check it first before restarting), you can add the following option to your PBS script:
#PBS -r n

Finally, in this example we suggest of using global scratch space /scratch/serial, which is visible to all the nodes. However, it is a shared resource prone to load bottlenecks and it's not being backed up. We suggest users copy important data to their home directories after the executable finishes.
Better performing Parallel Virtual File System (PVFS) is being setup to improve parallel I/O from the nodes.

Another option is to use home directory (which is NFS mounted thus I/O from the nodes will be quite slow) or local scratch /tmp, which is fast but not shared by other nodes but purged periodically. Also, CHPC will not retrieve files left on /tmp after the job is finished.

Example PBS Script for Arches:

#PBS -S /bin/bash
#PBS -l nodes=4,walltime=1:00:00
#PBS -m abe
#PBS -M username@your.address.here
#PBS -N jobname

# Create scratch directory on local disk
mkdir -p /scratch/serial-old/$USER/$PBS_JOBID

# Change to working directory
cd /scratch/serial-old/$USER/$PBS_JOBID

# Copy data files scratch directory
cp $HOME/working_directory/data_files /scratch/serial-old/$USER/$PBS_JOBID

# Execute serial or mpi job
# include /uufs/arches/sys/mpich/bin in your path environment variable
# note that for delicatearch or sanddunearch the MPI listed below does
# not use the fast network interconnects
/uufs/arches/sys/mpich/bin/mpirun -np 4 -machinefile $PBS_NODEFILE $HOME/working_directory/a.out > outputfile
 
# Copy files back home and cleanup
cp * $HOME/working_directory && cd .. && rm -rf /scratch/serial-old/$USER/$PBS_JOBID

PBS uses your default shell so there usually isn't the need to specify which shell to use.

Note that we are running the executable from the working_directory, but reading the input files and writing the output files into the /scratch/serial. /scratch/serial is recommended as it is shared among the nodes, but is not super fast as it's a single system.

Job Submission on Arches

Submit your job using the "qsub" command in PBS or the "runjob" command in Maui. See the PBS commands below for additional PBS commands.

For example, to submit a script file named "pbsjob", type

qsub pbsjob

PBS sets and expects a number of variables in a PBS script. For information on these variables and necessities, enter:

man qsub

Checking On Your Job

To check if your job is queued or running, use the "showq" command in Maui.

showq

See the Maui commands below for additional Maui commands.

Maui on Arches

The Maui scheduler uses information from your script to schedule your job. For detailed information please see the Maui web page. On Arches, different clusters have different maximum wall time. Delicatearch limit is 24 hours, Sanddunearch and Marchingmen is 72 hours (72:00:00), and Tunnelarch is 5 days (120:00:00).

Users log into an interactive "front end node" and develop their code from this machine using PBS. On Arches there are three locations for storage. Home directory space is common to all nodes of Arches (and other CHPC systems). Scratch space also has a common name on all nodes, however a physical scratch disk is local on each machine. Each storage area has different environment varibles which make it suitable for different situtations.

Using PBS we are able to manage and control the use of the system to allow for fair usage of all resources. Moreover, with PBS and MPI users do not need specific knowledge of the computer nodes.

PBS Batch Script Options

  • -a date_time.  Declares the time after which the job is eligible for execution. The date_time element is in the form: [[[[CC]YY]MM]DD]hhmm[.S].
  • -e path.  Defines the path to be used for the standard error stream of the batch job. The path is of the form: [hostname:]path_name.
  • -h.  Specifies that a user hold will be applied to the job at submission time.
  • -I.  Declares that the job is to be run "interactively". The job will be queued and scheduled as PBS batch job, but when executed the standard input, output, and error streams of the job will be connected through qsub to the terminal session in which qsub is running.
  • -j join.  Declares if the standard error stream of the job will be merged with the standard ouput stream. The join argument is one of the following:
    • oe-  Directs the two streams as standard output.
    • eo-  Directs the two streams as standard error.
    • n-  Any two streams will be separate(Default).
  • -l resource_list.  Defines the resources that are required by the job and establishes a limit on the amount of resources that can be consumed. Users will want to specify the walltime resource, and if they wish to run a parallel job, the ncpus resource.
  • -m mail_options.  Conditions under which the server will send a mail message about the job. The options are:
    • n: No mail ever sent
    • a (default): When the job aborts
    • b: When the job begins
    • e: When the job ends
  • -M user_list.  Declares the list of e-mail addresses to whom mail is sent. If unspecified it defaults to userid@host from where the job was submitted. You will most likely want to set this option.
  • -N name.  Declares a name for the job.
  • -o path.  Defines the path to be used for the standard output. [hostname:]path_name.
  • -q destination.  The destination is the queue.
  • -S path_list.  Declares the shell that interprets the job script. If not specified it will use the user's login shell.
  • -v variable_list.  Expands the list of environment variables which are exported to the job. The variable list is a comma-separated list of strings of the form variable or variable=value.
  • -V.  Declares that all environment variables in the qsub command's environment are to be exported to the batch job.

PBS User Commands

For any of the commands listed below you may do a "man command" for syntax and detailed information.

Frequently used PBS user commands:

  • qsub. Submits a job to the PBS queuing system. Please see qsub Options below.
  • qdel. Deletes a PBS job from the queue.
  • qstat. Shows status of PBS batch jobs.
  • xpbs. X interface for PBS users.

Less Frequently-Used PBS User Commands:

  • qalter. Modifies the attributes of a job.
  • qhold. Requests that the PBS server place a hold on a job.
  • qmove. Removes a job from the queue in which it resides and places the job in another queue.
  • qmsg. Sends a message to a PBS batch job. To send a message to a job is to write a message string into one or more of the job's output files.
  • qorder. Exchanges the order of two PBS batch jobs within a queue.
  • qrerun. Reruns a PBS batch job.
  • qrls. Releases a hold on a PBS batch job.
  • qselect. Lists the job identifier of those jobs which meet certain selection criteria.
  • qsig. Requests that a signal be sent to the session leader of a batch job.

Maui Scheduler User Commands

  • showq - displays jobs which are running, active, idling and non-queued.
  • showbf - shows backfill.
  • showstart - shows startime.
  • checkjob - displays status of a job.
  • showres - shows active reservations.

Each command accepts -h flag that displays help.

Maui commands are located in "/uufs/cluster.arches/sys/bin". Please see the Maui Scheduler documentation for more information.

C/C++

Updated: March 29, 2006

The Arches metacluster offers several compilers. The GNU Compiler Suite includes ANSI C, C++ and Fortran 77 compilers. The current version is 3.4.4, that is shipped with RedHat EL 4 that is run on the system.

In addition to GNU compilers, we offer three commercial compiler suites. The Pathscale compilers generally provide superior performance. They include C, C++ and Fortran 77/90/95. An advantage is full interoperability with GNU compilers, including g77.

The Portland Group Compiler Suite is another good compiler distribution, which we have seen to perform better in some cases than Pathscale. It should also interoperate with GNU, though we have seen problems with execution of some Fortran codes that were linking g77 compiled libraries.

Finally, we also offer non-commercial license for Intel compilers for EM64T platform, which work well on x86-64 Opterons. From our testing, this compiler performs just slightly slower than Pathscale and a bit faster than PGI.
This compiler is at present included just as a convenience to the users, with no library support, but, we would be happy to assist with library installation if such a need arises.

GNU compilers

The GNU distribution is located in the default area, that is, compilers in /usr/bin, libraries in /usr/lib or /usr/lib64, header files in /usr/include,.... The user should not need to do anything else than to invoke the compiler by its name, e.g.:

gcc source.c -o executable

Pathscale compilers

The latest version of Pathscale compilers are located at /uufs/arches/sys/pkg/pscale/std

To find the compiler version, use flag --version, i.e. pathcc --version.

In order to use the compiler, users have to source shell script that defines paths and some other environment variables.

  • source /uufs/arches/sys/pkg/pscale/std/etc/pscale.csh  (for csh/tcsh)
  • source /uufs/arches/sys/pkg/pscale/std/etc/pscale.sh  (for ksh/bash)

The compilers are invoked as pathcc, pathCC and pathf90 for C, C++ and F90, respecti vely. For list of available flags, use the man pages (e.g. man pathcc).

Aggressive optimization is achieved with -O3 -OPT:Ofast. Further performance gain can be achieved with using interprocedural analysis, invoked with -ipa flag, however, there are some limitations with the usage. Contact CHPC staff if you run into problems.

For more information on the compiler visit Pathscale EKOPath site.

Documentation, including whitepapers is at Pathscale documentation site.

PGI compilers

The latest version of Portland Group compilers are located at /uufs/arches/sys/pkg/pgi/std

To find the compiler version, use flag -V, i.e. pgcc -V.

In order to use the compiler, users have to source shell script that defines paths and some other environment variables.

  • source /uufs/arches/sys/pkg/pgi/std/etc/pgi.csh  (for csh/tcsh)
  • source /uufs/arches/sys/pkg/pgi/std/etc/pgi.sh  (for ksh/bash)

The compilers are invoked as pgcc, pgCC, pgf77 and pgf90 for C, C++, F77 and F90, respectively. For list of available flags, use the man pages (e.g. man pgcc).

We generally recommend flag -fastsse for good performance.

For more information on the compiler, visit Portland Group website.

Documentation including user's guide, language reference, etc. can be found here.

Intel compilers

The latest version of Intel C compilers is located at /uufs/arches/sys/pkg/intel/icc/std

To find the compiler version, use flag -v, i.e. icc -v.

In order to use the compiler, users have to source shell script that defines paths and some other environment variables.

  • source /uufs/arches/sys/pkg/intel/icc/std/bin/iccvars.csh  (for csh/tcsh)
  • source /uufs/arches/sys/pkg/intel/icc/std/bin/iccvars.sh  (for ksh/bash)

The compilers are invoked as icc, icpc and ifort for C, C++ and F90, respectively. For list of available flags, use the man pages (e.g. man icc).

We generally recommend flag -O3 for good performance.

For more information on the compiler, visit Intel C++ compiler website.

Documentation including user's guide, language reference, etc. can be found here.

Fortran

Updated: March 29th, 2006

The Arches metacluster offers several compilers. The GNU Compiler Suite includes ANSI C, C++ and Fortran 77 compilers. The current version is 3.4.4, that is shipped with RedHat EL 4 that is run on the system.

In addition to GNU compilers, we offer three commercial compiler suites. The Pathscale compilers generally provide superior performance. They include C, C++ and Fortran 77/90/95. An advantage is full interoperability with GNU compilers, including g77.

The Portland Group Compiler Suite is another good compiler distribution, which we have seen to perform better in some cases than Pathscale. It should also interoperate with GNU, though we have seen problems with execution of some Fortran codes that were linking g77 compiled libraries.

Finally, we also offer non-commercial license for Intel compilers for EM64T platform, which work well on x86-64 Opterons. From our testing, this compiler performs just slightly slower than Pathscale and a bit faster than PGI.
This compiler is at present included just as a convenience to the users, with no library support, but, we would be happy to assist with library installation if such a need arises.

GNU compilers

The GNU distribution is located in the default area, that is, compilers in /usr/bin, libraries in /usr/lib or /usr/lib64, header files in /usr/include,.... The user should not need to do anything else than to invoke the compiler by its name, e.g.:

g77 source.c -o executable

Pathscale compilers

The latest version of Pathscale compilers are located at /uufs/arches/sys/pkg/pscale/std

In order to use the compiler, users have to source shell script that defines paths and some other environment variables.

  • source /uufs/arches/sys/pkg/pscale/std/etc/pscale.csh  (for csh/tcsh)
  • source /uufs/arches/sys/pkg/pscale/std/etc/pscale.sh  (for ksh/bash)

The compilers are invoked as pathcc, pathCC and pathf90 for C, C++ and F90, respecti vely. For list of available flags, use the man pages (e.g. man pathf90).

Aggressive optimization is achieved with -O3 -OPT:Ofast. Further performance gain can be achieved with using interprocedural analysis, invoked with -ipa flag, however, there are some limitations with the usage. Contact CHPC staff if you run into problems.

For more information on the compiler visit Pathscale EKOPath site.

Documentation, including whitepapers is at Pathscale documentation site.

Portland Group compilers

The latest version of Portland Group compilers are located at /uufs/arches/sys/pkg/pgi/std

In order to use the compiler, users have to source shell script that defines paths and some other environment variables.

  • source /uufs/arches/sys/pkg/pgi/std/etc/pgi.csh  (for csh/tcsh)
  • source /uufs/arches/sys/pkg/pgi/std/etc/pgi.sh  (for ksh/bash)

The compilers are invoked as pgcc, pgCC, pgf77 and pgf90 for C, C++, F77 and F90, respectively. For list of available flags, use the man pages (e.g. man pgf90).

We generally recommend flag -fastsse for good performance.

For more information on the compiler, visit Portland Group website.

Documentation including user's guide, language reference, etc. can be found here.

Intel compilers

The latest version of Intel Fortran compilers is located at /uufs/arches/sys/pkg/intel/ifort/std

To find the compiler version, use flag -v, i.e. icc -v.

In order to use the compiler, users have to source shell script that defines paths and some other environment variables.

  • source /uufs/arches/sys/pkg/intel/ifort/std/bin/ifortvars.csh  (for csh/tcsh)
  • source /uufs/arches/sys/pkg/intel/ifort/std/bin/ifortvars.sh  (for ksh/bash)

The compilers are invoked as icc, icpc and ifort for C, C++ and F90, respectively. For list of available flags, use the man pages (e.g. man ifort).

We generally recommend flag -O3 for good performance.

For more information on the compiler, visit Intel Fortran compiler website< /a>.

Documentation including user's guide, language reference, etc. can be found her e.

The Data Display Debugger (ddd) is a graphical interface which supports multiple debuggers, including the standard GNU debugger, gdb. With ddd one can attach to running processes, set conditional break points, manipulate the data of executing processes, view source code, assembly code, registers, threads, and signal states. Man pages for both ddd and gdb are available. For more information visit the following URL:

http://www.gnu.org/software/ddd/

In addition the Portland Group includes a debugger, pgdbg, along with their compiler suite.

Totalview, a de-facto industry standard debugger supports both serial and parallel debugging. For details on how to use Totalview, refer to CHPC's Totalview page.

For serial profiling, there is GNU gprof and Portland Group pgprof. For parallel profiling, there are MPICH bundled upshot and jumpshot and, the recommended commercial product, Vampir. For details on how to use Vampir, refer to CHPC's Vampir/Vampirtrace Profiler webpage.

As Arches is a distributed memory parallel system, a message passing is the way to communicate between the processes in the parallel program. Message Passing Interface (MPI) is the prevalent communication system and several versions of MPI are installed on Arches. For instructions how to use MPI refer to CHPC's MPI webpage and Introduction to programming with MPI tutorial presentation.

All Arches nodes are dual processor, which means that shared memory programming can be used on these nodes to save some time on the MPI message overhead. OpenMP is emerging to be the major industry standard for shared memory programming, and is supported by the Portland Group compilers with command line flag -mp. More information on OpenMP can be found in Introduction to programming with OpenMP tutorial presentation.

Parallel debugging "will be" supported by the Totalview Debugger. For details on how to use Totalview, refer to CHPC's Totalview page.

Portland Group's debugger, pgdbg, also supports parallel debugging, though it's much simpler and more cumbersome to use than Totalview.

Profiler capable of timing MPI communication functions (which are often a bottleneck in a parallel program) is Vampir. We will most likely obtain it. For details on how to use Vampir, refer to CHPC's Vampir/Vampirtrace Profiler webpage.

NFS Issues

To achive optimal performance on the cluster a user needs to be mindful of a few things. First, the NFS server that provides home directory space can be oversaturated by excessive use. This will result in slower file access and a slower run time. The local scratch on each node should be utilized to reduce the need to access home directory space. NFS reads are much more efficient then NFS writes. Based on this fact, reading from home directory space and writing to local scratch produces a good base design. Once a job is completed, gather what is stored in local scratch back to the user's home directory. This procedure has one other advantage. The ethernet connection is not only used for NFS but also for MPI communications. If this network connection is flooded with too many NFS calls this will greatly reduce the efficiency of a parallel job. Therefore, unless it is necessary, keep the traffic off the network. This will allow for the maximum network resources to be left for MPI work.

TCP/IP Issues

In the current implementation, MPI calls are sent over TCP/IP. The large overhead of TCP/IP slows down MPI work. Moreover, its slow startup nature prevents low latency and high bandwidth for small MPI messages, ie, messages in the range of 256 bytes or less. To work around this problem, it is recommended to consolidate MPI buffers into fewer larger buffers to avoid sending many small buffers. Again, this is a temporary state as software and hardware are advancing and better solutions are near at hand.

Linear algebra subroutines

Updated: July 1, 2004

There are several different BLAS library versions on Arches, which are summarized below, in the order of our preference.

ATLAS

Automatically Tuned Linear Algebra Software (ATLAS) is an open source library aimed at providing portable performance solution. It provides full BLAS and certain LAPACK routines, which are being tuned to the computer platform at the compilation time. This is the BLAS-compatible library that we recommend using. ATLAS has been optimized for the Opteron platform and from our tests achieves the best performance in a set of BLAS operations. The library is located at /uufs/$CLUSTER/sys/pkg/atlas/std/lib.

The current version is 3.7.2.

Compilation instructions:

Fortran

GNU Fortran:

g77 source_name.f -o executable_name -L/uufs/$CLUSTER/sys/pkg/atlas/std/lib -lf77blas -latlas

To include also the LAPACK subset in ATLAS:

g77 source_name.f -o executable_name -L/uufs/$CLUSTER/sys/pkg/atlas/std/lib -llapack -lcblas -lf77blas -latlas

PGI Fortran:

pgf90 (or pgf77) source_name.f -o executable_name -L/uufs/$CLUSTER/sys/pkg/atlas/std/lib -lpgf90blas (or -lpgf77blas) -latlas

To include also the LAPACK subset in ATLAS:

pgf90 (or pgf77) source_name.f -o executable_name -L/uufs/$CLUSTER/sys/pkg/atlas/std/lib -lpgf90lapack (or -lpgf77lapack) -lcblas -lpgf90blas (or -lpgf77blas) -latlas

Pathscale Fortran:

pathf90 source_name.f -o executable_name -L/uufs/$CLUSTER/sys/pkg/atlas/std/lib -lpathf90blas -latlas

To include also the LAPACK subset in ATLAS:

pathf90 source_name.f -o executable_name -L/uufs/$CLUSTER/sys/pkg/atlas/std/lib -lpathf90lapack -lcblas -lpathf90blas -latlas

C/C++

gnu C/C++:

gcc (or g++) source_name.f -o executable_name -L/uufs/$CLUSTER/sys/pkg/atlas/std/lib -latlas -lcblas

PGI C/C++:

pgcc (or pgCC) source_name.f -o executable_name -L/uufs/$CLUSTER/sys/pkg/atlas/std/lib -latlas -lcblas

Pathscale C/C++:

pathcc (or pathCC) source_name.f -o executable_name -L/uufs/$CLUSTER/sys/pkg/atlas/std/lib -latlas -lcblas

AMD Core Math Library (ACML)

Since 2003, AMD is actively developing mathematical library targeted for its AthlonXP and Opteron processors. As of this writing, it includes full optimized BLAS, LAPACK and fast Fourier transform routines. For more information, consult AMD ACML webpage. From our tests and AMD's presentation, BLAS speed is almost as good as that of ATLAS, the FFT routines are several procent slower than FFTW, but keep improving with new releases. Latest version 2.5 available is located at /uufs/$CLUSTER/sys/pkg/acml/std/lib.

Compilation instructions:

Fortran

GNU Fortran:

g77 source_name.f -o executable_name -L/uufs/$CLUSTER/sys/pkg/acml/std/gnu64/lib -lacml

PGI Fortran:

pgf90 (or pgf77) source_name.f -o executable_name -L/uufs/$CLUSTER/sys/pkg/acml/std/pgi64/lib -lacml

Pathscale Fortran:

pathf90 -ff2c-abi /uufs/$CLUSTER/sys/pkg/acml/std/gnu64/lib/acml-2.5 source_name.f -o executable_name -L/uufs/$CLUSTER/sys/pkg/acml/std/gnu64/lib -lacml -lg2c

C/C++

gnu C/C++:

gcc (or g++) source_name.f -o executable_name -L/uufs/$CLUSTER/sys/pkg/acml/std/gnu64/lib -lacml

PGI C/C++:

pgcc (or pgCC) source_name.f -o executable_name -L/uufs/$CLUSTEr/sys/pkg/atlas/std/pgi64/lib -lacml

Pathscale C/C++:

pathcc (or pathCC) source_name.f -o executable_name -L/uufs/$CLUSTER/sys/pkg/atlas/std/gnu64/lib -lacml

GOTO library

GOTO library by Kazushige Goto is another high performance implementation of BLAS and partial LAPACK, which performance is comparable with ATLAS. AMD experts claim that GOTO's BLAS3 performance (DGEMM in particular) is slightly better than Atlas. For more information, consult GOTO webpage. Latest GOTO library is located in /uufs/$CLUSTER/sys/pkg/goto/std/lib.

Compilation instructions:

Fortran

GNU Fortran:

g77 source_name.f -o executable_name -Wl,-rpath=/uufs/$CLUSTER/sys/pkg/goto/std/lib /uufs/$CLUSTER/sys/pkg/goto/std/lib/libgoto.so

PGI Fortran:

pgf90 (or pgf77) source_name.f -o executable_name -Wl,-rpath=/uufs/$CLUSTER/sys/pkg/goto/std/lib /uufs/$CLUSTER/sys/pkg/goto/std/lib/libgoto.so

Pathscale Fortran:

pathf90 source_name.f -o executable_name -Wl,-rpath=/uufs/$CLUSTER/sys/pkg/goto/std/lib /uufs/$CLUSTER/sys/pkg/goto/std/lib/libgoto.so

C/C++

GNU C/C++:

gcc source_name.f -o executable_name -Wl,-rpath=/uufs/$CLUSTER/sys/pkg/goto/std/lib /uufs/$CLUSTER/sys/pkg/goto/std/lib/libgoto.so

PGI C/C++:

pgcc (or pgCC) source_name.f -o executable_name -Wl,-rpath=/uufs/$CLUSTER/sys/pkg/goto/std/lib /uufs/$CLUSTER/sys/pkg/goto/std/lib/libgoto.so

Pathscale C/C++:

pathcc source_name.f -o executable_name -Wl,-rpath=/uufs/$CLUSTER/sys/pkg/goto/std/lib /uufs/$CLUSTER/sys/pkg/goto/std/lib/libgoto.so

Portland Group BLAS

Portland Group ships its own version of BLAS library with its compilers. This is the BLAS that will get linked to your source when you use PG compilers with -lblas option. We discourage using it since it is not highly optimized. The libraries are located at $PGI/linux86/lib.

Compilation instructions (for reference only):

PGI Fortran

pgf90 (or pgf77) source_name.f -o executable_name -lblas

PGI C/C++

pgcc (pgCC) source_name.c -o executable_name -lblas