Origin 2000 from SGI

by Lloyd M Caldwell

Introduction:

Silicon Graphics Inc. has recently introduced a family of distributed shared memory multiprocessor systems called the Origin family. The Center for High Performance Computing (CHPC) and Scientific Computing and Imaging (SCI) have joined forces with SGI to place an Origin2000/Onyx2 InfiniteReality system on the UofU campus.

The system contains: 60 cpu's, 7,680 Megabytes of memory, 11 Gigabytes of disk space, 8 InfiniteReality Graphics Pipes, and 8 24 inch high resolution monitors. It is housed in 6 towers in the machine room in Merrill Engineering Building (MEB). The Mac Lab in MEB is to be converted to a graphics lab for some of the 24 inch high resolution monitors.

The Origin2000 is a scalable distributed shared memory architecture that provides a follow on to the SGI PowerChallenge class of symmetric multiprocessing systems. The Onyx2 InfiniteReality Graphics Pipes are high performance visualization graphics engines.

The onyx2 Infinite Reality system will be the subject of a future article.

Hardware:

The Origin2000 is built out of processing nodes linked together by an interconnection fabric. Each processing node contains one or two processors (two in our case), a portion of shared memory (ours have 256 Megabytes), a directory for cache coherence and two interfaces: one connects to the I/O devices and the other connects to the interconnection fabric (CrayLink Interconnect).

Origin2000 Components:

  • Processor: The Origin2000 processor is a MIPS R10000, 64-bit superscalar processor. This cpu supports dynamic scheduling, large memory address space and heavy overlapping of memory transactions (up to twelve per processor).
  • Memory: Each node board adds an independent bank of memory to the system, 4 Gigabytes per node board maximum. Up to 64 nodes can be configured per system which yields a maximum of 256 Gigabyte system memory.
  • I/O Controllers: Fast, Wide SCSI, Fiberchannel, 100BASE-Tx, ATM and HIPPI-Serial high-speed I/O interfaces are supported.
  • Hub: This ASIC (Application Specific Integrated Circuit) is the distributed shared memory controller. It provides cache-coherent access to all memory for the processors and I/O devices.
  • Directory Memory Controlled by the Hub, it provides information about the cache status of memory within its node. The status information provides scalable cache coherence and data migration of data to most frequently referencing node.
  • CrayLink Interconnect: High speed links and routers that tie the hubs together to make a single system image. It provides for low latency, scalable bandwidth, modular and fault tolerant operation.

Origin2000 Features:

  • Distributed shared-memory (DSM) and I/O: Memory is physically distributed throughout the system for fast processor access. Page migration hardware moves data into memory closer to a processor that is frequently using it, thus reducing memory latency. I/O devices are also distributed among the nodes but are universally accessible to all processors and other I/O devices in the system.
  • Directory-based cache coherence: Shared bus symmetric multiprocessing systems typically utilize a snoopy protocol to provide cache coherence. This requires broadcasts for every cache-line invalidation to every cpu in the system whether it has a copy of the cache line or not. In contrast, the hardware directory is distributed among the nodes of the system (along with main memory). Cache coherence is applied across the entire system and point-to-point messages are sent to ONLY the cpu's that actually are using the cache line. This significantly reduces the amount of cache-coherence traffic that passes through the system.
  • Page migration and replication: As discussed above.

Memory Hierarchy:

The Origin2000 implements a hierarchical memory accessing structure otherwise known as NUMA (Non-Uniform Memory Access). From lowest latency to highest, we have:

  • Processor registers
  • Cache: These are primary and secondary caches on the processors themselves.
  • Home Memory: This includes the node memory and directory memory of the local processor. The access is local if the address of the memory reference is to an address on the same node as the processor.
  • Remote Cache: remote nodes may be holding copies of a given memory block. If the requesting processor is writing, these copies must be invalidated. If the processor is reading, this level exists if another processor has the most up-to-date copy of the requested location.

Note: data only exists in either local or remote memory; copies of the data can exist in various processor caches. Keeping the copies consistent is the job of the logic in the various hubs.

System Bandwidth:

For discussing switched interconnection fabric bandwidth the following definitions will be used:

  • Peak Bandwidth: clock rate at the interface, times the data width of the interface. A theoretical number, or, the manufacturers specification of what "speed" you will never see :(.
  • Sustainded Bandwidth: Best case figure, subtract the packet header and any other overhead from the peak bandwidth. Does not take into account contention and other variable effects.
  • Bisection Bandwidth: Divide the interconnection fabric in half and measure the data rate across this divide. Useful estimate of bandwidth when data is not optimally placed. The following Tables indicate some of these "performance" figures at the interconnection fabric interfaces of the Hub.
TABLE 1.
Interface at HubHalf/Full DuplexPeak Bandwidth
(per second)
Sustained Bandwidth
(per second)
ProcessorUnidirectional800MB-
MemoryUnidirectional800MB-
Interconnection Fabricfull
half
1.6GB
800MB
1.25GB
711MB
TABLE 2.
System Size
(number of cpus)
Bisection Bandwidth
Sustained/Peak
(per second)
81.28GB/1.6GB
162.56GB/3.2GB
325.12GB/6.4GB
6410.2GB/12.GB
12820.5GB/25.6GB

Programming Model:

Virtual Address Space:

The MIPS family of 64-bit processors provides single uniform virtual address space for user processes. The Origin2000 uses the R10000 processor which defines a 244, 16 terabyte (TB) user addressable virtual address space. The virtual address is decoded as follows:

63:62 | 61:59 | 58:57 | 56:44 | 43:40 | 39:00
  • 63:62 =
    00 user space in kernel mode
    01 user space in supervisor mode
    10 physical address space
    11 kernel space in kernel mode
  • 61:59 =
    000 reserved
    001 reserved
    010 uncached
    011 cacheable, nonchoerent
    100 cacheable, coherent exclusive
    101 cacheable, coherent exclusive on write
    110 reserved
    111 uncached accelerated
  • 58:57 = Uncached attribute, selects among four uncached spaces
  • 56:44 = not translated by TLB (translation Lookaside Buffer)
  • 39:00 index to physical space when 63:62 = 10

Physical Address Space:

To any processor, main memory appears as a single address space containing many individually addressable blocks (pages). Each node is allocated a static portion of this address space, typically 4GB. Secondary cache lines are fixed in size at 32 words (128 bytes). Memory page sizes are multiples of 4KB, usually 16KB.

Conclusion:

Luckily none of this is visible to the programmer. Write your parallel code, see how it performs. If more performance is expected/needed utilize the tool dplace(1) which allows the programmer to specify the mapping of the application onto the system in a simple high-level script language. A memory access analysis tool, dprof(1) can be used to understand the memory reference patterns of an application. Dprof(1) can also automatically generate the dplace input script. For another way to optimally map your application on to the Origin2000 hardware the MIPSpro Fortran 77 Programmer's Guide (document number 007-2161-004) has more information on how to access the new features of the system.

References:

Finite Temperature Quantum Dynamics on the IBM SP

by August Calhoun, Marc Pavese, and Gregory A. Voth of the Department of Chemistry and Henry Eyring Center for Theoretical Chemistry, University of Utah, Salt Lake City, UT 84112

Introduction

Understanding condensed matter dynamics is a fundamental goal of theoretical chemistry. For example, such an understanding is extremely relevant to chemical reactivity, because the vast majority of industrial and biological chemical reactions are condensed phase processes. Naturally, the subject has received a great deal of attention from computational investigators as well. Largely by necessity, the computer simulation of condensed phase phenomena has mostly focused on the dynamics generated via classical mechanical equations of motion [i.e., molecular dynamics (MD) simulation]. Classical mechanics is a reasonable approximation for many condensed phase systems, but there are notable failures of this approach (see e.g., Refs. 1-5).

A more correct methodology would be to apply the laws of quantum mechanics to the condensed phase. However, quantum dynamical calculations have traditionally been limited to very few degrees of freedom, directly as a result of the difficulty of solving the time dependent Schroedinger equation. As an alternative, the centroid molecular dynamics method (CMD) has been developed to calculate the approximate time evolution of many-body quantum mechanical systems, such as those encountered in condensed matter simulations.[6-9] The method, CMD, is derived from the Feynman path integral formulation of quantum statistical mechanics,[10] and it focuses on the calculation of the ensemble-averaged response of the system as described by linear response theory (i.e., time correlation functions). The CMD method also has natural features that lend it to parallel computation, and we have exploited those features to develop a highly parallel algorithm for doing finite temperature quantum dynamics.[5]

Centroid Molecular Dynamics

CMD is a powerful method because it is analogous to classical MD, which is a well developed and widely applied technique.[11] In the path integral formulation of quantum statistical mechanics, a quantum particle can be represented by a chain of quasiparticles ("beads") q(t), each at a different point in "imaginary time," t, coupled together via nearest neighbor harmonic springs.12 The center of mass of these beads is the path centroid, and the position of the centroid is denoted here R. CMD allows one to calculate approximate quantum dynamics within the well understood MD framework, based upon the equation:

This equation of motion is very much like Newton's second law, which governs the motion of particles within classical mechanics, but the force is not the normal classical force. The central quantity within CMD is the force on the centroid of each quantum particle, f(R), i.e., the centroid force. The centroid force is a quantum ensemble averaged quantity, which must be recalculated at each timestep of the simulation. The motion of the beads is propagated via path integral molecular dynamics or path integral Monte Carlo (see, e.g., Ref. 12) with the centroids of the chains constrained to be fixed. Thus, the centroid force can be evaluated, which then determines the motion of the centroids. The trajectory of the centroids of the quantum particles, which can be calculated by the above equation of motion, provides a well defined approximation for the quantum time correlation functions of the system.[9] The computational challenge is to calculate the centroid force in a rapid and efficient manner during the CMD simulation.

FIG.1 A plot of the speedup versus the number of nodes on the IBM SP at the Cornell Theory Center for the liquid para-hydrogen CMD simulation. The algorithm scales in a near linear fashion on a parallel architecture.

We have parallelized the calculation of the centroid force in a multi-tiered fashion.[5] First, since the centroid force is an ensemble averaged property, we evaluate multiple trajectories of the beads in parallel, and average the results to calculate the centroid force. The second level of parallelization is in the calculation of the force and potential energy for the beads, which is necessary to evaluate the trajectories of the beads (even classically, the force and potential calculation is 99% of the computational effort, as is well known for MD and Monte Carlo11). The force and potential evaluation for the beads can be evaluated at each point in imaginary time t in parallel. The only communication between the particles at different points in imaginary time is through the nearest neighbor harmonic springs mentioned above.

FIG. 2. The phonon spectrum of pure solid para-hydrogen as computed from the Fourier transform of the velocity autocorrelation function. The agreement with the experimental IR and neutron scattering results is quite good. Classical mechanics is completely incorrect.

Our parallel CMD code has been run over 56 nodes of the CHPC SP (at approximately 5.5 GFLOPS), and could easily be scaled to run over many more. A plot of the speedup versus the number of nodes on the IBM SP (Fig. 1), shows that our calculation scales quite well on a parallel architecture.

Applications to Realistic Systems

We tested our method on low temperature liquid para-hydrogen.[5] The structure and dynamics of low temperature hydrogen, as calculated from classical MD, are in significant disagreement with experiment. Also, this system offers a good test of CMD implemented on the IBM SP, because the highly quantum nature of low temperature para-hydrogen requires considerable averaging to calculate the centroid force. We have calculated the velocity autocorrelation functions and self-diffusion constants for liquid para-hydrogen at 25K and 14K. In contrast to classical MD, the CMD results are in quite good agreement with experiment (1.54 verbs 1.6 A2/ps at 25K and 0.32 versus 0.4 A2/ps at 14K).[5,13] We have also recently computed the phonon spectrum of solid para-hydrogen at 5K (see Fig. 2). The positions of the peaks agree quite well with experimental measurements.[14,15] The classical mechanical results are rather poor.

Another application of the CMD method has been to study the quantum dynamics of an excess proton in water.[2,4] The rate of proton diffusion in water is anomalously high relative to similarly sized ions in water, and mechanism of this diffusion has been the subject of much study. The light mass of a proton, as well as the high frequency intramolecular motion of water molecules suggest that classical mechanics will provide an insufficient description of the system. The dynamics of a proton in water have been calculated with CMD, and it is indeed the case that the quantum effects are large. The results of the calculation show that an excess proton is quite delocalized in space, and spends very little time associated with one water molecule, suggesting that H5O2+ may be the dominant species in acidic solutions, rather than H3O+. We have also been able to explore the quantum dynamical motions that are likely to promote proton diffusion in water.[4]

We are presently studying the properties of impurity doped hydrogen, as well as to extending our model for the proton transfer in water to provide a more realistic representation of the motion of the proton between several water molecules. The parallel implementation of the CMD method described here should bring the quantum dynamics of many condensed matter systems within the reach of computer simulation.

Bibliography

  1. A. Wallqvist and B. J. Berne, Chem. Phys. Lett. 117, 214 (1985).
  2. J. Lobaugh and G. A. Voth, J. Chem. Phys. 106, 2400 (1997).
  3. J. B. Straus, A. Calhoun, and G. A. Voth, J. Chem. Phys. 102, 529 (1995).
  4. J. Lobaugh and G. A. Voth, J. Chem. Phys. 104, 2056 (1996).
  5. A. Calhoun, M. Pavese, and G. A. Voth, Chem. Phys. Lett. 262, 415 (1996).
  6. J. Cao and G. A. Voth, J. Chem. Phys. 99, 10070 (1993).
  7. J. Cao and G. A. Voth, J. Chem. Phys. 100, 5106 (1994).
  8. J. Cao and G. A. Voth, J. Chem. Phys. 101, 6157 (1994).
  9. J. Cao and G. A. Voth, J. Chem. Phys. 101, 6168 (1994).
  10. R. P. Feynman and A. R. Hibbs, Quantum Mechanics and Path Integrals (McGraw-Hill Publishing Company, New York, 1965).
  11. M. P. Allen and D. J. Tildesley, Computer Simulation of Liquids (Oxford University Press, New York, 1987).
  12. B. J. Berne and D. Thirumalai, Ann. Rev. Phys. Chem. 37, 401 (1986).
  13. G.J. Martyna and J. Cao, J. Chem. Phys. 104, 2028 (1996).
  14. H. P. Gush, W.F.J. Hare, E. J. Allin, and H. L. Welsh, Can. J. Phys. 38, 176 (1960).
  15. M. Nielsen, Phys. Rev. B 7, 1626 (1973).

Get it on the Web!!

by Ludovic Milin, CHPC Consultant

CHPC is in the process of putting together a collection of links to the webpages of the research groups using our facilities. If you are one of those and your group has a homepage, please send us the URL.

If your research group does not have a web presence but would like to have one, CHPC can provide you with space on its webserver as well as some help and expertise to start building a homepage, including formatting images and creating animations, movies, etc.

To submit a URL or for more information regarding these services, send mail to consult@usi.utah.edu or call 1-4439

CHPC organizes a UofU research exhibit at Supercomputing '97

by Stefano Foresti, CHPC Staff Scientist

The Center for High Performance Computing is organizing an exhibit at Supercomputing `97, the largest convention on high performance computing, to be held at the San Jose Convention Center, San Jose, CA, USA, on November 15-21, 1997.

Research exhibits at SC97 provide an opportunity to demonstrate new and innovative research results.

The CHPC is planning to exhibit relevant research in high performance computing and visualization at the University of Utah.

CHPC is soliciting research groups to participate to this exhibit and to contribute to the success of the proposal.

The SC97 committee encourages research exhibitors to submit multimedia proposals of their work, including, where appropriate, short video clips suitable for access over the WWW. References to additional information available on the WWW are also appropriate.

The deadline for proposals to SC97 is August 8, 1997. If your group is involved in high performance computing and visualization and you have appropriate material, please contact Dr. Stefano Foresti (stefano@usi.utah.edu or 581-3173), who is coordinating the research exhibit proposal.

More information on SC97 and research exhibits can be found at http://www.supercomp.org/sc97.

During the week of June 19 a workshop was held in the Physics department. A group of Physics department faculty and staff, along with upper elementary through high school science educators, met periodically over the last several months. Led by professor Eugene Loh, the group is developing World Wide Web based lessons in the field of cosmic rays and astrophysics.

This summer immersion session began with some talks on fundamental research and education, and the "New Paradigm" for science. Other talks included "Physics on the Web", "Science Topics Useful for Illustrating Science Concepts", "Java and its Capabilities", and "Utah Science Core Definition".

On Tuesday, the group went out to Dugway for a "Show and Tell" session.

The rest of the week, working groups got together to begin defining lessons for the Web. Over the next few months these lessons will begin to take shape.

CHPC is helping with this project by administering and maintaining the web server and providing technical support. We are also doing some programming work along with the teachers and physics folks to animate these lessons using the JAVA programming language from Sun.

The hope is that scientific facts and discoveries will be used to enliven the lessons, and we hope that learning about cosmic rays and astrophysics will come as a natural by-product of these lessons.

This project is funded by the National Science Foundation and Sun Microsystems, Inc.

Last Modified: October 06, 2008 @ 21:09:11