## RandomAccess## GUPS (Giga Updates Per Second)Random memory performance often maps directly to application performance and application development time. A small percentage of random memory accesses (cache misses) in an application can significantly affect the overall performance of that application. Current processor trends favor longer cache lines and stride-1 performance at the expense of random stride access performance. As random memory access becomes increasingly more expensive relative to processor operations, the need arises for a benchmark that discriminates on the basis of random memory access performance.
The effects of even a small percentage of random memory accesses are
illustrated in Figures 1 and 2, where outputs from the MAPS (Memory
Access Patterns) architecture profiling routine show measured memory
bandwidth for stride-1 load, random memory load, and various
percentages of random memory access for a Compaq Alphaserver (Lemieux
at the Pittsburgh Supercomputing Center) and an IBM SP-3 (Blue Horizon
at the San Diego Supercomputing Center (SDSC)) respectively. These
charts are from A. Snavely, et. al., at the SDSC Performance Modeling
and Characterization (PMaC) Laboratory. Figure 1. MAPS on a Compaq Alphaserver
Figure 2. MAPS on an IBM SP-3
GUPS (Giga UPdates per Second) is a measurement that profiles the memory architecture of a system and is a measure of performance similar to MFLOPS. The RandomAccess test is part of the HPC Challenge benchmark developed for the HPCS program. The test intended to exercise the GUPS capability of a system, much like the LINPACK benchmark is intended to exercise the MFLOPS capability of a computer. In each case, we would expect these benchmarks to achieve close to the "peak" capability of the memory system. The extent of the similarities between RandomAccess and LINPACK are limited to both benchmarks attempting to calculate a peak system capability. We are interested in knowing the GUPS performance of both entire systems and system subcomponents -- e.g., the GUPS rating of a distributed memory multiprocessor the GUPS rating of an SMP node, and the GUPS rating of a single processor. While there is typically a scaling of FLOPS with processor count, a similar phenomenon may not always occur for GUPS. GUPS is calculated by identifying the number of memory locations that can be randomly updated in one second, divided by 1 billion (1e9). The term "randomly" means that there is little relationship between one address to be updated and the next, except that they occur in the space of 1/2 the total system memory. An update is a read-modify-write operation on a table of 64-bit words. An address is generated, the value at that address read from memory, modified by an integer operation (add, and, or, xor) with a literal value, and that new value is written back to memory.
Select the memory size to be the power of two such that 2 When implementing a benchmark that measures GUPS on a distributed memory multiprocessor system, it may be required to define constraints as to how far in the random address stream each process is permitted to "look ahead". Likewise, it may be required to define a constraint as to the number of update messages that can be stored before processing to permit multi-level parallelism for those systems that support such a paradigm. The limits on "look ahead" and "stored updates" are being implemented to assure that the benchmark meets the intent to profile memory architecture and not induce significant artificial data locality. For the purpose of measuring GUPS, we will stipulate that each process is permitted to look ahead no more than 1024 random address stream samples with the same number of update messages stored before processing. ## Basic RandomAccess Benchmark Definition
Let T[ ] be a table of size 2 Let {A
For each a Where - +
- denotes addition in GF(2) i.e. exclusive "or"
- a
_{i}<j, i> - denotes the sequence of bits within a
_{i}e.g. <63, 64-n> are the highest n bits
The parameter n is defined such that: Constraints on the look ahead and storage before processing on distributed memory multi-processor systems is limited to 1024 per process. Operations performed in parallel (either due to look ahead or process parallelization) are allowed to be "unlocked" and therefore may result in a small percentage of updates being left unperformed due to race conditions. (If two parallel operations read the same location, update the value, and then each store the resulting values -- one update is lost.) This percentage may not exceed 1%. The basic RandomAccess definition is described graphically in Figure 3. Figure 3. RandomAccess Description
## HPC Challenge RandomAccess TestsRandomAccess will be run in three variants: - Single process -- T[ ] is sized to half of process' memory
- All processes perform identical to single process with no interactions (i.e., embarrassingly parallel)
- All processes cooperate to solve a larger problem
- The table T[ ] is distributed over all processes and uses half (rounded down) of all system memory
- Each process calculates a portion of the address stream {A
_{i}} starting at equal spacing along the cycle of the stream defined by the above polynomial -- a process may not calculate values outside of its range
We are supplying two versions of RandomAccess in the HPC Challenge Benchmark code distribution: - A local version that can be run either on a single process.
Figures 4 and 5 describe two potential implementations:
- Sequential RandomAccess (Figure 4)
- Vector or multi-threaded RandomAccess (Figure 5)
- A global version that runs using MPI-1 that is mandatory to run.
Figures 6 and 7 describe the implementations for
*power of two*and*non-power of two*numbers of processors.- MPIRandomAccess Power of Two Case (Figure 6)
- MPIRandomAccess Non-Power of Two Case (Figure 7)
Figure 4. Sequential RandomAccess Implementation
Figure 5. Vector or Multi-Threaded RandomAccess Implementation
Figure 6. MPIRandomAccess Power of Two Case Implementation
Figure 7. MPIRandomAccess Non-Power of Two Case Implementation
## The Supplied MPI-1 RandomAccess Code
The supplied MPI-1 code generates the input stream {A
If the number of processors is equal to a power of two, then the
global table can be distributed equally over the processors. In
addition, the processor number can be determined from that portion of
the input stream that identifies the address into the global table by
masking off log If the number of processors is not equal to a power of two, then the global table cannot be equally distributed between processors. In the MPI-1 implementation provided, there has been an attempt to minimize the differences in workloads and the largest difference in elements of T[ ] is one. The number of values in the input stream generated by each processor will be related to the number of global table entries on each processor. The MPI-1 version of RandomAccess treats the potential instance where the number of processors is a power of two as a special case, because of the significant simplifications possible because processor location and local offset can be determined by applying masks to the input stream values. The non power of two case uses an integer division to determine the processor location. The integer division will be more costly in terms of machine cycles to perform than the bit masking operations. ## Optimizing RandomAccessWhile it will be mandatory to run the HPC Challenge Benchmark code distribution -- including the MPI-1 RandomAccess version -- we encourage users to modify any of the kernels to optimize them for your architecture, software, middleware, and interprocessor features. The supplied UPC codes may be the basis for architecture-dependent optimized RandomAccess runs. Potential modifications to RandomAccess include, but are not limited to: - Developing an MPI-1 version that exploits the look-ahead and storage constraints of the specification
- Developing a hybrid MPI/OpenMP version
- Developing an MPI-2 version
- Developing a version that uses other single-sided communications like SHMEM
- Developing a Co-Array FORTRAN (CAF) version
- Developing a version that takes advantage of architecture features like E-registers, etc.
There are many other modifications possible. It is important that modified codes remain true to the intent of the basic requirements of the RandomAccess benchmark description with (1) size of the table T[ ] being a power of two and approximately half the global memory and with (2) look-ahead and storage constraints being followed. Specifically, attempts to improve memory access characteristics by reordering the address stream -- so that all updates are "local" -- violates the rules of the RandomAccess benchmark. We would appreciate if users would share their optimized codes so that we can evaluate the innovation expressed in the implementation and we can determine whether the version of the code maintains the spirit of the RandomAccess benchmark. It is not mandatory to provide a version of the code, although we will require at least a description of the modifications to post the benchmark performance times. We would also encourage users to estimate the staff hours required to make the changes to the optimized version of RandomAccess. This helps gauge the "how hard was it to get better performance". Even order of magnitude measures would be useful to help us examine issues relating to tradeoffs in productivity and performance. We hope that active involvement by the HPC community will develop highly optimized versions of RandomAccess similar to the highly optimized version of High Performance LINPACK (HPL). ## ContactsRandomAccess is one of the DARPA HPCS Discrete Math Benchmarks. It was initially developed by David Koester and Bob Lucas. [1] GF(2) (Galois Field of order 2) The elements of GF(2) can be represented using the integers 0 and 1, i.e., binary operands. |