Base Results
Optimized Results
Base and Optimized
Base Results
Optimized Results
Base and Optimized
Base Results
Optimized Results
Base and Optimized
Base Results
Optimized Results
Base and Optimized
Base Results
Optimized Results
Base and Optimized
Base Results
Optimized Results
Base and Optimized






RandomAccess

GUPS (Giga Updates Per Second)

Random memory performance often maps directly to application performance and application development time. A small percentage of random memory accesses (cache misses) in an application can significantly affect the overall performance of that application. Current processor trends favor longer cache lines and stride-1 performance at the expense of random stride access performance. As random memory access becomes increasingly more expensive relative to processor operations, the need arises for a benchmark that discriminates on the basis of random memory access performance.

The effects of even a small percentage of random memory accesses are illustrated in Figures 1 and 2, where outputs from the MAPS (Memory Access Patterns) architecture profiling routine show measured memory bandwidth for stride-1 load, random memory load, and various percentages of random memory access for a Compaq Alphaserver (Lemieux at the Pittsburgh Supercomputing Center) and an IBM SP-3 (Blue Horizon at the San Diego Supercomputing Center (SDSC)) respectively. These charts are from A. Snavely, et. al., at the SDSC Performance Modeling and Characterization (PMaC) Laboratory.


Figure 1. MAPS on a Compaq Alphaserver

Figure 2. MAPS on an IBM SP-3

GUPS (Giga UPdates per Second) is a measurement that profiles the memory architecture of a system and is a measure of performance similar to MFLOPS. The RandomAccess test is part of the HPC Challenge benchmark developed for the HPCS program. The test intended to exercise the GUPS capability of a system, much like the LINPACK benchmark is intended to exercise the MFLOPS capability of a computer. In each case, we would expect these benchmarks to achieve close to the "peak" capability of the memory system. The extent of the similarities between RandomAccess and LINPACK are limited to both benchmarks attempting to calculate a peak system capability. We are interested in knowing the GUPS performance of both entire systems and system subcomponents -- e.g., the GUPS rating of a distributed memory multiprocessor the GUPS rating of an SMP node, and the GUPS rating of a single processor. While there is typically a scaling of FLOPS with processor count, a similar phenomenon may not always occur for GUPS.

GUPS is calculated by identifying the number of memory locations that can be randomly updated in one second, divided by 1 billion (1e9). The term "randomly" means that there is little relationship between one address to be updated and the next, except that they occur in the space of 1/2 the total system memory. An update is a read-modify-write operation on a table of 64-bit words. An address is generated, the value at that address read from memory, modified by an integer operation (add, and, or, xor) with a literal value, and that new value is written back to memory.

Select the memory size to be the power of two such that 2n is less than or equal to half of the total memory. Each process operates on its own address stream, and the single table may be distributed among processes. The distribution of memory to processes is left to the implementer. A uniform data distribution may help balance the workload, while non-uniform data distributions may simplify the calculations that identify processor location by eliminating the requirement for integer divides. A small (less than 1%) percentage of missed updates are permitted.

When implementing a benchmark that measures GUPS on a distributed memory multiprocessor system, it may be required to define constraints as to how far in the random address stream each process is permitted to "look ahead". Likewise, it may be required to define a constraint as to the number of update messages that can be stored before processing to permit multi-level parallelism for those systems that support such a paradigm. The limits on "look ahead" and "stored updates" are being implemented to assure that the benchmark meets the intent to profile memory architecture and not induce significant artificial data locality. For the purpose of measuring GUPS, we will stipulate that each process is permitted to look ahead no more than 1024 random address stream samples with the same number of update messages stored before processing.

Basic RandomAccess Benchmark Definition

Let T[ ] be a table of size 2n.

Let {Ai} be a stream of 64-bit integers of length 2n+2 generated by the primitive polynomial over GF(2) [1], X2 + X+1.

For each ai, set T[ai <63, 64-n>] = T[ai <63, 64-n>] + ai

Where

+
denotes addition in GF(2) i.e. exclusive "or"
ai<j, i>
denotes the sequence of bits within ai e.g. <63, 64-n> are the highest n bits

The parameter n is defined such that:
n is the largest power of 2 that is less than or equal to half of main memory.

Constraints on the look ahead and storage before processing on distributed memory multi-processor systems is limited to 1024 per process. Operations performed in parallel (either due to look ahead or process parallelization) are allowed to be "unlocked" and therefore may result in a small percentage of updates being left unperformed due to race conditions. (If two parallel operations read the same location, update the value, and then each store the resulting values -- one update is lost.) This percentage may not exceed 1%. The basic RandomAccess definition is described graphically in Figure 3.


Figure 3. RandomAccess Description

HPC Challenge RandomAccess Tests

RandomAccess will be run in three variants:

  1. Single process -- T[ ] is sized to half of process' memory
  2. All processes perform identical to single process with no interactions (i.e., embarrassingly parallel)
  3. All processes cooperate to solve a larger problem
    1. The table T[ ] is distributed over all processes and uses half (rounded down) of all system memory
    2. Each process calculates a portion of the address stream {Ai} starting at equal spacing along the cycle of the stream defined by the above polynomial -- a process may not calculate values outside of its range

We are supplying two versions of RandomAccess in the HPC Challenge Benchmark code distribution:

  1. A local version that can be run either on a single process. Figures 4 and 5 describe two potential implementations:
    • Sequential RandomAccess (Figure 4)
    • Vector or multi-threaded RandomAccess (Figure 5)
  2. A global version that runs using MPI-1 that is mandatory to run. Figures 6 and 7 describe the implementations for power of two and non-power of two numbers of processors.
    • MPIRandomAccess Power of Two Case (Figure 6)
    • MPIRandomAccess Non-Power of Two Case (Figure 7)


Figure 4. Sequential RandomAccess Implementation

Figure 5. Vector or Multi-Threaded RandomAccess Implementation

Figure 6. MPIRandomAccess Power of Two Case Implementation

Figure 7. MPIRandomAccess Non-Power of Two Case Implementation

The Supplied MPI-1 RandomAccess Code

The supplied MPI-1 code generates the input stream {Ai} on all processors and the global table has been distributed as uniformly as possible to balance the workload and minimize any Amdahl fraction. This code does exploit "look-ahead". Addresses are sent to the appropriate processor where the table entry resides as soon as each address is calculated. Updates are performed as addresses are received. Each message is limited to a single 64 bit long integer containing element ai from {Ai}. Local offsets for T[ ] are extracted by the destination processor.

If the number of processors is equal to a power of two, then the global table can be distributed equally over the processors. In addition, the processor number can be determined from that portion of the input stream that identifies the address into the global table by masking off log2(p) bits in the address -- ai <63, (64-log2(p)), >.

If the number of processors is not equal to a power of two, then the global table cannot be equally distributed between processors. In the MPI-1 implementation provided, there has been an attempt to minimize the differences in workloads and the largest difference in elements of T[ ] is one. The number of values in the input stream generated by each processor will be related to the number of global table entries on each processor.

The MPI-1 version of RandomAccess treats the potential instance where the number of processors is a power of two as a special case, because of the significant simplifications possible because processor location and local offset can be determined by applying masks to the input stream values. The non power of two case uses an integer division to determine the processor location. The integer division will be more costly in terms of machine cycles to perform than the bit masking operations.

Optimizing RandomAccess

While it will be mandatory to run the HPC Challenge Benchmark code distribution -- including the MPI-1 RandomAccess version -- we encourage users to modify any of the kernels to optimize them for your architecture, software, middleware, and interprocessor features. The supplied UPC codes may be the basis for architecture-dependent optimized RandomAccess runs.  Potential modifications to RandomAccess include, but are not limited to:

  • Developing an MPI-1 version that exploits the look-ahead and storage constraints of the specification
  • Developing a hybrid MPI/OpenMP version
  • Developing an MPI-2 version
  • Developing a version that uses other single-sided communications like SHMEM
  • Developing a Co-Array FORTRAN (CAF) version
  • Developing a version that takes advantage of architecture features like E-registers, etc.

There are many other modifications possible. It is important that modified codes remain true to the intent of the basic requirements of the RandomAccess benchmark description with (1) size of the table T[ ] being a power of two and approximately half the global memory and with (2) look-ahead and storage constraints being followed. Specifically, attempts to improve memory access characteristics by reordering the address stream -- so that all updates are "local" -- violates the rules of the RandomAccess benchmark.

We would appreciate if users would share their optimized codes so that we can evaluate the innovation expressed in the implementation and we can determine whether the version of the code maintains the spirit of the RandomAccess benchmark. It is not mandatory to provide a version of the code, although we will require at least a description of the modifications to post the benchmark performance times. We would also encourage users to estimate the staff hours required to make the changes to the optimized version of RandomAccess. This helps gauge the "how hard was it to get better performance". Even order of magnitude measures would be useful to help us examine issues relating to tradeoffs in productivity and performance. We hope that active involvement by the HPC community will develop highly optimized versions of RandomAccess similar to the highly optimized version of High Performance LINPACK (HPL).

Contacts

RandomAccess is one of the DARPA HPCS Discrete Math Benchmarks. It was initially developed by David Koester and Bob Lucas.



[1] GF(2) (Galois Field of order 2) The elements of GF(2) can be represented using the integers 0 and 1, i.e., binary operands.