The HPC Challenge benchmark consists at this time of 7
benchmarks: HPL, STREAM, RandomAccess, PTRANS, FFTE, DGEMM and b_eff
Latency/Bandwidth. HPL is the Linpack TPP benchmark. The test stresses the floating point performance of a system. STREAM is a benchmark that measures sustainable memory bandwidth (in GB/s), RandomAccess measures the rate of random updates of memory. PTRANS measures the rate of transfer for larges arrays of data from multiprocessor’s memory. Latency/Bandwidth
measures (as the name suggests) latency and bandwidth of
communication patterns of increasing complexity between
as many nodes as is time-wise feasible.
Latency/Bandwidth measures latency (time required to send an 8-byte message
from one node to another) and bandwidth (message size divided by the time it
takes to transmit a 2,000,000 byte message) of network communication using
basic MPI routines. The measurement is done during non-simultanous
(ping-pong benchmark) and simultanous communication (random and natural ring
pattern) and therefore it covers two extreme levels of contention (no
contention and contention caused by each process communicates with a
randomly chosen neighbor in parallel) that might occur in real application.
Download the benchmark software, link in MPI and the BLAS, adjust the input file, and run the MPI program on your parallel system. See the README.txt file in the
benchmark distribution archive for the details.
A “base run” is defined as compiling and running the supplied program along with a version of MPI and the BLAS. No changes to the source code are allowed for the base run. For the base run, the MPI and the BLAS must be the ones in common use on the system. A user may adjust the input to the benchmark to accommodate their system.
Tflop/s is a rate of execution - trillion (ten to the
12th power) of floating point operations per second.
Whenever this term is used it will refer to 64-bit floating
point operations and the operations will be either addition
or multiplication (a “fused” multiply/add is counted as two
floating point operations). GB/s stands for Giga (ten to the 9th power) bytes per
second and is a unit of bandwidth - a rate of transfer of
data between the processor and memory and also over the
network. Two types of measurements may be reported for
network bandwidth: per CPU and accumulated (for all nodes).
Gup/s is short for Giga updates per second. An update
is the basic operation performed by RandomAccess benchmark:
read an integer from memory, change it in the processor, and
write it back to memory. The location of read and write is
the same by each time is selected at random. Therefore,
there is no relation between Gup/s and GB/s because the
latter implicitly refers to contiguous transfers. Such
transfers may benefit from prefetching while Gup/s transfers
The term usec is a common abbreviation of micro (ten
to the -6th power) seconds and is used to measure latency
The theoretical peak is based not on an actual performance from a benchmark run, but on a paper computation to determine the theoretical peak rate of execution of floating point operations for the machine. This is the number manufacturers often cite; it represents an upper bound on performance. That is, the manufacturer guarantees that programs will not exceed this rate-sort of a "speed of light" for a given computer. The theoretical peak performance is determined by counting the number of floating-point additions and multiplications (in full precision) that can be completed during a period of time, usually the cycle time of the machine. For example, an Intel Itanium 2 at 1.5 GHz can complete 4 floating point operations per cycle or a theoretical peak performance of 6 GFlop/s.
Why is my performance results below the theoritical peak?
The performance of a computer is a complicated issue, a function of many interrelated quantities. These quantities include the application, the algorithm, the size of the problem, the high-level language, the implementation, the human level of effort used to optimize the program, the compiler's ability to optimize, the age of the compiler, the operating system, the architecture of the computer, and the hardware characteristics. The results presented for this benchmark suite should not be extolled as measures of total system performance (unless enough analysis has been performed to indicate a reliable correlation of the benchmarks to the workload of interest) but, rather, as reference points for further evaluations.
Why are the performance results for my computer different than some other machine with the same characteristics?
There are many reasons why your results may vary from results recorded in previous machines. Issues such as load on the system, accuracy of the clock, compiler options, version of the compiler, size of cache, bandwidth from memory, amount of memory, etc can effect the performance even when the processors are the same.
There are quite a few reasons. First off, these options are useful to determine what matters and what does not on your system. Second, HPL is often used in the context of early evaluation of new systems. In such a case, everything is usually not quite working right, and it is convenient to be able to vary these parameters without recompiling. Finally, every system has its own peculiarities and one is likely to be willing to empirically determine the best set of parameters. In any case, one can always follow the advice provided in the HPL tuning section.
For HPL input, what problem size (matrix dimension N) should I use?
In order to find out the best performance of your system, the largest problem size fitting in memory is what you should aim for. The amount of memory used by HPL is essentially the size of the coefficient matrix. So for example, if you have 4 nodes with 256 Mb of memory on each, this corresponds to 1 Gb total, i.e., 125 M double precision (8 bytes) elements. The square root of that number is 11585. One definitely needs to leave some memory for the OS as well as for other things, so a problem size of 10000 is likely to fit. As a rule of thumb, 80 % of the total amount of memory is a good guess. If the problem size you pick is too large, swapping will occur, and the performance will drop. If multiple processes are spawn on each node (say you have 2 processors per node), what counts is the available amount of memory to each process.
How benchmark stores data If multiple values of N were chosen?
The benchmark code process each value of N in turn. For the first value of N, the memory is allocated, matrix data is generated, the linear system is solved and timed, the solution is verified, and the memory is deallocated.
HPL uses the block size NB for the data distribution as well as for the computational granularity. From a data distribution point of view, the smallest NB, the better the load balance. You definitely want to stay away from very large values of NB. From a computation point of view, a too small value of NB may limit the computational performance by a large factor because almost no data reuse will occur in the highest level of the memory hierarchy. The number of messages will also increase. Efficient matrix-multiply routines are often internally blocked. Small multiples of this blocking factor are likely to be good block sizes for HPL. The bottom line is that "good" block sizes are almost always in the [32 .. 256] interval. The best values depend on the computation / communication performance ratio of your system. To a much less extent, the problem size matters as well. Say for example, you emperically found that 44 was a good block size with respect to performance. 88 or 132 are likely to give slightly better results for large problem sizes because of a slighlty higher flop rate.
For HPL what process grid ratio P x Q should I use?
This depends on the physical interconnection network you have. Assuming a mesh or a switch HPL "likes" a 1:k ratio with k in [1..3]. In other words, P and Q should be approximately equal, with Q slightly larger than P. Examples: 2 x 2, 2 x 4, 2 x 5, 3 x 4, 4 x 4, 4 x 6, 5 x 6, 4 x 8 ... If you are running on a simple Ethernet network, there is only one wire through which all the messages are exchanged. On such a network, the performance and scalability of HPL is strongly limited and very flat process grids are likely to be the best choices: 1 x 4, 1 x 8, 2 x 4 ...
HPL has been designed to perform well for large problem sizes on hundreds of nodes and more. The software works on one node and for large problem sizes, one can usually achieve pretty good performance on a single processor as well. For small problem sizes however, the overhead due to message-passing, local indexing and so on can be significant.
Certainly. There is always room for performance improvements (unless you've reached the theoretical peak of you machine). Specific knowledge about a particular system is always a source of performance gains. Even from a generic point of view, better algorithms or more efficient formulation of the classic ones are potential winners.
You need to modify the input data file HPL.dat. This file should reside in the same directory as the executable hpl/bin//xhpl. An example HPL.dat file is provided by default but is not optimal for any practical system. This file contains information about the problem sizes, machine configuration, and algorithm features to be used by the executable.
The ping pong benchmark is executed on two processes.
From the client process a message (ping) is sent to the server process
and then bounced back to the client (pong). MPI standard blocking send
and receive is used.
The ping-pong patterns are done in a loop.
To achieve the communication time of one message,
the total communication time is measured on the client process
and divided by twice the loop length.
Additional startup latencies are masked out by starting the measurement
after one non-measured ping-pong.
The benchmark in hpcc uses 8 byte messages and loop length = 8 for
benchmarking the communication latency.
The benchmark is repeated 5 times and the shortest latency is reported.
To measure the communication bandwidth, 2,000,000 byte messages with
loop length 1 are repeated twice.
How is ping pong measured on more than 2 processors?
The ping-pong benchmark reports the maximum latency and minimum bandwidth
for a number of non-simultaneous ping-pong tests.
The ping-pongs are performed between as many as possible
(there is an upper bound on the time it takes to complete this test)
distinct pairs of processors.
Which parallel communication pattern is used in the random and natural ring benchmark?
For measuring latency and bandwidth of parallel communication,
all processes are arranged in a ring topology
and each process sends and receives a message from its left and its right
neighbor in parallel.
Two types of rings are reported:
a naturally ordered ring (i.e., ordered by the process ranks in
and the geometric mean of ten different randomly chosen process orderings in the ring.
The communication is implemented
(a) with MPI standard non-blocking receive and send, and
(b) with two calls to MPI_Sendrecv for both directions in the ring.
Always the fastest of both measurements are used.
For benchmarking latency and bandwidth, 8 byte and 2,000,000 byte long
messages are used.
With this type of parallel communication, the bandwidth per process is
as total amount of message data divided by the number of processes
and the maximal time needed in all processes.
This part of the benchmark is based on patterns studied in the effective
How does the parallel ring bandwidth relate to vendors' values?
Vendors are often reporting a duplex network bandwidth
(per CPU or accumulated)
by counting each message twice, i.e., as incoming and outgoing
data i.e., they are reporting parallel bandwidth values that are twice
of the values reported by this ring benchmark (because in this hpcc
benchmark, each transferred message is counted only once).
How do I change data size (matrix and vector dimensions) for the tests?
Only HPL and PTRANS matrix sizes can be changed directly in the hpccinf.txt or hpccmemf.txt input files. The remaining tests use the size of the largest HPL matrix to adjust the size of their input data. For example, in a sequential run, if the size of the HPL matrix is 1 GiB then each of the three vectors used by STREAM Triad will be 0.333 GiB, PTRANS matrix will be 0.5 GiB, the FFT vector size will be 125 MiB, and each of the three matrix sizes in DGEMM will be 333 MiB.
To summarize what the above papers say: the dimensions Px and Py of the virtual process grid for PTRANS have to have small GCD (Greatest Common Divisor) and small LCM (Least Common Multiple) to achive good performance. The number of steps to do the transpose is LCM(Px,Py)/GCD(Px,Py). And the number of communicating pairs is GCD(Px,Py).
HPCC has not been designed for running invidual tests. Quite the opposite. It's a harness that ties multiple tests together. Having said that, it is possible to comment out calls to individual tests in src/hpcc.c
Is there an easier way to specify input configuration?
Yes, there is. HPCC code looks for file "hpccmemf.txt". It is very minimalistic and allows for a quick specification of the input parameters. It takes only a single line that specifies the amount of memory for the run. The amount of memory can be specified per thread, per MPI process or the total memory for the entire machine. For example, if HPCC should use 1048576 bytes (1 MiB) per thread, the "hpccmemf.txt" should contain a line "Thread=1". If 2 MiB should be allocated per MPI process, the single line in the file should read: "Process=1". And finally, if the total memory used should be use 3 MiB, then the single line should be: "Total=3".
See the Overview section, which contains the rules. In particular, no code changes are allowed but use of general purpose libraries is allowed through compiler directives and options as well as linker flags.
Can I use my own libraries that implement portions of HPCC?
Yes, provided you do not voilate the
The base runs, the code cannot be changed but you can
use optimized libraries to speed up some sections of the
code. These libraries have to be generally available
on your system for others to use.
For optimized runs, you can replace some portions of
the benchmark with your propriatary code.
First name of the person that submits the results. The name is kept private, it is only stored internally. It is only used in correspondence with the submitter (if at all) and we also use it for the award announcements.
Last name of the person that submits the results. The name is kept private, it is only stored internally. It is only used in correspondence with the submitter (if at all) and we also use it for the award announcements.
Email address is used for the submission confirmation. It is not possible to submit an entry without a valid e-mail address. This email is not used for any other purposes and is not listed in the publicly accessible documents.
Why do I need to submit my institution or affiliation?
For each result submission, we collect and report either institution that owns the machine or with which the machine is affiliated with. This helps in determining the ownership of the machine and may serve as an outreach opportunity. Example entry: University of Tennessee.
The "Theoretical peak" field should contain the computational rate (even if only theoretical) of all the processors/cores used by the benchmark expressed in Tflop/s (trillion floating-point operations per second or 1012 flop/s). Typically, it is a product of the number of floating-point operations per cycle, clock frequency, and the number of processors/cores. For the multi-core chips, a common practice is to refer to a single core as processor. So the theoretical peak should be multiplied by the number of cores. The table below gives the number of floating-point operations per cycle for common processors/cores. Please keep in mind that the recent processors from AMD and Intel utilize frequency scaling. The nominal frequency from the specification might be lower than the maximum frequency, that the processor might be able to use under some circumstances. For the theoretical peak, the maximum frequency should be used.
What is the relation of this benchmark to the Linpack benchmark?
The Linpack benchmark called the Highly Parallel Computing Benchmark can be found in Table 3 of the Linpack Benchmark Report (PDF). This benchmark attempts to measure the best performance of a machine in solving a system of equations. The problem size and software can be chosen to produce the best performance. HPL is the benchmark used for the TOP500 report.
What does the HPC Challenge Benchmark have to do with the Top500?
The HPC Challenge Benchmark is an attempt to broaden the scope of benchmarking high performance systems. The Top500 uses one metric, the Linpack Benchmark (HPL), to rank the 500 fastest computer systems in use today. The HPC Challenge does not produce a ranking for systems, but provides a set of metrics for evaluations and comparisons.
The Top500 lists the 500 fastest computer systems being used today. In 1993 the collection was started and has been updated every 6 months since then. The report lists the sites that have the 500 most powerful computer systems installed. The best Linpack benchmark performance achieved is used as a performance measure in ranking the computers. The TOP500 list has been updated twice a year since June 1993.
To be listed on the Top500 list you have to run the software that can be found at http://www.netlib.org/benchmark/hpl/ and the performance of the benchmark run must be within the range of the 500 fasted computers for that period of time.
What is the relation of this benchmark to the STREAM benchmark?
The version of STREAM in HPCC has been modified so that the
source and destination arrays are allocated from the heap with
a dynamic size instead of having static storage with a
constant size. This allows the size of the arrays to be scaled
appropriately according to the memory size (derived from the
HPL parameters). From the compiler stand-point, it removes
information about pointer aliasing, alignment, and data size
- all of which might be crucial for efficient code generation.
Optimized run of HPCC is expected to deal with these
What is the relation of this benchmark to the Effective Communication Bandwidth (b_eff) benchmark?
In this benchmark, latency and bandwidth are measured
mainly with three communication patterns (ping-pong, random
ring, natural ring) and two message sizes (8 byte for
latency and 2,000,000 bytes for bandwidth measurements)
and these different results are reported independently.
The buffer memory is always reused in a loop of measurements.
The goal of b_eff is to compute an average bandwidth value that represents several ring patterns (sequentially and
randomly ordered) and 21 different message sizes. Memory
reuse is prohibited.