The HPC Challenge benchmark consists at this time of 7 benchmarks: HPL, STREAM, RandomAccess, PTRANS, FFTE, DGEMM and b_eff Latency/Bandwidth. HPL is the Linpack TPP benchmark. The test stresses the floating point performance of a system. STREAM is a benchmark that measures sustainable memory bandwidth (in GB/s), RandomAccess measures the rate of random updates of memory. PTRANS measures the rate of transfer for larges arrays of data from multiprocessor’s memory. Latency/Bandwidth measures (as the name suggests) latency and bandwidth of communication patterns of increasing complexity between as many nodes as is time-wise feasible.What is the license for the HPCC code?
All components of the code are distributed under the original BSD license.Where do I send my suggestions or questions about HPCC?
Please send your questions to the HPCC mailing list:
The address has been slightly scrambled to mislead spam
crawlers. Please edit it accordingly in your email client.
HPL measures the floating point execution rate for solving a system of linear equations.What does DGEMM measure?
DGEMM measures the floating point execution rate for double precision real matrix-matrix multiplication.What does STREAM measure?
STREAM is a benchmark that measures sustainable memory bandwidth (in GB/s).What does PTRANS measure?
PTRANS measures the rate of transfer for large arrays of data from multiprocessor’s memory.What does RandomAccess measure?
RandomAccess measures the rate of random updates of memory.What does FFT measure?
FFT measures the floating point rate of execution of double precision complex one-dimensional Discrete Fourier Tranform (DFT).What does Latency/Bandwidth measure?
Latency/Bandwidth measures latency (time required to send an 8-byte message from one node to another) and bandwidth (message size divided by the time it takes to transmit a 2,000,000 byte message) of network communication using basic MPI routines. The measurement is done during non-simultanous (ping-pong benchmark) and simultanous communication (random and natural ring pattern) and therefore it covers two extreme levels of contention (no contention and contention caused by each process communicates with a randomly chosen neighbor in parallel) that might occur in real application.Where can I get the benchmark program?
You can download the benchmark program from the download section of the main website: http://icl.cs.utk.edu/hpcc/What do I have to do to run the benchmark?
Download the benchmark software, link in MPI and the BLAS, adjust the input file, and run the MPI program on your parallel system. See the
README.txt file in the
benchmark distribution archive for the details.
User must supply a machine specific version of MPI and the BLAS.What is a “base run”?
A “base run” is defined as compiling and running the supplied program along with a version of MPI and the BLAS. No changes to the source code are allowed for the base run. For the base run, the MPI and the BLAS must be the ones in common use on the system. A user may adjust the input to the benchmark to accommodate their system.What is an optimized run?
An optimized run is where the user is allowed to change sections of the benchmark to provide optimized or tuned components.How are the benchmark results verified?
The program harness contains checks on the results for each test. These tests must be satisfied before the results can be submitted.What timing function is used in the benchmark run?
We are using the timing function provided by MPI, MPI_Wtime.What is a Tflop/s, GB/s, Gup/s and usec?
Tflop/s is a rate of execution - trillion (ten to the
12th power) of floating point operations per second.
Whenever this term is used it will refer to 64-bit floating
point operations and the operations will be either addition
or multiplication (a “fused” multiply/add is counted as two
floating point operations).
GB/s stands for Giga (ten to the 9th power) bytes per second and is a unit of bandwidth - a rate of transfer of data between the processor and memory and also over the network. Two types of measurements may be reported for network bandwidth: per CPU and accumulated (for all nodes).
Gup/s is short for Giga updates per second. An update is the basic operation performed by RandomAccess benchmark: read an integer from memory, change it in the processor, and write it back to memory. The location of read and write is the same by each time is selected at random. Therefore, there is no relation between Gup/s and GB/s because the latter implicitly refers to contiguous transfers. Such transfers may benefit from prefetching while Gup/s transfers will not.
The term usec is a common abbreviation of micro (ten to the -6th power) seconds and is used to measure latency of communication.
The theoretical peak is based not on an actual performance from a benchmark run, but on a paper computation to determine the theoretical peak rate of execution of floating point operations for the machine. This is the number manufacturers often cite; it represents an upper bound on performance. That is, the manufacturer guarantees that programs will not exceed this rate-sort of a "speed of light" for a given computer. The theoretical peak performance is determined by counting the number of floating-point additions and multiplications (in full precision) that can be completed during a period of time, usually the cycle time of the machine. For example, an Intel Itanium 2 at 1.5 GHz can complete 4 floating point operations per cycle or a theoretical peak performance of 6 GFlop/s.Why is my performance results below the theoritical peak?
The performance of a computer is a complicated issue, a function of many interrelated quantities. These quantities include the application, the algorithm, the size of the problem, the high-level language, the implementation, the human level of effort used to optimize the program, the compiler's ability to optimize, the age of the compiler, the operating system, the architecture of the computer, and the hardware characteristics. The results presented for this benchmark suite should not be extolled as measures of total system performance (unless enough analysis has been performed to indicate a reliable correlation of the benchmarks to the workload of interest) but, rather, as reference points for further evaluations.Why are the performance results for my computer different than some other machine with the same characteristics?
There are many reasons why your results may vary from results recorded in previous machines. Issues such as load on the system, accuracy of the clock, compiler options, version of the compiler, size of cache, bandwidth from memory, amount of memory, etc can effect the performance even when the processors are the same.Where can I get additional information on the benchmark?
For additional information on the benchmark see: http://icl.cs.utk.edu/hpcc/.Will the benchmark change over time?
We are planning to make additions to the benchmark collection over time. As we gain experience with the collections we are planning to expand the components in the benchmark.Why so many input options in the input file?
There are quite a few reasons. First off, these options are useful to determine what matters and what does not on your system. Second, HPL is often used in the context of early evaluation of new systems. In such a case, everything is usually not quite working right, and it is convenient to be able to vary these parameters without recompiling. Finally, every system has its own peculiarities and one is likely to be willing to empirically determine the best set of parameters. In any case, one can always follow the advice provided in the HPL tuning section.Is there a mailing list for asking questions
In order to find out the best performance of your system, the largest problem size fitting in memory is what you should aim for. The amount of memory used by HPL is essentially the size of the coefficient matrix. So for example, if you have 4 nodes with 256 Mb of memory on each, this corresponds to 1 Gb total, i.e., 125 M double precision (8 bytes) elements. The square root of that number is 11585. One definitely needs to leave some memory for the OS as well as for other things, so a problem size of 10000 is likely to fit. As a rule of thumb, 80 % of the total amount of memory is a good guess. If the problem size you pick is too large, swapping will occur, and the performance will drop. If multiple processes are spawn on each node (say you have 2 processors per node), what counts is the available amount of memory to each process.How benchmark stores data If multiple values of N were chosen?
The benchmark code process each value of N in turn. For the first value of N, the memory is allocated, matrix data is generated, the linear system is solved and timed, the solution is verified, and the memory is deallocated.For HPL input what block size NB should I use?
HPL uses the block size NB for the data distribution as well as for the computational granularity. From a data distribution point of view, the smallest NB, the better the load balance. You definitely want to stay away from very large values of NB. From a computation point of view, a too small value of NB may limit the computational performance by a large factor because almost no data reuse will occur in the highest level of the memory hierarchy. The number of messages will also increase. Efficient matrix-multiply routines are often internally blocked. Small multiples of this blocking factor are likely to be good block sizes for HPL. The bottom line is that "good" block sizes are almost always in the [32 .. 256] interval. The best values depend on the computation / communication performance ratio of your system. To a much less extent, the problem size matters as well. Say for example, you emperically found that 44 was a good block size with respect to performance. 88 or 132 are likely to give slightly better results for large problem sizes because of a slighlty higher flop rate.For HPL what process grid ratio P x Q should I use?
This depends on the physical interconnection network you have. Assuming a mesh or a switch HPL "likes" a 1:k ratio with k in [1..3]. In other words, P and Q should be approximately equal, with Q slightly larger than P. Examples: 2 x 2, 2 x 4, 2 x 5, 3 x 4, 4 x 4, 4 x 6, 5 x 6, 4 x 8 ... If you are running on a simple Ethernet network, there is only one wire through which all the messages are exchanged. On such a network, the performance and scalability of HPL is strongly limited and very flat process grids are likely to be the best choices: 1 x 4, 1 x 8, 2 x 4 ...What result is reported for mulitple values of HPL parameters?
The best performance is reported out of all HPL runs.For HPL what about the one processor case?
HPL has been designed to perform well for large problem sizes on hundreds of nodes and more. The software works on one node and for large problem sizes, one can usually achieve pretty good performance on a single processor as well. For small problem sizes however, the overhead due to message-passing, local indexing and so on can be significant.Can HPL be outperformed?
Certainly. There is always room for performance improvements (unless you've reached the theoretical peak of you machine). Specific knowledge about a particular system is always a source of performance gains. Even from a generic point of view, better algorithms or more efficient formulation of the classic ones are potential winners.How do I tune HPL?
You need to modify the input data file HPL.dat. This file should reside in the same directory as the executable hpl/bin/
Most likely yes. The only difference is the TOPdir variable. It should be set to '../../..' string.Can I extrapolate, interpolate, or use a similar computer's benchmark run for my computer's results?
No, a benchmark run must be submitted for each computer that is to be entered in the HPCchallenge list, even if the machines have identical processors and interconnecttion network.What does ping pong benchmark mean?
The ping pong benchmark is executed on two processes. From the client process a message (ping) is sent to the server process and then bounced back to the client (pong). MPI standard blocking send and receive is used. The ping-pong patterns are done in a loop. To achieve the communication time of one message, the total communication time is measured on the client process and divided by twice the loop length. Additional startup latencies are masked out by starting the measurement after one non-measured ping-pong. The benchmark in hpcc uses 8 byte messages and loop length = 8 for benchmarking the communication latency. The benchmark is repeated 5 times and the shortest latency is reported. To measure the communication bandwidth, 2,000,000 byte messages with loop length 1 are repeated twice.How is ping pong measured on more than 2 processors?
The ping-pong benchmark reports the maximum latency and minimum bandwidth for a number of non-simultaneous ping-pong tests. The ping-pongs are performed between as many as possible (there is an upper bound on the time it takes to complete this test) distinct pairs of processors.Which parallel communication pattern is used in the random and natural ring benchmark?
For measuring latency and bandwidth of parallel communication,
all processes are arranged in a ring topology
and each process sends and receives a message from its left and its right
neighbor in parallel.
Two types of rings are reported:
a naturally ordered ring (i.e., ordered by the process ranks in
and the geometric mean of ten different randomly chosen process orderings in the ring.
The communication is implemented
(a) with MPI standard non-blocking receive and send, and
(b) with two calls to MPI_Sendrecv for both directions in the ring.
Always the fastest of both measurements are used. For benchmarking latency and bandwidth, 8 byte and 2,000,000 byte long messages are used.
With this type of parallel communication, the bandwidth per process is defined as total amount of message data divided by the number of processes and the maximal time needed in all processes. This part of the benchmark is based on patterns studied in the effective bandwidth communication benchmark b_eff [Rab01, Kon1].How does the parallel ring bandwidth relate to vendors' values?
Vendors are often reporting a duplex network bandwidth (per CPU or accumulated) by counting each message twice, i.e., as incoming and outgoing data i.e., they are reporting parallel bandwidth values that are twice of the values reported by this ring benchmark (because in this hpcc benchmark, each transferred message is counted only once).How do I change data size (matrix and vector simensions) for the tests?
Only HPL and PTRANS matrix sizes can be changed directly in the hpccinf.txt or hpccmemf.txt input files. The remaining tests use the size of the largest HPL matrix to adjust the size of their input data. For example, in a sequential run, if the size of the HPL matrix is 1 GiB then each of the three vectors used by STREAM Triad will be 0.333 GiB, PTRANS matrix will be 0.5 GiB, the FFT vector size will be 125 MiB, and each of the three matrix sizes in DGEMM will be 333 MiB.What algorithm is used for transposing matrix in PTRANS?
The detailed description of the matrix transposition algorithm used by PTRANS is available as LAPACK Working Note No. 65.
To summarize what the above papers say: the dimensions Px and Py of the virtual process grid for PTRANS have to have small GCD (Greatest Common Divisor) and small LCM (Least Common Multiple) to achive good performance. The number of steps to do the transpose is LCM(Px,Py)/GCD(Px,Py). And the number of communicating pairs is GCD(Px,Py).How do I run individual tests in HPCC?
HPCC has not been designed for running invidual tests. Quite the opposite. It's a harness that ties multiple tests together. Having said that, it is possible to comment out calls to individual tests in src/hpcc.cHow is the HPCC output file translated into the information posted on the HPCC website?
key=value pairs in the
hpccoutf.txt file generated from a run of HPCC are mapped to the following columns in the results tables featured on the HPCC website:
HPL_Tflops --- G-HPL PTRANS_GBs --- G-PTRANS MPIRandomAccess_GUPs --- G-RandomAccess MPIFFT_Gflops --- G-FFT StarSTREAM_Triad * CommWorldProcs --- EP-STREAM Sys StarSTREAM_Triad --- EP-STREAM Triad StarDGEMM_Gflops --- EP-DGEMM RandomlyOrderedRingBandwidth_GBytes --- RandomRing Bandwidth RandomlyOrderedRingLatency_usec --- RandomRing Latency
For a single system, a user must provide a “base run” and may provide an optimized run.What are the rules for making a base run?
See the Overview section, which contains the rules.Can I run each benchmark in turn and concatenate the outputs?
No. To have the results accepted into the HPCchallange you must make a base run of all the benchmark components in one run.What am I allowed to modify?
For an optimized run the benchmark “harness” must be run. A user is allowed to replace the following routines only:
HPL_pdtrsv()(factorization and substitution functions)
No. The Strassen algorithm is not allowed as it changes the operation count for the factorization algorithm.Can I change the output from the run?
No modification to the output is allowed.What accuracy is acceptable?
The software is self checking so that if something is incorrect with the numerical results the output should signal the problem.Can I use my own libraries that implement portions of HPCC?
Yes, provided you do not voilate the rules. The base runs, the code cannot be changed but you can use optimized libraries to speed up some sections of the code. These libraries have to be generally available on your system for others to use. For optimized runs, you can replace some portions of the benchmark with your propriatary code.
First name of the person that submits the results. The name is kept private, it is only stored internally. It is only used in correspondence with the submitter (if at all) and we also use it for the award announcements.Why do I need to submit my last name?
Last name of the person that submits the results. The name is kept private, it is only stored internally. It is only used in correspondence with the submitter (if at all) and we also use it for the award announcements.Why do I need to submit my email?
Email address is used for the submission confirmation. It is not possible to submit an entry without a valid e-mail address. This email is not used for any other purposes and is not listed in the publicly accessible documents.Why do I need to submit my machine location?
The location is used to report the country (and state) where the benchmarked machine is located.Why do I need to submit my machine city?
The city where the benchmarked machine is located. This field is optional but very useful for differentiating locations within a single country.Why do I need to submit my institution or affiliation?
For each result submission, we collect and report either institution that owns the machine or with which the machine is affiliated with. This helps in determining the ownership of the machine and may serve as an outreach opportunity. Example entry: University of Tennessee.Why do I need to submit my institution or affiliation URL?
The URL is meant to give the opportunity to obtain more information about the institution or affiliation of the site where the computer is located.What should I input in the field named "Manufacturer/Integrator"?
This field should cotain the name of the company (or institution) that have assembled the machine.What should I input in the field named "System name"?
This field is for the name that describes the machine as a whole. Most commonly this would be the model name and number. Example: X1, SP2, Origin 2000, SuperDome.What should I input in the field named
The "Theoretical peak" field should contain the computational rate (even if only theoretical) of all the processors/cores used by the benchmark expressed in Tflop/s (trillion floating-point operations per second or 1012 flop/s). Typically, it is a product of the number of floating-point operations per cycle, clock frequency, and the number of processors/cores. For the multi-core chips, a common practice is to refer to a single core as processor. So the theoretical peak should be multiplied by the number of cores. The table below gives the number of floating-point operations per cycle for common processors/cores. Please keep in mind that the recent processors from AMD and Intel utilize frequency scaling. The nominal frequency from the specification might be lower than the maximum frequency, that the processor might be able to use under some circumstances. For the theoretical peak, the maximum frequency should be used.
|AMD Bulldozer 42xx, 62xx||4|
|AMD Piledriver 63xx||8|
|Cray X1, X1E, X2||16|
|Fujitsu SPARC64 V, V+, VI||2|
|G5 (IBM PowerPC 970)||4|
|IBM PowerPC 440||4|
|Intel Itanium 2/III/4||4|
|Intel Xeon Nehalem, Westmere||4|
|Intel Xeon Sandy Bridge||8|
|Intel Xeon Ivy Bridge||8|
|Intel Xeon Haswell||?|
|Intel Xeon Broadwell||?|
|Sun UltraSPARC I-V, T1 (Niagra), Rock||2|
Describes describes related benchmarksWhat is the relation of this benchmark to the Linpack benchmark?
The Linpack benchmark called the Highly Parallel Computing Benchmark can be found in Table 3 of the Linpack Benchmark Report (PDF). This benchmark attempts to measure the best performance of a machine in solving a system of equations. The problem size and software can be chosen to produce the best performance. HPL is the benchmark used for the TOP500 report.What does the HPC Challenge Benchmark have to do with the Top500?
The HPC Challenge Benchmark is an attempt to broaden the scope of benchmarking high performance systems. The Top500 uses one metric, the Linpack Benchmark (HPL), to rank the 500 fastest computer systems in use today. The HPC Challenge does not produce a ranking for systems, but provides a set of metrics for evaluations and comparisons.What is the Top500?
The Top500 lists the 500 fastest computer systems being used today. In 1993 the collection was started and has been updated every 6 months since then. The report lists the sites that have the 500 most powerful computer systems installed. The best Linpack benchmark performance achieved is used as a performance measure in ranking the computers. The TOP500 list has been updated twice a year since June 1993.How can I get my computer listed on the Top500?
To be listed on the Top500 list you have to run the software that can be found at http://www.netlib.org/benchmark/hpl/ and the performance of the benchmark run must be within the range of the 500 fasted computers for that period of time.Can I use HPCC for Top500 submission?
Yes.Where can I get a copy of the Top500 report?
The Top500 reports are maintained at http://www.top500.org/.Where can I get the software to generate performance results for the Top500?
There is software available that has been optimized and that many people use to generate the Top500 performance results. This benchmark attempts to measure the best performance of a machine in solving a system of equations. The problem size and software can be chosen to produce the best performance. A copy of that software can be downloaded from: http://www.netlib.org/benchmark/hpl/. In order to run this you will need MPI and an optimized version of the BLAS. For MPI you can see: http://www-unix.mcs.anl.gov/mpi/ and for the BLAS see: http://www.netlib.org/atlas/ .What is the relation of this benchmark to the STREAM benchmark?
The version of STREAM in HPCC has been modified so that the source and destination arrays are allocated from the heap with a dynamic size instead of having static storage with a compile-time constant size. This allows the size of the arrays to be scaled appropriately according to the memory size (derived from the HPL parameters). From the compiler stand-point, it removes information about pointer aliasing, alignment, and data size - all of which might be crucial for efficient code generation. Optimized run of HPCC is expected to deal with these issues.What is the relation of this benchmark to the Effective Communication Bandwidth (b_eff) benchmark?
In this benchmark, latency and bandwidth are measured mainly with three communication patterns (ping-pong, random ring, natural ring) and two message sizes (8 byte for latency and 2,000,000 bytes for bandwidth measurements) and these different results are reported independently. The buffer memory is always reused in a loop of measurements. The goal of b_eff is to compute an average bandwidth value that represents several ring patterns (sequentially and randomly ordered) and 21 different message sizes. Memory reuse is prohibited.