HPL FAQ

General questions

What is HPL?

HPL is a portable implementation of the Linpack benchmark.

What is the license for the HPL code?

All components of the code are distributed under the original BSD license.

Where do I send my suggestions or questions about HPCC?

Please send your questions to dongarra @ eecs.utk.edu.

What does HPL measure?

HPL measures the floating point execution rate for solving a system of linear equations.

What software in addition to the benchmark program is needed?

User must supply a machine specific version of MPI and the BLAS.

What is Gflop/s?

Gflop/s is a rate of execution - billion (ten to the 9th power) of floating point operations per second. Whenever this term is used it refers to 64-bit floating point operations and the operations are either addition or multiplication (a “fused” multiply/add is counted as two floating point operations).

What is the theoretical peak performance?

The theoretical peak is based not on an actual performance from a benchmark run, but on a paper computation to determine the theoretical peak rate of execution of floating point operations for the machine. This is the number manufacturers often cite; it represents an upper bound on performance. That is, the manufacturer guarantees that programs will not exceed this rate-sort of a "speed of light" for a given computer. The theoretical peak performance is determined by counting the number of floating-point additions and multiplications (in full precision) that can be completed during a period of time, usually the cycle time of the machine. For example, an Intel Itanium 2 at 1.5 GHz can complete 4 floating point operations per cycle or a theoretical peak performance of 6 GFlop/s.

Why is my performance results below the theoritical peak?

The performance of a computer is a complicated issue, a function of many interrelated quantities. These quantities include the application, the algorithm, the size of the problem, the high-level language, the implementation, the human level of effort used to optimize the program, the compiler's ability to optimize, the age of the compiler, the operating system, the architecture of the computer, and the hardware characteristics. The results presented for this benchmark suite should not be extolled as measures of total system performance (unless enough analysis has been performed to indicate a reliable correlation of the benchmarks to the workload of interest) but, rather, as reference points for further evaluations.

Why are the performance results for my computer different than some other machine with the same characteristics?

There are many reasons why your results may vary from results recorded in previous machines. Issues such as load on the system, accuracy of the clock, compiler options, version of the compiler, size of cache, bandwidth from memory, amount of memory, etc can effect the performance even when the processors are the same.

What about the one processor case?

HPL has been designed to perform well for large problem sizes on hundreds of nodes and more. The software works on one node and for large problem sizes, one can usually achieve pretty good performance on a single processor as well. For small problem sizes however, the overhead due to message-passing, local indexing and so on can be significant.

Can HPL be outperformed?

Certainly. There is always room for performance improvements (unless you've reached the theoretical peak of you machine). Specific knowledge about a particular system is always a source of performance gains. Even from a generic point of view, better algorithms or more efficient formulation of the classic ones are potential winners.

How do I tune HPL?

You need to modify the input data file HPL.dat. This file should reside in the same directory as the executable hpl/bin//xhpl. An example HPL.dat file is provided by default but is not optimal for any practical system. This file contains information about the problem sizes, machine configuration, and algorithm features to be used by the executable.

What is the relation of this benchmark to the Linpack benchmark?

The Linpack benchmark called the Highly Parallel Computing Benchmark can be found in Table 3 of the Linpack Benchmark Report (PDF). This benchmark attempts to measure the best performance of a machine in solving a system of equations. The problem size and software can be chosen to produce the best performance. HPL is the benchmark often used for the TOP500 report.

Input file

What problem size (matrix dimension N) should I use?

In order to find out the best performance of your system, the largest problem size fitting in memory is what you should aim for. The amount of memory used by HPL is essentially the size of the coefficient matrix. So for example, if you have 4 nodes with 256 Mb of memory on each, this corresponds to 1 Gb total, i.e., 125 M double precision (8 bytes) elements. The square root of that number is 11585. One definitely needs to leave some memory for the OS as well as for other things, so a problem size of 10000 is likely to fit. As a rule of thumb, 80 % of the total amount of memory is a good guess. If the problem size you pick is too large, swapping will occur, and the performance will drop. If multiple processes are spawn on each node (say you have 2 processors per node), what counts is the available amount of memory to each process.

What block size NB should I use?

HPL uses the block size NB for the data distribution as well as for the computational granularity. From a data distribution point of view, the smallest NB, the better the load balance. You definitely want to stay away from very large values of NB. From a computation point of view, a too small value of NB may limit the computational performance by a large factor because almost no data reuse will occur in the highest level of the memory hierarchy. The number of messages will also increase. Efficient matrix-multiply routines are often internally blocked. Small multiples of this blocking factor are likely to be good block sizes for HPL. The bottom line is that "good" block sizes are almost always in the [32 .. 256] interval. The best values depend on the computation / communication performance ratio of your system. To a much less extent, the problem size matters as well. Say for example, you emperically found that 44 was a good block size with respect to performance. 88 or 132 are likely to give slightly better results for large problem sizes because of a slighlty higher flop rate.

What process grid (P x Q) should I use?

This depends on the physical interconnection network you have. Assuming a mesh or a switch HPL "likes" a 1:k ratio with k in [1..3]. In other words, P and Q should be approximately equal, with Q slightly larger than P. Examples: 2 x 2, 2 x 4, 2 x 5, 3 x 4, 4 x 4, 4 x 6, 5 x 6, 4 x 8 ... If you are running on a simple Ethernet network, there is only one wire through which all the messages are exchanged. On such a network, the performance and scalability of HPL is strongly limited and very flat process grids are likely to be the best choices: 1 x 4, 1 x 8, 2 x 4 ...