HPL FAQ
General questions
What is HPL?
HPL is a portable implementation of the Linpack benchmark.
What is the license for the HPL code?
All components of the code are distributed under the original
BSD license.
Where do I send my suggestions or questions about HPCC?
Please send your questions to dongarra @ eecs.utk.edu.
What does HPL measure?
HPL measures the floating point execution rate for solving a system of linear equations.
What software in addition to the benchmark program is needed?
What is Gflop/s?
Gflop/s is a rate of execution - billion (ten to the 9th power) of floating point operations per second.
Whenever this term is used it refers to 64-bit floating
point operations and the operations are either addition
or multiplication (a “fused” multiply/add is counted as two
floating point operations).
What is the theoretical peak performance?
The theoretical peak is based not on an actual
performance from a benchmark run, but on a paper computation to
determine the theoretical peak rate of execution of floating point
operations for the machine. This is the number manufacturers often
cite; it represents an upper bound on performance. That is, the
manufacturer guarantees that programs will not exceed this rate-sort of
a "speed of light" for a given computer. The theoretical peak
performance is determined by counting the number of floating-point
additions and multiplications (in full precision) that can be completed
during a period of time, usually the cycle time of the machine. For
example, an Intel Itanium 2 at 1.5 GHz can complete 4 floating point
operations per cycle or a theoretical peak performance of 6 GFlop/s.
Why is my performance results below the theoritical peak?
The performance of a computer is a complicated issue, a
function of many interrelated quantities. These quantities include the
application, the algorithm, the size of the problem, the high-level
language, the implementation, the human level of effort used to
optimize the program, the compiler's ability to optimize, the age of
the compiler, the operating system, the architecture of the computer,
and the hardware characteristics. The results presented for this
benchmark suite should not be extolled as measures of total system
performance (unless enough analysis has been performed to indicate a
reliable correlation of the benchmarks to the workload of interest)
but, rather, as reference points for further evaluations.
Why are the performance results for my computer different than some other machine with the same characteristics?
There are many reasons why your results may vary from
results recorded in previous machines. Issues such as load on the
system, accuracy of the clock, compiler options, version of the
compiler, size of cache, bandwidth from memory, amount of memory, etc
can effect the performance even when the processors are the same.
What about the one processor case?
HPL has been designed to perform well for large problem
sizes on hundreds of nodes and more. The software works on one node and
for large problem sizes, one can usually achieve pretty good
performance on a single processor as well. For small problem sizes
however, the overhead due to message-passing, local indexing and so on
can be significant.
Can HPL be outperformed?
Certainly. There is always room for performance
improvements (unless you've reached the theoretical peak of you
machine). Specific knowledge about a particular system is always a
source of performance gains. Even from a generic point of view, better
algorithms or more efficient formulation of the classic ones are
potential winners.
How do I tune HPL?
What is the relation of this benchmark to the Linpack benchmark?
The Linpack benchmark called the Highly Parallel Computing Benchmark can be found in Table 3 of the Linpack Benchmark Report (PDF). This benchmark attempts to measure the best performance of a machine in
solving a system of equations. The problem size and software can be
chosen to produce the best performance. HPL is the benchmark often used for the TOP500 report.
Input file
What problem size (matrix dimension N) should I use?
In order to find out the best performance of your system,
the largest problem size fitting in memory is what you should aim for.
The amount of memory used by HPL is essentially the size of the
coefficient matrix. So for example, if you have 4 nodes with 256 Mb of
memory on each, this corresponds to 1 Gb total, i.e., 125 M double
precision (8 bytes) elements. The square root of that number is 11585.
One definitely needs to leave some memory for the OS as well as for
other things, so a problem size of 10000 is likely to fit. As a rule of
thumb, 80 % of the total amount of memory is a good guess. If the
problem size you pick is too large, swapping will occur, and the
performance will drop. If multiple processes are spawn on each node
(say you have 2 processors per node), what counts is the available
amount of memory to each process.
What block size NB should I use?
HPL uses the block size NB for the data distribution as
well as for the computational granularity. From a data distribution
point of view, the smallest NB, the better the load balance. You
definitely want to stay away from very large values of NB. From a
computation point of view, a too small value of NB may limit the
computational performance by a large factor because almost no data
reuse will occur in the highest level of the memory hierarchy. The
number of messages will also increase. Efficient matrix-multiply
routines are often internally blocked. Small multiples of this blocking
factor are likely to be good block sizes for HPL. The bottom line is
that "good" block sizes are almost always in the [32 .. 256] interval.
The best values depend on the computation / communication performance
ratio of your system. To a much less extent, the problem size matters
as well. Say for example, you emperically found that 44 was a good
block size with respect to performance. 88 or 132 are likely to give
slightly better results for large problem sizes because of a slighlty
higher flop rate.
What process grid (P x Q) should I use?
This depends on the physical interconnection network you
have. Assuming a mesh or a switch HPL "likes" a 1:k ratio with k in
[1..3]. In other words, P and Q should be approximately equal, with Q
slightly larger than P. Examples: 2 x 2, 2 x 4, 2 x 5, 3 x 4, 4 x 4, 4
x 6, 5 x 6, 4 x 8 ... If you are running on a simple Ethernet network,
there is only one wire through which all the messages are exchanged. On
such a network, the performance and scalability of HPL is strongly
limited and very flat process grids are likely to be the best choices:
1 x 4, 1 x 8, 2 x 4 ...