HPC Challenge Benchmark Record

System Information
Affiliation:   Argonne Leadership Computing Facility/Argonne National Laboratory   URL:   https://www.alcf.anl.gov
Location:   USA, Illinois, Argonne   System Use:   Research
System Manufacturer:   IBM   System Name:   Blue Gene/Q (MIRA)
Interconnect Manufacturer:   IBM   Interconnect Type:   BGQ 5D TORUS
Operating System:   Linux CNK   MPI:   MPICH2 version 1.5
MPI Wtick:   0.000000000625   BLAS:   ESSL 5.1
Language:   C   Compiler:   IBM XL C/C++ for Blue Gene, V12.1
Compiler Flags:   -O5 -qstrict -qarch=qp -qtune=qp -qipa -qsimd=auto -qthreaded -qsmp=omp -qmaxmem=-1   Processor Type:   IBM PowerPC A2
Processor Speed:   1.6 GHz   Total Processors:   49152
Processors Entered:   49152   Processors determined:   3145728
Cores per chip:   16   HPL Processes:   196608
MPI Processes:   196608   Threads Entered:   16
Threads determined:   16   FLOPs per cycle:  
Theoretical peak:   10066 TFlop/s   Total memory:   786432 GiB
FFT library:   ESSL 5.1  
Explain Optimizations:

GUPS: Special low-overhead runtime exposing the interconnect atomic RDMA support.
EP-STREAM: use of QPX vector intrinsics.
G-FFT: (2D-decomposition, FFTE (IOPT=-1/+1 data distribution) INPUT(NX,NY/NPUY,NZ/NPUZ) -> OUTPUT(NX,NY/NPUY,NZ/NPUZ)). custom low-overhead communication runtime, optimized alltoall implementation using one-sided RDMA and fine-grained transfers, custom rank placement to maximize interconnect throughput in the three alltoall communication phases, intra-node FFT using the IBM ESSL 5.1 implementation.
HPL: Base implementation that is moderately tuned. The rank placement to optimize FFT reduced the performance of broadcast in HPL. See note on MPI below.
Note: MPI performance is affected due to the allocation of the interconnect resources for the custom one-sided transfer runtime. Special memory allocator used across all benchmarks to avoid fragmentation.

HPL
HPL:   5709.28 Tflop/s   HPL time:   58951.7
HPL eps:   1.11022e-16   HPL Rnorm1:   0.00000912409
HPL Anorm1:   1992920   HPL AnormI:   1993150
HPL Xnorm1:   47008600   HPL XnormI:   38.8314
HPL N:   7962624   HPL NB:   384
HPL NProw:   256   HPL NPcol:   768
HPL depth:   0   HPL NBdiv:   2
HPL NBmin:   8   HPL CPfact:   C
HPL CRfact:   R   HPL CPtop:   5
HPL order:   R
HPL dMach EPS:   1.110223e-16   HPL sMach EPS:   0.00000005960464
HPL dMach sfMin:   2.225074e-308   HPL sMach sfMin:   1.175494e-38
HPL dMach Base:   2   HPL sMach Base:   2
HPL dMach Prec:   2.220446e-16   HPL sMach Prec:   0.0000001192093
HPL dMach mLen:   53   HPL sMach mLen:   24
HPL dMach Rnd:   1   HPL sMach Rnd:   1
HPL dMach eMin:   -1021   HPL sMach eMin:   -125
HPL dMach rMin:   2.225074e-308   HPL sMach rMin:   1.175494e-38
HPL dMach eMax:   1024   HPL sMach eMax:   128
HPL dMach rMax:   1.797693e308   HPL sMach rMax:   3.402823e38
dweps:   1.110223e-16   sweps:   0.00000005960464

PTRANS
PTRANS:   12751.8 GB/s   PTRANS time:   9.15461 seconds
PTRANS residual:   0   PTRANS N:   3981312
PTRANS NB:   384   PTRANS NProw:   256
PTRANS NPcol:   768

STREAM
S-STREAM Copy:   27.7467 GB/s   S-STREAM Scale:   27.7209 GB/s
S-STREAM Add:   28.2254 GB/s   S-STREAM Triad:   28.2654 GB/s
EP-STREAM Copy:   6.93412 GB/s   EP-STREAM Scale:   6.90049 GB/s
EP-STREAM Add:   7.15537 GB/s   EP-STREAM Triad:   7.25217 GB/s
STREAM Vector Size:   107495424   STREAM Threads:   16

RandomAccess
S-RandomAccess:   0.00745259 Gup/s   EP-RandomAccess:   0.00731611 Gup/s
G-RandomAccess:   417.788 Gup/s   G-RandomAccess N:   35184372088832
G-RandomAccess time:   336.864 seconds   G-RandomAccess Check Time:   1572.59 seconds
G-RandomAccess Errors:   11   G-RandomAccess Errors Fraction:   0.000000000000312639
G-RandomAccess TimeBound:   -1   G-RandomAccess ExeUpdates:   140737488355328
RandomAccess N:   268435456

FFT
S-FFT:   2.87548 GFlop/s   EP-FFT:   2.57356 GFlop/s
MPIFFT:   226101 GFlop/s   MPIFFT N:   7421703487488
MPIFFT Max Error:   0.00000000000000266916   MPIFFT time0:   0 seconds
MPIFFT time1:   0 seconds   MPIFFT time2:   0 seconds
MPIFFT time3:   0 seconds   MPIFFT time4:   0 seconds
MPIFFT time5:   0 seconds   MPIFFT time6:   0 seconds
FFTEnblk:   16   FFTEnp:   8
FFTEl2size:   1048576

DGEMM
S-DGEMM:   35.576 GFlop/s   EP-DGEMM:   29.2575 GFlop/s
DGEMM N:   10367

RandomRing Latency/Bandwidth
RandomRing Latency:   4.36215 usec   RandomRing Bandwidth:   0.0725169 GB/s

NaturalRing Latency/Bandwidth
NaturalRing Latency:   4.37494 usec   NaturalRing Bandwidth:   1.02355 GB/s

PingPong Latency/Bandwidth
Maximum PingPong Latency:   5.01555 usec   Maximum PingPong Bandwidth:   1.76433 GB/s
Minimum PingPong Latency:   3.00984 usec   Minimum PingPong Bandwidth:   1.50529 GB/s
Average PingPong Latency:   3.99506 usec   Average PingPong Bandwidth:   1.6365 GB/s

Size of Data Types
char:   1 byte     short:   2 bytes
int:   4 bytes   long:   8 bytes
void ptr:   8 bytes   float:   4 bytes
double:   8 bytes   size t:   8 bytes
s64Int:   8 bytes   u64Int:   8 bytes

OpenMP
M OpenMP:   201107   OpenMP Num Threads:   16
OpenMP Num Procs:   16   OpenMP Max Threads:   16

Memory
MemProc:   -1   MemSpec:   -1
MemVal:   -1

CPS
CPS_HPCC_FFT_235:   0   CPS_HPCC_FFTW_ESTIMATE:   0
CPS_HPCC_MEMALLCTR:   1   CPS_HPL_USE_GETPROCESSTIMES:   0
CPS_RA_SANDIA_NOPT:   0   CPS_RA_SANDIA_OPT2:   0


Version: 1.4.2.f - Run Type: opt - Parent ID: 540
Created: 2014-10-26 - Exported: Wed Jun 28 03:15:54 2017
HPC Challenge Benchmark Record