HPC Challenge Benchmark Record

System Information
Affiliation:   Argonne National Lab - LCF   URL:   www.alcf.anl.gov
Location:   USA, Illinois, Argonne   System Use:   Government
System Manufacturer:   IBM   System Name:   Blue Gene/P
Interconnect Manufacturer:   IBM   Interconnect Type:   Torus
Operating System:   Blue Gene CNK   MPI:   MPICH 2
MPI Wtick:   0.000000001176471   BLAS:   ESSL 4.3
Language:   C   Compiler:   IBM XL C/C++ 9.00
Compiler Flags:   -DHPCC_MEMALLCTR -g -O3 -qhot -qsmp=omp -qmaxmem=-1 -DBGPOPT   Processor Type:   PowerPC 450
Processor Speed:   0.85 GHz   Total Processors:   40960
Processors Entered:   32768   Processors determined:   131072
Cores per chip:   4   HPL Processes:   32768
MPI Processes:   32768   Threads Entered:   4
Threads determined:   4   FLOPs per cycle:  
Theoretical peak:   557 TFlop/s   Total memory:   81920 GiB
FFT library:    
Explain Optimizations:

The RandomAccess algorithm is similar in principle to the algorithm
submitted for HPCC in 2005, i.e., it employs software routing and
aggregation on a 3D-torus topology, routing every update along the three dimensions in dimension ordered routing manner and ensuring that the application does not have more than 1024 updates with it at any point in time. The significant changes for Blue Gene/P are as follows: Since an MPI node has 4 cores at its service, the functionality is distributed on the 4 cores with core 0 generating the updates and routing them along the X dimension, core 1 receiving from X dimension and routing along Y, core 2 receiving from Y and routing along Z and core 3 receiving from Z and performing the updates to the local table. The code bypasses MPI and directly uses the lower layer DMA based SPI communication layer (available as standard software library of Blue Gene/P).

The IBM Blue Gene/P system supports direct use of messaging DMA hardware in parallel with use of MPI for applications messaging. To enable this direct use mode for DMA an initialization call to setup the DMA fifos must be executed before invoking the MPI_Init call. The optimized HPCC code has introduced a function call dma_init  which is invoked just before MPI_init for this purpose. This is a method that has been put in to support special messaging situations and that is used in a number of production codes  including QCD. It is also well documented in the Blue Gene redbook.

For MPIFFT we changed the algorithm of parallel 1D FFT from 9-step FFT to basic 6-step algorithm. And then we modified the parallelized FFT codes under HPCC_pzfft1d function to fit to Blue Gene system. We modified all the functions in "fft235.c" to SIMDize radix 2,3,4,5 and 8 FFT routines by using intrinsic functions of IBM XLC compiler to generate appropriate double FPU instructions.

HPL:   173.362 Tflop/s   HPL time:   57080.7
HPL eps:   1.11022e-16   HPL Rnorm1:   0.00000115715
HPL Anorm1:   615309   HPL AnormI:   615249
HPL Xnorm1:   4574500   HPL XnormI:   11.9858
HPL N:   2457601   HPL NB:   120
HPL NProw:   128   HPL NPcol:   256
HPL depth:   1   HPL NBdiv:   6
HPL NBmin:   6   HPL CPfact:   C
HPL CRfact:   R   HPL CPtop:   3
HPL order:   R
HPL dMach EPS:   1.110223e-16   HPL sMach EPS:   0.00000005960464
HPL dMach sfMin:   0   HPL sMach sfMin:   1.175494e-38
HPL dMach Base:   2   HPL sMach Base:   2
HPL dMach Prec:   2.220446e-16   HPL sMach Prec:   0.0000001192093
HPL dMach mLen:   53   HPL sMach mLen:   24
HPL dMach Rnd:   1   HPL sMach Rnd:   1
HPL dMach eMin:   -1021   HPL sMach eMin:   -125
HPL dMach rMin:   0   HPL sMach rMin:   1.175494e-38
HPL dMach eMax:   1024   HPL sMach eMax:   128
HPL dMach rMax:   1.797693e308   HPL sMach rMax:   3.402823e38
dweps:   1.110223e-16   sweps:   0.00000005960464

PTRANS:   625.204 GB/s   PTRANS time:   18.948 seconds
PTRANS residual:   0   PTRANS N:   1228800
PTRANS NB:   120   PTRANS NProw:   128
PTRANS NPcol:   256

S-STREAM Copy:   5.43815 GB/s   S-STREAM Scale:   3.62631 GB/s
S-STREAM Add:   3.97957 GB/s   S-STREAM Triad:   3.97997 GB/s
EP-STREAM Copy:   5.43754 GB/s   EP-STREAM Scale:   3.6263 GB/s
EP-STREAM Add:   3.9796 GB/s   EP-STREAM Triad:   3.97996 GB/s
STREAM Vector Size:   61440050   STREAM Threads:   4

S-RandomAccess:   0.0096932 Gup/s   EP-RandomAccess:   0.00969341 Gup/s
G-RandomAccess:   103.18 Gup/s   G-RandomAccess N:   4398046511104
G-RandomAccess time:   170.5 seconds   G-RandomAccess Check Time:   1009.14 seconds
G-RandomAccess Errors:   0   G-RandomAccess Errors Fraction:   0
G-RandomAccess TimeBound:   -1   G-RandomAccess ExeUpdates:   17592186044416
RandomAccess N:   134217728

S-FFT:   1.21389 GFlop/s   EP-FFT:   1.21354 GFlop/s
MPIFFT:   5079.59 GFlop/s   MPIFFT N:   549755813888
MPIFFT Max Error:   0.0000000000000024651   MPIFFT time0:   0.397244 seconds
MPIFFT time1:   4.26304 seconds   MPIFFT time2:   2.08924 seconds
MPIFFT time3:   5.30936 seconds   MPIFFT time4:   3.88742 seconds
MPIFFT time5:   4.96885 seconds   MPIFFT time6:   0.189394 seconds
FFTEnblk:   16   FFTEnp:   8
FFTEl2size:   1048576

S-DGEMM:   9.67524 GFlop/s   EP-DGEMM:   9.67646 GFlop/s
DGEMM N:   7837

RandomRing Latency/Bandwidth
RandomRing Latency:   6.23889 usec   RandomRing Bandwidth:   0.0219922 GB/s

NaturalRing Latency/Bandwidth
NaturalRing Latency:   4.85518 usec   NaturalRing Bandwidth:   0.743607 GB/s

PingPong Latency/Bandwidth
Maximum PingPong Latency:   6.61654 usec   Maximum PingPong Bandwidth:   0.385704 GB/s
Minimum PingPong Latency:   3.58265 usec   Minimum PingPong Bandwidth:   0.379582 GB/s
Average PingPong Latency:   5.06575 usec   Average PingPong Bandwidth:   0.385048 GB/s

Size of Data Types
char:   1 byte     short:   2 bytes
int:   4 bytes   long:   4 bytes
void ptr:   4 bytes   float:   4 bytes
double:   8 bytes   size t:   4 bytes
s64Int:   8 bytes   u64Int:   8 bytes

M OpenMP:   200505   OpenMP Num Threads:   4
OpenMP Num Procs:   4   OpenMP Max Threads:   4

MemProc:   -1   MemSpec:   -1
MemVal:   -1


Version: 1.2.0.f - Run Type: opt - Parent ID: 317
Created: 2008-11-17 - Exported: Sun Jun 25 01:26:52 2017
HPC Challenge Benchmark Record