ࡱ >
A B = \508 X " 1 Arial1 Arial1 Arial1 Arial1 Arial1 Arial1 Arial1 Arial1 Tahoma1 Arial1 Arial1 Arial
yyyy-mm-dd + ) , *
x""@ @ 4 x""@ @ 3 x""@ @ 4 x""@ @ 3 $@ @ d@ @ / @ @ `@ @ / 8 3f f f f ̙ ̙ 3f 3 f ff 3f 3f 3 33 3 3f 33 333 Sheet1 : Created Manufacturer System Processor Type Processor Speed
(GHz) Total Memory
(GiB) Interconnect Interconnect Manufacturer Affiliation URL Country City
System Use OS BLAS FFT Library Language Compiler Compiler Flags
Version Major
Version Minor
Version Micro Version Release Run Type Optimization MPI Processor Count Total from Form Derived Cores per Chip
HPL Processes Threads from Code
MPI Processes FLOPs per cycle Theoretical Peak
(TFlop/s)
HPL
(TFlop/s)
per System per Process
per Thread
PTRANS
(GB/s) Single STREAM Copy
(GB/s) Single STREAM Scale
(GB/s) Single STREAM Add
(GB/s) Single STREAM Triad
(GB/s) EP STREAM Copy
(GB/s) EP STREAM Scale
(GB/s) EP STREAM Add
(GB/s) EP STREAM Triad
(GB/s) Single RandomAccess
(GUP/s) EP RandomAccess
(GUP/s) Global RandomAccess
(GUP/s)
RandomRing Latency
(microsecond) Bandwidth
(GB/s) NaturalRing PingPong Latency
(microsecond) Maximum Minimum Average PingPong Bandwidth
(GB/s) Single DGEMM
(GFlop/s) EP DGEMM
(GFlop/s) Single FFT
(GFlop/s) EP FFT
(GFlop/s) Global FFT
(GFlop/s) sizeof char sizeof short
sizeof int sizeof long sizeof void_ptr sizeof float
sizeof double
sizeof size_t
sizeof s64Int
sizeof u64Int MPI Wtick HPL time HPL EPS
HPL RnormI
HPL Anorm1
HPL AnormI
HPL Xnorm1
HPL XnormI HPL N HPL NB HPL nprow HPL npcol HPL depth HPL nbdiv HPL nbmin
HPL cpfact
HPL crfact HPL ctop HPL order
HPL dMach EPS HPL dMach sfMin HPL dMach Base HPL dMach Prec HPL dMach mLen
HPL dMach Rnd HPL dMach eMin HPL dMach rMin HPL dMach eMax HPL dMach rMax
HPL sMach EPS HPL sMach sfMin HPL sMach Base HPL sMach Prec HPL sMach mLen
HPL sMach Rnd HPL sMach eMin HPL sMach rMin HPL sMach eMax HPL sMach rMax dweps sweps DGEMM N PTRANS time PTRANS residual PTRANS n PTRANS nb PTRANS nprow PTRANS npcol MPI Random Access N MPI Random Access time MPI Random Access Check Time MPI Random Access Errors! MPI Random Access Errors Fraction MPI Random Access ExeUpdates MPI Random Access Timebound Random Access N STREAM Vector Size STREAM Threads FFT N MPI FFT N MPI FFT maxErr
MPI FFT time0
MPI FFT time1
MPI FFT time2
MPI FFT time3
MPI FFT time4
MPI FFT time5
MPI FFT time6 FFT nBlk FFT np FFT L2 size M OpenMP OpenMP Num Threads OpenMP Max Threads OpenMP Num Procs Mem Proc Mem Spec Mem Val CPS HPCC FFT 235 CPS HPCC FFTW ESTIMATE CPS HPCC MEMALLCTR CPS HPL USE GET PROCESS TIMES CPS RA SANDIA NOPT CPS RA SANDIA OPT2 CPS USING FFTW ID ParentID Intel Intel Endeavor clusterK Intel Xeon Gold 6148 (SMT OFF, TURBO OFF,192 GB RAM DDR4 2666 12*16GB, ECC) Intel OPA 100Gbps Intel Corporation http://www.intel.com/ USA, New Mexico
Rio Rancho Vendor Oracle Linux Server 7.3 Intel MKL 2017.2.174 C Intel Composer 2017u4 -O2 -xCORE2-AVX512 -ip -ansi-alias -fno-alias -DLONG_IS_64BITS -DRA_SANDIA_OPT2 -DUSING_FFTW -DHPCC_FFT_235 (-opt-streaming-stores always for stream.c) f base Intel MPI 2018.1.163 R IBM System_X3550_M4 Intel Xeon EM64T
infiniband University of Bordj Bou Arreridj www.univ-bba.dz Algeria Bordj Bou Arreridj Research RHEL 6.5
blas-3.2.1 gcc 4.4.7-4 -o5 IBM MPI 09.01.03.00K Intel Xeon E5-2697 v4 (SMT OFF, TURBO OFF,128 GB RAM DDR4 2400 8*16GB, ECC) Red Hat EL 7.2 Intel MKL11.3.0 Intel Composer 2016 -O2 -xCORE2-AVX2 -ip -ansi-alias -fno-alias -DLONG_IS_64BITS -DRA_SANDIA_OPT2 -DUSING_FFTW -DHPCC_FFT_235 (-opt-streaming-stores always for stream.c) Intel MPI 5.1.2 Fujitsu
K computer Fujitsu SPARC64 VIIIfx Tofu Interconnect2 RIKEN Advanced Institute for Computational Scinece% http://www.aics.riken.jp/index_e.html Japan Kobe, Hyogo Linux1 Parallelnavi Technical Computing Language V1.0L20 FFTE-6.1 alpha~ -Kfast,parallel,openmp,ocl,preex (and "-mlcmain=main -SSL2BLAMP" for LINKFLAGS, "-Xg" for C files, "-Fwide" for Fortran files) opt G-HPL: adopted pipelined broadcast communication and promoted fine-grain overlapped data transfer G-RandomAccess: replaced MPI function calls with RDMA function calls, introduced 3-way data sorting and transfer to improve load imbalance, overlapped data transfer with data sorting, transformed data-sorting loop to use conditional move CPU instructions and applied loop distribution to table-updating loop S&EP-RandomAccess: applied loop distribution to table-updating loop and increased iterations of inner loops G-FFT: used the latest version of FFTE library EP-STREAM: inserted compiler directives to use a CPU feature of fast cache line allocation L seagate camaro intel seagate test lab Malaysia seremban SL 6
GOTOBLAS 1.13 gcc MPICH2 1.2.1> Intel Xeon E5-2697 v3 (SMT OFF, TURBO OFF, PC4-17000R-15, ECC)\ FDR Infiniband (48 Mellanox MSX6025F-1BFR switches, Mellanox MCX353A-FCAT adapters on nodes) Mellanox Red Hat EL 6.5 Intel MKL 11.2 Intel Composer XE 2015.0.064 -O3 -xCORE-AVX2 -ip -ansi-alias -fno-alias -qopt-subscript-in-range -DUSING_FFTW -DMKL_INT=long -DLONG_IS_64BITS -DRA_SANDIA_OPT2 -DHPCC_FFT_235(-opt-streaming-stores always for stream.c) Intel MPI 5.0.0.028 https://launchpad.net/mpi4you mpi4you
AMD/FX8150 AsRock XFast LAN AsROCK
kurrola.dy.fi http://kurrola.dy.fi Finland Lappeenranta Academic" Ubuntu 14.04 ( 3.13.0-39-generic ) acml-6.0.6.13-gfortran64* gcc version 4.8.2 (Ubuntu 4.8.2-19ubuntu1)% -fPIC -m64 -pthread -O3 -march=bdver1 mpich-3.0.4 Blue Gene/Q (MIRA) IBM PowerPC A2 BGQ 5D TORUSA Argonne Leadership Computing Facility/Argonne National Laboratory https://www.alcf.anl.gov
USA, Illinois Argonne Linux CNK ESSL 5.1! IBM XL C/C++ for Blue Gene, V12.1S -O5 -qstrict -qarch=qp -qtune=qp -qipa -qsimd=auto -qthreaded -qsmp=omp -qmaxmem=-1
GUPS: Special low-overhead runtime exposing the interconnect atomic RDMA support.
EP-STREAM: use of QPX vector intrinsics.
G-FFT: (2D-decomposition, FFTE (IOPT=-1/+1 data distribution) INPUT(NX,NY/NPUY,NZ/NPUZ) -> OUTPUT(NX,NY/NPUY,NZ/NPUZ)). custom low-overhead communication runtime, optimized alltoall implementation using one-sided RDMA and fine-grained transfers, custom rank placement to maximize interconnect throughput in the three alltoall communication phases, intra-node FFT using the IBM ESSL 5.1 implementation.
HPL: Base implementation that is moderately tuned. The rank placement to optimize FFT reduced the performance of broadcast in HPL. See note on MPI below.
Note: MPI performance is affected due to the allocation of the interconnect resources for the custom one-sided transfer runtime. Special memory allocator used across all benchmarks to avoid fragmentation. MPICH2 version 1.5G -O5 -qstrict -qarch=qp -qtune=qp -qipa -qsimd=auto -qthreaded -qsmp=omp Cray Cray XE6 AMD Opteron Gemini Indiana University http://www.iu.edu/ USA, Indiana Bloomington SUSE Linux Enterprise Server 11! Cray Scientific Libraries 12.1.01 Cray Compiling Environment 8.1.9I -O3 -DRA_SANDIA_OPT2 -hlist=m -DLONG_IS_64BITS -hrestrict=a -h pragma=omp Cray MPICH 2/ Xeon E5-2697 v2 (SMT OFF, Turbo OFF, DDR3-1600) http://www.intel.com Red Hat EL 6.4 Intel MKL 11.1 Intel Composer XE 2013.4.183 -O3 -xAVX -ip -ansi-alias -fno-alias -opt-subscript-in-range -DUSING_FFTW -DMKL_INT=long -L$MKL/lib/intel64 -I$MKL/include/fftw -L$MPI/lib64 -DLONG_IS_64BITS -DRA_SANDIA_OPT2 -DHPCC_FFT_235 (and "-opt-streaming-stores always" for stream.c) Intel MPI 4.1.3.045 tel Composer XE 2013.4.183 MPI4YOU AMD/FX-8150 AsRock Ubuntu 12.4 LTS ACML version 5.2.0 (mp) gcc 4.6( -fPIC -m64 -pthread -O3 -march=barcelona mpich 3.0.1 Ubuntu 12.4 ACML version 5.2.0, Xeon E5-2670 (SMT OFF, Turbo OFF, DDR3-1600) USA, Washington DuPont Red Hat EL 6.1 Intel MKL 11.x Intel Compiler 12.x -O3 -xAVX -ip -ansi-alias -fno-alias -opt-subscript-in-range -DUSING_FFTW -DMKL_INT=long -L$MKL/lib/intel64 -I$MKL/include/fftw -L$MPI/lib64 -DLONG_IS_64BITS -DRA_SANDIA_OPT2 (and "-opt-str<