All optimizations done to routines in accordance with HPCC rules
Customized G-RandomAccess and EP-RandomAccess: Uses IBM hardware support.
Customized G-PTRANS: Bypasses MPI and uses IBM Hub Chip interconnect
Customized G-FFT: Uses IBM Hub Chip interconnect during transpose phase
Customized EP-Stream: Uses pthreads explicit paralleization
Customized G-HPL: Uses OpenMP during LASWP phase
G-HPL: adopted pipelined broadcast communication and promoted fine-grain overlapped data transfer
G-RandomAccess: replaced MPI function calls with RDMA function calls, overlapped data transfer with data sorting, transformed data-sorting loop to use conditional move CPU instructions and applied loop distribution to table-updating loop
S&EP-RandomAccess: applied loop distribution to table-updating loop and increased iterations of inner loops
G-FFT: used the latest version of FFTE library
EP-STREAM: inserted compiler directives to use a CPU feature of fast cache line allocation
G-HPL: adopted pipelined broadcast communication and promoted fine-grain overlapped data transfer
G-RandomAccess: replaced MPI function calls with RDMA function calls and overlapped data transfer with data sorting
G-FFT: used the latest version of FFTE library
EP-STREAM: inserted compiler directives to use a CPU feature of fast cache line allocation
Extend loop length(SN/EP-RA), vectorization(G-RA), Use ASL(SN/EP-FFT), Use FFTE 4.0 vector algorithm and ASL(G-FFT)
Extend loop length (SN/EP-RA), vectorization (G-RA), Use ASL(SN/EP-FFT), Use FFTE 4.0 vector algorithm and ASL (G-FFT)
Extend loop length (SN/EP-RA), vectorization (G-RA), Use ASL(SN/EP-FFT), Use FFTE 4.0 vector algorithm and ASL (G-FFT)
HPL, FFT and Random Access were modified to utilize Blue Gene specific intrinics, OpenMP, and custom algorithm for RA to take advantage to the network topology of the system.
No optimizations other than those included in the ESSL math libraries were used.
Extend loop length (SN/EP-RA), vectorization (G-RA), Use ASL(SN/EP-FFT), Use FFTE 4.0 vector algorithm and ASL (G-FFT)
hpcc-1.0.0 using Sandia's MPIRandomAccess_opt.c
Modified HPL broadcast algorithms
Modified communication phase of Sandia's AnyNodesMPIRandomAccessUpdate to coordinate updates between ranks on-node, using MPI. No changes were made to lookahead, nor to the generation or application of updates.
Modified HPL broadcast algorithms
Used intercept routines for fftw_create_plan, fftw_one, fftw_mpi_create_plan, fftw_mpi_local_sizes, fftw_mpi_destroy_plan, fftw_mpi to interface with Cray-modified FFTW 3.2alpha3
Modified communication phase of Sandia's AnyNodesMPIRandomAccessUpdate to coordinate updates between ranks on-node, using MPI. No changes were made to lookahead, nor to the generation or application of updates.
Modified HPL broadcast algorithms
Used intercept routines for fftw_create_plan, fftw_one, fftw_mpi_create_plan, fftw_mpi_local_sizes, fftw_mpi_destroy_plan, fftw_mpi to interface with Cray-modified FFTW 3.2alpha3 (enhancements to FFTW's Alltoall algorithms)
The RandomAccess algorithm is similar in principle to the algorithm
submitted for HPCC in 2005, i.e., it employs software routing and
aggregation on a 3D-torus topology, routing every update along the three dimensions in dimension ordered routing manner and ensuring that the application does not have more than 1024 updates with it at any point in time. The significant changes for Blue Gene/P are as follows: Since an MPI node has 4 cores at its service, the functionality is distributed on the 4 cores with core 0 generating the updates and routing them along the X dimension, core 1 receiving from X dimension and routing along Y, core 2 receiving from Y and routing along Z and core 3 receiving from Z and performing the updates to the local table. The code bypasses MPI and directly uses the lower layer DMA based SPI communication layer (available as standard software library of Blue Gene/P).
The IBM Blue Gene/P system supports direct use of messaging DMA hardware in parallel with use of MPI for applications messaging. To enable this direct use mode for DMA an initialization call to setup the DMA fifos must be executed before invoking the MPI_Init call. The optimized HPCC code has introduced a function call dma_init which is invoked just before MPI_init for this purpose. This is a method that has been put in to support special messaging situations and that is used in a number of production codes including QCD. It is also well documented in the Blue Gene redbook.
For MPIFFT we changed the algorithm of parallel 1D FFT from 9-step FFT to basic 6-step algorithm. And then we modified the parallelized FFT codes under HPCC_pzfft1d function to fit to Blue Gene system. We modified all the functions in "fft235.c" to SIMDize radix 2,3,4,5 and 8 FFT routines by using intrinsic functions of IBM XLC compiler to generate appropriate double FPU instructions.
Extend loop length, add some compiler directives (SN/EP RA), vectorization (HPL,G-RA), add MPI_Barrier(PTRANS)
Extend loop length, add some compiler directives(SN/EP RA), vectorization (HPL,G-RA), add MPI_Barrier(PTRANS)