## Same test code on different devices, one correct, one wrong

Open discussion for MAGMA library (Matrix Algebra on GPU and Multicore Architectures)

### Same test code on different devices, one correct, one wrong

I tested the performance of magma routines on multiple CPUs with multiple GPUs codes. But the results I got from two devices are completely different. One shows correct result, the other shows suspicious result. How can this happen? Could it be the device problem or the bug in the code?

The routine I tested is magmaf_cgetrf_gpu. For the host routine, there is no such issue. Both devices can research the correct result.

The result on cuda1 which gives correct result:
Code: Select all
` Process             1  of             2  took GPU:             1 Process             0  of             2  took GPU:             0   runtime:    8652.217     ms   runtime:    8652.270     ms  Solving A x = b using LU factorization:                         || A || =  5.228E+03                         || b || =  9.999E-01                         || x || =  4.428E+00                   || b - A x || =  1.000E+00   Gflops  =     82.73448000285711                      Solution is CORRECT  Solving A x = b using LU factorization:                         || A || =  5.228E+03                         || b || =  9.999E-01                         || x || =  4.428E+00                   || b - A x || =  1.000E+00   Gflops  =     82.76268897634245                      Solution is CORRECT`

The result on cuda2 which gives suspicious result:
Code: Select all
` Process             0  of             2  took GPU:             0 Process             1  of             2  took GPU:             1   runtime:    10409.17     ms Info :             8   runtime:    10409.17     ms  Solving A x = b using LU factorization:                         || A || =  5.228E+03                         || b || =  9.999E-01                         || x || =  9.999E-01                   || b - A x || =  2.650E+03   Gflops  =     139.3943775337270                 Solution is suspicious,  8.304E+02  Solving A x = b using LU factorization:                         || A || =  5.228E+03                         || b || =  9.999E-01                         || x || =  9.999E-01                   || b - A x || =  2.650E+03   Gflops  =     68.76998791355494                 Solution is suspicious,  8.304E+02`

For the suspicious result, the x is the same with b. magmaf_cgetrf_gpu just does a copy instead of solving the equations. Does any one know why?

Device Info
cuda1 (correct)
Code: Select all
`running on: n48 gives a report of the devices that were found./deviceQuery Starting... CUDA Device Query (Runtime API) version (CUDART static linking)There are 4 devices supporting CUDADevice 0: "Tesla T10 Processor"  CUDA Driver Version:                           3.20  CUDA Runtime Version:                          3.20  CUDA Capability Major/Minor version number:    1.3  Total amount of global memory:                 4294770688 bytes  Multiprocessors x Cores/MP = Cores:            30 (MP) x 8 (Cores/MP) = 240 (Cores)  Total amount of constant memory:               65536 bytes  Total amount of shared memory per block:       16384 bytes  Total number of registers available per block: 16384  Warp size:                                     32  Maximum number of threads per block:           512  Maximum sizes of each dimension of a block:    512 x 512 x 64  Maximum sizes of each dimension of a grid:     65535 x 65535 x 1  Maximum memory pitch:                          2147483647 bytes  Texture alignment:                             256 bytes  Clock rate:                                    1.44 GHz  Concurrent copy and execution:                 Yes  Run time limit on kernels:                     No  Integrated:                                    No  Support host page-locked memory mapping:       Yes  Compute mode:                                  Default (multiple host threads can use this device simultaneously)  Concurrent kernel execution:                   No  Device has ECC support enabled:                No  Device is using TCC driver mode:               NoDevice 1: "Tesla T10 Processor"......(same with Device 0)Device 2: "Tesla T10 Processor"......(same with Device 0)Device 3: "Tesla T10 Processor"......(same with Device 0)deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 3.20, CUDA Runtime Version = 3.20, NumDevs = 4, Device = Tesla T10 Processor, Device = Tesla T10 ProcessorPASSEDPress <Enter> to Quit...`

cuda2 (suspicious)
Code: Select all
`running on: n53 gives a report of the devices that were found./deviceQuery Starting... CUDA Device Query (Runtime API) version (CUDART static linking)There are 3 devices supporting CUDADevice 0: "Tesla M2070"  CUDA Driver Version:                           3.20  CUDA Runtime Version:                          3.20  CUDA Capability Major/Minor version number:    2.0  Total amount of global memory:                 5636554752 bytes  Multiprocessors x Cores/MP = Cores:            14 (MP) x 32 (Cores/MP) = 448 (Cores)  Total amount of constant memory:               65536 bytes  Total amount of shared memory per block:       49152 bytes  Total number of registers available per block: 32768  Warp size:                                     32  Maximum number of threads per block:           1024  Maximum sizes of each dimension of a block:    1024 x 1024 x 64  Maximum sizes of each dimension of a grid:     65535 x 65535 x 1  Maximum memory pitch:                          2147483647 bytes  Texture alignment:                             512 bytes  Clock rate:                                    1.15 GHz  Concurrent copy and execution:                 Yes  Run time limit on kernels:                     No  Integrated:                                    No  Support host page-locked memory mapping:       Yes  Compute mode:                                  Default (multiple host threads can use this device simultaneously)  Concurrent kernel execution:                   Yes  Device has ECC support enabled:                Yes  Device is using TCC driver mode:               NoDevice 1: "Tesla M2070"......(same with Device 0)Device 2: "Tesla M2070"......(same with Device 0)deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 3.20, CUDA Runtime Version = 3.20, NumDevs = 3, Device = Tesla M2070, Device = Tesla M2070PASSED`

More detail about the cuda2(suspicious) can be found here: http://h18004.www1.hp.com/products/quickspecs/DS_00205/DS_00205.pdf or
The hardware model is HP proliant SL390s g7 2U half-width server

Now I have updated the cuda drivers from 3.2 to 4.0. Still the same problem happened.

Test Code:
test_magma.f
Code: Select all
`!!   -- MAGMA (version 1.0) --!      Univ. of Tennessee, Knoxville!      Univ. of California, Berkeley!      Univ. of Colorado, Denver!      November 2010!!  @generated c!      program testing_cgetrf_gpu_f      use magma      use more_mpi      integer :: rc, mydev, numdev       external cublas_init, cublas_set_matrix, cublas_get_matrix      external cublas_shutdown, cublas_alloc      external clange, cgemm, cgesv, slamch      real clange, slamch      integer cublas_alloc      real              :: rnumber(2), Anorm, Bnorm, Rnorm, Xnorm       real, allocatable :: work(:)      complex, allocatable       :: h_A(:), h_B(:), h_X(:)      magma_devptr_t                :: devptrA, devptrB      integer,    allocatable       :: ipiv(:)      complex                    :: zone, mzone      integer                       :: i, n, info, stat, lda      integer                       :: size_of_elt, nrhs      real(kind=8)                  :: flops, t      integer                       :: tstart(2), tend(2)      PARAMETER          ( nrhs = 1, zone = 1., mzone = -1. )      integer c1,c2,cr,cm                         call init          ierr = cudaGetDeviceCount(numdev)       mydev = 0 !mod(cpuid,numdev)            print *, "Process ", cpuid, " of ", numprocs, " took GPU: ", mydev       ierr = cudaSetDevice(mydev)             call cublas_init()      n   = 10240 ! 2048      lda  = n      ldda = ((n+31)/32)*32      size_of_elt = sizeof_complex !------ Allocate CPU memory      allocate(h_A(lda*n))      allocate(h_B(lda*nrhs))      allocate(h_X(lda*nrhs))      allocate(work(n))      allocate(ipiv(n))!------ Allocate GPU memory      stat = cublas_alloc(ldda*n, size_of_elt, devPtrA)      if (stat .ne. 0) then         write(*,*) "device memory allocation failed"         stop      endif      stat = cublas_alloc(ldda*nrhs, size_of_elt, devPtrB)      if (stat .ne. 0) then         write(*,*) "device memory allocation failed"         stop      endif!---- Initializa the matrix      do i=1,lda*n         call random_number(rnumber)         h_A(i) = rnumber(1)      end do      do i=1,lda*nrhs        call random_number(rnumber)        h_B(i) = rnumber(1)      end do      h_X(:) = h_B(:)!---- devPtrA = h_A      call cublas_set_matrix(n, n, size_of_elt, h_A, lda, devptrA, ldda)!---- devPtrB = h_B      call cublas_set_matrix(n, nrhs, size_of_elt, h_B, lda, devptrB, ldda)      call MPI_BARRIER(mpi_comm_world,mpi_err)      call system_clock( c1, cr, cm ) !---- Call magma LU ----------------      call magma_gettime_f(tstart)      call magmaf_cgetrf_gpu(n, n, devptrA, ldda, ipiv, info)      call magma_gettime_f(tend)            call MPI_BARRIER(mpi_comm_world,mpi_err)      call system_clock( count=c2 )       print *, '  runtime:', 1.e3*real(c2-c1) / real(cr), 'ms'       if ( info .ne. 0 )  then         write(*,*) "Info : ", info      end if!---- Call magma solve -------------      call magmaf_cgetrs_gpu('n', n, nrhs, devptrA, ldda, ipiv, devptrB, ldda, info)      if ( info .ne. 0 )  then         write(*,*) "Info : ", info      end if!---- h_X = devptrB      call cublas_get_matrix (n, nrhs, size_of_elt, devptrB, ldda, h_X, lda)!---- Compare the two results ------      Anorm = clange('I', n, n,    h_A, lda, work)      Bnorm = clange('I', n, nrhs, h_B, lda, work)      Xnorm = clange('I', n, nrhs, h_X, lda, work)      call cgemm('n', 'n', n,  nrhs, n, zone, h_A, lda, h_X, lda, mzone, h_B, lda)      Rnorm = clange('I', n, nrhs, h_B, lda, work)      write(*,*)      write(*,*  ) 'Solving A x = b using LU factorization:'      write(*,105) '  || A || = ', Anorm      write(*,105) '  || b || = ', Bnorm      write(*,105) '  || x || = ', Xnorm      write(*,105) '  || b - A x || = ', Rnorm      flops = 2. * n * n * n / 3.        call magma_gettimervalue_f(tstart, tend, t)      write(*,*)   '  Gflops  = ',  flops / t / 1e6      write(*,*)      Rnorm = Rnorm / ( (Anorm*Xnorm+Bnorm) * n * slamch('E') )      if ( Rnorm > 60. ) then         write(*,105) '  Solution is suspicious, ', Rnorm      else         write(*,105) '  Solution is CORRECT'       end if!---- Free CPU memory      deallocate(h_A, h_X, h_B, work, ipiv)!---- Free GPU memory      call cublas_free(devPtrA)      call cublas_free(devPtrB)      call cublas_shutdown() 105  format((a35,es10.3))      call MPI_FINALIZE(rc)      end`

init.f
Code: Select all
`module more_mpi   include 'mpif.h'   integer :: ierr,cpuid,numprocs,namelen !mpi   character(len=100) processor_nameend module`

more_mpi.f
Code: Select all
`subroutine init   use more_mpi   call mpi_init(ierr)   call mpi_comm_rank(mpi_comm_world,cpuid,ierr)   call mpi_comm_size(mpi_comm_world,numprocs,ierr)    call mpi_get_processor_name(processor_name,namelen,ierr)end subroutine init`

makefile
Code: Select all
`.SUFFIXES: .cuf .oL1= test_magma.o more_mpi.o init.o fortran.oCULAINCLUDES= -I\${CULA_INC_PATH}CULALIBPATH64= -L\${CULA_LIB_PATH_64}CUDAINCLUDES= -I\${CUDA_INC_PATH}CUDALIBPATH64= -L\${CUDA_LIB_PATH_64}CUDALIBS= -lcudart -lcuda -lcublasMAGMALIBPATH=-L/u/af/bj/cding/MAGMA/libMAGMALIBS= -lmagma -lmagmablasGPULIBS= -lcula_pgfortran #-lcula -lcublas -lcudart PGFLAGS= -Mfree -O3 -DADD_#CUDA= -ta=nvidia -McudaCUDA=SOPT= LINK1=  /opt/intel/Compiler/11.1/069/mkl/lib/em64t/libmkl_scalapack_lp64.a \       /opt/intel/Compiler/11.1/069/mkl/lib/em64t/libmkl_intel_lp64.a \       /opt/intel/Compiler/11.1/069/mkl/lib/em64t/libmkl_blacs_openmpi_lp64.a \       /opt/intel/Compiler/11.1/069/mkl/lib/em64t/libmkl_core.a \       /opt/intel/Compiler/11.1/069/mkl/lib/em64t/libmkl_sequential.a \       /opt/intel/Compiler/11.1/069/mkl/lib/em64t/libmkl_core.a \       /opt/intel/Compiler/11.1/069/mkl/lib/em64t/libmkl_sequential.a \       /opt/intel/Compiler/11.1/069/mkl/lib/em64t/libmkl_core.a \       -lpthread#LINK_CU=  /opt/pgi/linux86-64/10.6/lib/libcudafor.aLINK_CU=  /opt/pgi/linux86-64/11.5/lib/libcudafor.aPF90= mpif90PGFOR= pgfortranPGA_EX= magma_test_mgpusmain: \$(L1)   \$(PF90) \$(SOPT) \$(PGFLAGS) \$(L1) \$(CUDAINCLUDES) \$(CUDALIBPATH64) \$(CUDALIBS) \$(CULAINCLUDES) \$(CULALIBPATH64) \$(GPULIBS) \$(CUDAINCLUDES) \$(CUDALIBPATH64) \$(CUDALIBS) \$(MAGMALIBPATH) \$(MAGMALIBS) \$(LINK1) \$(LINK_CU) -o \$(PGA_EX).f.o:   \$(PF90) \$(SOPT) \$(PGFLAGS) -Mpreprocess -Dmagma_devptr_t="integer(kind=8)" -I/opt/development/gpu/current/cuda/include -I/u/af/bj/cding/download/magma_1.0.0-rc5/include -I/u/af/bj/cding/download/magma_1.0.0-rc5/include \$(CUDAINCLUDES) \$(CUDALIBPATH64) \$(CUDALIBS) -c \$<.cuf.o:   \$(PGFOR) \$(SOPT) \$(PGFLAGS) \$(CUDA) \$(CULAINCLUDES) \$(CULALIBPATH64) \$(GPULIBS) -I/opt/development/gpu/current/cuda/include -I/u/af/bj/cding/download/magma_1.0.0-rc5/include -I/u/af/bj/cding/download/magma_1.0.0-rc5/include \$(CUDAINCLUDES) \$(CUDALIBPATH64) \$(CUDALIBS) -c \$<.c.o:   gcc -O3 -DCUBLAS_USE_THUNKING \$(CUDAINCLUDES) -c \$<test_magma.o: test_magma.f init.o more_mpi.omore_mpi.o: more_mpi.finit.o: init.f more_mpi.ofortran.o: fortran.cclean:   /bin/rm -f *o *mod \$(L1b) \$(L2b) \$(PGA_EX)    del:   rm -f *.mio.mines.edu`

execute command:
Code: Select all
`mpiexec -np 2 magma_test_mgpus`
bsb3166

Posts: 8
Joined: Mon Jul 25, 2011 12:28 pm

### Re: Same test code on different devices, one correct, one wr

This is really weird I know. Did any one find the some problem?
bsb3166

Posts: 8
Joined: Mon Jul 25, 2011 12:28 pm