Page 1 of 1

Run time on the CPU is greater than on the GPU

PostPosted: Wed Sep 06, 2017 8:28 am
by Ham
Hey all,

When I measured the runtime on the CPU and GPU with CPU_TIME(), I found that the run time on the CPU is greater than on the GPU. I use the fortran interface with GMRES solver without the precond because with the precond it takes more time.
what could cause this?
I run my program on HP Intel machine Xeon E3-1240v3 3.4 GHz with gfortran-4.9
below the run time for the first 6 iterations / 100iter

Code: Select all
iter   CPU run time    /s         |    GPU run time   /s
  1    4.0000000000000036E-003    |    0.35599999999999987                           
  2    0.0000000000000000         |    1.6000000000000014E-002                     
  3    4.0000000000000036E-003    |    1.9999999999999574E-002                     
  4    0.0000000000000000         |    1.6000000000000014E-002                     
  5    4.0000000000000036E-003    |    2.0000000000003126E-002                     
  6    0.0000000000000000         |    1.6000000000001791E-002   


Thank you in advance.

Re: Run time on the CPU is greater than on the GPU

PostPosted: Wed Sep 06, 2017 9:58 am
by mgates3
Can you be more specific about what you are doing? For instance:
  • How big is the problem (rows/cols, number of nonzeros)?
  • What format is your matrix (CSR, ...)?
  • Are you including time to transfer the matrix to the GPU, or are you transferring the matrix before?
  • What MAGMA functions are you calling?
  • What model GPU are you using?

I suggest using magmaf_wtime, omp_get_wtime, or MPI_Wtime, instead of cpu_time(). Fortran's cpu_time() seems to measure CPU time used by the process (i.e., time the process is working, similar to getrusage), not elapsed wall clock time. Notably, cpu_time() will not reflect time spent on the GPU. We always use wall time.

Here's a simple test (see code below). When timing sleep(1), cpu_time() measures almost no time, since the process isn't working, but magmaf_wtime() measures the expected 1 second elapsed wall time. When timing gemm() with 1 thread, cpu_time() and magmaf_wtime() measure similar times.
Code: Select all
prompt> setenv OMP_NUM_THREADS 1
prompt> setenv VECLIB_MAXIMUM_THREADS 1
prompt> ./time
sleep(1)
cpu_time     = 0.000139
magmaf_wtime = 1.003619

gemm()
cpu_time     = 0.067450
magmaf_wtime = 0.067607


But when timing gemm() with 2 threads, cpu_time() measures time working in both threads, so it is double the wall clock elapsed time that magmaf_wtime() measures.
Code: Select all
prompt> setenv OMP_NUM_THREADS 2
prompt> setenv VECLIB_MAXIMUM_THREADS 2
prompt> ./time
sleep(1)
cpu_time     = 0.000143
magmaf_wtime = 1.005181

gemm()
cpu_time     = 0.081534
magmaf_wtime = 0.041313


Code: Select all
program main
    use magma
    implicit none
   
    double precision :: start, start2, t, t2
    double precision :: A(1000,1000), B(1000,1000), C(1000,1000)
    integer :: n
    double precision :: alpha, beta
    n = 1000
    alpha = 1.0
    beta  = 2.0
   
    call cpu_time( start )
    call magmaf_wtime( start2 )
    print '(a)', 'sleep(1)'
    call sleep(1)
    call cpu_time( t )
    call magmaf_wtime( t2 )
    t = t - start
    t2 = t2 - start2
    print '(a,f8.6)', 'cpu_time     = ', t
    print '(a,f8.6)', 'magmaf_wtime = ', t2
    print '()'
   
    call cpu_time( start )
    call magmaf_wtime( start2 )
    print '(a)', 'gemm()'
    call dgemm( "n", "n", n, n, n, alpha, A, n, B, n, beta, C, n )
    call cpu_time( t )
    call magmaf_wtime( t2 )
    t = t - start
    t2 = t2 - start2
    print '(a,f8.6)', 'cpu_time     = ', t
    print '(a,f8.6)', 'magmaf_wtime = ', t2
    print '()'
end


On MacOS, compiled with:
Code: Select all
gfortran -Wall -I /opt/magma/include -o time time.f90 -L /opt/magma/lib -Wl,-rpath,/opt/magma/lib -lmagma -framework Accelerate


-mark