GPU interface to dgetrf with streams

Open discussion for MAGMA

GPU interface to dgetrf with streams

Postby jah87 » Thu Jun 13, 2013 5:53 pm

I've made some modifications to src/dgetrf_gpu.cpp such that the explicit H2D and D2H data transfers are being performed asynchronously with streams. With this addition, the routine should be able to execute asynchronously between concurrently executing OpenMP threads (my matrix is moderately small at about 160x160 per thread).

The problem is that there are still some synchronous data transfers being performed, as revealed by the CUDA command-line profiler (COMPUTE_PROFILE=1):
Code: Select all
method=[ _Z14dtranspose3_32PdiPKdiiiii ] gputime=[ 7.168 ] cputime=[ 17.000 ] occupancy=[ 0.625 ]
method=[ _Z13dtranspose_32PdiPKdi ] gputime=[ 5.440 ] cputime=[ 8.000 ] occupancy=[ 0.625 ]
method=[ memcpyDtoHasync ] gputime=[ 13.856 ] cputime=[ 6.000 ]
method=[ memcpyHtoD ] gputime=[ 5.504 ] cputime=[ 10.000 ]
method=[ memcpyHtoD ] gputime=[ 1.376 ] cputime=[ 9.000 ]
method=[ memcpyHtoD ] gputime=[ 0.960 ] cputime=[ 9.000 ]
method=[ memcpyHtoD ] gputime=[ 0.992 ] cputime=[ 8.000 ]
method=[ memcpyHtoD ] gputime=[ 0.992 ] cputime=[ 8.000 ]
method=[ memcpyHtoD ] gputime=[ 1.376 ] cputime=[ 12.000 ]
method=[ memcpyHtoD ] gputime=[ 0.992 ] cputime=[ 9.000 ]
method=[ memcpyHtoD ] gputime=[ 0.992 ] cputime=[ 9.000 ]
method=[ memcpyHtoD ] gputime=[ 1.376 ] cputime=[ 9.000 ]
method=[ memcpyHtoD ] gputime=[ 0.960 ] cputime=[ 8.000 ]
method=[ memcpyHtoD ] gputime=[ 0.992 ] cputime=[ 8.000 ]
method=[ memcpyHtoD ] gputime=[ 0.960 ] cputime=[ 8.000 ]
method=[ memcpyHtoD ] gputime=[ 1.408 ] cputime=[ 9.000 ]
method=[ memcpyHtoD ] gputime=[ 0.960 ] cputime=[ 8.000 ]
method=[ memcpyHtoD ] gputime=[ 0.992 ] cputime=[ 8.000 ]
method=[ memcpyHtoD ] gputime=[ 0.960 ] cputime=[ 8.000 ]
method=[ memcpyHtoD ] gputime=[ 1.376 ] cputime=[ 9.000 ]
method=[ memcpyHtoD ] gputime=[ 1.152 ] cputime=[ 8.000 ]
method=[ memcpyHtoD ] gputime=[ 0.992 ] cputime=[ 8.000 ]
method=[ memcpyHtoD ] gputime=[ 0.992 ] cputime=[ 8.000 ]
method=[ memcpyHtoD ] gputime=[ 1.376 ] cputime=[ 10.000 ]
method=[ memcpyHtoD ] gputime=[ 0.960 ] cputime=[ 8.000 ]
method=[ _Z13dlaswp_kernel15dlaswp_params_t ] gputime=[ 32.448 ] cputime=[ 14.000 ] occupancy=[ 0.094 ]


The overhead from this one portion of the routine dominates the cost in terms of CPU time, and prevents me from asynchronously executing on the other threads. I'm fairly certain the source of these transfers is in the call to magmablas_dpermute_long2s:
Code: Select all
lapackf77_dgetrf( &rows, &nb, work, &lddwork, ipiv+i*nb, &iinfo);
if ( (*info == 0) && (iinfo > 0) )
    *info = iinfo + i*nb;

magmablas_dpermute_long2s( n, dAT, lddat, ipiv, nb, i*nb, stream );
magma_dsetmatrix_async( m-i*nb, nb, work, lddwork, dAP, maxm, stream );


The source for magmablas_dpermute_long2s:
Code: Select all
extern "C" void
magmablas_dpermute_long2s( magma_int_t n, double *dAT, magma_int_t ldda,
                           magma_int_t *ipiv, magma_int_t nb, magma_int_t ind,
                           magma_stream_t stream )
{
    for( int k = 0; k < nb; k += MAX_PIVOTS ) {
        int npivots = min( MAX_PIVOTS, nb-k );
        // fields are:             dAT  n  lda  j0       npivots
        dlaswp_params_t params = { dAT, n, ldda, ind + k, npivots };
        for( int j = 0; j < npivots; ++j ) {
            params.ipiv[j] = ipiv[ind + k + j] - k - 1;
            ipiv[ind + k + j] += ind;
        }
        dlaswps( params, stream );
    }
}
extern "C" void dlaswps( dlaswp_params_t &params, magma_stream_t stream )
{
    int blocks = (params.n + NTHREADS - 1) / NTHREADS;
    dlaswp_kernel<<< blocks, NTHREADS, 0, stream >>>( params );
}


I have attempted to replace the call to dlaswp in magmablas_dpermute_long2 with the magmablas_dlaswp2 routine, which uses a device copy of the pivots (ipiv), but I have not been able to successfully implement this, if it's even possible.

Does anyone have some suggestions on how to proceed, given that I need each CPU thread to be able to call dgetrf_gpu completely asynchronously from the other CPU threads?
jah87
 
Posts: 21
Joined: Tue May 01, 2012 1:54 pm

Re: GPU interface to dgetrf with streams

Postby Stan Tomov » Thu Jun 13, 2013 11:18 pm

The current code uses stream 0 for the GPU BLAS. This would not allow concurrent BLAS execution on the GPU from the different threads.

Related to the communications, I think magmablas_dpermute_long2s does not use synchronous communications. The routine does not have any explicit communication, only the arguments get sent implicitly by CUDA. In particular, we pack a number of pivots in a structure and this structure is passed to the kernel. The structure, similar to any other argument, is passed from the CPU to the GPU asynchronously by CUDA: I think when a kernel is called from the CPU, the CUDA compiler inserts code that is preparing the arguments (possibly packing them in some contiguous data), queuing the task for execution, and exiting. Thus the call is asynchronous - the control is passed back "immediately" to the calling thread, and a CUDA background thread makes sure the arguments for the task queued are sent to the GPU and execution is started on the GPU.

We wrote the magmablas_dlaswp2 so that we have only one kernel call. The pivots here though are assumed to be on the GPU so you have to copy them first from the CPU to the GPU. This is possible to do (we have done it, but have not release a version with it yet).

I will be interested to see some performance results using this approach. Thanks.
Stan Tomov
 
Posts: 251
Joined: Fri Aug 21, 2009 10:39 pm

Re: GPU interface to dgetrf with streams

Postby jah87 » Fri Jun 14, 2013 9:44 am

Stan,

Thanks for your reply. I'll have to make my response brief, as I wrote a very detailed response, only to have it lost when I attempted to submit it.

In the implementation I speak of, I have replaced the call to magma_dgetrf_gpu with magma_dgetrf_gpu_v2, which accepts a unique CUBLAS handle its associated CUDA stream as an argument from each calling CPU thread. Within the routine, all magmablas routines are replaced with a variant, which accepts the stream as an argument and performs the GPU operations on that stream. All CUBLAS routines are replaced with their _v2 equivalents, and passed the necessary handle so that they should all be executing asynchronously. All memory transfers are replaced with _async equivalents as well.

Perhaps the arguments being implicitly sent by CUDA are only done asynchronously under certain conditions. From what I could gather, it seems that the memory transfer can only be performed asynchronously if the data on the host is allocated with pinned memory. Maybe something like the following could work:

Code: Select all
cudaMemcpyAsync( params_dev, params, sizeof(dlaswp_params_t), cudaMemcpyHostToDevice, stream);
dlaswp_kernel<<< blocks, NTHREADS, 0, stream >>>( params_device );
cudaMemcpyAsync( params, params_dev, sizeof(dlaswp_params_t), cudaMemcpyDeviceToHost, stream);


I'm a bit of a novice with C, so forgive my ignorance, but I do not know how to allocate the structure as pinned memory, since there is no explicit "malloc". Is it just that whatever is passed to the structure must be pinned (in this case ipiv)?

I am very interested in the use of the magmablas_dlaswp2 routine. At this point, it may just be best to wait for this implementation, or develop my own. Do you know when/if this is planned to be released?

As far is performance is concerned, I am seeing significant improvement in my code using the non-default streams, but I am waiting for a fully-functional implementation before I do any extensive performance testing.

Austin
jah87
 
Posts: 21
Joined: Tue May 01, 2012 1:54 pm

Re: GPU interface to dgetrf with streams

Postby Stan Tomov » Fri Jun 14, 2013 1:46 pm

Austin,
I see. This sounds good. We have been asked by users to provide this type of stream interface, so any experimental results on performance would be very useful for us to know.

I can check with NVIDIA developers if routine arguments are always sent asynchronously or if there are cases that they get sent synchronously.

To use magmablas_dlaswp2 you can allocate ipiv in pinned memory on the CPU, and precede the magmablas_dlaswp2 call by asynchronous CPU to GPU copy of the corresponding pivoting indexes from ipiv (on the CPU) to some dipiv (on the GPU).
Regarding putting arguments (in this case the structure with pivots) in pinned memory, I don't think it matters, just because CUDA does not use that memory directly to send data to the GPU - they get copied into intermediate buffer (which I assume they allocate in pinned memory at CUDA installation time).
Stan
Stan Tomov
 
Posts: 251
Joined: Fri Aug 21, 2009 10:39 pm

Re: GPU interface to dgetrf with streams

Postby jah87 » Fri Jun 21, 2013 7:05 pm

At this point, I have a working modification to dgetrf_gpu.cpp which uses streams to allow for asynchronous execution between OpenMP threads (aka multicore). Each thread creates a unique context (aka handle), and associates its stream with its handle. It's undoubtedly a naive implementation using streams, and I have not yet optimized the magmablas routines to account for the change in the availability of GPU resources. As a first attempt at further optimization, I also managed to implement a version with each thread spawning two streams and using this to overlap some of the communication/computation on each CPU core.

As a quick test of performance, I modified the testing_dgetrf_gpu routine to parallelize the inner "niter" loop with OpenMP. Allocating/setting device or pinned memory enforces device synchronization, so I moved these operations outside of the dgetrf_gpu routine itself and pass the dAT, dAP, ipiv, and work variables as arguments. Even with my naive implementations, the desired effect is achieved:
(Tests performed with AMD "Interlagos" Opteron 6274 + NVIDIA Tesla K20X Kepler GK110 with one thread per FPU. The problem size was chosen to be 150 to mirror the application. A block size of 128 was used.)
There's some diminishing returns as I increase the number of threads, which is something I hope I can remedy, maybe through batching or something of the sort.

No streams:
Code: Select all
Number of threads: 1
  M     N     CPU GFlop/s (sec)   GPU GFlop/s (sec)   ||PA-LU||/(||A||*N)
=========================================================================
  150   150      3.05 ( 0.0007)      1.70 ( 0.0013)   2.50e-18
...
  150   150      5.09 ( 0.0004)      2.03 ( 0.0011)   2.87e-18
  150   150      5.07 ( 0.0004)      2.02 ( 0.0011)   2.53e-18
      niter   =======  =======    =======  =======
         16      4.80 ( 0.0075)      1.97 ( 0.0182)

Number of threads: 2
  M     N     CPU GFlop/s (sec)   GPU GFlop/s (sec)   ||PA-LU||/(||A||*N)
=========================================================================
  150   150      2.70 ( 0.0008)      0.81 ( 0.0028)   2.50e-18
...
  150   150      4.86 ( 0.0005)      1.88 ( 0.0012)   2.57e-18
  150   150      4.93 ( 0.0005)      2.02 ( 0.0011)   2.57e-18
      niter   =======  =======    =======  =======
         16      8.77 ( 0.0041)      2.79 ( 0.0129)

Number of threads: 4
  M     N     CPU GFlop/s (sec)   GPU GFlop/s (sec)   ||PA-LU||/(||A||*N)
=========================================================================
  150   150      2.77 ( 0.0008)      0.47 ( 0.0048)   2.50e-18
...
  150   150      5.04 ( 0.0004)      0.66 ( 0.0034)   2.71e-18
  150   150      4.85 ( 0.0005)      0.70 ( 0.0032)   2.71e-18
      niter   =======  =======    =======  =======
         16     15.64 ( 0.0023)      1.98 ( 0.0181)

Number of threads: 8
  M     N     CPU GFlop/s (sec)   GPU GFlop/s (sec)   ||PA-LU||/(||A||*N)
=========================================================================
  150   150      2.99 ( 0.0007)      0.29 ( 0.0076)   2.50e-18
...
  150   150      4.78 ( 0.0005)      0.23 ( 0.0098)   2.66e-18
  150   150      4.76 ( 0.0005)      0.27 ( 0.0082)   2.66e-18
      niter   =======  =======    =======  =======
         16     23.28 ( 0.0015)      1.73 ( 0.0207)


One stream per thread:
Code: Select all
Number of threads: 1
  M     N     CPU GFlop/s (sec)   GPU GFlop/s (sec)   ||PA-LU||/(||A||*N)
=========================================================================
  150   150      3.02 ( 0.0007)      2.12 ( 0.0011)   2.50e-18
...
  150   150      5.11 ( 0.0004)      2.25 ( 0.0010)   2.87e-18
  150   150      5.07 ( 0.0004)      2.26 ( 0.0010)   2.53e-18
      niter   =======  =======    =======  =======
         16      4.83 ( 0.0074)      2.24 ( 0.0160)

Number of threads: 2
  M     N     CPU GFlop/s (sec)   GPU GFlop/s (sec)   ||PA-LU||/(||A||*N)
=========================================================================
  150   150      2.69 ( 0.0008)      1.82 ( 0.0012)   2.50e-18
...
  150   150      5.09 ( 0.0004)      2.20 ( 0.0010)   2.57e-18
  150   150      5.08 ( 0.0004)      1.86 ( 0.0012)   2.57e-18
      niter   =======  =======    =======  =======
         16      9.10 ( 0.0039)      3.81 ( 0.0094)

Number of threads: 4
  M     N     CPU GFlop/s (sec)   GPU GFlop/s (sec)   ||PA-LU||/(||A||*N)
=========================================================================
  150   150      2.59 ( 0.0009)      1.65 ( 0.0014)   2.50e-18
...
  150   150      4.83 ( 0.0005)      1.84 ( 0.0012)   2.71e-18
  150   150      4.84 ( 0.0005)      1.56 ( 0.0014)   2.71e-18
      niter   =======  =======    =======  =======
         16     15.42 ( 0.0023)      5.24 ( 0.0068)

Number of threads: 8
  M     N     CPU GFlop/s (sec)   GPU GFlop/s (sec)   ||PA-LU||/(||A||*N)
=========================================================================
  150   150      2.01 ( 0.0011)      1.47 ( 0.0015)   2.50e-18
...
  150   150      4.70 ( 0.0005)      0.93 ( 0.0024)   2.66e-18
  150   150      4.62 ( 0.0005)      0.79 ( 0.0028)   2.66e-18
      niter   =======  =======    =======  =======
         16     21.95 ( 0.0016)      5.31 ( 0.0067)


Two streams per thread:
Code: Select all
Number of threads: 1
  M     N     CPU GFlop/s (sec)   GPU GFlop/s (sec)   ||PA-LU||/(||A||*N)
=========================================================================
  150   150      2.99 ( 0.0007)      2.17 ( 0.0010)   2.50e-18
...
  150   150      5.09 ( 0.0004)      2.36 ( 0.0009)   2.87e-18
  150   150      5.08 ( 0.0004)      2.35 ( 0.0010)   2.53e-18
      niter   =======  =======    =======  =======
         16      4.80 ( 0.0075)      2.32 ( 0.0155)

Number of threads: 2
  M     N     CPU GFlop/s (sec)   GPU GFlop/s (sec)   ||PA-LU||/(||A||*N)
=========================================================================
  150   150      2.70 ( 0.0008)      1.89 ( 0.0012)   2.50e-18
...
  150   150      5.10 ( 0.0004)      2.05 ( 0.0011)   2.57e-18
  150   150      4.97 ( 0.0005)      2.01 ( 0.0011)   2.57e-18
      niter   =======  =======    =======  =======
         16      8.97 ( 0.0040)      3.91 ( 0.0092)

Number of threads: 4
  M     N     CPU GFlop/s (sec)   GPU GFlop/s (sec)   ||PA-LU||/(||A||*N)
=========================================================================
  150   150      2.55 ( 0.0009)      1.47 ( 0.0015)   2.50e-18
...
  150   150      5.08 ( 0.0004)      1.61 ( 0.0014)   2.71e-18
  150   150      4.80 ( 0.0005)      1.30 ( 0.0017)   2.71e-18
      niter   =======  =======    =======  =======
         16     15.75 ( 0.0023)      5.00 ( 0.0072)

Number of threads: 8
  M     N     CPU GFlop/s (sec)   GPU GFlop/s (sec)   ||PA-LU||/(||A||*N)
=========================================================================
  150   150      2.10 ( 0.0011)      1.07 ( 0.0021)   2.50e-18
...
  150   150      4.73 ( 0.0005)      0.91 ( 0.0025)   2.66e-18
  150   150      4.80 ( 0.0005)      0.73 ( 0.0031)   2.66e-18
      niter   =======  =======    =======  =======
         16     22.66 ( 0.0016)      5.55 ( 0.0065)


Summary:
Code: Select all
Size   Iterations   Threads   CPU   No streams 1 stream/thread 2 streams/thread   Best   Improvement
==================================================================================
150   16   1   1   0.410   0.464   0.483   2 streams/thread   1.178
150   16   2   1   0.318   0.419   0.436   2 streams/thread   1.370
150   16   4   1   0.127   0.340   0.317   1 stream/thread   2.684
150   16   8   1   0.074   0.242   0.245   2 streams/thread   3.296


Other sizes:
Code: Select all
Size   Iterations   Threads   CPU   No streams 1 stream/thread 2 streams/thread   Best   Improvement
==================================================================================
256   16   1   1   0.400   0.917   0.958   2 streams/thread   2.394
256   16   2   1   0.342   0.839   0.852   2 streams/thread   2.495
256   16   4   1   0.277   0.624   0.699   2 streams/thread   2.522
256   16   8   1   0.163   0.396   0.439   2 streams/thread   2.698
384   16   1   1   0.814   1.442   1.518   2 streams/thread   1.864
384   16   2   1   0.790   1.283   1.373   2 streams/thread   1.738
384   16   4   1   0.600   1.038   1.149   2 streams/thread   1.916
384   16   8   1   0.232   0.661   0.657   1 stream/thread   2.853
512   16   1   1   1.367   1.942   2.037   2 streams/thread   1.490
512   16   2   1   1.308   1.808   1.871   2 streams/thread   1.431
512   16   4   1   0.950   1.509   1.586   2 streams/thread   1.669
512   16   8   1   0.468   0.915   0.910   1 stream/thread   1.954
640   16   1   1   1.903   2.555   2.710   2 streams/thread   1.424
640   16   2   1   1.636   2.320   2.454   2 streams/thread   1.500
640   16   4   1   1.027   1.977   2.069   2 streams/thread   2.014
640   16   8   1   0.478   1.136   1.224   2 streams/thread   2.563
768   16   1   1   2.506   3.149   3.322   2 streams/thread   1.325
768   16   2   1   2.120   2.975   3.063   2 streams/thread   1.445
768   16   4   1   1.473   2.520   2.544   2 streams/thread   1.727
768   16   8   1   0.661   1.462   1.573   2 streams/thread   2.379
896   16   1   1   3.145   3.779   4.000   2 streams/thread   1.272
896   16   2   1   2.633   3.398   3.649   2 streams/thread   1.386
896   16   4   1   1.805   2.889   3.028   2 streams/thread   1.677
896   16   8   1   0.766   1.635   1.769   2 streams/thread   2.309
1024   16   1   1   3.604   4.119   4.308   2 streams/thread   1.195
1024   16   2   1   3.182   3.898   3.993   2 streams/thread   1.255
1024   16   4   1   2.359   3.305   3.394   2 streams/thread   1.439
1024   16   8   1   1.105   2.124   2.306   2 streams/thread   2.086
jah87
 
Posts: 21
Joined: Tue May 01, 2012 1:54 pm


Return to User discussion

Who is online

Users browsing this forum: No registered users and 4 guests