The problem is that there are still some synchronous data transfers being performed, as revealed by the CUDA command-line profiler (COMPUTE_PROFILE=1):
- Code: Select all
method=[ _Z14dtranspose3_32PdiPKdiiiii ] gputime=[ 7.168 ] cputime=[ 17.000 ] occupancy=[ 0.625 ]
method=[ _Z13dtranspose_32PdiPKdi ] gputime=[ 5.440 ] cputime=[ 8.000 ] occupancy=[ 0.625 ]
method=[ memcpyDtoHasync ] gputime=[ 13.856 ] cputime=[ 6.000 ]
method=[ memcpyHtoD ] gputime=[ 5.504 ] cputime=[ 10.000 ]
method=[ memcpyHtoD ] gputime=[ 1.376 ] cputime=[ 9.000 ]
method=[ memcpyHtoD ] gputime=[ 0.960 ] cputime=[ 9.000 ]
method=[ memcpyHtoD ] gputime=[ 0.992 ] cputime=[ 8.000 ]
method=[ memcpyHtoD ] gputime=[ 0.992 ] cputime=[ 8.000 ]
method=[ memcpyHtoD ] gputime=[ 1.376 ] cputime=[ 12.000 ]
method=[ memcpyHtoD ] gputime=[ 0.992 ] cputime=[ 9.000 ]
method=[ memcpyHtoD ] gputime=[ 0.992 ] cputime=[ 9.000 ]
method=[ memcpyHtoD ] gputime=[ 1.376 ] cputime=[ 9.000 ]
method=[ memcpyHtoD ] gputime=[ 0.960 ] cputime=[ 8.000 ]
method=[ memcpyHtoD ] gputime=[ 0.992 ] cputime=[ 8.000 ]
method=[ memcpyHtoD ] gputime=[ 0.960 ] cputime=[ 8.000 ]
method=[ memcpyHtoD ] gputime=[ 1.408 ] cputime=[ 9.000 ]
method=[ memcpyHtoD ] gputime=[ 0.960 ] cputime=[ 8.000 ]
method=[ memcpyHtoD ] gputime=[ 0.992 ] cputime=[ 8.000 ]
method=[ memcpyHtoD ] gputime=[ 0.960 ] cputime=[ 8.000 ]
method=[ memcpyHtoD ] gputime=[ 1.376 ] cputime=[ 9.000 ]
method=[ memcpyHtoD ] gputime=[ 1.152 ] cputime=[ 8.000 ]
method=[ memcpyHtoD ] gputime=[ 0.992 ] cputime=[ 8.000 ]
method=[ memcpyHtoD ] gputime=[ 0.992 ] cputime=[ 8.000 ]
method=[ memcpyHtoD ] gputime=[ 1.376 ] cputime=[ 10.000 ]
method=[ memcpyHtoD ] gputime=[ 0.960 ] cputime=[ 8.000 ]
method=[ _Z13dlaswp_kernel15dlaswp_params_t ] gputime=[ 32.448 ] cputime=[ 14.000 ] occupancy=[ 0.094 ]
The overhead from this one portion of the routine dominates the cost in terms of CPU time, and prevents me from asynchronously executing on the other threads. I'm fairly certain the source of these transfers is in the call to magmablas_dpermute_long2s:
- Code: Select all
lapackf77_dgetrf( &rows, &nb, work, &lddwork, ipiv+i*nb, &iinfo);
if ( (*info == 0) && (iinfo > 0) )
*info = iinfo + i*nb;
magmablas_dpermute_long2s( n, dAT, lddat, ipiv, nb, i*nb, stream );
magma_dsetmatrix_async( m-i*nb, nb, work, lddwork, dAP, maxm, stream );
The source for magmablas_dpermute_long2s:
- Code: Select all
extern "C" void
magmablas_dpermute_long2s( magma_int_t n, double *dAT, magma_int_t ldda,
magma_int_t *ipiv, magma_int_t nb, magma_int_t ind,
magma_stream_t stream )
{
for( int k = 0; k < nb; k += MAX_PIVOTS ) {
int npivots = min( MAX_PIVOTS, nb-k );
// fields are: dAT n lda j0 npivots
dlaswp_params_t params = { dAT, n, ldda, ind + k, npivots };
for( int j = 0; j < npivots; ++j ) {
params.ipiv[j] = ipiv[ind + k + j] - k - 1;
ipiv[ind + k + j] += ind;
}
dlaswps( params, stream );
}
}
extern "C" void dlaswps( dlaswp_params_t ¶ms, magma_stream_t stream )
{
int blocks = (params.n + NTHREADS - 1) / NTHREADS;
dlaswp_kernel<<< blocks, NTHREADS, 0, stream >>>( params );
}
I have attempted to replace the call to dlaswp in magmablas_dpermute_long2 with the magmablas_dlaswp2 routine, which uses a device copy of the pivots (ipiv), but I have not been able to successfully implement this, if it's even possible.
Does anyone have some suggestions on how to proceed, given that I need each CPU thread to be able to call dgetrf_gpu completely asynchronously from the other CPU threads?