I have recently made the move from MAGMA 0.2 to 1.0 RC3, and have noticed that in certain cases, the GPU LU decomposition (device interface) such as is provided by magma_cgetrf_gpu(), the device memory used has increased by about a factor 2.

Upon investigation of cgetrf_gpu.cpp I found the following (from line 138).

- Code: Select all
`if ((m == n) && (m % 32 == 0) && (ldda%32 == 0))`

magmablas_cinplace_transpose( dAT, ldda, lddat );

else {

if ( CUBLAS_STATUS_SUCCESS != cublasAlloc(maxm*maxn, sizeof(cuFloatComplex), (void**)&dAT) ) {

cublasFree( dAP );

return MAGMA_ERR_CUBLASALLOC;

}

magmablas_ctranspose2( dAT, lddat, dA, ldda, m, n );

}

It seems that if the matrix is not square, or not a multiple of 32, then additional memory equal to the matrix size (padded to 32) is allocated on the device. I am sure that this padding helps performance, but it does make the interaction with my existing code a little difficult. Is there anyone else that has come across this and has an easy fix for me.

If this is not the case, I will have a look at what is required and submit a patch for square matrices that are not a multiple of 32.

Thanks, and keep up the good work.