The solver first transposes the matrix in the GPU memory. The CUDA kernel that is doing it is of block size 32 and we request larger matrix so that we do not code the transpose operation for general matrix size. This will be most probably changed in future releases. When the next panel has to be processed, it is first transposed (to move it back to the standard data layout that LAPACK expects) and than sent to the CPU and factored there using LAPACK. The work space on the GPU needed for this and other operations is requested by the user - to be given as single pointer.
int dlda = (N/32)*32;
if (dlda<N) dlda+=32;
cublasSetMatrix( N, N, sizeof( float ), A, N, d_A, dlda ) ;
Here we just make the device lda of d_A divisible by 32 (and larger than N). This is where the matrix is copied and transposed in-place. The rest of the memory is used as workspace. So, to answer your question, we do "padding" just for the transpose operation, not for BLAS, and in future releases we will remove the need for the "padding" in the transpose operation.