If you use the CPU interface magma_dgetrf, then it usually isn't worthwhile to copy the matrix back to the GPU to use the GPU interface magma_dgetrs_gpu. The copy takes as long as the solve, so just use lapack's dgetrs. So there are generally two options:
1) Use magma_dgetrf (CPU interface) and lapack dgetrs.
2) Use magma_dgetrf_gpu and magma_dgetrs_gpu (both GPU interfaces). Currently, for matrices larger than about half the GPU's memory, the size (m, n, ldda) must be a multiple of 32. This can be accomplished by adding a small identity block on the matrix, such as:
A2 = [ A 0 ]
[ 0 I ]
The lda for any routine is the leading dimension of the matrix that you give that routine. For instance, if m=1000, and you allocate the matrix A with lda=1000 on the CPU, then call the CPU interface with A and lda. If you allocate dA on the GPU with ldda=1024, then call the GPU interface with dA and ldda. For performance reasons, we nearly always round the ldda on the GPU up to a multiple of 32. This aligns memory reads, making them much faster.