This is not surprising. There are overheads in a multi-GPU implementation, such as copying pieces of the matrix twice to two GPUs. Only when the matrix is sufficiently large does the computational savings overcome the added overheads. It also depends a lot on the specific CPU, GPU, and implementation of LAPACK and BLAS that you are using. You can try tuning the block size (NB) in control/get_nb.cpp.