The nthread option is not used for all routines. The best way to see which options are used by any particular tester is to look in the testing code (e.g., testing_dgeqrf.cpp). For some option xyz, look for opts.xyz.
In MAGMA, most CPU threading is left to the LAPACK and BLAS libraries. (Exceptions are some eigenvalue routines where MAGMA uses OpenMP.) For instance, with Intel's MKL, set the $MKL_NUM_THREADS environment variable. Some libraries use OpenMP, so setting $OMP_NUM_THREADS works. If you are using ATLAS, according to the FAQ below, the number of threads is fixed at compile time, and you need to link with the threaded ATLAS libraries (-lptcblas -lptf77blas) instead of the serial ATLAS libraries (-lcblas -lf77blas).http://math-atlas.sourceforge.net/faq.html#tnum
In general, the CPU interface transfers the matrix to the GPU, does the computation, then transfers the result back. It attempts to hide the transfers by overlapping them with part of the computation. The GPU interface doesn't have to transfer the entire matrix, so it can sometimes be a bit faster. The computation itself is generally exactly the same.
MAGMA uses both the CPU and the GPU simultaneously (in general, depending on the algorithm). For instance, for QR, it does the panel factorization on the CPU while doing the previous trailing matrix update on the GPU. We are working on dynamic scheduling to better utilize all the CPU cores.
Since you have multiple GPUs, you may want to use the multi-GPU routines, denoted with _mgpu or _m.
./testing_dgeqrf_mgpu --ngpu 2
If you set $MAGMA_NUM_GPUS, some CPU interfaces will also use multiple GPUs.
setenv MAGMA_NUM_GPUS 2
Interfaces that do this are: geev_m, gehrd_m, geqrf, gesv, getrf, posv, potrf.