Yes, all functions use the GPU. The difference is just the interface. Routine magma_sgetrf and magma_sgetrf_gpu takes input matrix and produces result on the CPU memory (as shown in testing_sgetrf.cpp), while magma_sgetrf_gpu assumes the input matrix and the output factorization are on the GPU memory (as given in testing_sgetrf_gpu.cpp).
Routine magma_sgetrf_gpu2 is another version of LU in the GPU interface. It was coded to facilitate the solving part, more precisely the pivoting. We used it internally for the mixed precision iterative refinement solvers (see testing_dsgesv_gpu.cpp). The interface here is not as in LAPACK that's why we didn't include it in the Users' guide. It has one more argument related to pivoting that facilitates parallel pivoting - the LAPACK compliant is sequential in nature in the sense you have to make it in order, e.g. giving row i of the matrix was interchanged with row IPIV(i). The new indexing shows row i of the matrix was moved to row IPIV(i) and therefore could process reordering,e.g., needed in solvers, in parallel.