We are happy to hear you use and like the MAGMA package.
1) ...in testing_sgesv_gpu.cpp file:
magma_sgetrf_gpu(&N, &N, d_A, &dlda, IPIV, h_work_M_S, INFO);
magma_sgetrs_gpu("N", N, NRHS, d_A, dlda, IPIV, d_B, LDB, INFO, h_work_M_S);
I think this is the equivalent way to magma_sgesv_gpu, but there is no call to sgesv_gpu...
We are going to fix this. There is no particular reason - we should group the pair of calling magma_sgetrf_gpu and magma_sgetrs_gpu into a sgesv_gpu routine (currently not added mainly because we missed it).
2) The testing_sgesv_gpu.cpp does not give CPU results. Only it gives GPU results.
There is no particular reason for this one either. We will make the tester uniform with the others.
3) My code has several calls to Cublas_sgemm. May I substitute it by MAGMA_sgemm to get better
Yes, especially for sizes not multiple of 32. magma_sgemm uses a combination of kernels and may need some additional tuning but in general should be faster for the sizes not multiple of the algorithm's internal blocking sizes (e.g. 32).