It's not surprising that the CUBLAS gemm is now beating the MAGMA BLAS gemm. Our gemm was developed some time ago, while NVIDIA continues to optimize their gemm. In fact, NVIDIA has used portions of the MAGMA BLAS gemm in implementing their gemm. The matrix dimensions also have a large impact. In particular, for sgemm the MAGMA BLAS kernel uses blocking of 96 for M, 96 for N, and 16 for K (from comments in the sgemm_fermi.cu). Since your M=20, which is only a portion of the 96 block size, this will produce poor results. Try comparing some different sizes, particularly that are multiples of 96. You could also try writing an sgemm with a custom blocking size that fits your problem, based on the code in sgemm_fermi64.cu.