I am using the source code for the SGEMM function (in sgemm_fermi.cu) , (I will have to make modifications to it in future to compute a slightly different result) in a project, but this function when compared to the performance of the SGEMM in cublas is significantly reduced. I was hoping the magmablas performace was the same as the source-code is available for modification and I didn't want to have to write the function from scratch.
I have attached a screenshot of the profiler output for my project showing the difference in the compiled sgemm function vs the time the cublas kernel takes for the same matrix multiplication. Here fermiSgemm_v2_kernel_NN is the magmablas kernel. This takes 18us compared to 11us for the cublas kernel.
I have tried the alternative versions in sgemm_fermi64.cu (14us) and sgemm_fermi80.cu (16us) while they are better they are still slower compared to the cublas (11us) kernel execution time.
Is this expected, or should I be using some special compilation options for the .cu file? or should I be using a different version of the sgemm...cu kernel files?
The matrix sizes are 262144 x 244 and 20 x 224 to give a 20 x 262144 result. This is running on a Tesla C2070, with compile architecture options set to sm_20 and compute_20.