Performance of SGEMM Magma vs CUBLAS (toolkit 4.2)

Open discussion for MAGMA

Performance of SGEMM Magma vs CUBLAS (toolkit 4.2)

Postby darthbaker » Tue Jun 05, 2012 9:29 pm

Hi,

I am using the source code for the SGEMM function (in sgemm_fermi.cu) , (I will have to make modifications to it in future to compute a slightly different result) in a project, but this function when compared to the performance of the SGEMM in cublas is significantly reduced. I was hoping the magmablas performace was the same as the source-code is available for modification and I didn't want to have to write the function from scratch.

I have attached a screenshot of the profiler output for my project showing the difference in the compiled sgemm function vs the time the cublas kernel takes for the same matrix multiplication. Here fermiSgemm_v2_kernel_NN is the magmablas kernel. This takes 18us compared to 11us for the cublas kernel.

I have tried the alternative versions in sgemm_fermi64.cu (14us) and sgemm_fermi80.cu (16us) while they are better they are still slower compared to the cublas (11us) kernel execution time.

Is this expected, or should I be using some special compilation options for the .cu file? or should I be using a different version of the sgemm...cu kernel files?

The matrix sizes are 262144 x 244 and 20 x 224 to give a 20 x 262144 result. This is running on a Tesla C2070, with compile architecture options set to sm_20 and compute_20.

Thanks,

Performance.jpg
Profiler Timelime
Performance.jpg (67.94 KiB) Viewed 1349 times
darthbaker
 
Posts: 4
Joined: Tue Jun 05, 2012 9:00 pm

Re: Performance of SGEMM Magma vs CUBLAS (toolkit 4.2)

Postby mgates3 » Wed Jun 06, 2012 5:19 pm

It's not surprising that the CUBLAS gemm is now beating the MAGMA BLAS gemm. Our gemm was developed some time ago, while NVIDIA continues to optimize their gemm. In fact, NVIDIA has used portions of the MAGMA BLAS gemm in implementing their gemm. The matrix dimensions also have a large impact. In particular, for sgemm the MAGMA BLAS kernel uses blocking of 96 for M, 96 for N, and 16 for K (from comments in the sgemm_fermi.cu). Since your M=20, which is only a portion of the 96 block size, this will produce poor results. Try comparing some different sizes, particularly that are multiples of 96. You could also try writing an sgemm with a custom blocking size that fits your problem, based on the code in sgemm_fermi64.cu.

-mark
mgates3
 
Posts: 442
Joined: Fri Jan 06, 2012 2:13 pm

Re: Performance of SGEMM Magma vs CUBLAS (toolkit 4.2)

Postby darthbaker » Wed Jun 06, 2012 5:41 pm

Hi Mark,

Thanks for that information, in my case M=20 is not fixed and can vary with each subsequent iteration. I will test out other sizes and see how they perform and investigate a custom block size to see if that helps.

Good to know that I hadn't done anything majorly wrong in extracting the sgemm source.

Thanks,
darthbaker
 
Posts: 4
Joined: Tue Jun 05, 2012 9:00 pm


Return to User discussion

Who is online

Users browsing this forum: Baidu [Spider] and 4 guests

cron