Gemm shared memory allocation

Open discussion for MAGMA library (Matrix Algebra on GPU and Multicore Architectures)
Post Reply
Posts: 1
Joined: Thu Jan 12, 2017 4:16 pm

Gemm shared memory allocation

Post by vasilas » Thu Jan 12, 2017 4:27 pm

I have some issues understanding the following segment from the gemm_template_device_tn kernel ( gemm_template_device.c file):

__shared__ T sA[BLK_K][BLK_M+1]; // +1 only required if A is transposed
__shared__ T sB[BLK_N][BLK_K+1]; // +1 always required

Since i checked that every single kernel in the file contains the +1 offset and by removing it makes no difference (granted not out of bounds indexing) noticed a big performance penalty when running it. Compiling the MAGMA source with nvcc and -arch=sm_20 parameter had got the following output:

0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info : Used 56 registers, 12800 bytes smem, 64 bytes cmem[0]

Whereas compiling the modified code (remove +1 offset) i got:

16 bytes stack frame, 12 bytes spill stores, 12 bytes spill loads
ptxas info : Used 63 registers, 12288 bytes smem, 64 bytes cmem[0]

Could you please give us some insight regarding what does +1 really do, in the code?
Thank you for your time !

Posts: 842
Joined: Fri Jan 06, 2012 2:13 pm

Re: Gemm shared memory allocation

Post by mgates3 » Thu Jan 19, 2017 1:04 am

Skewing the array for efficient column-major access.

Let's say, hypothetically, that you're accessing a float[32][32] array.
If one warp hits one row, you have no bank conflicts.
But if one warp hits one column you have complete serialization.
All threads hit the same shared memory bank, since banks are 32-way interleaved.

Now, if you skew the array by 1, your threads are reading: 0, 33, 66, ...,
which means banks: 0%32, 33%32, 66%32, which means banks: 1, 2, 3, ..., so you have no conflicts.
Makes sense?

The impact depends on the shape of your thread block and may not always be very dramatic.
May also diminish on newer generations of cards, as Nvidia tries to improve the situation (more memory ports).

[posted on behalf of Jakub]

Post Reply