Open discussion for MAGMA library (Matrix Algebra on GPU and Multicore Architectures)
2 posts • Page 1 of 1
Calling cudaMalloc properly aligns the beginning of the floating point data allocated for fully coalescent accesses (at the beginning of the data). The starting address for the cards before Fermi had to be aligned at 16*sizeof(type). In order for this to hold for columns after the first, one has to ensure that the lda is divisible by 16. We make this requirement a little stronger (divisibility by 32) in anticipation of hardware changes that may require it.