lda is the leading dimension of the matrix. An m x n matrix may be a submatrix of a larger lda x n matrix in memory. For example, using Matlab notation,
A = [
11, 12, 13
21, 22, 23
31, 32, 33
41, 42, 43
has an lda=4.
A2 = A( 3:4, 2:3 )
A2 = [
In this case, A2 is a 2x2 sub-matrix of A, so its leading dimension (lda) is still 4. (I mean A2 is literally a sub-matrix of A, not a copy of a sub-matrix of A.)
We prefix matrices with "d" to mean on the device, so dA is the matrix A on the GPU device. The leading dimension for dA is then ldda, to distinguish it from lda for A. dAT is dA transposed, to be in row-major order instead of column-major order.
(m + 31) / 32) * 32
is a cryptic way of rounding m up to the next multiple of 32. Remember that the floor is taken in integer division, so read this as
floor( (m + 31) / 32 ) * 32
The +31 is to force rounding up, so in effect it is
ceil( m / 32 ) * 32
The ldda is rounded up because the GPU is most efficient at reading data when each column is aligned on a 32 word boundary.
nb is the block size. That is, we process nb columns of A at a time.