I'd like to report the following bug (MAGMA 1.3.0 and all previous version) concerning the implementation
of the general dgemm for FERMI cards.
Consider the following case where matrices A and B equal
- Code: Select all
101. 104. 107. nan
102. 105. 108. nan
103. 106. 109. nan
nan nan nan nan
and C, alpha and beta are arbitrary (not containing any nan's).
Computing C = alpha*op(A)*op(B) + beta*C for the submatrix A (1:3,1:3) and B(1:3,1:3)
yields a result containing nan's.
Bug fix
The first problem is the following computation of the matrix size that is used for the texture mapping of the arrays A and B.
In the function magmablas_dgemm_fermi(...) replace
- Code: Select all
size_t sizeA = (size_t) lda * (size_t) (!TransA ? k : m);
size_t sizeB = (size_t) ldb * (size_t) (!TransB ? n : k);
with
- Code: Select all
size_t sizeA = (size_t) (lda * ((!TransA ? k : m) - 1) + (!TransA ? m : k) );
size_t sizeB = (size_t) (ldb * ((!TransB ? n : k) - 1) + (!TransB ? k : n) );
This fixes the bug for C = alpha*A*B^T + beta*C.
For C = alpha*A^T*B + beta*C the access to A and B must be protected
with a conditional load.
In fermiDgemm_v2_kernel_TN( ...) replace
- Code: Select all
int tll = tx2;
#pragma unroll
for(int y=0; y<4; y++)
Abs[ty2*4+y][tx2] = (tll<k)* fetch_x_A(trackA + y*lda);
#pragma unroll
for(int y=0; y<4; y++)
Bb[tx2][ty2*4+y] = /* (tll<k)* */ fetch_x_B( trackB + y*ldb );
__syncthreads();
double Axs[4];
double Bxp[4];
double Cb[16] = {0,0,0,0, 0,0,0,0, 0,0,0,0, 0,0,0,0};
int k1;
for(k1=0; k1<(k-16); k1+=16)
{
tll +=16;
B += 16;
A += 16 ;
trackA+=16 ;
trackB+=16;
#pragma unroll
for(int y=0; y<4; y++)
xxA[y] = (tll<k)* fetch_x_A(trackA + y*lda);
#pragma unroll
for(int y=0; y<4; y++)
xxB[y] = /* (tll<k)* */ fetch_x_B(trackB + y*ldb);
with
- Code: Select all
int tll = tx2;
// +++ CHANGED +++
#pragma unroll
for(int y=0; y<4; y++)
Abs[ty2*4+y][tx2] = (tll<k) ? fetch_x_A(trackA + y*lda) : 0.0;
// +++ CHANGED +++
#pragma unroll
for(int y=0; y<4; y++)
Bb[tx2][ty2*4+y] = (tll<k) ? fetch_x_B(trackB + y*ldb) : 0.0;
__syncthreads();
double Axs[4];
double Bxp[4];
double Cb[16] = {0,0,0,0, 0,0,0,0, 0,0,0,0, 0,0,0,0};
int k1;
for(k1=0; k1<(k-16); k1+=16)
{
tll +=16;
B += 16;
A += 16 ;
trackA+=16 ;
trackB+=16;
// +++ CHANGED +++
#pragma unroll
for(int y=0; y<4; y++)
xxA[y] = (tll<k) ? fetch_x_A(trackA + y*lda) : 0.0;
// +++ CHANGED +++
#pragma unroll
for(int y=0; y<4; y++)
xxB[y] = (tll<k) ? fetch_x_B(trackB + y*ldb) : 0.0;
For C = alpha*A*B + beta*C we must protect the access to B.
In fermiDgemm_v2_kernel_NN( ...) replace
- Code: Select all
int tll = tx2;
#pragma unroll
for(int y=0; y<4; y++)
Abs[ty2*4+y][tx2] = (tll<k)* fetch_x_A(trackA + y*lda);
#pragma unroll
for(int y=0; y<4; y++)
Bb[tx2][ty2*4+y] = /* (tll<k)* */ fetch_x_B( trackB + y*ldb );
__syncthreads();
double Axs[4];
double Bxp[4];
double Cb[16] = {0,0,0,0, 0,0,0,0, 0,0,0,0, 0,0,0,0};
int k1;
for(k1=0; k1<(k-16); k1+=16)
{
tll +=16;
B += 16;
A += 16 ;
trackA+=16 ;
trackB+=16;
#pragma unroll
for(int y=0; y<4; y++)
xxA[y] = (tll<k)* fetch_x_A(trackA + y*lda);
#pragma unroll
for(int y=0; y<4; y++)
xxB[y] = /* (tll<k)* */ fetch_x_B(trackB + y*ldb);
with
- Code: Select all
int tll = tx2;
#pragma unroll
for(int y=0; y<4; y++)
Abs[tx2+ y*16][ty2] = fetch_x_A(trackA + y*16) ;
//+++ CHANGED +++
#pragma unroll
for(int y=0; y<4; y++)
Bb[tx2][ty2*4+y] = (tll<k) ? fetch_x_B( trackB + y * ldb) : 0.0;
__syncthreads();
double Axs[4];
double Bxp[4];
double Cb[16] = {0,0,0,0, 0,0,0,0, 0,0,0,0, 0,0,0,0};
int k1;
for(k1=0; k1<(k-16); k1+=16)
{
tll+=16;
A += lda *16 ;
B += 16;
trackA += 16*lda ;
trackB += 16;
#pragma unroll
for( int y=0; y<4; y++)
xxA[y] = fetch_x_A(trackA + y*16);
//+++ CHANGED +++
#pragma unroll
for( int y=0; y<4; y++)
xxB[y] = (tll<k) ? fetch_x_B( trackB + y*ldb) : 0.0;
For the case C = alpha*A^T*B^T + beta*C the access to A must be protected.
This can be done in analogy to the case C=A^T*B + beta*C.
It is very likely (I did not check) that the implementation of SGEMM,CGEMM and ZGEMM contain this bug
also. For software maintenance reasons it might be better do replace this check with the pointer redirection
approach to load the last row or column if the thread block gets out of bounds.
The nice feature of the texture mapping to return zero values outside the registered array only
works for the case of trailing nan's and only in the case of C = alpha*A*B^T+beta*C.
To explore the read only cache in TESLA K20 cards one has to make sure to only
access valid positions within the submatrix.
I do not think that a lot of people ever ran into this problem. We only found it by introducing checkword areas before and after
the data parts. The 32 bit integer checkwords that we used accidentaly yielded an nan value if interpreted as 64 bit
floating point value.
Cheers Eric
