## NaN bug in MAGMAblas dgemm

Open discussion for MAGMA library (Matrix Algebra on GPU and Multicore Architectures)
eric_heinemeyer
Posts: 1
Joined: Thu Feb 28, 2013 10:51 am

### NaN bug in MAGMAblas dgemm

Hello,

I'd like to report the following bug (MAGMA 1.3.0 and all previous version) concerning the implementation
of the general dgemm for FERMI cards.

Consider the following case where matrices A and B equal

Code: Select all

`````` 101. 104. 107. nan
102. 105. 108. nan
103. 106. 109. nan
nan   nan  nan  nan
``````
and C, alpha and beta are arbitrary (not containing any nan's).
Computing C = alpha*op(A)*op(B) + beta*C for the submatrix A (1:3,1:3) and B(1:3,1:3)
yields a result containing nan's.

Bug fix
The first problem is the following computation of the matrix size that is used for the texture mapping of the arrays A and B.

In the function magmablas_dgemm_fermi(...) replace

Code: Select all

``````    size_t sizeA = (size_t) lda * (size_t) (!TransA ? k : m);
size_t sizeB = (size_t) ldb * (size_t) (!TransB ? n : k);
``````
with

Code: Select all

``````    size_t sizeA = (size_t) (lda * ((!TransA ? k : m) - 1) + (!TransA ? m : k) );
size_t sizeB = (size_t) (ldb * ((!TransB ? n : k) - 1) + (!TransB ? k : n) );
``````
This fixes the bug for C = alpha*A*B^T + beta*C.

For C = alpha*A^T*B + beta*C the access to A and B must be protected
In fermiDgemm_v2_kernel_TN( ...) replace

Code: Select all

``````    int tll = tx2;

#pragma unroll
for(int y=0; y<4; y++)
Abs[ty2*4+y][tx2] =  (tll<k)*  fetch_x_A(trackA + y*lda);

#pragma unroll
for(int y=0; y<4; y++)
Bb[tx2][ty2*4+y] = /* (tll<k)* */ fetch_x_B( trackB + y*ldb );

double Axs;
double Bxp;

double Cb = {0,0,0,0,    0,0,0,0, 0,0,0,0, 0,0,0,0};

int k1;
for(k1=0; k1<(k-16); k1+=16)
{
tll +=16;
B += 16;
A += 16  ;
trackA+=16 ;
trackB+=16;

#pragma unroll
for(int y=0; y<4; y++)
xxA[y] =  (tll<k)*  fetch_x_A(trackA + y*lda);

#pragma unroll
for(int y=0; y<4; y++)
xxB[y] = /* (tll<k)* */ fetch_x_B(trackB + y*ldb);
``````
with

Code: Select all

``````
int tll = tx2;

// +++ CHANGED +++
#pragma unroll
for(int y=0; y<4; y++)
Abs[ty2*4+y][tx2] =  (tll<k) ? fetch_x_A(trackA + y*lda) : 0.0;

// +++ CHANGED +++
#pragma unroll
for(int y=0; y<4; y++)
Bb[tx2][ty2*4+y]  =  (tll<k) ? fetch_x_B(trackB + y*ldb) : 0.0;

double Axs;
double Bxp;

double Cb = {0,0,0,0,    0,0,0,0, 0,0,0,0, 0,0,0,0};

int k1;
for(k1=0; k1<(k-16); k1+=16)
{
tll +=16;
B += 16;
A += 16  ;
trackA+=16 ;
trackB+=16;

// +++ CHANGED +++
#pragma unroll
for(int y=0; y<4; y++)
xxA[y] = (tll<k) ? fetch_x_A(trackA + y*lda) : 0.0;

// +++ CHANGED +++
#pragma unroll
for(int y=0; y<4; y++)
xxB[y] = (tll<k) ? fetch_x_B(trackB + y*ldb) : 0.0;
``````
For C = alpha*A*B + beta*C we must protect the access to B.
In fermiDgemm_v2_kernel_NN( ...) replace

Code: Select all

``````    int tll = tx2;
#pragma unroll
for(int y=0; y<4; y++)
Abs[ty2*4+y][tx2] =  (tll<k)*  fetch_x_A(trackA + y*lda);

#pragma unroll
for(int y=0; y<4; y++)
Bb[tx2][ty2*4+y] = /* (tll<k)* */ fetch_x_B( trackB + y*ldb );

double Axs;
double Bxp;

double Cb = {0,0,0,0,    0,0,0,0, 0,0,0,0, 0,0,0,0};

int k1;
for(k1=0; k1<(k-16); k1+=16)
{
tll +=16;
B += 16;
A += 16  ;
trackA+=16 ;
trackB+=16;

#pragma unroll
for(int y=0; y<4; y++)
xxA[y] =  (tll<k)*  fetch_x_A(trackA + y*lda);

#pragma unroll
for(int y=0; y<4; y++)
xxB[y] = /* (tll<k)* */ fetch_x_B(trackB + y*ldb);
``````
with

Code: Select all

``````    int tll = tx2;
#pragma unroll
for(int y=0; y<4; y++)
Abs[tx2+ y*16][ty2] = fetch_x_A(trackA + y*16) ;

//+++ CHANGED  +++
#pragma unroll
for(int y=0; y<4; y++)
Bb[tx2][ty2*4+y] = (tll<k) ? fetch_x_B( trackB + y * ldb) : 0.0;

double Axs;
double Bxp;

double Cb = {0,0,0,0, 0,0,0,0, 0,0,0,0, 0,0,0,0};

int k1;
for(k1=0; k1<(k-16); k1+=16)
{
tll+=16;
A += lda *16  ;
B += 16;
trackA += 16*lda ;
trackB += 16;

#pragma unroll
for( int y=0; y<4; y++)
xxA[y] =  fetch_x_A(trackA + y*16);

//+++ CHANGED  +++
#pragma unroll
for( int y=0; y<4; y++)
xxB[y] = (tll<k) ? fetch_x_B( trackB + y*ldb) : 0.0;
``````
For the case C = alpha*A^T*B^T + beta*C the access to A must be protected.
This can be done in analogy to the case C=A^T*B + beta*C.

It is very likely (I did not check) that the implementation of SGEMM,CGEMM and ZGEMM contain this bug
also. For software maintenance reasons it might be better do replace this check with the pointer redirection
approach to load the last row or column if the thread block gets out of bounds.
The nice feature of the texture mapping to return zero values outside the registered array only
works for the case of trailing nan's and only in the case of C = alpha*A*B^T+beta*C.

To explore the read only cache in TESLA K20 cards one has to make sure to only
access valid positions within the submatrix.

I do not think that a lot of people ever ran into this problem. We only found it by introducing checkword areas before and after
the data parts. The 32 bit integer checkwords that we used accidentaly yielded an nan value if interpreted as 64 bit
floating point value.

Cheers Eric

Stan Tomov
Posts: 279
Joined: Fri Aug 21, 2009 10:39 pm

### Re: NaN bug in MAGMAblas dgemm

Hi Eric,
Thanks for the report. Indeed we had a wrong assumption that certain out of bound values will be multiplied by 0 and therefore would be irrelevant for the result - which resulted in this bug when out of bound values are nan, and then nans propagated in the result. The other precisions have the same issue. We will fix it for the next release.
Stan

fletchjp
Posts: 203
Joined: Mon Dec 27, 2010 7:29 pm

### Re: NaN bug in MAGMAblas dgemm

I wonder if this report could be the reason for the other bug which I have just reported for 1.3.0 with DSGESV - that a NaN is picked up from unset memory not actually used in the calculation. That would explain why the memory error happens only sometimes.

John