Page 1 of 1

### NaN bug in MAGMAblas dgemm

Posted: Thu Feb 28, 2013 11:41 am
Hello,

I'd like to report the following bug (MAGMA 1.3.0 and all previous version) concerning the implementation
of the general dgemm for FERMI cards.

Consider the following case where matrices A and B equal
Code: Select all
` 101. 104. 107. nan 102. 105. 108. nan 103. 106. 109. nan nan   nan  nan  nan`

and C, alpha and beta are arbitrary (not containing any nan's).
Computing C = alpha*op(A)*op(B) + beta*C for the submatrix A (1:3,1:3) and B(1:3,1:3)
yields a result containing nan's.

Bug fix
The first problem is the following computation of the matrix size that is used for the texture mapping of the arrays A and B.

In the function magmablas_dgemm_fermi(...) replace
Code: Select all
`    size_t sizeA = (size_t) lda * (size_t) (!TransA ? k : m);    size_t sizeB = (size_t) ldb * (size_t) (!TransB ? n : k);`

with
Code: Select all
`    size_t sizeA = (size_t) (lda * ((!TransA ? k : m) - 1) + (!TransA ? m : k) );    size_t sizeB = (size_t) (ldb * ((!TransB ? n : k) - 1) + (!TransB ? k : n) );`

This fixes the bug for C = alpha*A*B^T + beta*C.

For C = alpha*A^T*B + beta*C the access to A and B must be protected
In fermiDgemm_v2_kernel_TN( ...) replace
Code: Select all
`    int tll = tx2;    #pragma unroll    for(int y=0; y<4; y++)        Abs[ty2*4+y][tx2] =  (tll<k)*  fetch_x_A(trackA + y*lda);    #pragma unroll    for(int y=0; y<4; y++)        Bb[tx2][ty2*4+y] = /* (tll<k)* */ fetch_x_B( trackB + y*ldb );    __syncthreads();    double Axs[4];    double Bxp[4];    double Cb[16] = {0,0,0,0,    0,0,0,0, 0,0,0,0, 0,0,0,0};    int k1;    for(k1=0; k1<(k-16); k1+=16)    {        tll +=16;        B += 16;        A += 16  ;        trackA+=16 ;        trackB+=16;        #pragma unroll        for(int y=0; y<4; y++)            xxA[y] =  (tll<k)*  fetch_x_A(trackA + y*lda);        #pragma unroll        for(int y=0; y<4; y++)            xxB[y] = /* (tll<k)* */ fetch_x_B(trackB + y*ldb); `

with
Code: Select all
`       int tll = tx2;    // +++ CHANGED +++    #pragma unroll    for(int y=0; y<4; y++)        Abs[ty2*4+y][tx2] =  (tll<k) ? fetch_x_A(trackA + y*lda) : 0.0;    // +++ CHANGED +++    #pragma unroll    for(int y=0; y<4; y++)        Bb[tx2][ty2*4+y]  =  (tll<k) ? fetch_x_B(trackB + y*ldb) : 0.0;    __syncthreads();    double Axs[4];    double Bxp[4];    double Cb[16] = {0,0,0,0,    0,0,0,0, 0,0,0,0, 0,0,0,0};    int k1;    for(k1=0; k1<(k-16); k1+=16)    {        tll +=16;        B += 16;        A += 16  ;        trackA+=16 ;        trackB+=16;        // +++ CHANGED +++        #pragma unroll        for(int y=0; y<4; y++)            xxA[y] = (tll<k) ? fetch_x_A(trackA + y*lda) : 0.0;        // +++ CHANGED +++        #pragma unroll        for(int y=0; y<4; y++)            xxB[y] = (tll<k) ? fetch_x_B(trackB + y*ldb) : 0.0; `

For C = alpha*A*B + beta*C we must protect the access to B.
In fermiDgemm_v2_kernel_NN( ...) replace
Code: Select all
`    int tll = tx2;    #pragma unroll    for(int y=0; y<4; y++)        Abs[ty2*4+y][tx2] =  (tll<k)*  fetch_x_A(trackA + y*lda);    #pragma unroll    for(int y=0; y<4; y++)        Bb[tx2][ty2*4+y] = /* (tll<k)* */ fetch_x_B( trackB + y*ldb );    __syncthreads();    double Axs[4];    double Bxp[4];    double Cb[16] = {0,0,0,0,    0,0,0,0, 0,0,0,0, 0,0,0,0};    int k1;    for(k1=0; k1<(k-16); k1+=16)    {        tll +=16;        B += 16;        A += 16  ;        trackA+=16 ;        trackB+=16;        #pragma unroll        for(int y=0; y<4; y++)            xxA[y] =  (tll<k)*  fetch_x_A(trackA + y*lda);        #pragma unroll        for(int y=0; y<4; y++)            xxB[y] = /* (tll<k)* */ fetch_x_B(trackB + y*ldb);`

with
Code: Select all
`    int tll = tx2;    #pragma unroll    for(int y=0; y<4; y++)        Abs[tx2+ y*16][ty2] = fetch_x_A(trackA + y*16) ;    //+++ CHANGED  +++    #pragma unroll    for(int y=0; y<4; y++)        Bb[tx2][ty2*4+y] = (tll<k) ? fetch_x_B( trackB + y * ldb) : 0.0;     __syncthreads();    double Axs[4];    double Bxp[4];    double Cb[16] = {0,0,0,0, 0,0,0,0, 0,0,0,0, 0,0,0,0};    int k1;    for(k1=0; k1<(k-16); k1+=16)    {        tll+=16;        A += lda *16  ;        B += 16;        trackA += 16*lda ;        trackB += 16;        #pragma unroll        for( int y=0; y<4; y++)            xxA[y] =  fetch_x_A(trackA + y*16);        //+++ CHANGED  +++        #pragma unroll        for( int y=0; y<4; y++)            xxB[y] = (tll<k) ? fetch_x_B( trackB + y*ldb) : 0.0;`

For the case C = alpha*A^T*B^T + beta*C the access to A must be protected.
This can be done in analogy to the case C=A^T*B + beta*C.

It is very likely (I did not check) that the implementation of SGEMM,CGEMM and ZGEMM contain this bug
also. For software maintenance reasons it might be better do replace this check with the pointer redirection
approach to load the last row or column if the thread block gets out of bounds.
The nice feature of the texture mapping to return zero values outside the registered array only
works for the case of trailing nan's and only in the case of C = alpha*A*B^T+beta*C.

To explore the read only cache in TESLA K20 cards one has to make sure to only
access valid positions within the submatrix.

I do not think that a lot of people ever ran into this problem. We only found it by introducing checkword areas before and after
the data parts. The 32 bit integer checkwords that we used accidentaly yielded an nan value if interpreted as 64 bit
floating point value.

Cheers Eric

### Re: NaN bug in MAGMAblas dgemm

Posted: Thu Feb 28, 2013 4:08 pm
Hi Eric,
Thanks for the report. Indeed we had a wrong assumption that certain out of bound values will be multiplied by 0 and therefore would be irrelevant for the result - which resulted in this bug when out of bound values are nan, and then nans propagated in the result. The other precisions have the same issue. We will fix it for the next release.
Stan

### Re: NaN bug in MAGMAblas dgemm

Posted: Tue Apr 02, 2013 1:39 pm
I wonder if this report could be the reason for the other bug which I have just reported for 1.3.0 with DSGESV - that a NaN is picked up from unset memory not actually used in the calculation. That would explain why the memory error happens only sometimes.

John