Page 1 of 1

NaN bug in MAGMAblas dgemm

PostPosted: Thu Feb 28, 2013 11:41 am
by eric_heinemeyer
Hello,

I'd like to report the following bug (MAGMA 1.3.0 and all previous version) concerning the implementation
of the general dgemm for FERMI cards.

Consider the following case where matrices A and B equal
Code: Select all
 101. 104. 107. nan
 102. 105. 108. nan
 103. 106. 109. nan
 nan   nan  nan  nan

and C, alpha and beta are arbitrary (not containing any nan's).
Computing C = alpha*op(A)*op(B) + beta*C for the submatrix A (1:3,1:3) and B(1:3,1:3)
yields a result containing nan's.

Bug fix
The first problem is the following computation of the matrix size that is used for the texture mapping of the arrays A and B.

In the function magmablas_dgemm_fermi(...) replace
Code: Select all
    size_t sizeA = (size_t) lda * (size_t) (!TransA ? k : m);
    size_t sizeB = (size_t) ldb * (size_t) (!TransB ? n : k);

with
Code: Select all
    size_t sizeA = (size_t) (lda * ((!TransA ? k : m) - 1) + (!TransA ? m : k) );
    size_t sizeB = (size_t) (ldb * ((!TransB ? n : k) - 1) + (!TransB ? k : n) );


This fixes the bug for C = alpha*A*B^T + beta*C.

For C = alpha*A^T*B + beta*C the access to A and B must be protected
with a conditional load.
In fermiDgemm_v2_kernel_TN( ...) replace
Code: Select all
    int tll = tx2;

    #pragma unroll
    for(int y=0; y<4; y++)
        Abs[ty2*4+y][tx2] =  (tll<k)*  fetch_x_A(trackA + y*lda);

    #pragma unroll
    for(int y=0; y<4; y++)
        Bb[tx2][ty2*4+y] = /* (tll<k)* */ fetch_x_B( trackB + y*ldb );

    __syncthreads();

    double Axs[4];
    double Bxp[4];

    double Cb[16] = {0,0,0,0,    0,0,0,0, 0,0,0,0, 0,0,0,0};

    int k1;
    for(k1=0; k1<(k-16); k1+=16)
    {
        tll +=16;
        B += 16;
        A += 16  ;
        trackA+=16 ;
        trackB+=16;

        #pragma unroll
        for(int y=0; y<4; y++)
            xxA[y] =  (tll<k)*  fetch_x_A(trackA + y*lda);

        #pragma unroll
        for(int y=0; y<4; y++)
            xxB[y] = /* (tll<k)* */ fetch_x_B(trackB + y*ldb);
 

with
Code: Select all
 
     int tll = tx2;

    // +++ CHANGED +++
    #pragma unroll
    for(int y=0; y<4; y++)
        Abs[ty2*4+y][tx2] =  (tll<k) ? fetch_x_A(trackA + y*lda) : 0.0;

    // +++ CHANGED +++
    #pragma unroll
    for(int y=0; y<4; y++)
        Bb[tx2][ty2*4+y]  =  (tll<k) ? fetch_x_B(trackB + y*ldb) : 0.0;

    __syncthreads();

    double Axs[4];
    double Bxp[4];

    double Cb[16] = {0,0,0,0,    0,0,0,0, 0,0,0,0, 0,0,0,0};

    int k1;
    for(k1=0; k1<(k-16); k1+=16)
    {
        tll +=16;
        B += 16;
        A += 16  ;
        trackA+=16 ;
        trackB+=16;

        // +++ CHANGED +++
        #pragma unroll
        for(int y=0; y<4; y++)
            xxA[y] = (tll<k) ? fetch_x_A(trackA + y*lda) : 0.0;

        // +++ CHANGED +++
        #pragma unroll
        for(int y=0; y<4; y++)
            xxB[y] = (tll<k) ? fetch_x_B(trackB + y*ldb) : 0.0;


For C = alpha*A*B + beta*C we must protect the access to B.
In fermiDgemm_v2_kernel_NN( ...) replace
Code: Select all
    int tll = tx2;
    #pragma unroll
    for(int y=0; y<4; y++)
        Abs[ty2*4+y][tx2] =  (tll<k)*  fetch_x_A(trackA + y*lda);

    #pragma unroll
    for(int y=0; y<4; y++)
        Bb[tx2][ty2*4+y] = /* (tll<k)* */ fetch_x_B( trackB + y*ldb );

    __syncthreads();

    double Axs[4];
    double Bxp[4];

    double Cb[16] = {0,0,0,0,    0,0,0,0, 0,0,0,0, 0,0,0,0};

    int k1;
    for(k1=0; k1<(k-16); k1+=16)
    {
        tll +=16;
        B += 16;
        A += 16  ;
        trackA+=16 ;
        trackB+=16;

        #pragma unroll
        for(int y=0; y<4; y++)
            xxA[y] =  (tll<k)*  fetch_x_A(trackA + y*lda);

        #pragma unroll
        for(int y=0; y<4; y++)
            xxB[y] = /* (tll<k)* */ fetch_x_B(trackB + y*ldb);

with
Code: Select all
    int tll = tx2;
    #pragma unroll
    for(int y=0; y<4; y++)
        Abs[tx2+ y*16][ty2] = fetch_x_A(trackA + y*16) ;

    //+++ CHANGED  +++
    #pragma unroll
    for(int y=0; y<4; y++)
        Bb[tx2][ty2*4+y] = (tll<k) ? fetch_x_B( trackB + y * ldb) : 0.0;

    __syncthreads();

    double Axs[4];
    double Bxp[4];

    double Cb[16] = {0,0,0,0, 0,0,0,0, 0,0,0,0, 0,0,0,0};

    int k1;
    for(k1=0; k1<(k-16); k1+=16)
    {
        tll+=16;
        A += lda *16  ;
        B += 16;
        trackA += 16*lda ;
        trackB += 16;

        #pragma unroll
        for( int y=0; y<4; y++)
            xxA[y] =  fetch_x_A(trackA + y*16);

        //+++ CHANGED  +++
        #pragma unroll
        for( int y=0; y<4; y++)
            xxB[y] = (tll<k) ? fetch_x_B( trackB + y*ldb) : 0.0;


For the case C = alpha*A^T*B^T + beta*C the access to A must be protected.
This can be done in analogy to the case C=A^T*B + beta*C.

It is very likely (I did not check) that the implementation of SGEMM,CGEMM and ZGEMM contain this bug
also. For software maintenance reasons it might be better do replace this check with the pointer redirection
approach to load the last row or column if the thread block gets out of bounds.
The nice feature of the texture mapping to return zero values outside the registered array only
works for the case of trailing nan's and only in the case of C = alpha*A*B^T+beta*C.

To explore the read only cache in TESLA K20 cards one has to make sure to only
access valid positions within the submatrix.

I do not think that a lot of people ever ran into this problem. We only found it by introducing checkword areas before and after
the data parts. The 32 bit integer checkwords that we used accidentaly yielded an nan value if interpreted as 64 bit
floating point value.

Cheers Eric

Re: NaN bug in MAGMAblas dgemm

PostPosted: Thu Feb 28, 2013 4:08 pm
by Stan Tomov
Hi Eric,
Thanks for the report. Indeed we had a wrong assumption that certain out of bound values will be multiplied by 0 and therefore would be irrelevant for the result - which resulted in this bug when out of bound values are nan, and then nans propagated in the result. The other precisions have the same issue. We will fix it for the next release.
Stan

Re: NaN bug in MAGMAblas dgemm

PostPosted: Tue Apr 02, 2013 1:39 pm
by fletchjp
I wonder if this report could be the reason for the other bug which I have just reported for 1.3.0 with DSGESV - that a NaN is picked up from unset memory not actually used in the calculation. That would explain why the memory error happens only sometimes.

John