Bug in ZINPLACE_TRANSPOSE?
Bug in ZINPLACE_TRANSPOSE?
Hi,
I have seen strange error with zgetrf_gpu (wrong answer or segfault with memory calls after the routine) on Cray XK (Fermi+). The error occurs when (1) M=N, (2) M is multiple of 32, and (3) LDA is mulitple of 32, but not equal to M. I thought this could be a bug in zinplace_tranpsose. So I changed the source to call magmablas_ztransepose2, then the problem is fixed. I think zinplace_transpose source might have some bugs, but I haven't figured it yet.
It will be great, if you can help us.
Thanks,
Keita
I have seen strange error with zgetrf_gpu (wrong answer or segfault with memory calls after the routine) on Cray XK (Fermi+). The error occurs when (1) M=N, (2) M is multiple of 32, and (3) LDA is mulitple of 32, but not equal to M. I thought this could be a bug in zinplace_tranpsose. So I changed the source to call magmablas_ztransepose2, then the problem is fixed. I think zinplace_transpose source might have some bugs, but I haven't figured it yet.
It will be great, if you can help us.
Thanks,
Keita

 Posts: 271
 Joined: Fri Aug 21, 2009 10:39 pm
Re: Bug in ZINPLACE_TRANSPOSE?
Hi Keita,
Thank you for this bug report. We were able to reproduce it and to fix the bug. The fix will be in the next release (beginning of May). It is for a case where we do an inplace matrix transpose. There are two calls to the transpose routine (in file zgetrf_gpu.cpp). The first one is:
and must be replaced by
The second call must be replaced by
Please let us know if this worked for your tests. Thanks again.
Stan
Thank you for this bug report. We were able to reproduce it and to fix the bug. The fix will be in the next release (beginning of May). It is for a case where we do an inplace matrix transpose. There are two calls to the transpose routine (in file zgetrf_gpu.cpp). The first one is:
Code: Select all
if ((m == n) && (m % 32 == 0) && (ldda%32 == 0))
magmablas_zinplace_transpose( dAT, ldda, lddat );
and must be replaced by
Code: Select all
if ((m == n) && (m % 32 == 0) && (ldda%32 == 0)){
lddat = ldda;
magmablas_zinplace_transpose( dAT, ldda, m);
}
Code: Select all
magmablas_zinplace_transpose( dAT, lddat, m );
Stan
Re: Bug in ZINPLACE_TRANSPOSE?
Stan,
I can see the same problem with sgetrf_gpu, dgetrf_gpu and cgetrf_gpu.
Thanks,
Keita
I can see the same problem with sgetrf_gpu, dgetrf_gpu and cgetrf_gpu.
Thanks,
Keita

 Posts: 271
 Joined: Fri Aug 21, 2009 10:39 pm
Re: Bug in ZINPLACE_TRANSPOSE?
Keita,
Yes, we fixed the other versions as well. Actually, we generate them from the double complex version. I will go through the other LU versions as well, in particular the CPU interface ones, to see if they also need this fix.
Stan
Yes, we fixed the other versions as well. Actually, we generate them from the double complex version. I will go through the other LU versions as well, in particular the CPU interface ones, to see if they also need this fix.
Stan
Re: Bug in ZINPLACE_TRANSPOSE?
I am still seeing wrong answer if A and X are allocated in contiguous memory as described below. I suspect that zinplace_tranpose is making outofbound memory access when M!=LDA. In my application, I finally get the correct answer after changing the source to apply zinplace_transpose for LDA=M only.
====Error case description==
(Definition of BIGA and A in Fortran90 notation.)
LDA = M + 32; (M is divisible by 32).
BIGA = Double Complex Array of (1:LDA,1:M)
A = BIGA(33:lda,33:lda)
In my application code:
magma_zgetrf_gpu ( m,m, &A[32*lda+32], lda, ipvt, info);
After zgetrf call, the elements in A(1:32,33:M) are polluted.
====Error case description==
(Definition of BIGA and A in Fortran90 notation.)
LDA = M + 32; (M is divisible by 32).
BIGA = Double Complex Array of (1:LDA,1:M)
A = BIGA(33:lda,33:lda)
In my application code:
magma_zgetrf_gpu ( m,m, &A[32*lda+32], lda, ipvt, info);
After zgetrf call, the elements in A(1:32,33:M) are polluted.
Re: Bug in ZINPLACE_TRANSPOSE?
It appears transpose routine looks OK. I am also looking into magmablas_zpermute_long2 to find any outofbound memory access.
Just for clarification. The error occurs when:
M=3584
LDA=3616
Input matrix starts at A(32,32).
Just for clarification. The error occurs when:
M=3584
LDA=3616
Input matrix starts at A(32,32).
keitat wrote:I am still seeing wrong answer if A and X are allocated in contiguous memory as described below. I suspect that zinplace_tranpose is making outofbound memory access when M!=LDA. In my application, I finally get the correct answer after changing the source to apply zinplace_transpose for LDA=M only.
====Error case description==
(Definition of BIGA and A in Fortran90 notation.)
LDA = M + 32; (M is divisible by 32).
BIGA = Double Complex Array of (1:LDA,1:M)
A = BIGA(33:lda,33:lda)
In my application code:
magma_zgetrf_gpu ( m,m, &A[32*lda+32], lda, ipvt, info);
After zgetrf call, the elements in A(1:32,33:M) are polluted.

 Posts: 271
 Joined: Fri Aug 21, 2009 10:39 pm
Re: Bug in ZINPLACE_TRANSPOSE?
Keita,
I managed to reproduce this bug as well. Thanks for reporting it. It was in magmablas_zpermute_long2. Now it is fixed in the SVN and we will make it available soon. Meanwhile, the fix is to call zpermute as
The added argument changes the implementation of zpemute in magmablas/zpermutev2.cu (the other precisions are similar) as follows:
The bug was that in this particular case the code was permuting user data outside of the submatrix that is being factorized. We didn't have this issue in the CPU interface code and for this case we were only checking correctness for the factorization.
Stan
I managed to reproduce this bug as well. Thanks for reporting it. It was in magmablas_zpermute_long2. Now it is fixed in the SVN and we will make it available soon. Meanwhile, the fix is to call zpermute as
Code: Select all
magmablas_zpermute_long2( n, dAT, lddat, ipiv, nb, i*nb );
Code: Select all
extern "C" void
magmablas_zpermute_long2( int n, cuDoubleComplex *dAT, int lda, int *ipiv, int nb, int ind )
{
int k;
for( k = 0; k < nbBLOCK_SIZE; k += BLOCK_SIZE )
{
//zlaswp_params_t params = { dAT, lda, lda, ind + k };
zlaswp_params_t2 params = { dAT, n, lda, ind + k, BLOCK_SIZE };
...
}
Stan