dtrsm bug in magma 1.1

Open discussion for MAGMA

dtrsm bug in magma 1.1

Postby rundel » Fri Mar 09, 2012 8:52 pm

I've run into a strange issue with magma that I can't explain and I've never seen anything like it before. As best I can describe the issue is that if I write code that exclusively uses magma_dtrsm I get the correct result without issue however if I include any reference to other magma functions in the same code (even if the other magma function is not called) calls to magma_dtrsm result in a zero filled matrix.

To better illustrate this issue I've created a github gist (https://gist.github.com/2009450) with minimal sample code that exhibits this issue on my system. If I comment out the line containing magma_dpotrs_gpu I get the correct result from magma_dtrsm however if the line is left uncommented I get a zero filled matrix. In both cases the function containing the magma_dpotrs_gpu call is never called so I have no idea why this should affect the functionality of magma_dtrsm. Both v1 and v2 of cublasDtrsm function without issue in the exact same code. The issue is not unique to dpotrs as magma_dgetrs_gpu, magma_dgeqrs_gpu, and magma_dgesv_gpu also seem to cause the same issue (other functions may as well I have not tested more widely). This issue also appears to affect higher level functions that depend on dtrsm.

I am working on Ubuntu 11.10 with a Geforce GTX 460, nvidia drivers 295.20, magma 1.1, and atlas 3.8.4. If there is any other helpful information I have left out please let me know.
rundel
 
Posts: 5
Joined: Fri Mar 09, 2012 8:29 pm

Re: dtrsm bug in magma 1.1

Postby rundel » Fri Mar 16, 2012 3:41 pm

Has anyone been able to replicate this issue? I am still banging my head agaisnt the wall and I have not been able to make any resolving on solving the issue. The only glimmer of light is that I have been able to get correct result from magmablas_dtrsm inconsistently if I link to magma before magmablas, but this will still sometimes produce a zero filled matrix.

Obviously this inconsistency should not be happening but I still cannot tell if the issue is with my setup/environment or with magma.
rundel
 
Posts: 5
Joined: Fri Mar 09, 2012 8:29 pm

Re: dtrsm bug in magma 1.1

Postby mgates3 » Mon Mar 19, 2012 4:33 pm

I couldn't exactly replicate the issue, but it seems there is a problem. Running memcheck finds an invalid memory reference, which indicates a bug in MAGMA. We'll look into it. Otherwise, use the cublas dtrsm.
-mark

> cuda-memcheck ./dtrsm
========= CUDA-MEMCHECK

1.000000 0.000000 0.000000
-0.400000 1.000000 0.000000
-0.080000 -0.300000 1.000000

1.000000 0.000000 0.000000
-0.400000 1.000000 0.000000
-0.080000 -0.300000 1.000000
========= Invalid __global__ read of size 8
========= at 0x00000600 in diag_dtrtri_kernel_upper
========= by thread (0,0,0) in block (0,0,0)
========= Address 0x3c00200168 is out of bounds
=========
========= Invalid __global__ read of size 8
========= at 0x00001618 in triple_dgemm_update_16_R
========= by thread (0,0,0) in block (0,0,0)
========= Address 0x3c002002e8 is out of bounds
=========
========= ERROR SUMMARY: 2 errors
mgates3
 
Posts: 428
Joined: Fri Jan 06, 2012 2:13 pm

Re: dtrsm bug in magma 1.1

Postby rundel » Thu Mar 22, 2012 11:19 am

Running cuda-memcheck on my systems results in the following:

>cuda-memcheck ./dtrsm
========= CUDA-MEMCHECK

0.000000 0.000000 0.000000
0.000000 0.000000 0.000000
0.000000 0.000000 0.000000

1.000000 0.000000 0.000000
-0.400000 1.000000 0.000000
-0.080000 -0.300000 1.000000
========= Invalid __global__ read of size 8
========= at 0x000000d0 in diag_dtrtri_kernel_upper
========= by thread (9,0,0) in block (0,0,0)
========= Address 0x200300048 is out of bounds
=========
========= Invalid __global__ read of size 8
========= at 0x000000d0 in diag_dtrtri_kernel_upper
========= by thread (10,0,0) in block (0,0,0)
========= Address 0x200300050 is out of bounds
=========
========= Invalid __global__ read of size 8
========= at 0x000000d0 in diag_dtrtri_kernel_upper
========= by thread (11,0,0) in block (0,0,0)
========= Address 0x200300058 is out of bounds
=========
========= Invalid __global__ read of size 8
========= at 0x000000d0 in diag_dtrtri_kernel_upper
========= by thread (12,0,0) in block (0,0,0)
========= Address 0x200300060 is out of bounds
=========
========= Invalid __global__ read of size 8
========= at 0x000000d0 in diag_dtrtri_kernel_upper
========= by thread (13,0,0) in block (0,0,0)
========= Address 0x200300068 is out of bounds
=========
========= Invalid __global__ read of size 8
========= at 0x000000d0 in diag_dtrtri_kernel_upper
========= by thread (14,0,0) in block (0,0,0)
========= Address 0x200300070 is out of bounds
=========
========= Invalid __global__ read of size 8
========= at 0x000000d0 in diag_dtrtri_kernel_upper
========= by thread (15,0,0) in block (0,0,0)
========= Address 0x200300078 is out of bounds
=========
========= ERROR SUMMARY: 7 errors
rundel
 
Posts: 5
Joined: Fri Mar 09, 2012 8:29 pm

Re: dtrsm bug in magma 1.1

Postby brom » Thu Mar 22, 2012 3:23 pm

I've found that almost every MAGMA BLAS kernel takes liberties with reading out of bounds GPU memory when reading into shared memory buffers.

On some systems this is OK. However, on others (like yours) it causes errors.

Typically this memory won't be accessed so it's not an obvious error, but an error none-the-less (as you can see).
brom
 
Posts: 18
Joined: Tue Jan 25, 2011 8:20 pm

Re: dtrsm bug in magma 1.1

Postby jgpallero » Fri Mar 23, 2012 7:25 am

brom wrote:I've found that almost every MAGMA BLAS kernel takes liberties with reading out of bounds GPU memory when reading into shared memory buffers.

On some systems this is OK. However, on others (like yours) it causes errors.

Typically this memory won't be accessed so it's not an obvious error, but an error none-the-less (as you can see).


Mmmm... Could be this an explanation for viewtopic.php?f=2&t=429?
jgpallero
 
Posts: 29
Joined: Tue Nov 15, 2011 12:38 pm

Re: dtrsm bug in magma 1.1

Postby rundel » Fri Mar 30, 2012 10:57 am

The errors I was seeing with cuda-memcheck seem to be innocuous, diag_dtrtri_kernel_upper and diag_dtrtri_kernel_lower do read from undefined device memory if the data is smaller than the block size, but the result is zeroed when this happens. Seems like the same result could be achieved with a ternary operator so that cuda-memcheck doesn't get confused but I don't know if that has possible implications for branch divergence.

After poking around a little bit more it seems like the actual issue I am observing may in fact depend on dtrsm's calls to magmablas_dgemm. If I comment out #define cublasDgemm magmablas_dgemm so that I am using cublas' dgemm I get the proper result. I'll try to play around with things more to see where exactly the calculation goes wrong, but since the fermi implementation of dgemm uses textures I cannot use cuda-gdb to debug the kernel which makes things much more complicated. Anyone have advice on debugging under these circumstances?
rundel
 
Posts: 5
Joined: Fri Mar 09, 2012 8:29 pm

Re: dtrsm bug in magma 1.1

Postby rundel » Fri Mar 30, 2012 2:40 pm

I can confirm that the issue I am having is due to an issue with magmablas_dgemm. Particularly, if I examine the matrices going into the dgemm call on line 1994 of dtrm_tesla.cu (cublasDgemm ('N', 'N', MM, N, MM, alpha, d_dinvA+i*NB, NB, b+i, ldb, 0.0, d_x+i, M);) I get the correct result if I use cublas_dgemm and a matrix filled with zeros when using magmablas_dgemm.

Edit: Building with tesla as the target gpu gives a functioning dgemm, my understanding was that a GTX 460 should be fermi capable. Any ideas on why this might not be working?


Results look like the following:

magmablas_dgemm:
d_dinvA+i*NB:
1.000000 -0.400000 -0.080000
0.000000 1.000000 -0.300000
0.000000 0.000000 1.000000

b+i:
1.000000 0.000000 0.000000
0.000000 1.000000 0.000000
0.000000 0.000000 1.000000

Dgemm 1:
0.000000 0.000000 0.000000
0.000000 0.000000 0.000000
0.000000 0.000000 0.000000


cublas_dgemm:
d_dinvA+i*NB:
1.000000 -0.400000 -0.080000
0.000000 1.000000 -0.300000
0.000000 0.000000 1.000000

b+i:
1.000000 0.000000 0.000000
0.000000 1.000000 0.000000
0.000000 0.000000 1.000000

Dgemm 1:
1.000000 -0.400000 -0.080000
0.000000 1.000000 -0.300000
0.000000 0.000000 1.000000
rundel
 
Posts: 5
Joined: Fri Mar 09, 2012 8:29 pm


Return to User discussion

Who is online

Users browsing this forum: Yahoo [Bot] and 3 guests