testing_zgesvd segmentation fault in V1.0 RC3 with mkl10.3

Open discussion for MAGMA

testing_zgesvd segmentation fault in V1.0 RC3 with mkl10.3

Postby addee » Wed Jan 26, 2011 9:29 pm

hi,

I compiled magma V1.0 RC3 with the make.inc.shared (fix mkl path accordingly). Other test drive executable such as testing_sgeqrf works fine. However, the testing_zgesvd will crash with Segmentation Fault:

$>./testing_zgesvd
device 0: GeForce GTX 470, 1215.0 MHz clock, 1279.7 MB memory
device 1: GeForce GTX 470, 1215.0 MHz clock, 1279.7 MB memory
Usage:
testing_zgesvd -M 1024 -N 1024

N CPU Time(s) GPU Time(s) ||R||_F / ||A||_F
==========================================================
Segmentation fault


the backtrace from gdb is :
#0 0x00007ffff4ad8d2b in mkl_lapack_zlacpy ()
from /opt/intel/composerxe-2011.1.107/mkl/lib/intel64/libmkl_core.so
#1 0x00007ffff5e8caf8 in zlacpy_ ()
from /opt/intel/composerxe-2011.1.107/mkl/lib/intel64/libmkl_intel_lp64.so
#2 0x00000000004011a0 in main ()


The testing_zgesvd is commented out in the testing/Makefile by default. Is it a working version?
Do you guys have any insight about this? Thanks!


PS: my environment is
Ubuntu 10.10 with MKL 10.3, GCC 4.4.5
Duo socket Intel i7 Xeon
Two GTX 470
addee
 
Posts: 3
Joined: Wed Jan 26, 2011 9:10 pm

Re: testing_zgesvd segmentation fault in V1.0 RC3 with mkl10

Postby Stan Tomov » Thu Jan 27, 2011 9:47 pm

Hi,
I was wondering if the problem is due to a memory limitation (in which case we have forgotten to check somewhere the result of GPU memory allocation). Can you check if it would work for a fixed smaller size problem, e.g.,
./testing_zgesvd -M 1024 -N 1024
Thanks,
Stan
Stan Tomov
 
Posts: 253
Joined: Fri Aug 21, 2009 10:39 pm

Re: testing_zgesvd segmentation fault in V1.0 RC3 with mkl10

Postby brom » Fri Jan 28, 2011 12:43 pm

This seg faults for me too using Atlas (still can't compile with MKL).
brom
 
Posts: 18
Joined: Tue Jan 25, 2011 8:20 pm

Re: testing_zgesvd segmentation fault in V1.0 RC3 with mkl10

Postby addee » Sun Jan 30, 2011 4:39 pm

Stan Tomov wrote:Hi,
I was wondering if the problem is due to a memory limitation (in which case we have forgotten to check somewhere the result of GPU memory allocation). Can you check if it would work for a fixed smaller size problem, e.g.,
./testing_zgesvd -M 1024 -N 1024
Thanks,
Stan


Thanks Stan. Yes, it seems to be the memory limitation, as smaller matrix (for example 1024*1024) will work.
What's the rule of thumb about how large the marix magma zgesvd can handle per magma implementation? Does the entire matrix is shipped on the Device memory? And how much extra workspace storage needed on the Device?
For example, the 8064*8064 double complex matrix in the testing_zgesvd is 992MB which fails magma on my GTX 470 with 1280MB GPU memory.


Another question: Is the sgesvd a working version?
I also tried using the sgesvd in the testing_zgesvd with the variables changed to float and replacing "z" to "s" in the relevant lapack function names. The sgesvd produces different errors for different size of matrices. I inject some printf checkpoint after every major function calls.

10*10: Segfaults at releasing the memory. But the error of 1.0 is too large.
$>./testing_sgesvd -M 10 -N 10
device 0: GeForce GTX 470, 1215.0 MHz clock, 1279.7 MB memory
device 1: GeForce GTX 470, 1215.0 MHz clock, 1279.7 MB memory
testing_sgesvd -M 10 -N 10

N CPU Time(s) GPU Time(s) ||R||_F / ||A||_F
==========================================================

check point: passed lapackf77_slarnv.

check point: passed lapackf77_slacpy.

check point: passed first magma_sgesvd.

check point: passed h_R=h_A.

check point: passed second magma_sgesvd.

check point: passed lapackf77_sgesvd.
10 0.00 0.00 1.000000e+00
Segmentation fault



100*100: segfaults at calling magma_sgesvd
$>./testing_sgesvd -M 100 -N 100
device 0: GeForce GTX 470, 1215.0 MHz clock, 1279.7 MB memory
device 1: GeForce GTX 470, 1215.0 MHz clock, 1279.7 MB memory

testing_sgesvd -M 100 -N 100

N CPU Time(s) GPU Time(s) ||R||_F / ||A||_F
==========================================================

check point: passed lapackf77_slarnv.

check point: passed lapackf77_slacpy.
Segmentation fault




1000*1000: The first call of magma_sgesvd looks good, but "can not bind to texture" error comes out repeatly at the second call of magma_sgesvd


$>./testing_sgesvd -M 100 -N 100
device 0: GeForce GTX 470, 1215.0 MHz clock, 1279.7 MB memory
device 1: GeForce GTX 470, 1215.0 MHz clock, 1279.7 MB memory

testing_sgesvd -M 1000 -N 1000

N CPU Time(s) GPU Time(s) ||R||_F / ||A||_F
==========================================================

check point: passed lapackf77_slarnv.

check point: passed lapackf77_slacpy.

check point: passed first magma_sgesvd.

check point: passed h_R=h_A.
can not bind to texture
can not bind to texture
..........(thousands of lines of "can not bind to texture")
can not bind to texture
can not bind to texture

check point: passed second magma_sgesvd.

check point: passed lapackf77_sgesvd.
1000 4.36 6.70 nan
Segmentation fault


Thank you.
addee
 
Posts: 3
Joined: Wed Jan 26, 2011 9:10 pm

Re: testing_zgesvd segmentation fault in V1.0 RC3 with mkl10

Postby fletchjp » Mon Jan 31, 2011 6:42 am

addee

The single precision equivalent of zgesvd would be cgesvd. Have you tried that?

I am interested in the 'can not bind to texture' messages. I tried researching it on google but only found references to my own messages on this list!!

If anyone knows more about it please post something.

Best wishes

John
fletchjp
 
Posts: 175
Joined: Mon Dec 27, 2010 7:29 pm

Re: testing_zgesvd segmentation fault in V1.0 RC3 with mkl10

Postby addee » Mon Jan 31, 2011 4:33 pm

The single precision equivalent of zgesvd would be cgesvd. Have you tried that?

Hi John, I didn't try the cgesvd, because I need to handle Real entry matrix and used sgesvd.

The "can not bind to texture" is not coming from the magma code as I have tried grep the sentence from the magma src folder. I think the error is more likely reporting from cuBLAS.

Are there anyone with successful experience with the magma_sgesvd? The usage in the sgesvd.cpp is the same as zgesvd that says the input matrix A is COMPLEX*16 array. I wonder if it's auto generated code. (No offensive ;) )
addee
 
Posts: 3
Joined: Wed Jan 26, 2011 9:10 pm

Re: testing_zgesvd segmentation fault in V1.0 RC3 with mkl10

Postby brom » Mon Jan 31, 2011 5:10 pm

"can not bind to texture" is a CUDA error that happens when, well, a texture can't be bound. usually due to hardware limitations.
brom
 
Posts: 18
Joined: Tue Jan 25, 2011 8:20 pm

Re: testing_zgesvd segmentation fault in V1.0 RC3 with mkl10

Postby fletchjp » Mon Jan 31, 2011 6:25 pm

brom wrote:"can not bind to texture" is a CUDA error that happens when, well, a texture can't be bound. usually due to hardware limitations.


Is there any CUDA or other NVIDIA documentation on this error message and the context which might cause it? I have cases which sometimes give it and sometimes not, and I suspect that memory on the CPU or GPU is getting into an inconsistent state.

Does anyone know of any NVIDIA tools to help with this sort of problem? I have been using cuda-memcheck but it does not seem to be finding the problems.

I am working on Ubuntu Linux 10.04 (64 bit).

Thanks

John
fletchjp
 
Posts: 175
Joined: Mon Dec 27, 2010 7:29 pm

Re: testing_zgesvd segmentation fault in V1.0 RC3 with mkl10

Postby Stan Tomov » Mon Jan 31, 2011 6:40 pm

Yes, the code is generated for the different precisions starting from double complex. We are still fixing this routine in real arithmetic.
Stan
Stan Tomov
 
Posts: 253
Joined: Fri Aug 21, 2009 10:39 pm


Return to User discussion

Who is online

Users browsing this forum: Bing [Bot] and 2 guests