magma_sgeqrf_gpu seg fault

Open discussion for MAGMA

magma_sgeqrf_gpu seg fault

Postby justin » Mon Oct 11, 2010 4:00 pm

I'm having a problem with 'magma_sgeqrf_gpu' giving a segmentation fault for sizes > 2048, the problem is most easily seen directly in the included testing/testing_sgeqrf_gpu program by executing:

>>./testing_sgeqrf_gpu -N 2048

I am using a Tesla C1060 with CUDA 3.1 drivers and toolkit. Has anyone else experienced this or have any ideas?

Thanks,
- Justin Voo
justin
 
Posts: 1
Joined: Mon Oct 11, 2010 3:35 pm

Re: magma_sgeqrf_gpu seg fault

Postby lferraro » Mon Oct 25, 2010 9:04 am

All the following tests give segfault on our system, even with small sizes:
* testing_cgeqrf, testing_cgeqrf_gpu
* testing_cpotrf, testing_cpotrf_gpu
* testing_dgelqf, testing_dgeqlf
* testing_dgeqrf, testing_dgeqrf_gpu
* testing_dgeqrs_gpu
* testing_dpotrf, testing_dpotrf_gpu
* testing_dsgeqrsv_gpu
* testing_dsposv
* testing_sgelqf, testing_sgeqlf
* testing_sgeqrf, testing_sgeqrf_gpu
* testing_sgeqrs_gpu
* testing_spotrf, testing_spotrf_gpu
* testing_zgeqrf, testing_zgeqrf_gpu

Running gdb, segfault always happen in cudaMemcpy2DAsync ()
Here follows a sample from cgeqrf.

Program received signal SIGSEGV, Segmentation fault.
0x00002aaaae2631e8 in pthread_attr_setdetachstate ()
from /usr/lib64/libcuda.so.1
(gdb) where
#0 0x00002aaaae2631e8 in pthread_attr_setdetachstate ()
from /usr/lib64/libcuda.so.1
#1 0x00002aaaae2880c2 in pthread_attr_setdetachstate ()
from /usr/lib64/libcuda.so.1
#2 0x00002aaaae250be2 in pthread_attr_setdetachstate ()
from /usr/lib64/libcuda.so.1
#3 0x00002aaaae23ac57 in pthread_attr_setdetachstate ()
from /usr/lib64/libcuda.so.1
#4 0x00002aaaae2df20b in pthread_attr_setdetachstate ()
from /usr/lib64/libcuda.so.1
#5 0x00002aaaadf90771 in ?? ()
from /caspur/local/apps/cuda/current/lib64/libcudart.so.3
#6 0x00002aaaadf802e6 in cudaMemcpy2DAsync ()
from /caspur/local/apps/cuda/current/lib64/libcudart.so.3
#7 0x0000000000407539 in magma_cgeqrf ()
#8 0x0000000000406b49 in main (argc=10183344, argv=0x7fff026fbd90)
at testing_cgeqrf.cpp:109

I used intel/11.1.064 with CUDA 3.1 - linux 2.6.18 x86_64 - Intel Xeon X5650
device 0: Tesla S2050, 1147.0 MHz clock, 2687.4 MB memory
device 1: Tesla S2050, 1147.0 MHz clock, 2687.4 MB memory
$> cat /proc/nvidia/version
NVRM version: NVIDIA UNIX x86_64 Kernel Module 256.40 Wed Jul 7 12:44:03 PDT 2010
GCC version: gcc version 4.1.2 20080704 (Red Hat 4.1.2-48)

Can you help us? Am I missing something?
lferraro
 
Posts: 2
Joined: Mon Sep 13, 2010 12:05 pm


Return to User discussion

Who is online

Users browsing this forum: Google [Bot] and 3 guests

cron