64-bit integer in MAGMA

Open discussion for MAGMA

Re: 64-bit integer in MAGMA

Postby evanlezar » Fri Sep 07, 2012 10:03 am

Continuing the discussion here, although there are now some other issues:

I am currently running a large random single precision real matrix problem -- M=71710 N=71710 LDA=71712 -- and the computation continues until at one point the program SEGFAULT in a call to cudaStreamCreate():

Code: Select all
#0  0x00007fffea9382c9 in ?? () from /usr/lib64/libcuda.so
#1  0x00007fffea938575 in ?? () from /usr/lib64/libcuda.so
#2  0x00007fffea921a13 in ?? () from /usr/lib64/libcuda.so
#3  0x00007fffea903336 in ?? () from /usr/lib64/libcuda.so
#4  0x00007fffea8eafe4 in ?? () from /usr/lib64/libcuda.so
#5  0x00007ffff76113eb in ?? () from /home/lezar/feko.LINUX_EM64T/bin/libcudart.so.4
#6  0x00007ffff764a772 in cudaStreamCreate () from /home/lezar/feko.LINUX_EM64T/bin/libcudart.so.4
#7  0x0000000003b0ccb4 in magma_queue_create (queuePtr=0x7fffffff7a38) at ../../interface_cuda/interface.cpp:99
#8  0x0000000003b1b887 in magmablas_sgetmatrix_transpose_mgpu (num_gpus=1, stream0=0x7fffffff7c68, dat=0x7fffffff7c48, ldda=14976,
    ha=0x7ffbdc3cc018, lda=71712, dB=0x7fffffff7c08, lddb=71712, m=71710, n=14976, nb=128)
    at ../../magmablas/sgetmatrix_transpose_mgpu.cu:53
#9  0x0000000003b06402 in magma_sgetrf3_ooc (num_gpus0=1, m=71710, n=71710, a=0x7ff9dc224018, lda=71712, ipiv=0x1028aad8,
    info=0x7fffffff8178) at ../../src/sgetrf3_ooc.cpp:331
#10 0x0000000003b035e5 in magma_sgetrf (m=71710, n=71710, a=0x7ff9dc224018, lda=71712, ipiv=0x1028aad8, info=0x7fffffff8178)
    at ../../src/sgetrf.cpp:154


The call in question is in magmablas_sgetmatrix_transpose_mgpu is the second call in the loop over the number of devices (I am only using one device).

The call is thus:
Code: Select all
magma_queue_create( &stream[d][1] );


No errors are indicated by any of the check_error calls in the other stream handling functions. Could it be that there is some memory leaks in the CUDA library and that this is presenting itself here? What, in your experience, is the limit for the number of streams that can be used with one device?

I have added debug output to the magma queue creation functions, and get the following output:

Code: Select all
@@ created stream 270863920 num_streams=1
@@ created stream 271280752 num_streams=2
@@ created stream 271280784 num_streams=3
@@ created stream 271280816 num_streams=4
@@ destroying stream 271280784  num_streams=3
@@ destroying stream 271280816  num_streams=2
@@ created stream 271280816 num_streams=3
@@ created stream 271280784 num_streams=4
@@ destroying stream 271280816  num_streams=3
@@ destroying stream 271280784  num_streams=2

in magmablas_sgetmatrix_transpose_mgpu
before: stream0=1 stream1=70091132
    @@ created stream 271280784 num_streams=3
    @@ created stream 271289056 num_streams=4
after: stream0=271280784 stream1=271289056
    @@ destroying stream 271280784  num_streams=3
    @@ destroying stream 271289056  num_streams=2

@@ created stream 271289056 num_streams=3
@@ created stream 271280784 num_streams=4
@@ destroying stream 271289056  num_streams=3
@@ destroying stream 271280784  num_streams=2
@@ created stream 271280784 num_streams=3
@@ created stream 271289024 num_streams=4
@@ destroying stream 271280784  num_streams=3
@@ destroying stream 271289024  num_streams=2

in magmablas_sgetmatrix_transpose_mgpu
before: stream0=1 stream1=70091132
    @@ created stream 271289024 num_streams=3
    @@ created stream 271280816 num_streams=4
after: stream0=271289024 stream1=271280816
    @@ destroying stream 271289024  num_streams=3
    @@ destroying stream 271280816  num_streams=2

@@ created stream 271280816 num_streams=3
@@ created stream 271289024 num_streams=4
@@ destroying stream 271280816  num_streams=3
@@ destroying stream 271289024  num_streams=2
@@ created stream 271289024 num_streams=3
@@ created stream 271280816 num_streams=4
@@ destroying stream 271289024  num_streams=3
@@ destroying stream 271280816  num_streams=2

in magmablas_sgetmatrix_transpose_mgpu
before: stream0=1 stream1=70091132
    @@ created stream 271280816 num_streams=3


Let me know if you need further information.
evanlezar
 
Posts: 33
Joined: Tue Aug 25, 2009 7:20 pm
Location: Stellenbosch, South Africa

Re: 64-bit integer in MAGMA

Postby mgates3 » Fri Sep 07, 2012 3:43 pm

We've used at least 16 streams before. (CUDA lets you create more than 16, but effectively uses only 16 at a time, according NVIDIA's presentation.)

To speed up debugging, you could try commenting out the magma_sgemm and other magma BLAS calls. That eliminates computation but will do the same memory and stream allocations. Of course, if the error is a bad memory reference in a BLAS call, that will hide the error (which may also be a clue to which kernel is stepping on memory).

The problem is still on my radar to look at.
-mark
mgates3
 
Posts: 442
Joined: Fri Jan 06, 2012 2:13 pm

Re: 64-bit integer in MAGMA

Postby evanlezar » Tue Sep 11, 2012 5:07 am

Thanks Mark,

After disabling the sgemm and strsm calls, I end up with a launch failure in magmablas_spermute_long3. I know that this is supposed to swap the columns of the transposed matrix, but the exact working is a little tricky to piece together. Is there a paper or something that describes the algorithm being used here so that I can try to find the exact source of the error?

Regards
Evan
evanlezar
 
Posts: 33
Joined: Tue Aug 25, 2009 7:20 pm
Location: Stellenbosch, South Africa

Re: 64-bit integer in MAGMA

Postby evanlezar » Wed Sep 12, 2012 11:57 am

Some more from my side:

I have run the example I have here through cuda-memcheck and get a number of messages such as the following:
Code: Select all
========= Invalid __global__ write of size 4
=========     at 0x00000d10 in ../../magmablas/stranspose-v2.cu:50:stranspose3_32
=========     by thread (0,5,0) in block (13,1,0)
=========     Address 0x20682d280 is out of bounds
=========


When running cuda-gdb with cuda memcheck on, it shows that the error is on the following line:
Code: Select all
[Launch of CUDA Kernel 1 (stranspose3_32<<<(2241,4,1),(32,8,1)>>>) on Device 0]

Program received signal CUDA_EXCEPTION_1, Lane Illegal Address.
[Switching focus to CUDA kernel 1, grid 1, block (6,1,0), thread (0,1,0), device 0, sm 9, warp 1, lane 0]
0x000000000bc69410 in stranspose3_32<<<(2241,4,1),(32,8,1)>>> (m32=2, n32=0, __val_paramB=0x205008000, ldb=14976,
    __val_paramA=0x200a00000, lda=71712, m=71710, n=128) at stranspose-v2.cu:50
50                      B[0*ldb] = a[inx][iny+0];

(cuda-gdb) print B
$1 = (float * @generic) 0x205b0ea80


The problem is that in my mind, the matrix B has been allocated as 14976*71712*sizeof(float) bytes, and as such the valid address range should be up to 0x1205d4800.
evanlezar
 
Posts: 33
Joined: Tue Aug 25, 2009 7:20 pm
Location: Stellenbosch, South Africa

Re: 64-bit integer in MAGMA

Postby evanlezar » Thu Sep 13, 2012 9:32 am

More from my investigation:

When using the MAGMA_NGR_NB environment variable to ensure that the memory for dAT (which I have moved to a separate allocation) does not exceed 4GB, then the cuda-memcheck errors disappear. Could it be that somewhere 32bit pointers are being used internally? I am using the -m 64 nvcc flag for compilation, so I would not expect this to be the case.
evanlezar
 
Posts: 33
Joined: Tue Aug 25, 2009 7:20 pm
Location: Stellenbosch, South Africa

Previous

Return to User discussion

Who is online

Users browsing this forum: Yahoo [Bot] and 2 guests

cron