!!!! device memory allocation error

Open discussion for MAGMA

!!!! device memory allocation error

Postby fdahm » Tue Aug 30, 2011 11:34 am

I was using magma 1.0 rc5 for a couple of month.
now i 've switched to the final release, i got "!!!! device memory allocation error (magma_zhetrd) " as soon as i call magmaf_zheevd,
(i checked, i call cublasInit properly before and i use very small matrix sizes)
(the testing also fail for:
testing_zheevd ==> !!!! device memory allocation error (magma_zhetrd)
testing_zheevd_gpu ==> !!!! device memory allocation error (magma_zheevd_gpu)
testing_zhegvd ==> !!!! device memory allocation error (magma_zheevd_gpu)
but works for magma_zhetrd, zhetrd_gpu


And it seems that on my Tesla T10 based cluster, zpotrf_gpu doesn't work either (the only time i tested it with rc5, i got incorrect results)
magma-1.0_cuda3.2/testing/testing_zpotrf_gpu
device 0: Tesla T10 Processor, 1440.0 MHz clock, 4095.8 MB memory
device 1: Tesla T10 Processor, 1440.0 MHz clock, 4095.8 MB memory
Usage:
testing_zpotrf_gpu -N 1024
N CPU GFlop/s GPU GFlop/s ||R||_F / ||A||_F
========================================================
Argument 6 of magma_zpotrf had an illegal value.
1024 9.16 159422.58 2.685846e+01
fdahm
 
Posts: 8
Joined: Tue Feb 15, 2011 10:04 am

Re: !!!! device memory allocation error

Postby fdahm » Wed Aug 31, 2011 8:56 am

I've partially solve the problem.
I think these allocation error are related to a known bug between icc and cublas:

cuda 4 release notes say:
* (Linux) There is a known bug in ICC with respect to passing 16-byte aligned types by value to GCC-built code such as the CUDA Toolkit libraries e.g. CUBLAS. At this time, passing a double2 or cuDoubleComplex or any other 16-byte aligned type by value to GCC-built code from ICC-built code will pass incorrect data. Intel has been informed of this bug. As a workaround, a GCC-built wrapper function that accepts the data by reference from the ICC-built code can be linked with the ICC-built code; the GCC-built wrapper can then, in turn, pass the data by value to the CUDA Toolkit libraries.


Do magma developpers were aware of this bug?
Since i build magma with icc and mkl (to get highest hybrid performance), it's not surprising i get errors.
The error i mentionned in the first topic occurs in every double complex routines with cuda 3.2, but only on zgemm with cuda4.

does anyone encounter similar issue and found a workaround?
fdahm
 
Posts: 8
Joined: Tue Feb 15, 2011 10:04 am


Return to User discussion

Who is online

Users browsing this forum: Majestic-12 [Bot], Yahoo [Bot] and 1 guest