Nan problems with dgetrf_gpu on RC3

Open discussion for MAGMA

Nan problems with dgetrf_gpu on RC3

Postby fletchjp » Sat Jan 22, 2011 8:02 am

I am running the same tests I have run on RC2 with RC3.

System: Nehalem CPU with 8 Gbytes main memory, GTX 460 with 2 Gbytes memory, MAGMA 1.0.0 RC3 out of the box and GotoBLAS2 compiled for CORE2 (see other threads).

I get the following output from a run of testing_dgetrf_gpu:

It is crashing on exit, which I have also seen with RC2. I hope this helps figure out what is happening. There are references in the traceback to /usr/lib/atlas which I did not think I was using. I have attached my make.inc in case that has a clue in it.

If I next run testing_zgetrf_gpu and then run testing_dgetrf_gpu both run without error. I have attached the outputs below.

Best wishes

John

Code: Select all
fletcher@fletcher-desktop:~/magma_1.0.0-rc3/testing$ ./testing_dgetrf_gpu
device 0: GeForce GTX 460, 1400.0 MHz clock, 2047.2 MB memory

Usage:
  testing_dgetrf_gpu -M 1024 -N 1024



  M     N   CPU GFlop/s    GPU GFlop/s   ||PA-LU||/(||A||*N)
============================================================
  960   960   16.48          21.56         nan
 1920  1920   25.32          43.90         nan
 3072  3072   25.87          60.96         nan
 4032  4032   26.53          65.27         nan
 4992  4992   25.64          67.47         nan
 5952  5952   26.42          68.89         nan
 7104  7104   26.90          69.78         nan
 8064  8064   27.04          70.76         nan
 9024  9024   27.39          71.40         nan
 9984  9984   26.71          71.57         nan
*** glibc detected *** ./testing_dgetrf_gpu: munmap_chunk(): invalid pointer: 0x00007fdc337b0010 ***
======= Backtrace: =========
/lib/libc.so.6(+0x775b6)[0x7fdc743785b6]
./testing_dgetrf_gpu[0x402670]
/lib/libc.so.6(__libc_start_main+0xfd)[0x7fdc7431fc4d]
./testing_dgetrf_gpu[0x401b59]
======= Memory map: ========
00400000-0053a000 r-xp 00000000 08:01 10750245                           /home/fletcher/magma_1.0.0-rc3/testing/testing_dgetrf_gpu
00739000-0073a000 r--p 00139000 08:01 10750245                           /home/fletcher/magma_1.0.0-rc3/testing/testing_dgetrf_gpu
0073a000-0073b000 rw-p 0013a000 08:01 10750245                           /home/fletcher/magma_1.0.0-rc3/testing/testing_dgetrf_gpu
00ba4000-02fbc000 rw-p 00000000 00:00 0                                  [heap]
7fdc03f30000-7fdc337b0000 rw-s 1d2ad8000 00:05 4755                      /dev/nvidia0
7fdc337b0000-7fdc63031000 rw-p 00000000 00:00 0
7fdc63031000-7fdc65031000 rw-p 00000000 00:00 0
7fdc67033000-7fdc69033000 rw-p 00000000 00:00 0
7fdc6b031000-7fdc6d031000 rw-p 00000000 00:00 0
7fdc6e82c000-7fdc6ec2e000 rw-s 1d757e000 00:05 4755                      /dev/nvidia0
7fdc6ec2e000-7fdc6f030000 rw-s 1d7967000 00:05 4755                      /dev/nvidia0
7fdc6f030000-7fdc71030000 rw-p 00000000 00:00 0
7fdc7110f000-7fdc7142c000 rw-p 00000000 00:00 0
7fdc7142c000-7fdc7152c000 rw-s 1d785f000 00:05 4755                      /dev/nvidia0
7fdc7152c000-7fdc7162c000 rw-s 1e675b000 00:05 4755                      /dev/nvidia0
7fdc7162c000-7fdc7172c000 rw-s 1e665b000 00:05 4755                      /dev/nvidia0
7fdc7172c000-7fdc7182c000 rw-s 1e6557000 00:05 4755                      /dev/nvidia0
7fdc7182c000-7fdc7182d000 ---p 00000000 00:00 0
7fdc7182d000-7fdc7202d000 rwxp 00000000 00:00 0
7fdc7202d000-7fdc7202e000 ---p 00000000 00:00 0
7fdc7202e000-7fdc7282e000 rwxp 00000000 00:00 0
7fdc7282e000-7fdc7282f000 ---p 00000000 00:00 0
7fdc7282f000-7fdc7302f000 rwxp 00000000 00:00 0
7fdc7302f000-7fdc737c0000 r-xp 00000000 08:01 54401328                   /usr/lib/atlas/libblas.so.3gf.0
7fdc737c0000-7fdc739bf000 ---p 00791000 08:01 54401328                   /usr/lib/atlas/libblas.so.3gf.0
7fdc739bf000-7fdc739c4000 r--p 00790000 08:01 54401328                   /usr/lib/atlas/libblas.so.3gf.0
7fdc739c4000-7fdc739ca000 rw-p 00795000 08:01 54401328                   /usr/lib/atlas/libblas.so.3gf.0
7fdc739ca000-7fdc739d1000 r-xp 00000000 08:01 28181369                   /lib/librt-2.11.1.so
7fdc739d1000-7fdc73bd0000 ---p 00007000 08:01 28181369                   /lib/librt-2.11.1.so
7fdc73bd0000-7fdc73bd1000 r--p 00006000 08:01 28181369                   /lib/librt-2.11.1.so
7fdc73bd1000-7fdc73bd2000 rw-p 00007000 08:01 28181369                   /lib/librt-2.11.1.so
7fdc73bd2000-7fdc73bd4000 r-xp 00000000 08:01 28181708                   /lib/libdl-2.11.1.so
7fdc73bd4000-7fdc73dd4000 ---p 00002000 08:01 28181708                   /lib/libdl-2.11.1.so
7fdc73dd4000-7fdc73dd5000 r--p 00002000 08:01 28181708                   /lib/libdl-2.11.1.so
7fdc73dd5000-7fdc73dd6000 rw-p 00003000 08:01 28181708                   /lib/libdl-2.11.1.so
7fdc73dd6000-7fdc73dec000 r-xp 00000000 08:01 28180674                   /lib/libz.so.1.2.3.3
7fdc73dec000-7fdc73feb000 ---p 00016000 08:01 28180674                   /lib/libz.so.1.2.3.3
7fdc73feb000-7fdc73fec000 r--p 00015000 08:01 28180674                   /lib/libz.so.1.2.3.3
7fdc73fec000-7fdc73fed000 rw-p 00016000 08:01 28180674                   /lib/libz.so.1.2.3.3
7fdc73fed000-7fdc740e3000 r-xp 00000000 08:01 50335407                   /usr/lib/libstdc++.so.6.0.13
7fdc740e3000-7fdc742e3000 ---p 000f6000 08:01 50335407                   /usr/lib/libstdc++.so.6.0.13
7fdc742e3000-7fdc742ea000 r--p 000f6000 08:01 50335407                   /usr/lib/libstdc++.so.6.0.13
7fdc742ea000-7fdc742ec000 rw-p 000fd000 08:01 50335407                   /usr/lib/libstdc++.so.6.0.13
7fdc742ec000-7fdc74301000 rw-p 00000000 00:00 0
7fdc74301000-7fdc7447b000 r-xp 00000000 08:01 28181446                   /lib/libc-2.11.1.so
7fdc7447b000-7fdc7467a000 ---p 0017a000 08:01 28181446                   /lib/libc-2.11.1.so
7fdc7467a000-7fdc7467e000 r--p 00179000 08:01 28181446                   /lib/libc-2.11.1.so
7fdc7467e000-7fdc7467f000 rw-p 0017d000 08:01 28181446                   /lib/libc-2.11.1.so
7fdc7467f000-7fdc74684000 rw-p 00000000 00:00 0
7fdc74684000-7fdc7469a000 r-xp 00000000 08:01 28180559                   /lib/libgcc_s.so.1
7fdc7469a000-7fdc74899000 ---p 00016000 08:01 28180559                   /lib/libgcc_s.so.1
7fdc74899000-7fdc7489a000 r--p 00015000 08:01 28180559                   /lib/libgcc_s.so.1
7fdc7489a000-7fdc7489b000 rw-p 00016000 08:01 28180559                   /lib/libgcc_s.so.1
7fdc7489b000-7fdc7491d000 r-xp 00000000 08:01 28180728                   /lib/libm-2.11.1.so
7fdc7491d000-7fdc74b1c000 ---p 00082000 08:01 28180728                   /lib/libm-2.11.1.so
7fdc74b1c000-7fdc74b1d000 r--p 00081000 08:01 28180728                   /lib/libm-2.11.1.so
7fdc74b1d000-7fdc74b1e000 rw-p 00082000 08:01 28180728                   /lib/libm-2.11.1.so
7fdc74b1e000-7fdc74c09000 r-xp 00000000 08:01 50339064                   /usr/lib/libgfortran.so.3.0.0
7fdc74c09000-7fdc74e08000 ---p 000eb000 08:01 50339064                   /usr/lib/libgfortran.so.3.0.0
7fdc74e08000-7fdc74e09000 r--p 000ea000 08:01 50339064                   /usr/lib/libgfortran.so.3.0.0
7fdc74e09000-7fdc74e0a000 rw-p 000eb000 08:01 50339064                   /usr/lib/libgfortran.so.3.0.0
7fdc74e0a000-7fdc74e0b000 rw-p 00000000 00:00 0
7fdc74e0b000-7fdc75700000 r-xp 00000000 08:01 54401329                   /usr/lib/atlas/liblapack.so.3gf.0
7fdc75700000-7fdc758ff000 ---p 008f5000 08:01 54401329                   /usr/lib/atlas/liblapack.so.3gf.0
7fdc758ff000-7fdc75900000 r--p 008f4000 08:01 54401329                   /usr/lib/atlas/liblapack.so.3gf.0
7fdc75900000-7fdc75905000 rw-p 008f5000 08:01 54401329                   /usr/lib/atlas/liblapack.so.3gf.0
7fdc75905000-7fdc75a12000 rw-p 00000000 00:00 0 Aborted


Code: Select all
#//////////////////////////////////////////////////////////////////////////////
#   -- MAGMA (version 1.0) --
#      Univ. of Tennessee, Knoxville
#      Univ. of California, Berkeley
#      Univ. of   Colorado, Denver
#      November 2010
#//////////////////////////////////////////////////////////////////////////////

#
# GPU_TARGET specifies for which GPU you want to compile MAGMA
#      0: Tesla family
#      1: Fermi Family
#
GPU_TARGET = 1

CUDADIR=/usr/local/cuda

CC        = gcc
NVCC=$(CUDADIR)/bin/nvcc
#NVCC      = nvcc
FORT      = gfortran

ARCH      = ar
ARCHFLAGS = cr
RANLIB    = ranlib

OPTS      = -O3 -DADD_
NVOPTS    = --compiler-options -fno-strict-aliasing -DUNIX -O3 -DADD_
LDOPTS    = -fPIC -z muldefs

# using GotoBLAS
LIB       = -lgoto2  -lpthread -lcublas -lcudart -llapack -lm
# using default BLAS (single thread)
#LIB       = -lblas  -lpthread -lcublas -lcudart -llapack -lm

LIBDIR    = -L/home/fletcher/GotoBLAS2 -L/usr/lib64 -L$(CUDADIR)/lib64
INC       = -I$(CUDADIR)/include

LIBMAGMA     = ../lib/libmagma.a
LIBMAGMABLAS = ../lib/libmagmablas.a


Code: Select all
fletcher@fletcher-desktop:~/magma_1.0.0-rc3/testing$ ./testing_zgetrf_gpu
device 0: GeForce GTX 460, 1400.0 MHz clock, 2047.2 MB memory

Usage:
  testing_zgetrf_gpu -M 1024 -N 1024



  M     N   CPU GFlop/s    GPU GFlop/s   ||PA-LU||/(||A||*N)
============================================================
  960   960   20.35          45.96         1.102403e-17
 1920  1920   27.31          59.75         1.096587e-17
 3072  3072   27.25          63.34         1.075028e-17
 4032  4032   27.74          67.39         1.033353e-17
 4992  4992   24.41          68.40         1.044090e-17
 5952  5952   27.30          69.05         1.025062e-17
 7104  7104   27.62          69.62         1.020955e-17
 8064  8064   27.39          69.86         1.004068e-17
 9024  9024   27.27          70.14         9.916281e-18
 9984  9984   27.79          70.31         9.820221e-18

fletcher@fletcher-desktop:~/magma_1.0.0-rc3/testing$ ./testing_dgetrf_gpu
device 0: GeForce GTX 460, 1400.0 MHz clock, 2047.2 MB memory

Usage:
  testing_dgetrf_gpu -M 1024 -N 1024



  M     N   CPU GFlop/s    GPU GFlop/s   ||PA-LU||/(||A||*N)
============================================================
  960   960   21.77          21.14         4.197521e-18
 1920  1920   25.42          43.69         3.620278e-18
 3072  3072   25.91          60.80         4.114900e-18
 4032  4032   26.32          65.09         3.825857e-18
 4992  4992   26.30          67.30         3.645565e-18
 5952  5952   26.45          68.77         3.493297e-18
 7104  7104   26.68          69.71         3.407056e-18
 8064  8064   26.76          70.75         2.707749e-18
 9024  9024   26.36          71.40         2.627284e-18
 9984  9984   26.34          71.50         2.535688e-18
fletchjp
 
Posts: 175
Joined: Mon Dec 27, 2010 7:29 pm

Re: Nan problems with dgetrf_gpu on RC3

Postby Stan Tomov » Thu Jan 27, 2011 10:59 pm

John,
Did you have any luck figuring out the problem? I haven't been able to reproduce it on our systems. Is it possible you have multiple versions of CUDA installed, or something like that. I have seen some problems when the CUDA used and nvcc are not the same version. I assume you also have the newest drivers. After you get the correct results what happens - can you get again in a mode where you get nans?
Regards,
Stan
Stan Tomov
 
Posts: 251
Joined: Fri Aug 21, 2009 10:39 pm

Re: Nan problems with dgetrf_gpu on RC3

Postby katayama » Fri Jan 28, 2011 8:23 am

Dear Stan,

I just want to say that I get the nans as well.

Nobu

[katayama@lb01 testing]$ cat nohup.out

### (./testing_dgetrf)

device 0: Tesla C2050, 1147.0 MHz clock, 2687.2 MB memory

Usage:
testing_dgetrf_gpu -M 1024 -N 1024



M N CPU GFlop/s GPU GFlop/s ||PA-LU||/(||A||*N)
============================================================
1024 1024 11.90 18.49 4.608794e-18
2048 2048 20.80 48.66 3.851219e-18
3072 3072 34.30 79.42 4.202012e-18
4032 4032 43.08 111.00 4.039879e-18
5184 5184 50.34 141.83 3.818907e-18
6016 6016 53.28 158.79 3.696577e-18
7040 7040 57.82 180.07 nan
8064 8064 60.00 190.80 nan
9088 9088 61.79 203.67 nan
10112 10112 63.84 212.26 nan

###(./testing_dgetrf_gpu)

device 0: Tesla C2050, 1147.0 MHz clock, 2687.2 MB memory

Usage:
testing_dgetrf_gpu -M 1024 -N 1024



M N CPU GFlop/s GPU GFlop/s ||PA-LU||/(||A||*N)
============================================================
960 960 13.87 12.38 nan
1920 1920 23.51 45.45 nan
3072 3072 36.28 102.85 nan
4032 4032 40.92 148.02 nan
4992 4992 46.48 182.26 nan
5952 5952 51.53 190.45 nan
7104 7104 56.60 210.39 nan
8064 8064 59.51 223.53 nan
9024 9024 61.47 234.86 nan
9984 9984 62.17 242.12 nan


[katayama@lb01 magma]$ ~/NVIDIA_GPU_Computing_SDK/C/bin/linux/release/deviceQuery
/lbhome/katayama/NVIDIA_GPU_Computing_SDK/C/bin/linux/release/deviceQuery Starting...

CUDA Device Query (Runtime API) version (CUDART static linking)

There is 1 device supporting CUDA

Device 0: "Tesla C2050"
CUDA Driver Version: 3.20
CUDA Runtime Version: 3.20
CUDA Capability Major/Minor version number: 2.0
Total amount of global memory: 2817720320 bytes
Multiprocessors x Cores/MP = Cores: 14 (MP) x 32 (Cores/MP) = 448 (Cores)
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 49152 bytes
Total number of registers available per block: 32768
Warp size: 32
Maximum number of threads per block: 1024
Maximum sizes of each dimension of a block: 1024 x 1024 x 64
Maximum sizes of each dimension of a grid: 65535 x 65535 x 1
Maximum memory pitch: 2147483647 bytes
Texture alignment: 512 bytes
Clock rate: 1.15 GHz
Concurrent copy and execution: Yes
Run time limit on kernels: Yes
Integrated: No
Support host page-locked memory mapping: Yes
Compute mode: Default (multiple host threads can use this device simultaneously)
Concurrent kernel execution: Yes
Device has ECC support enabled: Yes
Device is using TCC driver mode: No

deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 3.20, CUDA Runtime Version = 3.20, NumDevs = 1, Device = Tesla C2050


PASSED

Press <Enter> to Quit...
-----------------------------------------------------------

[katayama@lb01 magma]$ nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2010 NVIDIA Corporation
Built on Wed_Nov__3_16:16:57_PDT_2010
Cuda compilation tools, release 3.2, V0.2.1221
[katayama@lb01 magma]$
katayama
 
Posts: 12
Joined: Sat Jan 16, 2010 8:33 am

Re: Nan problems with dgetrf_gpu on RC3

Postby fletchjp » Fri Jan 28, 2011 12:31 pm

Stan

Thank you for your comments.

When I first obtained my computer, I loaded up cudatoolkit 3.2.12 for linux 64 and ubuntu 10.4.
I do now have 3.2.16 but have not yet loaded it.

I am using drivers 260.19.12 and could update to 260.19.26.

I also loaded CULA 2.1 which at that time needed CUDA 3.1 and I thought had loaded it, but I cannot find any trace of the files.
I now have a copy of CULA R10 (they have changed the release pattern) but have not loaded it as I have not explored CULA now I am using MAGMA.

I have downloaded the files which come with "CUDA by example" and compiled enum_gpu (with the addition of the CUDA version) and get the following output:

Code: Select all
------------------------
CUDA VERSION INFORMATION
cudaDriverGetVersion returns 3020
------------------------
   --- General Information for device 0 ---
Name:  GeForce GTX 460
Compute capability:  2.1
Clock rate:  1400000
Device copy overlap:  Enabled
Kernel execution timeout :  Enabled
   --- Memory Information for device 0 ---
Total global mem:  2146631680
Total constant Mem:  65536
Max mem pitch:  2147483647
Texture Alignment:  512
   --- MP Information for device 0 ---
Multiprocessor count:  7
Shared mem per mp:  49152
Registers per mp:  32768
Threads in warp:  32
Max threads per block:  1024
Max thread dimensions:  (1024, 1024, 64)
Max grid dimensions:  (65535, 65535, 1)


I notice that my GTX 460 returns a compute capability of 2.1. Is that significant?

I have found some discussion of the GTX 460 here:
http://forums.nvidia.com/index.php?showtopic=191665

There is the suggestion that compilation should be done with -arch=sm_21 for this card.

What I can do is do a series of upgrades of the drivers and CUDA and see if anything makes a difference.

Should I set something in the MAGMA makefile for the different compute capability?

Thanks again

John
fletchjp
 
Posts: 175
Joined: Mon Dec 27, 2010 7:29 pm

Re: Nan problems with dgetrf_gpu on RC3

Postby fletchjp » Fri Jan 28, 2011 5:11 pm

katayama wrote:Dear Stan,

I just want to say that I get the nans as well.

Nobu


I want to comment separately on the results which Nobu reports. These are similar to ones I have seen where the nan results start part way through a run and continue into the next run.

These results are for a different type of GPU from mine (C2050) which is CUDA type 2.0, not 2.1 for mine. So those variables seem to be ruled out. I am very grateful to have this report as I was planning to see whether I could try the examples out on other hardware to see if the problems would go away on other systems.

I have another reason to say thankyou, which is that my strategy in getting this computer was to get a cheap entry level system for development and evaluation for a future system based on e.g. C2050 so that data on comparative performance is going to be very helpful.

I have in the meantime updated to CUDA 3.2.16 and use sm_21 for compiling, neither of which changes seem to change things. I am going to try updating the driver. I have seen reported that there is a beta of a 270 series driver. I have not yet seen a download.

Just for completeness, here is the output from deviceQuery for my GTX 460:

Code: Select all
./deviceQuery Starting...

 CUDA Device Query (Runtime API) version (CUDART static linking)

There is 1 device supporting CUDA

Device 0: "GeForce GTX 460"
  CUDA Driver Version:                           3.20
  CUDA Runtime Version:                          3.20
  CUDA Capability Major/Minor version number:    2.1
  Total amount of global memory:                 2146631680 bytes
  Multiprocessors x Cores/MP = Cores:            7 (MP) x 48 (Cores/MP) = 336 (Cores)
  Total amount of constant memory:               65536 bytes
  Total amount of shared memory per block:       49152 bytes
  Total number of registers available per block: 32768
  Warp size:                                     32
  Maximum number of threads per block:           1024
  Maximum sizes of each dimension of a block:    1024 x 1024 x 64
  Maximum sizes of each dimension of a grid:     65535 x 65535 x 1
  Maximum memory pitch:                          2147483647 bytes
  Texture alignment:                             512 bytes
  Clock rate:                                    1.40 GHz
  Concurrent copy and execution:                 Yes
  Run time limit on kernels:                     Yes
  Integrated:                                    No
  Support host page-locked memory mapping:       Yes
  Compute mode:                                  Default (multiple host threads can use this device simultaneously)
  Concurrent kernel execution:                   Yes
  Device has ECC support enabled:                No
  Device is using TCC driver mode:               No

deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 3.20, CUDA Runtime Version = 3.20, NumDevs = 1, Device = GeForce GTX 460


Thank you again.

John
fletchjp
 
Posts: 175
Joined: Mon Dec 27, 2010 7:29 pm

Re: Nan problems with dgetrf_gpu on RC3

Postby Stan Tomov » Fri Jan 28, 2011 7:09 pm

We know there is some problem with the MAGMA TRSM and we are working on it. The problem shows up on certain systems, sometimes works, sometimes not, so it is difficult to debug. I would be curious if this is the problem that you have - to test just block the redefinition
Code: Select all
#define cublasDtrsm magmablas_dtrsm

in file dgetrf_gpu.cpp, and recompile (which will result in using cublasDtrsm). Thank you for your help on this.
Regards,
Stan
Stan Tomov
 
Posts: 251
Joined: Fri Aug 21, 2009 10:39 pm

Re: Nan problems with dgetrf_gpu on RC3

Postby katayama » Sat Jan 29, 2011 7:38 am

Dear stan,

Here is the result. Sometimes the first 1024X1024 also gets nan. Othertimes it is OK.
I also tried on GTX 580 and show the result at the end

[katayama@lb01 magma_1.0.0-rc3]$ nm src/dgetrf_gpu.o
0000000000000000 r .LC0
0000000000000008 r .LC1
U __gxx_personality_v0
U cuCtxSynchronize
U cublasAlloc
U cublasDtrsm
U cublasFree
U cublasGetMatrix
U cublasSetMatrix
U cudaFreeHost
U cudaMallocHost
U dgetrf_
U free
0000000000000000 T magma_dgetrf_gpu
U magma_get_dgetrf_nb
U magmablas_dgemm
U magmablas_dinplace_transpose
U magmablas_dpermute_long2
U magmablas_dtranspose
U magmablas_dtranspose2
U malloc
[katayama@lb01 magma_1.0.0-rc3]$ ./testing/testing_dgetrf
device 0: Tesla C2050, 1147.0 MHz clock, 2687.2 MB memory

Usage:
testing_dgetrf_gpu -M 1024 -N 1024



M N CPU GFlop/s GPU GFlop/s ||PA-LU||/(||A||*N)
============================================================
1024 1024 13.90 18.63 4.608794e-18
2048 2048 26.45 59.09 nan
...

GTX 580

device 0: GeForce GTX 580, 1544.0 MHz clock, 1535.2 MB memory
device 1: GeForce GTX 580, 1544.0 MHz clock, 1535.7 MB memory

Usage:
testing_dgetrf_gpu -M 1024 -N 1024



M N CPU GFlop/s GPU GFlop/s ||PA-LU||/(||A||*N)
============================================================
1024 1024 8.37 24.78 nan
2048 2048 9.23 64.08 nan
can not bind to texture
can not bind to texture
can not bind to texture
can not bind to texture
can not bind to texture
can not bind to texture
can not bind to texture
can not bind to texture
can not bind to texture
can not bind to texture
can not bind to texture
can not bind to texture
can not bind to texture
can not bind to texture
can not bind to texture
can not bind to texture
...
can not bind to texture
can not bind to texture
3072 3072 9.99 113.53 nan
Argument 7 of dgetrf had an illegal value.
4032 4032 10.48 4369075.54 1.767382e-01
Argument 7 of dgetrf had an illegal value.
5184 5184 10.61 8442055.79 1.767285e-01
Argument 7 of dgetrf had an illegal value.
6016 6016 10.66 13194271.24 1.767584e-01
Argument 7 of dgetrf had an illegal value.
7040 7040 10.73 21144030.40 1.767586e-01
Argument 7 of dgetrf had an illegal value.
8064 8064 10.76 34955853.68 1.767457e-01
Argument 7 of dgetrf had an illegal value.
9088 9088 10.81 50035455.80 1.767460e-01
Argument 7 of dgetrf had an illegal value.
10112 10112 10.83 57438947.12 1.767458e-01


It may have already reported but with C2050 of all testings, here are the ones I get nans


+ ./testing_zcgeqrsv_gpu
device 0: Tesla C2050, 1147.0 MHz clock, 2687.2 MB memory

Usage:
testing_zcgeqrsv_gpu -M 1024 -N 1024 -nrhs 1

Epsilon(double): 1.110223e-16
Epsilon(single): 5.960464e-08



CPU GFlop/s G P U GFlop/s
N DP DP SP MP ||b-Ax||/||A|| NumIter
=======================================================================
1024 19.21 35.89 58.02 45.49 1.236591e-15 2
2048 17.44 82.47 41.82 46.11 nan 0
3072 33.04 143.99 56.81 55.90 nan 0
4032 31.62 191.28 76.34 75.75 nan 0
5184 38.49 98.16 101.27 100.74 nan 0
6016 41.77 114.29 116.06 118.67 nan 0
7040 45.79 135.69 139.29 136.83 nan 0
7520 47.42 145.20 149.34 147.99 nan 0


+ ./testing_dorgqr_gpu
device 0: Tesla C2050, 1147.0 MHz clock, 2687.2 MB memory

Usage:
testing_dorgqr_gpu -M 1024 -N 1024 -K 1024


M N CPU GFlop/s GPU GFlop/s ||R|| / ||A||
=======================================================
1024 1024 43.2 28.9 4.90e-16
2048 2048 108.6 114.5 5.52e-16
3072 3072 90.0 101.3 5.27e-16
4032 4032 132.5 139.2 nan
5184 5184 171.1 179.0 nan
6016 6016 192.5 197.9 nan
7040 7040 212.7 219.7 nan
8064 8064 225.6 233.2 nan
9088 9088 235.5 241.9 nan
9984 9984 242.2 249.4 nan
+ ./testing_dgeqrs_gpu
device 0: Tesla C2050, 1147.0 MHz clock, 2687.2 MB memory

Usage:
testing_dgeqrs_gpu -nrhs 3 -M 1024 -N 1024


||b-Ax|| / (N||A||)
M N CPU GFlop/s GPU GFlop/s CPU GPU
============================================================
1024 1024 12.4 9.0 3.43e-18 nan
2048 2048 26.0 19.7 7.03e-19 nan
3072 3072 18.6 28.3 nan nan
4032 4032 24.8 38.4 nan nan
5184 5184 30.4 50.7 nan nan
6016 6016 37.1 61.1 nan nan
7040 7040 44.2 72.1 nan nan
8064 8064 51.1 82.9 nan nan
9088 9088 56.3 93.9 nan nan
10112 10112 61.3 104.8 nan nan

+ ./testing_sgeqrf_gpu
device 0: Tesla C2050, 1147.0 MHz clock, 2687.2 MB memory

Usage:
testing_sgeqrf_gpu -M 1024 -N 1024



M N CPU GFlop/s GPU GFlop/s ||R||_F / ||A||_F
==========================================================
1024 1024 18.78 18.94 1.084622e-06
2048 2048 37.49 21.58 nan
3072 3072 50.45 32.74 nan
4032 4032 70.38 43.81 nan
5184 5184 85.75 57.37 nan
6016 6016 96.24 66.79 nan
7040 7040 105.65 77.67 nan
8064 8064 112.26 89.49 nan
9088 9088 119.61 100.71 nan
9984 9984 122.49 111.05 nan
+ ./testing_sorgqr_gpu
device 0: Tesla C2050, 1147.0 MHz clock, 2687.2 MB memory

Usage:
testing_sorgqr_gpu -M 1024 -N 1024 -K 1024


M N CPU GFlop/s GPU GFlop/s ||R|| / ||A||
=======================================================
1024 1024 55.2 42.5 nan
2048 2048 89.1 92.1 nan
3072 3072 149.2 158.9 nan
4032 4032 219.0 220.4 nan
5184 5184 293.5 298.7 nan
6016 6016 343.1 352.2 nan
7040 7040 392.0 405.5 nan
8064 8064 430.1 445.0 nan
9088 9088 459.0 476.3 nan
9984 9984 479.6 497.7 nan
+ ./testing_sgeqrf_gpu
device 0: Tesla C2050, 1147.0 MHz clock, 2687.2 MB memory

Usage:
testing_sgeqrf_gpu -M 1024 -N 1024



M N CPU GFlop/s GPU GFlop/s ||R||_F / ||A||_F
==========================================================
1024 1024 18.78 18.94 1.084622e-06
2048 2048 37.49 21.58 nan
3072 3072 50.45 32.74 nan
4032 4032 70.38 43.81 nan
5184 5184 85.75 57.37 nan
6016 6016 96.24 66.79 nan
7040 7040 105.65 77.67 nan
8064 8064 112.26 89.49 nan
9088 9088 119.61 100.71 nan
9984 9984 122.49 111.05 nan
+ ./testing_sorgqr_gpu
device 0: Tesla C2050, 1147.0 MHz clock, 2687.2 MB memory

Usage:
testing_sorgqr_gpu -M 1024 -N 1024 -K 1024


M N CPU GFlop/s GPU GFlop/s ||R|| / ||A||
=======================================================
1024 1024 55.2 42.5 nan
2048 2048 89.1 92.1 nan
3072 3072 149.2 158.9 nan
4032 4032 219.0 220.4 nan
5184 5184 293.5 298.7 nan
6016 6016 343.1 352.2 nan
7040 7040 392.0 405.5 nan
8064 8064 430.1 445.0 nan
9088 9088 459.0 476.3 nan
9984 9984 479.6 497.7 nan
katayama
 
Posts: 12
Joined: Sat Jan 16, 2010 8:33 am

Re: Nan problems with dgetrf_gpu on RC3

Postby fletchjp » Sun Jan 30, 2011 3:29 pm

Stan Tomov wrote:We know there is some problem with the MAGMA TRSM and we are working on it. The problem shows up on certain systems, sometimes works, sometimes not, so it is difficult to debug. I would be curious if this is the problem that you have - to test just block the redefinition
Code: Select all
#define cublasDtrsm magmablas_dtrsm

in file dgetrf_gpu.cpp, and recompile (which will result in using cublasDtrsm). Thank you for your help on this.
Regards,
Stan


Stan

I have done some tests and find that changing to cublasDtrsm does remove the problem with dgetrf_gpu and also with dgetrf. I have also made the same change in dgetrs_gpu as well. I notice dtrsm is also used in dgesv_gpu (a wrapper on the others) and also in some dpo routines (dposv_gpu, dpotrf_gpu, dpotrs_gpu and dpotrf).

I notice another thing - when I am using gotoBlas I can delete -llapack from the linking in make.inc.
The implication of that for me is that the gotoBlas contains enough of lapack to cover these routines and therefore I don't need one of the other ones supplied by the Ubuntu 10.4 installation. This includes an atlas lapack as an alternative. I was getting failure references to atlas routines which I did not think I was linking. I will try it like that and see if that reduces the problems.

Thanks again

John
fletchjp
 
Posts: 175
Joined: Mon Dec 27, 2010 7:29 pm

Re: Nan problems with dgetrf_gpu on RC3

Postby katayama » Mon Jan 31, 2011 8:35 am

Dear Stan/John

In my previous post dated on Jan. 29, I made a mistake
(See line with <<<<<<<<<<<<<<<< below)
I commented out the #define statement in dgetrf_gpu and ran testing_dgetrf not testing_dgetrf_gpu.
(I only checked the usage line which says it is testing_dgetrf_gpu...)

I checked it again and found out that, as John says, with cublasTtrsm, dgetrf runs fine without nans.

I also did not explain well. Everything below this testing_dgetrf is with original code, not #define statement commented out

Best

Nobu

<<<original post>>>

[katayama@lb01 magma_1.0.0-rc3]$ nm src/dgetrf_gpu.o
0000000000000000 r .LC0
0000000000000008 r .LC1
U __gxx_personality_v0
U cuCtxSynchronize
U cublasAlloc
U cublasDtrsm
U cublasFree
U cublasGetMatrix
U cublasSetMatrix
U cudaFreeHost
U cudaMallocHost
U dgetrf_
U free
0000000000000000 T magma_dgetrf_gpu
U magma_get_dgetrf_nb
U magmablas_dgemm
U magmablas_dinplace_transpose
U magmablas_dpermute_long2
U magmablas_dtranspose
U magmablas_dtranspose2
U malloc
[katayama@lb01 magma_1.0.0-rc3]$ ./testing/testing_dgetrf <<<<<<<<<<<<<<
device 0: Tesla C2050, 1147.0 MHz clock, 2687.2 MB memory

Usage:
testing_dgetrf_gpu -M 1024 -N 1024
katayama
 
Posts: 12
Joined: Sat Jan 16, 2010 8:33 am

Re: Nan problems with dgetrf_gpu on RC3

Postby fletchjp » Mon Jan 31, 2011 9:31 am

I had noticed that testing_dgetrf reports itself as testing_dgetrf_gpu.

John
fletchjp
 
Posts: 175
Joined: Mon Dec 27, 2010 7:29 pm

Next

Return to User discussion

Who is online

Users browsing this forum: No registered users and 2 guests