NaN errors with dpotrf and dpotrf_gpu

Open discussion for MAGMA

NaN errors with dpotrf and dpotrf_gpu

Postby fletchjp » Tue Dec 28, 2010 7:08 pm

I am getting NaN values for the norm result with testing_dpotrf and testing_dpotrf_gpu.

The results are inconsistent and confusing as follows.

1. spotrf, cpotrf and zpotrf do not show the problem
2. dpotrf shows it sometimes and it depends what has run before I run it. If I have just run zpotrf I get some good values.
3. They worked O.K. yesterday when I made some initial tests and changed from a single threaded BLAS to use GotoBLAS. I have some files to prove it!

I have explored a number of things but cannot get consistent behaviour.

I saw something about NVIDIA drivers which may need updating. I am currently using 260.19.21 and have 260.19.26.

I have installed MAGMA 1.0 RC2 on a system running Ubuntu Linux 10.4 (64 bit) and have CUDA 3.2 installed.
I have an 8 core CPU and 8 Gbytes of main memory. The GPU reports as follows:
device 0: GeForce GTX 460, 1400.0 MHz clock, 2047.2 MB memory
I am using gcc and gfortran and have GotoBLAS2 installed.

Is there a command I can run to restore the GPU to a consistent start state?

I would welcome any advice.

John
fletchjp
 
Posts: 170
Joined: Mon Dec 27, 2010 7:29 pm

Re: NaN errors with dpotrf and dpotrf_gpu

Postby Stan Tomov » Sun Jan 02, 2011 10:09 pm

John,

Thanks for this bug report. We couldn't reproduce it so far (we tried GTX 480 and C2050) and I was wondering if you still get that error. The four precisions are generated from double complex so if some of them work a possible problem should be in the BLAS used. I see that in double precision we use magmablas_dtrsm and magmablas_dgemm. If you comment out the redefinitions at the beginning of the files, i.e.,

#define cublasDgemm magmablas_dgemm
#define cublasDtrsm magmablas_dtrsm

and recompile, you would be using CUBLAS. I guess this would work. If yes, can you please check if the problem comes from magmablas_dgemm or magmablas_dtrsm (or both) by trying different combinations of redefining CUBLAS with these MAGMA BLAS routines.

Can you also please post your make.inc file?

Thanks,
Stan
Stan Tomov
 
Posts: 247
Joined: Fri Aug 21, 2009 10:39 pm

Re: NaN errors with dpotrf and dpotrf_gpu

Postby fletchjp » Mon Jan 03, 2011 7:50 am

Stan

Thank you for this and other replies. I have made the changes to comment out both the magma blas routines in dpotrf and dpotrf_gpu and rerun the tests. The NaNs are replaced by the correct small values. I don't regard this as final because I have experienced the problem as intermittent.

Here is my make.inc file. I am currently using GotoBLAS2. Ubuntu 10.4 (64 bit) system (for details see an earlier post).

Code: Select all
#//////////////////////////////////////////////////////////////////////////////
#   -- MAGMA (version 1.0) --
#      Univ. of Tennessee, Knoxville
#      Univ. of California, Berkeley
#      Univ. of   Colorado, Denver
#      November 2010
#//////////////////////////////////////////////////////////////////////////////

#
# GPU_TARGET specifies for which GPU you want to compile MAGMA
#      0: Tesla family
#      1: Fermi Family
#
GPU_TARGET = 1

CUDADIR=/usr/local/cuda

CC        = gcc
NVCC=$(CUDADIR)/bin/nvcc
#NVCC      = nvcc
FORT      = gfortran

ARCH      = ar
ARCHFLAGS = cr
RANLIB    = ranlib

OPTS      = -O3 -DADD_
NVOPTS    = --compiler-options -fno-strict-aliasing -DUNIX -O3 -DADD_
LDOPTS    = -fPIC -z muldefs

# using GotoBLAS
LIB       = -lgoto2  -lpthread -lcublas -lcudart -llapack -lm
# using default BLAS (single thread)
#LIB       = -lblas  -lpthread -lcublas -lcudart -llapack -lm

LIBDIR    = -L/home/fletcher/GotoBLAS2 -L/usr/lib64 -L$(CUDADIR)/lib64
INC       = -I$(CUDADIR)/include

LIBMAGMA     = ../lib/libmagma.a
LIBMAGMABLAS = ../lib/libmagmablas.a



I have also had problems with DGESV which only uses dtrsm and not dgemm. Using the magma routine I get this testing_dgesv_gpu:


Code: Select all
device 0: GeForce GTX 460, 1400.0 MHz clock, 2047.2 MB memory

Usage:
  testing_dgesv_gpu -nrhs 100 -N 1024



  N     NRHS       GPU GFlop/s      || b-Ax || / ||A||
========================================================
 1024   100              24.05        8.352689e-01
 2048   100              45.53        8.454415e-01
 3072   100              61.00        8.421578e-01
 4032   100              65.12        8.142203e-01
 5184   100              67.79        8.306369e-01
 6016   100              68.90        8.343519e-01
 7040   100              69.89        8.141526e-01
 8064   100              70.88        8.183255e-01
 9088   100              71.54        8.071067e-01
10112   100              71.91        0.000000e+00


testing_zgesv_gpu runs O.K.

With cublas I get this, with lower GPU Gflops. I have some cases where I get some correct answers with lower Gflops and some wrong answers with higher Gflops, implying that the intermittence is caused by a different choice of blas routine. I don't know how that could come about.


Code: Select all
device 0: GeForce GTX 460, 1400.0 MHz clock, 2047.2 MB memory

Usage:
  testing_dgesv_gpu -nrhs 100 -N 1024



  N     NRHS       GPU GFlop/s      || b-Ax || / ||A||
========================================================
 1024   100              15.33        1.398313e-15
 2048   100              32.11        2.007455e-15
 3072   100              44.85        2.586575e-15
 4032   100              50.08        2.092208e-14
 5184   100              55.22        1.029670e-14
 6016   100              57.58        8.200146e-15
 7040   100              59.72        1.682541e-14
 8064   100              61.52        6.188573e-15
 9088   100              63.10        1.849722e-15
10112   100              64.16        3.223983e-15


Switching back to magmablas_dtrsm and recompiling, testing_dgesv_gpu now runs correctly and with somewhat changed Gflops values:

Code: Select all
device 0: GeForce GTX 460, 1400.0 MHz clock, 2047.2 MB memory

Usage:
  testing_dgesv_gpu -nrhs 100 -N 1024



  N     NRHS       GPU GFlop/s      || b-Ax || / ||A||
========================================================
 1024   100              18.94        4.088725e-15
 2048   100              38.98        3.078874e-15
 3072   100              53.45        3.859138e-15
 4032   100              58.62        2.800186e-14
 5184   100              62.43        1.239449e-14
 6016   100              64.10        1.297417e-14
 7040   100              65.76        2.029449e-14
 8064   100              67.16        6.810131e-15
 9088   100              68.26        2.045831e-15
10112   100              68.89        4.308972e-15


I am sorry this is rather long. I hope it helps. I am very interested in getting good answers from DGESV.

Thanks and Happy New Year

John
fletchjp
 
Posts: 170
Joined: Mon Dec 27, 2010 7:29 pm

Re: NaN errors with dpotrf and dpotrf_gpu

Postby fletchjp » Tue Jan 04, 2011 6:56 pm

Stan

I am seeing similar problems with dgetrf and dgetrf_gpu.

With dgetf_gpu it is similar, with nan appearing for the result, but the problem goes away if I run zgetrf_gpu and then run dgetrf_gpu again.

With dgetrf I have seen a different problem, with the message

Code: Select all
can not bind to texture


happening and then an error in parameter 7.

Unfortunately, I cannot get this to reproduce at the moment, even after shutting down and powering off, and I had lost it off the top of the screen before I could capture it.

In both cases I have modified the functions to use the CUblas and the error has not been seen with those versions.

Incidentally, I think it would help to indicate with some of the tests how much memory is needed to run them. testing_zgetrf_gpu needed most of my 8 gig machine to run, and I think most of the 2 gig on the GPU. As I use the GPU also for screen output and run the system monitor, I notice that the screen display update slows down when the GPU is working on the numerical tasks. I notice that GotoBLAS normally runs 4 of my 8 cores at 100 % when working hard. Only when I was building GotoBLAS did I see more of them running.

Best wishes

John
fletchjp
 
Posts: 170
Joined: Mon Dec 27, 2010 7:29 pm

Re: NaN errors with dpotrf and dpotrf_gpu

Postby Stan Tomov » Wed Jan 05, 2011 1:39 am

John,

Thanks again for reporting these observations.

Now I can confirm that something weird may be happening. I had done experiments on a GTX480 (which is the closest that we have to yours) before and as I indicated in my previous post in this thread I couldn't reproduce any of the problems you mentioned. After your latest posts I went back at that machine and could reproduce the problems you mention and many others! I requested exclusive use of the machine immediately after it was rebooted. I saw similar problems at the first several runs but than it went into some stable state and now everything works.

I'll do some more experimentation and probably talk to NVIDIA people to see if they have a suggestion what may cause this type of behavior. In our case our system administrator mentioned that he has been getting frequent requests to reboot the machine because the numerical libraries we use get into some problem with the hardware.

Otherwise, I didn't see any problems in your make.inc file. I also compiled with GOTO BLAS and have it run right now. For example, testing dgesv_gpu I get this
Code: Select all
tomov:ig /mnt/scratch/tomov/sc_release/testing> ./testing_dgesv_gpu                            <- 12:09AM
device 0: GeForce GTX 480, 1401.0 MHz clock, 1535.6 MB memory
device 1: Tesla C870, 1350.0 MHz clock, 1535.8 MB memory
device 2: Tesla C870, 1350.0 MHz clock, 1535.8 MB memory

Usage:
  testing_dgesv_gpu -nrhs 100 -N 1024

  N     NRHS       GPU GFlop/s      || b-Ax || / ||A||
========================================================
 1024   100              11.44        2.959912e-15
 2048   100              42.81        2.718611e-15
 3072   100              72.24        4.025854e-15
 4032   100              90.23        2.707239e-14
 5184   100             102.98        1.153798e-14
 6016   100             109.03        1.001725e-14
 7040   100             115.95        1.922823e-14
 8064   100             122.28        6.464400e-15
 9088   100             126.38        1.934203e-15
10112   100             129.89        3.381447e-15

This uses MAGMA BLAS and 6 GOTO BLAS threads (AMD Opteron 8439 SE). Our system is 8 socket 6-core AMD Opterons. In general I would recommend running MAGMA 1.0 using the number of cores on a socket.

By the way, the GTX are very fast in single precision arithmetic. While on the machine I played with some of the testers, to get quite impressed by the single precision complex performance, e.g.,
Code: Select all
tomov:ig /mnt/scratch/tomov/sc_release/testing> ./testing_cgeqrf_gpu
device 0: GeForce GTX 480, 1401.0 MHz clock, 1535.6 MB memory
device 1: Tesla C870, 1350.0 MHz clock, 1535.8 MB memory
device 2: Tesla C870, 1350.0 MHz clock, 1535.8 MB memory

Usage:
  testing_cgeqrf_gpu -M 1024 -N 1024

  M     N   CPU GFlop/s   GPU GFlop/s    ||R||_F / ||A||_F
==========================================================
 1024  1024   58.72         119.39        1.377159e-06
 2048  2048   65.61         230.87        1.838887e-06
 3072  3072   83.48         378.93        2.220599e-06
 4032  4032   88.20         495.47        2.574741e-06
 5184  5184   87.81         586.26        2.979100e-06
 6016  6016   90.36         657.67        3.179357e-06
 7040  7040   90.47         717.10        3.293436e-06
 8064  8064   92.54         760.13        3.393870e-06
 9088  9088   92.49         792.80        3.489388e-06
 9984  9984   92.99         814.09        3.586298e-06


Regards,
Stan
Stan Tomov
 
Posts: 247
Joined: Fri Aug 21, 2009 10:39 pm

Re: NaN errors with dpotrf and dpotrf_gpu

Postby fletchjp » Wed Jan 05, 2011 4:59 am

Stan

Thank you for all the replies.

Here is the strange output from dgetrf which I can get starting from cold in the morning. I could not get it to repeat last night.

Is there a software tool I could run to see what is happening on the GPU? I usually run with System Monitor on the main display and use NX to run the tests from another computer. I have only ever seen the strange behaviour with the d versions, never s, c, or z.

Code: Select all
fletcher@fletcher-desktop:~/magma_1.0.0-rc2/testing$ ./testing_dgetrf
device 0: GeForce GTX 460, 1400.0 MHz clock, 2047.2 MB memory

Usage:
  testing_dgetrf -M 1024 -N 1024



  M     N   CPU GFlop/s    GPU GFlop/s   ||PA-LU||/(||A||*N)
============================================================
can not bind to texture
can not bind to texture
can not bind to texture
can not bind to texture
can not bind to texture
can not bind to texture
can not bind to texture
can not bind to texture
can not bind to texture
can not bind to texture
can not bind to texture
can not bind to texture
can not bind to texture
can not bind to texture
can not bind to texture
can not bind to texture
can not bind to texture
can not bind to texture
can not bind to texture
can not bind to texture
can not bind to texture
can not bind to texture
can not bind to texture
can not bind to texture
can not bind to texture
can not bind to texture
can not bind to texture
can not bind to texture
can not bind to texture
can not bind to texture
can not bind to texture
can not bind to texture
can not bind to texture
can not bind to texture
can not bind to texture
can not bind to texture
can not bind to texture
can not bind to texture
can not bind to texture
can not bind to texture
can not bind to texture
can not bind to texture
can not bind to texture
can not bind to texture
can not bind to texture
can not bind to texture
can not bind to texture
can not bind to texture
can not bind to texture
can not bind to texture
can not bind to texture
can not bind to texture
can not bind to texture
can not bind to texture
 1024  1024   20.90          46.07         nan
Argument 7 of dgetrf had an illegal value.
 2048  2048   33.02         520411.45         1.766772e-01
Argument 7 of dgetrf had an illegal value.
 3072  3072   35.84         1756603.11         1.767735e-01
Argument 7 of dgetrf had an illegal value.
 4032  4032   37.05         3971886.55         1.767382e-01
Argument 7 of dgetrf had an illegal value.
 5184  5184   37.86         8442055.40         1.767285e-01
Argument 7 of dgetrf had an illegal value.
 6016  6016   39.56         13194270.78         1.767584e-01
Argument 7 of dgetrf had an illegal value.
 7040  7040   39.49         19382027.38         1.767586e-01
Argument 7 of dgetrf had an illegal value.
 8064  8064   38.59         31778048.19         1.767457e-01
Argument 7 of dgetrf had an illegal value.
 9088  9088   37.90         45486777.31         1.767460e-01
Argument 7 of dgetrf had an illegal value.
10112 10112   38.19         62660668.82         1.767458e-01


Best wishes

John
fletchjp
 
Posts: 170
Joined: Mon Dec 27, 2010 7:29 pm

Re: NaN errors with dpotrf and dpotrf_gpu

Postby emilb » Sun Feb 10, 2013 12:19 pm

I have some information that may shed some light on this. I'm working with a quantum molecular dynamics code called RMG that does subspace diagonalizations. It works well with lapack and scalapack but I ran into the same sort of problems described here when I tried to use magma. Wrong, inconsistent or error results including illegal parameters from DLASCL or texture errors. I had been doing my tests on my workstation which has a GTX560 card but since that does not support ECC I decided to try it on Blue Waters which has K20x cards in order to rule out memory errors.

Well things were different. Still not right but I noticed something interesting. RMG uses some random starting vectors. On Blue Waters the initial iteration with magma_dpotrf_gpu differed by a large amount from the lapack version but the results started to converge after a few more iterations. So I then tried a starting position that did not include random vectors and I got identical results from magma and lapack. In the random case the matrix passed to dpotrf will not be diagonally dominant while in the second case it will be. It appears to me that magma_dpotrf_gpu is not handling things correctly when the matrix is not well conditioned. I'll run some additional tests to try to confirm this.
emilb
 
Posts: 3
Joined: Fri Feb 08, 2013 1:24 pm

Re: NaN errors with dpotrf and dpotrf_gpu

Postby fletchjp » Tue Apr 02, 2013 2:00 pm

Thank you. I have not been active here for a long time. I am just restarting. I can see some light being shed on the reasons for strange inconsistencies. If they are specific to cheaper hardware such as mine that is worth knowing.

John
fletchjp
 
Posts: 170
Joined: Mon Dec 27, 2010 7:29 pm

Re: NaN errors with dpotrf and dpotrf_gpu

Postby nashp » Tue May 14, 2013 7:41 am

Hi All,

I just came across this as I was going to post something similar myself. I experienced some NaN errors when using magmablas_dgemm. I switched to cublasDgemm and this rectified the issue. The data, arguments and it's position in the code was the same for both. I am using a Tesla C2075 and shared version of the magma(blas) libraries (1.3). I'm not sure if this helps, but I could also try and reproduce the error if that would be more use.

Peter
nashp
 
Posts: 2
Joined: Wed Jan 09, 2013 8:04 am

Re: NaN errors with dpotrf and dpotrf_gpu

Postby fletchjp » Thu Sep 12, 2013 12:03 pm

It is interesting that the last posting was for work on a Tesla i.e. not the cheap versions of the cards.
fletchjp
 
Posts: 170
Joined: Mon Dec 27, 2010 7:29 pm

Next

Return to User discussion

Who is online

Users browsing this forum: Bing [Bot], hirokiH, Yahoo [Bot] and 1 guest