magma_cgesv stability

Open discussion for MAGMA

magma_cgesv stability

Postby mh1 » Fri Mar 15, 2013 5:10 pm

I have downloaded, installed, and integrated MAGMA 1.3 into my application. I am currently focusing on magma_cgesv.

MAGMA 1.3
CUDA 5.0
Ubuntu 11.04 and 12.04

I have run several tests on a 580 card (GPU_TARGET = Fermi). On one test that solve a linear system of size 9072x9072. It appears to be in-line w/ my MKL CPU based results as expected.
I run the same test on a K10 card (GPU_TARGET = Fermi). The results again appear to be in-line w/ my CPU based results.
I run the same test on the same K10 card (GPU_TARGET = Kepler). The data is way off from MKL CPU results - not NaN's though.

I run another test with the 580 card (GPU_TARGET = Fermi) on a 35790x35790 system. Accuracy is way off to the MKL CPU results.
I run the same test on a K10 card (GPU_TARGET = Fermi). The results again appear way off from CPU based results. Moreover, the K10 and 580 results, both being wrong, are also very dissimilar.
I run the same test on a K10 card (GPU_TARGET = Kepler). I receive NaN's as output.

Conclusion : MAGMA appears to be giving different results depending on GPU_TARGET specified. In some cases MAGMA is giving NaNs.

My make.inc is below. I only change GPU_TARGET for each test. Is there something I have not properly configured that would cause these accuracy problems?

------------------------------ make.inc -------------------------------------------------
# GPU_TARGET specifies for which GPU you want to compile MAGMA:
GPU_TARGET = Fermi
#GPU_TARGET = Kepler
#GPU_TARGET = Tesla

CC = icc
NVCC = nvcc
FORT = gfortran -DCUBLAS_GFORTRAN

ARCH = ar
ARCHFLAGS = cr
RANLIB = ranlib

# CUDA 5 needs -DCUBLAS_GFORTRAN or appropriate setting to compile cuda/src/fortran.c bindings
OPTS = -O3 -DADD_ -DCUBLAS_GFORTRAN
F77OPTS = -O3 -DADD_
FOPTS = -O3 -DADD_
NVOPTS = -O3 -DADD_ --compiler-options -fno-strict-aliasing -DUNIX
LDOPTS = -fPIC -Xlinker -zmuldefs

# old MKL
#LIB = -lmkl_intel_lp64 -lmkl_intel_thread -lmkl_lapack -lmkl_core -lguide -lpthread -lcublas -lcudart -lm

# see MKL Link Advisor at http://software.intel.com/sites/products/mkl/
# icc with MKL 10.3
LIB = -lmkl_intel_lp64 -lmkl_intel_thread -lmkl_core -lpthread -lcublas -lcudart -lm -openmp

# define these in your environment
MKLROOT = /opt/intel/mkl
CUDADIR = /usr/local/cuda
#CUDADIR = /usr/local/cuda-5.0

LIBDIR = -L$(MKLROOT)/lib/intel64 \
-L$(CUDADIR)/lib64
INC = -I$(CUDADIR)/include
------------------------------ make.inc -------------------------------------------------
mh1
 
Posts: 11
Joined: Thu Mar 14, 2013 4:24 pm

Re: magma_cgesv stability

Postby mgates3 » Tue Mar 19, 2013 1:50 pm

Your make.inc looks okay to me. The 9k problem probably fits in your GPU memory, while the large 35k system will run non-GPU-resident, so the algorithm is significantly different. With GPU_TARGET=Fermi, MAGMA will use the MAGMA BLAS GEMM, whereas with GPU_TARGET=Kepler, it should be using CUBLAS GEMM.

By results, I assume you mean the solution X. The LU factors are usually different for different implementations, because of different block sizes and different pivots get selected. To verify the LU factors, you have to check the backwards error, ||A - LU|| / ||A||.

Can you post a snippet of your code where you call magma, and what are all the parameters (e.g., n, nrhs, lda, ldb)? Did you check the info value?

Do you see these same issues when using the magma testers? For example:

Code: Select all
GPU_TARGET=Fermi, on Fermi card
romulus ~/magma-1.3.0-fermi/testing> ./testing_cgesv -N 9072 -R 1
MAGMA 1.3.0
device 0: Tesla S2050, 1147.0 MHz clock, 2687.4 MB memory, capability 2.0
    N   NRHS   GPU GFlop/s (sec)   ||B - AX|| / ||A||*||X||
===========================================================
 9072      1    509.33 (   3.91)   1.78e-06


GPU_TARGET=Fermi, on Kepler card
bunsen ~/magma-1.3.0-fermi/testing> ./testing_cgesv -N 9072 -R 1
MAGMA 1.3.0
device 0: Tesla K20c, 705.5 MHz clock, 4799.6 MB memory, capability 3.5
    N   NRHS   GPU GFlop/s (sec)   ||B - AX|| / ||A||*||X||
===========================================================
 9072      1    887.86 (   2.24)   1.72e-06


GPU_TARGET=Kepler, on Kepler card
bunsen ~/magma-1.3.0-kepler/testing> ./testing_cgesv -N 9072 -R 1
MAGMA 1.3.0
device 0: Tesla K20c, 705.5 MHz clock, 4799.6 MB memory, capability 3.5
    N   NRHS   GPU GFlop/s (sec)   ||B - AX|| / ||A||*||X||
===========================================================
 9072      1    885.16 (   2.25)   1.72e-06


While the performance for this cgesv appears the same for different GPU_TARGET, the performance for dgesv is different:

Code: Select all
GPU_TARGET=Fermi, on Kepler card
bunsen ~/magma-1.3.0-fermi/testing> ./testing_dgesv -N 9072 -R 1
MAGMA 1.3.0
device 0: Tesla K20c, 705.5 MHz clock, 4799.6 MB memory, capability 3.5
    N   NRHS   GPU GFlop/s (sec)   ||B - AX|| / ||A||*||X||
===========================================================
 9072      1    296.72 (   1.68)   1.80e-15


GPU_TARGET=Kepler, on Kepler card
bunsen ~/magma-1.3.0-kepler/testing> ./testing_dgesv -N 9072 -R 1
MAGMA 1.3.0
device 0: Tesla K20c, 705.5 MHz clock, 4799.6 MB memory, capability 3.5
    N   NRHS   GPU GFlop/s (sec)   ||B - AX|| / ||A||*||X||
===========================================================
 9072      1    369.89 (   1.35)   1.64e-15
Last edited by mgates3 on Thu Mar 21, 2013 5:13 pm, edited 1 time in total.
Reason: Minor edit to add [code] ... [/code] to line up columns
mgates3
 
Posts: 424
Joined: Fri Jan 06, 2012 2:13 pm

Re: magma_cgesv stability

Postby mh1 » Wed Mar 20, 2013 4:12 pm

Thank you for your reply. Some answers.

[1] Yes, I am referring to the solution from cgesv. The tests were run from solving a linear system generated from my application and not from the testing_cgesv.cpp test. I have plotting tools that take the solution from cgesv and apply it to application specific data. In some cases these solution yields expected results and in the case of exhausting GPU RAM, or targeting Kepler, the solution is giving very different answers than the equivalent MKL CPU cgesv calls. See more below.

[2] I do receive the following info when running on a card in which the card RAM is exhausted. Is this a terminate info error or should MAGMA attempt to solve the system on the CPU? I noticed your tests did not include one in which the GPU RAM is exhausted. Can you please confirm you see this problem as well?
#define MAGMA_ERR_DEVICE_ALLOC -113

[3] I built a custom test harness w/ the essential content below. This should demonstrate how I call cgesv. See [4] below too.
Code: Select all
  magma_int_t N = 5000;
  magma_int_t NRHS = 2;
  magma_int_t lda = N;
  magma_int_t ldb = N;
  magma_int_t* ipiv = new int[N];
  magma_int_t info;

  srand(time(NULL));

  cuFloatComplex* A = new cuFloatComplex[N*N];

  for (magma_int_t iRow = 0; iRow < N; iRow++)
  {
    for (magma_int_t jCol = 0; jCol < N; jCol++)
    {
      float re = ((float)rand())/((float)RAND_MAX);
      float im = ((float)rand())/((float)RAND_MAX);

      A[jCol*lda + iRow].x = re;
      A[jCol*lda + iRow].y = im;
    }
  }

  cuFloatComplex* b = new cuFloatComplex[N * NRHS];

  for (magma_int_t iRow = 0; iRow < N; iRow++)
  {
    for (magma_int_t jCol = 0; jCol < NRHS; jCol++)
    {
      float re = ((float)rand())/((float)RAND_MAX);
      float im = ((float)rand())/((float)RAND_MAX);

      b[jCol*ldb + iRow].x = re;
      b[jCol*ldb + iRow].y = im;
    }
  }

  cuFloatComplex* luA = new cuFloatComplex[N*N];
  memcpy(luA, A, N * N * sizeof(cuFloatComplex));

  cuFloatComplex* x = new cuFloatComplex[N * NRHS];
  memcpy(x, b, N * NRHS * sizeof(cuFloatComplex));

  magma_cgesv(N, NRHS, luA, lda, ipiv, x, ldb, &info);


[4] I had several linking problems while building the MAGMA tests. testing_cgesv.cpp did not build for example. I posted my make.inc before and here are the linking errors I received at the end. I would like to get passed these linking problems so that we can have a common test harness in the discussion. Can you help with these link errors?

Code: Select all
gfortran -DCUBLAS_GFORTRAN -fPIC -Xlinker -zmuldefs -DGPUSHMEM=200 testing_zgetrf_gpu_f.o fortran.o -o testing_zgetrf_gpu_f \
   libtest.a lin/liblapacktest.a -L../lib -lmagma -lmagmablas -lmagma \
   -L/opt/intel/mkl/lib/intel64 -L/usr/local/cuda/lib64 \
   -lmkl_intel_lp64 -lmkl_intel_thread -lmkl_core -lpthread -lcublas -lcudart -lm -openmp
../lib/libmagma.a(zgeqrf3_gpu.o): In function `magma_zgeqrf3_gpu':
zgeqrf3_gpu.cpp:(.text+0x31b): undefined reference to `_intel_fast_memset'
../lib/libmagma.a(zgeqrf_gpu.o): In function `magma_zgeqrf_gpu':
zgeqrf_gpu.cpp:(.text+0x34b): undefined reference to `_intel_fast_memset'
../lib/libmagma.a(auxiliary.o): In function `sp_cat':
auxiliary.cpp:(.text+0x50b): undefined reference to `_intel_fast_memset'
../lib/libmagma.a(xerbla.o): In function `magma_xerbla':
xerbla.cpp:(.text+0x2b): undefined reference to `__intel_sse2_strlen'
/opt/intel/mkl/lib/intel64/libmkl_intel_thread.so: undefined reference to `__kmpc_ok_to_fork'
/opt/intel/mkl/lib/intel64/libmkl_intel_thread.so: undefined reference to `__kmpc_end_single'
/opt/intel/mkl/lib/intel64/libmkl_intel_thread.so: undefined reference to `__kmpc_ordered'
/opt/intel/mkl/lib/intel64/libmkl_intel_thread.so: undefined reference to `__kmpc_for_static_init_8'
/opt/intel/mkl/lib/intel64/libmkl_intel_thread.so: undefined reference to `omp_get_thread_num'
/opt/intel/mkl/lib/intel64/libmkl_intel_thread.so: undefined reference to `__kmpc_barrier'
/opt/intel/mkl/lib/intel64/libmkl_intel_thread.so: undefined reference to `omp_get_num_threads'
/opt/intel/mkl/lib/intel64/libmkl_intel_thread.so: undefined reference to `omp_get_num_procs'
/opt/intel/mkl/lib/intel64/libmkl_intel_thread.so: undefined reference to `__kmpc_dispatch_next_4'
/opt/intel/mkl/lib/intel64/libmkl_intel_thread.so: undefined reference to `__kmpc_end_reduce_nowait'
/opt/intel/mkl/lib/intel64/libmkl_intel_thread.so: undefined reference to `__kmpc_critical'
/opt/intel/mkl/lib/intel64/libmkl_intel_thread.so: undefined reference to `__kmpc_dispatch_fini_8'
/opt/intel/mkl/lib/intel64/libmkl_intel_thread.so: undefined reference to `__kmpc_atomic_cmplx8_add'
/opt/intel/mkl/lib/intel64/libmkl_intel_thread.so: undefined reference to `__kmpc_atomic_float4_add'
/opt/intel/mkl/lib/intel64/libmkl_intel_thread.so: undefined reference to `__kmpc_serialized_parallel'
/opt/intel/mkl/lib/intel64/libmkl_intel_thread.so: undefined reference to `__kmpc_end_critical'
/opt/intel/mkl/lib/intel64/libmkl_intel_thread.so: undefined reference to `__kmpc_dispatch_init_8'
/opt/intel/mkl/lib/intel64/libmkl_intel_thread.so: undefined reference to `ompc_set_nested'
/opt/intel/mkl/lib/intel64/libmkl_intel_thread.so: undefined reference to `__kmpc_for_static_init_8u'
/opt/intel/mkl/lib/intel64/libmkl_intel_thread.so: undefined reference to `omp_get_nested'
/opt/intel/mkl/lib/intel64/libmkl_intel_thread.so: undefined reference to `__kmpc_dispatch_fini_4'
/opt/intel/mkl/lib/intel64/libmkl_intel_thread.so: undefined reference to `omp_in_parallel'
/opt/intel/mkl/lib/intel64/libmkl_intel_thread.so: undefined reference to `__kmpc_push_num_threads'
/opt/intel/mkl/lib/intel64/libmkl_intel_thread.so: undefined reference to `__kmpc_reduce_nowait'
/opt/intel/mkl/lib/intel64/libmkl_intel_thread.so: undefined reference to `omp_get_max_threads'
/opt/intel/mkl/lib/intel64/libmkl_intel_thread.so: undefined reference to `__kmpc_for_static_init_4'
/opt/intel/mkl/lib/intel64/libmkl_intel_thread.so: undefined reference to `__kmpc_end_serialized_parallel'
/opt/intel/mkl/lib/intel64/libmkl_intel_thread.so: undefined reference to `__kmpc_flush'
/opt/intel/mkl/lib/intel64/libmkl_intel_thread.so: undefined reference to `__kmpc_single'
/opt/intel/mkl/lib/intel64/libmkl_intel_thread.so: undefined reference to `__kmpc_dispatch_next_8'
/opt/intel/mkl/lib/intel64/libmkl_intel_thread.so: undefined reference to `__kmpc_atomic_float8_add'
/opt/intel/mkl/lib/intel64/libmkl_intel_thread.so: undefined reference to `__kmpc_end_master'
/opt/intel/mkl/lib/intel64/libmkl_intel_thread.so: undefined reference to `__kmpc_dispatch_init_4'
/opt/intel/mkl/lib/intel64/libmkl_intel_thread.so: undefined reference to `__kmpc_global_thread_num'
/opt/intel/mkl/lib/intel64/libmkl_intel_thread.so: undefined reference to `__kmpc_end_ordered'
/opt/intel/mkl/lib/intel64/libmkl_intel_thread.so: undefined reference to `__kmpc_fork_call'
/opt/intel/mkl/lib/intel64/libmkl_intel_thread.so: undefined reference to `__kmpc_atomic_fixed8_add'
/opt/intel/mkl/lib/intel64/libmkl_intel_thread.so: undefined reference to `__kmpc_for_static_fini'
/opt/intel/mkl/lib/intel64/libmkl_intel_thread.so: undefined reference to `__kmpc_master'
/opt/intel/mkl/lib/intel64/libmkl_intel_thread.so: undefined reference to `__kmpc_atomic_cmplx4_add'
collect2: ld returned 1 exit status
make[1]: *** [testing_zgetrf_gpu_f] Error 1
make[1]: Leaving directory `/home/mh1/Development/GPU/MAGMA1.3/magma-1.3.0_Fermi/testing'
make: *** [test] Error 2
Last edited by mgates3 on Thu Mar 21, 2013 12:52 pm, edited 1 time in total.
Reason: minor edit to put code into [code] ... [/code] blocks for proper indenting
mh1
 
Posts: 11
Joined: Thu Mar 14, 2013 4:24 pm

Re: magma_cgesv stability

Postby mgates3 » Thu Mar 21, 2013 1:17 pm

Error -113 is out-of-memory on the GPU. The factorization has NOT been computed. This is odd, though, because it should switch to a non-GPU-resident algorithm. In other words, it should still use the GPU to solve it, just cycling parts of the matrix through the available GPU memory (and return info=0). I have solved problems up to N=100,000 before, which uses the non-GPU-resident algorithm. I will have to investigate some more why that might be failing.

On a cursory view, your code looks fine. One minor comment though. It should allocate lda*N, instead of N*N. Since you set lda=N, these are the same, but if you later change lda, it will create problems. The memory copies should also be lda*N. Likewise with ldb.

That linking error occurs because you are using gfortran instead of ifort. Add to LIB -liomp5 for the __kmpc errors and -lirc for the intel_fast_memset errors. But since you have icc, I suggest using ifort. Those are just the couple Fortran testers, anyhow. Did all the C testers compile okay?

-mark
mgates3
 
Posts: 424
Joined: Fri Jan 06, 2012 2:13 pm

Re: magma_cgesv stability

Postby mh1 » Thu Mar 21, 2013 3:20 pm

Your linking suggestions are correct and I have now successfully compiled testing_cgesv.cpp. I am now revisiting the original issues using the as delivered MAGMA testing_cgev.cpp. The original issues remain but now are demonstrated w/ MAGMA delivered examples.

ISSUE1 : MAGMA (GPU_TARGET = Fermi) cgesv is failing if GPU RAM is exhausted. Here is the test I ran that demonstrates this issue. Notice when the failure (-113) takes place on non-integral 1024 divisions when nearing GPU RAM max. Can you please confirm the failure or suggest a fix?

Code: Select all
$ ./testing_cgesv -N 9000 -N 10000 -N 10112 -N 11000 -N 11136
MAGMA 1.3.0
device 0: GeForce GTX 580, 1544.0 MHz clock, 1535.7 MB memory, capability 2.0
device 1: GeForce GT 430, 1400.0 MHz clock, 1023.6 MB memory, capability 2.1

Usage: ./testing_cgesv -N <matrix size> -R <right hand sides>
  -N can be repeated up to 10 times

    N   NRHS   GPU GFlop/s (sec)   ||B - AX|| / ||A||*||X||
===========================================================
 9000    100    292.16 (   6.88)   1.24e-06
magma_cgesv returned error -113.
10000    100   2565.84 (   1.07)   7.38e-01
10112    100    326.44 (   8.70)   1.78e-06
magma_cgesv returned error -113.
11000    100   4120.02 (   0.88)   7.53e-01
11136    100    399.42 (   9.47)   1.85e-06


ISSUE2 : MAGAM (GPU_TARGET = Kepler) for cgesv is failing in general regardless of RAM issues. Can you please confirm the failure or suggest a fix?

Code: Select all
$ ./testing_cgesv
MAGMA 1.3.0
device 0: Tesla K10.G2.8GB, 745.0 MHz clock, 3583.6 MB memory, capability 3.0
device 1: Tesla K10.G2.8GB, 745.0 MHz clock, 3583.6 MB memory, capability 3.0

Usage: ./testing_cgesv -N <matrix size> -R <right hand sides>
  -N can be repeated up to 10 times

    N   NRHS   GPU GFlop/s (sec)   ||B - AX|| / ||A||*||X||
===========================================================
 1024    100      5.39 (   0.69)        nan
magma_cgesv returned error 1474.
 2048    100     66.18 (   0.40)   7.70e-01
 3072    100     99.51 (   0.85)        nan
 4032    100    189.13 (   0.99)        nan
 5184    100    272.26 (   1.44)        nan
 6016    100    320.84 (   1.90)        nan
 7040    100    432.45 (   2.24)        nan
 8064    100    470.32 (   3.08)        nan
 9088    100    543.88 (   3.80)        nan
10112    100    543.33 (   5.23)        nan
Last edited by mgates3 on Thu Mar 21, 2013 4:34 pm, edited 1 time in total.
Reason: minor edit to put output in [code] ... [/code] to line up columns
mh1
 
Posts: 11
Joined: Thu Mar 14, 2013 4:24 pm

Re: magma_cgesv stability

Postby mgates3 » Thu Mar 21, 2013 5:03 pm

For issue 2, in Makefile.internal, try changing -arch sm_35 to -arch sm_30, and re-compiling at least the magmablas directory, e.g., from the magma-1.3.0 directory:

cd magmablas
make clean
cd ..
make

It was compiling only for architecture 3.5 (e.g., Kepler K20) instead of 3.0 (e.g., Kepler K10). I've fixed this for the next release.

-mark
mgates3
 
Posts: 424
Joined: Fri Jan 06, 2012 2:13 pm

Re: magma_cgesv stability

Postby mh1 » Thu Mar 21, 2013 6:51 pm

That does indeed fix Issue2 (sm_35 vs sm_30). I have both cards, and I have verified the fix on K10 for now. I will follow up for K20 if the error appears in that test.

Were you able to replicate Issue1?

Thank you for your help.
mh1
 
Posts: 11
Joined: Thu Mar 14, 2013 4:24 pm

Re: magma_cgesv stability

Postby mh1 » Wed Mar 27, 2013 9:31 am

Hello,

It appears MAGMA does not solve linear systems (via cgesv) that do not fit entirely within GPU RAM, contrary to the documentation.

I was hoping someone from the MAGMA team could confirm the experiment below which demonstrates MAGMA cgesv does not solve beyond GPU RAM.

This was one feature that made MAGMA attractive. I want to make sure it is indeed a MAGMA problem and not something I have done.

Thank you for your time.

Code: Select all
MAGMA 1.3.0
device 0: GeForce GTX 580, 1544.0 MHz clock, 1535.7 MB memory, capability 2.0
device 1: GeForce GT 430, 1400.0 MHz clock, 1023.6 MB memory, capability 2.1

Usage: ./testing_cgesv -N <matrix size> -R <right hand sides>
  -N can be repeated up to 10 times

    N   NRHS   GPU GFlop/s (sec)   ||B - AX|| / ||A||*||X||
===========================================================
 9000    100    292.16 (   6.88)   1.24e-06
magma_cgesv returned error -113.
10000    100   2565.84 (   1.07)   7.38e-01
10112    100    326.44 (   8.70)   1.78e-06
magma_cgesv returned error -113.
11000    100   4120.02 (   0.88)   7.53e-01
11136    100    399.42 (   9.47)   1.85e-06
mh1
 
Posts: 11
Joined: Thu Mar 14, 2013 4:24 pm

Re: magma_cgesv stability

Postby mgates3 » Wed Mar 27, 2013 5:36 pm

As I said before, determining when & why this is occurring will take some investigation. Since we do not have the exact GPU card that you are using, we cannot exactly reproduce your results. I have successfully solved matrices larger than the GPU memory (below). Notice even in your own results that some of the matrices larger than 10000 are solved correctly, so it is an intermittent error, probably depending on the matrix size, GPU memory size, and what is currently in the GPU memory.

Code: Select all
> ./testing_cgesv -N 500 -N 17000 -N 18000 -N 19000
MAGMA 1.3.0
device 0: Tesla S2050, 1147.0 MHz clock, 2687.4 MB memory, capability 2.0
Usage: ./testing_cgesv [options] [-h|--help]

    N   NRHS   GPU GFlop/s (sec)   ||B - AX|| / ||A||*||X||
===========================================================
  500      1     10.22 (   0.03)   1.99e-07
17000      1    659.61 (  19.87)   2.32e-06
18000      1    665.99 (  23.36)   2.15e-06
19000      1    644.43 (  28.39)   5.45e-05
> ./testing_cgesv -N 20000 -N 25000
MAGMA 1.3.0
device 0: Tesla S2050, 1147.0 MHz clock, 2687.4 MB memory, capability 2.0
Usage: ./testing_cgesv [options] [-h|--help]

    N   NRHS   GPU GFlop/s (sec)   ||B - AX|| / ||A||*||X||
===========================================================
20000      1    517.65 (  41.22)   1.62e-05
25000      1    523.51 (  79.60)   5.37e-05


-mark
mgates3
 
Posts: 424
Joined: Fri Jan 06, 2012 2:13 pm

Re: magma_cgesv stability

Postby mh1 » Thu Mar 28, 2013 11:40 am

Ok, I misunderstood the diagnosis of the problem. I am able to demonstrate the GPU RAM failure on all my cards {430,580,K10}.

Thank you again for your help.
mh1
 
Posts: 11
Joined: Thu Mar 14, 2013 4:24 pm

Next

Return to User discussion

Who is online

Users browsing this forum: Bing [Bot] and 3 guests

cron