Crash in CGESV

Open discussion for MAGMA library (Matrix Algebra on GPU and Multicore Architectures)

Crash in CGESV

Postby mh1 » Tue Jun 02, 2015 12:07 pm

Platform : Ubuntu 14.04, Intel 15.0 (w/ MKL), driver 346.46, CUDA 7.0, 3x680.

MAGMA crashed on a linear system Ax = b w/ these dimensions : A is 51210x51210 and b is 51210x2

Here is the error received :

Code: Select all
CUBLAS error: memory mapping error (11) in magmablas_csetmatrix_transpose_mgpu at csetmatrix_transpose_mgpu.cu:58
CUDA runtime error: an illegal memory access was encountered (77) in magma_cgetrf_m at cgetrf_m.cpp:225
CUDA runtime error: an illegal memory access was encountered (77) in magma_cgetrf_m at cgetrf_m.cpp:226


I suppose my first question is there a problem size limitation for MAGMA? I have physical problems that can go into the 1M+ range but I don't expect to use MAGMA on such problems. In fact, I am initially targeting 50Kx50K sized systems but MAGMA is performing so well I would be interested in taking it higher. So how high can I go? And any ideas on crash?
mh1
 
Posts: 30
Joined: Thu Mar 14, 2013 4:24 pm

Re: Crash in CGESV

Postby hsahasra » Tue Jun 02, 2015 12:41 pm

Which GPU do you have? How big a problem you can solve depends on the GPU and it's memory.
hsahasra
 
Posts: 32
Joined: Mon Jun 24, 2013 3:40 pm

Re: Crash in CGESV

Postby mh1 » Tue Jun 02, 2015 1:12 pm

Platform : Ubuntu 14.04, Intel 15.0 (w/ MKL), driver 346.46, CUDA 7.0, 3x680. Specifically GeForce GTX 680 (3 of them). I actually have many GPUs but this is the machine it crashed on.

Why does GPU memory matter? My understanding is MAGMA can solve beyond extents of GPU memory but transferring parts of problem to the CPU (as needed). In fact, I am already solving many problems beyond the extents of GPU memory, but this one in particular crashed. It seems I should instead be bound by CPU memory.
mh1
 
Posts: 30
Joined: Thu Mar 14, 2013 4:24 pm

Re: Crash in CGESV

Postby mgates3 » Wed Jun 03, 2015 10:44 am

Did you compile with LP64 (32-bit integers, 64-bit pointers) or ILP64 (64-bit integers and pointers)? Usually we use LP64, but for very large system need to use ILP64. The problem is 47000 * 47000 > 2**31, so offsets in the matrix overflow a 32-bit int. See make.inc.mkl-icc-ilp64.

Yes, MAGMA's CPU memory interface (magma_cgesv) should automatically use an out-of-GPU-memory algorithm for large systems. I have solved systems as large as 100k x 100k on even older Fermi cards with 2687.4 MB of memory.

-mark
mgates3
 
Posts: 750
Joined: Fri Jan 06, 2012 2:13 pm

Re: Crash in CGESV

Postby mh1 » Wed Jun 03, 2015 4:34 pm

Hi Mark,

I initially compiled with this :

Code: Select all
LIB       = -lmkl_intel_lp64 -lmkl_intel_thread -lmkl_core -lpthread -lcublas -lcudart -lstdc++ -lm


I got the error noted above. I then took your suggestion and compiled with the ilp64 make.inc you recommend. Now I get a crash, and nothing dumped to standard out or standard error.

BTW, I also tried a slightly smaller problem : 41316x41316. It too fails with ilp64 (double free or corruption (!prev): 0x00000000056e3060 ***), but I have successfully solved it with an lp64 compile.

Any other ideas?
mh1
 
Posts: 30
Joined: Thu Mar 14, 2013 4:24 pm

Re: Crash in CGESV

Postby mgates3 » Wed Jun 03, 2015 5:02 pm

When you say that you compiled with ILP64, did you do a 'make clean' first, or otherwise started from a fresh MAGMA source without any previously compiled .o, .a, or .so files?

And your CFLAGS and NVCCFLAGS have -DMKL_ILP64, and Fortran has -fdefault-integer-8, as shown in make.inc.mkl-icc-ilp64? Changing LIB alone is not sufficient.

Are your tests using MAGMA's testers, or your own code calling MAGMA? Your own code would also need to have -DMKL_ILP64 set when it is compiled. All the code has to agree that magma_int_t is 64-bit when it is compiled. If using MAGMA's testers, please include the complete input & output, as shown below, which greatly aides in understanding the issue you see.

After you compiled with ILP64, did small problems work, say N=1000? If nothing works, then it's likely that something didn't get recompiled to use ILP64.

Here are results on our machine (2x8 core Sandy Bridge, K40c). This matrix is 18.6 GB, while the card has 11.2 GB.

Code: Select all
bunsen magma-trunk/testing> ./testing_cgesv -N 50000
MAGMA 1.6.2 svn compiled for CUDA capability >= 3.5
CUDA runtime 7000, driver 7000. OpenMP threads 16. MKL 11.2.3, MKL threads 16.
device 0: Tesla K40c, 745.0 MHz clock, 11519.6 MB memory, capability 3.5
Usage: ./testing_cgesv [options] [-h|--help]

ngpu 1
    N  NRHS   CPU Gflop/s (sec)   GPU GFlop/s (sec)   ||B - AX|| / N*||A||*||X||
================================================================================
50000     1     ---   (  ---  )   2362.76 ( 141.09)   2.10e-10   ok


-mark
mgates3
 
Posts: 750
Joined: Fri Jan 06, 2012 2:13 pm

Re: Crash in CGESV

Postby mh1 » Thu Oct 22, 2015 3:07 pm

I have made some more progress on this problem but I am still seeing issues. Here is what has been done so far :

[1] I use the make.inc below to compile my ILP64 Magma library. BTW, should I set -DMAGMA_ILP64 somewhere? I see it in a note but I don't see it set. Documentation suggests setting it should be set, but looking in magma_types.h it appears setting MKL_ILP64 is enough to change magma_int_t. Just making sure.

Code: Select all
#//////////////////////////////////////////////////////////////////////////////
#   -- MAGMA (version 1.6.1) --
#      Univ. of Tennessee, Knoxville
#      Univ. of California, Berkeley
#      Univ. of Colorado, Denver
#      @date January 2015
#//////////////////////////////////////////////////////////////////////////////

# GPU_TARGET contains one or more of Tesla, Fermi, or Kepler,
# to specify for which GPUs you want to compile MAGMA:
#     Tesla  - NVIDIA compute capability 1.x cards (no longer supported in CUDA 6.5)
#     Fermi  - NVIDIA compute capability 2.x cards
#     Kepler - NVIDIA compute capability 3.x cards
# The default is "Fermi Kepler".
# See http://developer.nvidia.com/cuda-gpus
#
#GPU_TARGET ?= Fermi Kepler

CC        = icc
CXX       = icpc
NVCC      = nvcc
FORT      = ifort

ARCH      = ar
ARCHFLAGS = cr
RANLIB    = ranlib

# Use -fPIC to make shared (.so) and static (.a) library;
# can be commented out if making only static library.
FPIC      = -fPIC

CFLAGS    = -O3 $(FPIC) -DADD_ -Wall -openmp -DMAGMA_SETAFFINITY -DMAGMA_WITH_MKL
FFLAGS    = -O3 $(FPIC) -DADD_ -warn all -warn nounused
F90FLAGS  = -O3 $(FPIC) -DADD_ -warn all -warn nounused
NVCCFLAGS = -O3         -DADD_       -Xcompiler "-fno-strict-aliasing $(FPIC)"
LDFLAGS   =     $(FPIC)              -openmp

# Defining MAGMA_ILP64 or MKL_ILP64 changes magma_int_t to int64_t in include/magma_types.h
CFLAGS    += -DMKL_ILP64
FFLAGS    += -fdefault-integer-8
F90FLAGS  += -fdefault-integer-8
NVCCFLAGS += -DMKL_ILP64

# Options to do extra checks for non-standard things like variable length arrays;
# it is safe to disable all these
CFLAGS   += -pedantic -Wno-long-long
#CFLAGS   += -Werror  # uncomment to ensure all warnings are dealt with
CXXFLAGS := $(CFLAGS) -std=c++98
CFLAGS   += -std=c99

# IMPORTANT: this link line is for 64-bit int !!!!
# For regular 64-bit builds using 64-bit pointers and 32-bit int,
# use the lp64 library, not the ilp64 library. See make.inc.mkl-gcc or make.inc.mkl-icc.
# see MKL Link Advisor at http://software.intel.com/sites/products/mkl/
# icc with MKL 10.3, Intel threads, 64-bit int
# note -DMAGMA_ILP64 or -DMKL_ILP64, and -fdefault-integer-8 in FFLAGS above
LIB       = -lmkl_intel_ilp64 -lmkl_intel_thread -lmkl_core -lpthread -lcublas -lcudart -lstdc++ -lm

# define library directories preferably in your environment, or here.
# for MKL run, e.g.: source /opt/intel/composerxe/mkl/bin/mklvars.sh intel64
#MKLROOT ?= /opt/intel/composerxe/mkl
#CUDADIR ?= /usr/local/cuda
-include make.check-mkl
-include make.check-cuda

LIBDIR    = -L$(CUDADIR)/lib64 \
            -L$(MKLROOT)/lib/intel64

INC       = -I$(CUDADIR)/include \
            -I$(MKLROOT)/include


[2] I compile my application with -DMKL_ILP64. I also link with -lmkl_intel_ilp64 and the ILP64 based MAGMA shared object from [1].

[3] I call MAGMA in my application like so :

Code: Select all
  magma_int_t N = A.numRows;
  magma_int_t NRHS = b.numColumns;
  magma_int_t lda = A.leadingDimension;
  magma_int_t ldb = b.leadingDimension;
  magma_int_t* ipiv = new magma_int_t[A.numRows];
  magma_int_t info;

  int sts = magma_cgesv(N, NRHS, (magmaFloatComplex *)A.data, lda, ipiv, (magmaFloatComplex *)b.data, ldb, &info);


I then ran the following tests :

Test 1 : Solve an application problem with 41,316 unknowns using :
(a) MKL library <Success>
(b) MAGMA with one 690 GPU device <Success>
(c) MAGMA with two 690 GPU devices <Failure with message "CUDA runtime error: invalid device pointer (17) in magma_cgetrf_m at cgetrf_m.cpp:348">

Test 2 : Solve an application problem with 51,210 unknowns using :
(a) MKL library <Success>
(b) MAGMA with one 690 GPU device <Success>
(c) MAGMA with two 690 GPU devices <Failure with message "Floating point exception ..." at command line (not in standard error)>

Test 3 : Solve an application problem with 15,406 unknowns using :
(a) MKL library <Success>
(b) MAGMA with one 690 GPU device <Success>
(c) MAGMA with two 690 GPU devices <Success>

Test 4 : Ran the cgesv ILP64 based tester with 51,210 <Success>

Code: Select all
mark.horn@sid ~/Development/MAGMA/magma-1.6.2/testing $ ./testing_cgesv -N 51210
MAGMA 1.6.2  compiled for CUDA capability >= 3.0
CUDA runtime 7000, driver 7000. OpenMP threads 12. MKL 11.2.0, MKL threads 6.
ndevices 2
device 0: GeForce GTX 690, 1019.5 MHz clock, 2047.7 MB memory, capability 3.0
device 1: GeForce GTX 690, 1019.5 MHz clock, 2047.8 MB memory, capability 3.0
Usage: ./testing_cgesv [options] [-h|--help]

ngpu 1
    N  NRHS   CPU Gflop/s (sec)   GPU GFlop/s (sec)   ||B - AX|| / N*||A||*||X||
================================================================================
51210     1     ---   (  ---  )    963.27 ( 371.80)   1.99e-10   ok


BTW, I inject these environment variables at run-time to control multi-GPUs :
MAGMA_NUM_GPUS=2
CUDA_VISIBLE_DEVICES=0,1

So it appears my application is compiled correctly with ILP64 and appears to be handling the 51,210 problem using MKL no problems. It appears ILP64 MAGMA is working as well in my application with a single device but not multiple devices (in general). Based on tests above, I do not see a clear indication as to what is causing the failures. It appears to be failing for different reasons as well. I can get multiple GPUs to work (with stability) in my application if I do an LP64 compile, but that is limited in size as discussed.

Any other ideas? Am I compiling correctly?
mh1
 
Posts: 30
Joined: Thu Mar 14, 2013 4:24 pm

Re: Crash in CGESV

Postby mgates3 » Fri Oct 23, 2015 2:15 pm

Yes, we recognize either MKL_ILP64 or MAGMA_ILP64. It's here:

include/magma_types.h:#if defined(MAGMA_ILP64) || defined(MKL_ILP64)

Everything looks fine to me. I'll run some tests to check things out. I know I've run a single GPU upto N=100,000, but I don't think I've tried such large runs with multiple GPUs. There may be some place that we accidentally use a 32-bit int instead of a 64-bit magma_int_t for an offset in the multi-GPU code.

-mark
mgates3
 
Posts: 750
Joined: Fri Jan 06, 2012 2:13 pm

Re: Crash in CGESV

Postby mh1 » Fri Oct 23, 2015 4:41 pm

I have attempted the 51,210 unknowns tests on a box with 4 K10s. Runs for 2,4,6 GPU devices from the set completed. The 8 GPU device run failed with a seg fault (the third different error I have captured on the multi-GPU tests).

BTW, the box with the 690 (2 GPUs) is a box that I use for other activities like browsing web, Matlab, etc ... it is my developer workstation. The 4xK10 box sits in a rack with no video display attached. The LP64 compiled version of my application (plus the LP64 version of MAGMA) didn't show any problems running on my developer workstation so I expected ILP64 to do the same. I don't know if this is part of the problem - just mentioning in case it might be.

So I think something is wrong. I don't know what else to do on my end. The compilation appears to be correct. MKL ILP64 isn't giving any problems. Sometimes ILP64 MAGMA succeeds on multi-GPUS and/or > 2**31 unknown size, but most times it fails. I am not sure which is the issue - mutli-GPUs with ILP64 or 2**31 with ILP64 - or both. If you think there is something I can do I may be able to spend a bit more time on it but most of the time I had dedicated to this has been eaten up with test failures.

Any help you can provide is greatly appreciated. At the moment we have to limit MAGMA to LP64 compiles and limited problem sizes until we can determine stability problems with ILP64 MAGMA.
mh1
 
Posts: 30
Joined: Thu Mar 14, 2013 4:24 pm

Re: Crash in CGESV

Postby mgates3 » Sat Oct 24, 2015 6:47 am

Thanks Mark. It does seem there is a bug in that code with respect to 64-bit ints. We will continue to investigate. Might take a bit to get to it, due to upcoming conferences.
-mark
mgates3
 
Posts: 750
Joined: Fri Jan 06, 2012 2:13 pm

Next

Return to User discussion

Who is online

Users browsing this forum: No registered users and 3 guests

cron