## Dense LUD seg fault

Open discussion for MAGMA library (Matrix Algebra on GPU and Multicore Architectures)

### Dense LUD seg fault

Hello,

I am experimenting with some large matrix sizes that fit into CPU memory, but not in GPU memory for LU factorization.

I am using the 'magma_dgetrf_m' routine. http://icl.cs.utk.edu/projectsfiles/mag ... de2af56314

In the documentation it says that the matrix may exceed the GPU memory and must be in CPU memory. The problem size I am using is 64,000 unknowns (~30 GB) running it on a machine with ~192 GB of CPU RAM and a Tesla K40 (12 GB GDDR5). Below is the code that is called to execute the example. The initMatrix function initializes a matrix that is non-zero and diagonally dominant.

Code: Select all
`int matrixSize = 64000;double *matrix = new double[matrixSize * matrixSize];initMatrix(matrix, matrixSize, matrixSize);magma_init();magma_setdevice(1);int *pivot = new int[matrixSize];int info;magma_dgetrf_m(1, matrixSize, matrixSize, matrix, matrixSize, piv, &info);`

Am I missing something here or are there memory concerns with matrices that do not fit into GPU memory? I am able to get the correct solution when using smaller problem sizes that will fit into the Tesla memory (i.e. 32,000 unknowns).
tblattner

Posts: 8
Joined: Tue Aug 09, 2016 4:38 pm

### Re: Dense LUD seg fault

My first inclination is that you may need to use an ILP64 LAPACK library. If you can solve a problem of size around 45000 (15.1 GiB), but it fails for 47000 (16.5 GiB), that is very likely the problem. The cutoff is about sqrt(2^31) = 46340, where the offset m*n (or lda*n) no longer fits into 32 bits.

You can use magma_dgetrf. It will automatically use magma_dgetrf_m if needed.

What BLAS and LAPACK library are you using?

If you have Intel MKL, see the make.inc.mkl-icc-ilp64 (or gcc) configuration. It adds the -DMKL_ILP64 to make magma_int_t a long long (64-bit), and links with the ilp64 version of the library.

-mark
mgates3

Posts: 734
Joined: Fri Jan 06, 2012 2:13 pm

### Re: Dense LUD seg fault

Incidentally, even your "new double[ matrixSize * matrixSize ]" could fail, since matrixSize * matrixSize will overflow. You need to cast it to size_t before the multiply: new double[ size_t(matrixSize) * matrixSize ].

Thanks for including sample code.

-mark
mgates3

Posts: 734
Joined: Fri Jan 06, 2012 2:13 pm

### Re: Dense LUD seg fault

mgates3 wrote:My first inclination is that you may need to use an ILP64 LAPACK library. If you can solve a problem of size around 45000 (15.1 GiB), but it fails for 47000 (16.5 GiB), that is very likely the problem. The cutoff is about sqrt(2^31) = 46340, where the offset m*n (or lda*n) no longer fits into 32 bits.

You can use magma_dgetrf. It will automatically use magma_dgetrf_m if needed.

What BLAS and LAPACK library are you using?

If you have Intel MKL, see the make.inc.mkl-icc-ilp64 (or gcc) configuration. It adds the -DMKL_ILP64 to make magma_int_t a long long (64-bit), and links with the ilp64 version of the library.

-mark

I am going to be checking performance going from 1 to 4 GPUs, so I want to control the number of GPUs used when calling dgetrf.

BLAS library used is openblas
LAPACK library is compiled from source from netlib.

I have another test case that uses LAPACKE, linked to the netlib library and openblas. Source below

Code: Select all
`long matrixSize = 64000;double *matrix = new double[matrixSize*matrixSize];initMatrix(matrix, matrixSize, matrixSize);int *piv = new int[matrixSize];LAPACKE_dgetrf(LAPACK_COL_MAJOR, matrixSize, matrixSize, matrix, matrixSize, piv);`

Using LAPACKE works for 64,000 and even 100,000 unknowns.

mgates3 wrote:Incidentally, even your "new double[ matrixSize * matrixSize ]" could fail, since matrixSize * matrixSize will overflow. You need to cast it to size_t before the multiply: new double[ size_t(matrixSize) * matrixSize ].

Thanks for including sample code.

-mark

I checked my source and I actually use long. My mistake, not sure why I put int as the type for matrixSize. Although on my architecture long is 64-bit, I should probably use long long to ensure the correct precision across other machines.

I'm trying to avoid proprietary libraries like MKL, but will give it a go and see if it will work. Is there a way to have magma_int_t become a long long without MKL?
tblattner

Posts: 8
Joined: Tue Aug 09, 2016 4:38 pm

### Re: Dense LUD seg fault

Also found this in the netlib lapack package: http://www.netlib.org/lapack/lapacke.html#_integers

I'm going to see if the suggestion to redefine lapack_int might help.

Edit:
Checked the magma_types.h and it looks like there is a way to turn on ILP64 by defining MAGMA_ILP64. Will have a go with this.
tblattner

Posts: 8
Joined: Tue Aug 09, 2016 4:38 pm

### Re: Dense LUD seg fault

Managed to get the program to run without seg faulting, but at the tail end of the computation, I'm getting these errors: http://pastebin.com/9LM8xAhR

Updated source:

Code: Select all
`magma_int_t matrixSize = 64000;double *matrix = new double[matrixSize * matrixSize];initMatrix(matrix, matrixSize, matrixSize);magma_int_t *piv = new magma_int_t[matrixSize];magma_int_t info;magma_dgetrf_m(1, matrixSize, matrixSize, matrix, matrixSize, piv, &info);`

Tried with 32,000 unknowns and get the following error:
munmap_chunk(): invalid pointer: 0x00007f20b7cf8010

I compiled MAGMA with the cmake.inc.openblas template and added -DMAGMA_ILP64

OpenBLAS is installed from http://www.openblas.net/, which includes LAPACKE.

OpenBLAS build complete. (BLAS CBLAS LAPACK LAPACKE)
OS ... Linux
Architecture ... x86_64
BINARY ... 64bit
C compiler ... GCC (command line : gcc)
Fortran compiler ... GFORTRAN (command line : gfortran)
tblattner

Posts: 8
Joined: Tue Aug 09, 2016 4:38 pm

### Re: Dense LUD seg fault

Decided to add a full souce code example + my make.inc when compiling MAGMA

http://pastebin.com/X2HcWFwv

OpenBLAS:
Ran make && make install prefix=...

Magma make.inc:
http://pastebin.com/hqL9tGNQ

To compile:
/usr/bin/c++ -I/path/to/openblas -I/usr/local/cuda-7.5/include -Wall -Wextra -Wno-unused-parameter -Wno-reorder -std=c++11 -Wl,--no-as-needed -lpthread -g -o test-magma.cpp.o -c test-magma.cpp

/usr/bin/c++ -Wall -Wextra -Wno-unused-parameter -Wno-reorder -std=c++11 -Wl,--no-as-needed -lpthread -g test-magma.cpp.o -o test-magma -L/path/to/openblas/lib -rdynamic -lpthread /usr/local/cuda/lib64/libcudart_static.a -lpthread -ldl -lrt -lmagma -lopenblas /usr/local/cuda/lib64/libcublas.so /usr/local/cuda/lib64/libcublas_device.a /usr/local/cuda/lib64/libcudart_static.a -lcuda /usr/local/cuda/lib64/libcublas.so /usr/local/cuda/lib64/libcublas_device.a -lcuda -Wl,-rpath,/path/to/openblas/lib:/usr/local/cuda/lib64

./test-magma to run.

Errors when running:
http://pastebin.com/UwkB28W2
tblattner

Posts: 8
Joined: Tue Aug 09, 2016 4:38 pm

### Re: Dense LUD seg fault

Is OpenBLAS compiled with ILP64? Usually BLAS libraries are compiled with 32-bit int. MKL is an exception in that it provides both LP64 (32-bit int, 64-bit long/pointer) and ILP64 (64-bit int/long/pointer) versions. Incidentally, MKL is now freely available under a community license.

Do small problems like N=10, N=100, and N=1000 work?

-mark
mgates3

Posts: 734
Joined: Fri Jan 06, 2012 2:13 pm

### Re: Dense LUD seg fault

mgates3 wrote:Is OpenBLAS compiled with ILP64? Usually BLAS libraries are compiled with 32-bit int. MKL is an exception in that it provides both LP64 (32-bit int, 64-bit long/pointer) and ILP64 (64-bit int/long/pointer) versions. Incidentally, MKL is now freely available under a community license.

Do small problems like N=10, N=100, and N=1000 work?

-mark

Googled for OpenBLAS ILP, but came up short. Peaked around the Makefiles and looks like there is an option to enable ILP64.

So I recompiled OpenBLAS with 'INTERFACE64=1 make' and that seems to have enabled ILP64 for OpenBLAS and LAPACK.

Then in the test source code added the following modifications:
(1) added 'typedef long long int lapack_int' to force lapack to use ILP64
(2) added -DMAGMA_ILP64 directive when compiling the test case.

And now no more error!

Seems like this did the trick! I was able to run the 64,000 with no more issues!

I am curious how my LAPACKE_dgetrf call succeeded with 64,000 and 100,000 unknowns. Will now make sure to add the above modifications for my future benchmarks.

Thank you for all the help!

Also thanks for the tip on the MKL community license, I will install that next and compare.
tblattner

Posts: 8
Joined: Tue Aug 09, 2016 4:38 pm