cgesv

Open discussion for MAGMA

cgesv

Postby mgates3 » Mon May 12, 2014 12:52 pm

Submitted by Gabriella Ceci:

Hi all,

I'm currently testing MAGMA (v. 1.3 and v. 1.4.1) by running cgesv and
cgesv_gpu routines (available in testing directory) and I've some doubts
for MAGMA experts!!

D1)
In both versions, for cgesv code, it is possible to change the number of
GPUs by using "--ngpu x" and the env var MAGMA_NUM_GPUS.
Such option is not working for cgesv_gpu. Is there a way to choose the
number of GPUs for testing_cgesv_gpu?
The testing_cgesv_gpu code runs by default on one GPUs.


D2)
From the following results, could you explain me why I cannot see any CPU
usage? In both cases, CPU GFLOP/s and time are null:

Code: Select all
$ ./testing_cgesv_gpu --ngpu 1 -N 26000
MAGMA 1.4.1 , compiled for CUDA capability >= 3.0
device 0: Tesla K20Xm, 732.0 MHz clock, 5759.6 MB memory, capability 3.5
device 1: Tesla K20Xm, 732.0 MHz clock, 5759.6 MB memory, capability 3.5
device 2: Tesla K20Xm, 732.0 MHz clock, 5759.6 MB memory, capability 3.5
device 3: Tesla K20Xm, 732.0 MHz clock, 5759.6 MB memory, capability 3.5
device 4: Tesla K20Xm, 732.0 MHz clock, 5759.6 MB memory, capability 3.5
device 5: Tesla K20Xm, 732.0 MHz clock, 5759.6 MB memory, capability 3.5
device 6: Tesla K20Xm, 732.0 MHz clock, 5759.6 MB memory, capability 3.5
device 7: Tesla K20Xm, 732.0 MHz clock, 5759.6 MB memory, capability 3.5
Usage: ./testing_cgesv_gpu [options] [-h|--help]

   N  NRHS   CPU GFlop/s (sec)   GPU GFlop/s (sec)   ||B - AX|| /N*||A||*||X||
================================================================================
26000     1     ---   (  ---  )   1658.28 (  28.27)   9.95e-11


Code: Select all
$ ./testing_cgesv --ngpu 1 -N 26000
MAGMA 1.4.1 , compiled for CUDA capability >= 3.0
device 0: Tesla K20Xm, 732.0 MHz clock, 5759.6 MB memory, capability 3.5
device 1: Tesla K20Xm, 732.0 MHz clock, 5759.6 MB memory, capability 3.5
device 2: Tesla K20Xm, 732.0 MHz clock, 5759.6 MB memory, capability 3.5
device 3: Tesla K20Xm, 732.0 MHz clock, 5759.6 MB memory, capability 3.5
device 4: Tesla K20Xm, 732.0 MHz clock, 5759.6 MB memory, capability 3.5
device 5: Tesla K20Xm, 732.0 MHz clock, 5759.6 MB memory, capability 3.5
device 6: Tesla K20Xm, 732.0 MHz clock, 5759.6 MB memory, capability 3.5
device 7: Tesla K20Xm, 732.0 MHz clock, 5759.6 MB memory, capability 3.5
Usage: ./testing_cgesv [options] [-h|--help]

ngpu 1
   N  NRHS   CPU Gflop/s (sec)   GPU GFlop/s (sec)   ||B - AX|| / N*||A||*||X||
================================================================================
26000     1     ---   (  ---  )   1380.40 (  33.96)   9.95e-11


Could you, please, explain me what is the difference between these two
codes, testing_cgesv and testing_cgesv_gpu, and suggest me which one is
the best to evaluate the system performances?
I'm asking this information because I expect that cgesv performs
"something" on the host and then on the device, while cgesv_gpu performs
"something" just on the device.


D3)
How is the memory managed in MAGMA?
For example, the following case crashes (on 1, 2 and 3 GPUs):

Code: Select all
$ ./testing_cgesv --ngpu 1 -N 48000
MAGMA 1.4.1 , compiled for CUDA capability >= 3.0
device 0: Tesla K20Xm, 732.0 MHz clock, 5759.6 MB memory, capability 3.5
device 1: Tesla K20Xm, 732.0 MHz clock, 5759.6 MB memory, capability 3.5
device 2: Tesla K20Xm, 732.0 MHz clock, 5759.6 MB memory, capability 3.5
device 3: Tesla K20Xm, 732.0 MHz clock, 5759.6 MB memory, capability 3.5
device 4: Tesla K20Xm, 732.0 MHz clock, 5759.6 MB memory, capability 3.5
device 5: Tesla K20Xm, 732.0 MHz clock, 5759.6 MB memory, capability 3.5
device 6: Tesla K20Xm, 732.0 MHz clock, 5759.6 MB memory, capability 3.5
device 7: Tesla K20Xm, 732.0 MHz clock, 5759.6 MB memory, capability 3.5
Usage: ./testing_cgesv [options] [-h|--help]

ngpu 1
   N  NRHS   CPU Gflop/s (sec)   GPU GFlop/s (sec)   ||B - AX|| / N*||A||*||X||
================================================================================
!!!! magma_malloc_cpu failed for: h_A


while -N 46000 gives good results.
My question is:
a matrix size N=46.000, implies 2.116.000.000 matrix entries. Each entry
is a single complex (8 bytes), so in total the matrix should take around
15 Gb (46k*46k*8=16.928*10^9).
How is the matrix allocated if each GPU has around 5 Gb of memory?

Moreover, cgesv_gpu has, as maximum matrix size allowed, N=26000 (for higher values of N the code crashes).
Why such a big gap (46000 vs. 26000)?


D4)
By running MAGMA v1.3, cgesv code, I've the following failure before the result:

Code: Select all
$ export MAGMA_NUM_GPUS=8
$ ./testing_cgesv --ngpu 8 -N 46000
MAGMA 1.3.0
device 0: Tesla K20Xm, 732.0 MHz clock, 5759.6 MB memory, capability 3.5
device 1: Tesla K20Xm, 732.0 MHz clock, 5759.6 MB memory, capability 3.5
device 2: Tesla K20Xm, 732.0 MHz clock, 5759.6 MB memory, capability 3.5
device 3: Tesla K20Xm, 732.0 MHz clock, 5759.6 MB memory, capability 3.5
device 4: Tesla K20Xm, 732.0 MHz clock, 5759.6 MB memory, capability 3.5
device 5: Tesla K20Xm, 732.0 MHz clock, 5759.6 MB memory, capability 3.5
device 6: Tesla K20Xm, 732.0 MHz clock, 5759.6 MB memory, capability 3.5
device 7: Tesla K20Xm, 732.0 MHz clock, 5759.6 MB memory, capability 3.5
Usage: ./testing_cgesv [options] [-h|--help]

   N   NRHS   GPU GFlop/s (sec)   ||B - AX|| / ||A||*||X||
===========================================================
CUBLAS error: memory mapping error (11) in magma_cgetrf2_mgpu at cgetrf2_mgpu.cpp:274
CUDA runtime error: unspecified launch failure (4) in magma_cgetrf2_mgpu at cgetrf2_mgpu.cpp:329
CUBLAS error: memory mapping error (11) in magma_cgetrf2_mgpu at cgetrf2_mgpu.cpp:376
CUDA runtime error: unspecified launch failure (4) in magma_cgetrf2_mgpu at cgetrf2_mgpu.cpp:337
** On entry to CTRSM  parameter number 5 had an illegal value
** On entry to CGEMM  parameter number 3 had an illegal value
CUDA runtime error: unspecified launch failure (4) in magma_cgetrf2_mgpu at cgetrf2_mgpu.cpp:337
[...]
CUDA runtime error: unspecified launch failure (4) in magma_cgetrf2_mgpu at cgetrf2_mgpu.cpp:337
** On entry to CTRSM  parameter number 5 had an illegal value
** On entry to CGEMM  parameter number 3 had an illegal value
** On entry to CTRSM  parameter number 5 had an illegal value
** On entry to CGEMM  parameter number 3 had an illegal value
CUDA runtime error: unspecified launch failure (4) in magma_cgetrf2_mgpu at cgetrf2_mgpu.cpp:251
CUBLAS error: memory mapping error (11) in magma_cgetrf2_mgpu at cgetrf2_mgpu.cpp:274
[...]
CUBLAS error: memory mapping error (11) in magmablas_cgetmatrix_transpose_mgpu at cgetmatrix_transpose_mgpu.cu:61
[...]
CUDA runtime error: unspecified launch failure (4) in magma_cgetrf_m at cgetrf_m.cpp:346
[...]
CUDA runtime error: unspecified launch failure (4) in magma_cgetrf_m at cgetrf_m.cpp:374
CUDA runtime error: unspecified launch failure (4) in magma_cgetrf_m at cgetrf_m.cpp:375
46000      1   12903.60 (  20.12)       -nan


Do you have idea of what is going on?


Any help is very welcome.
Thanks a lot in advance, regards :)
mgates3
 
Posts: 442
Joined: Fri Jan 06, 2012 2:13 pm

Re: cgesv

Postby mgates3 » Mon May 12, 2014 1:45 pm

First, what BLAS and LAPACK are you using, e.g., MKL or ACML or ATLAS? What OS and CPUs are you using?

D1)
--ngpu is an option to the tester. It sets environment variable MAGMA_NUM_GPUS for that run. The environment variable is used inside magma_cgesv, so use that when using MAGMA inside an application.

Both cgesv and cgesv_gpu are hybrid codes which use both the CPU and the GPU. The difference is where the input & output matrix is stored. For cgesv, all data is on the CPU. For cgesv_gpu, the matrix dA and right-hand-sides dB are on the GPU. A result of this is that cgesv_gpu always runs on one GPU, whereas cgesv can distribute the data to multiple GPUs.

D2)
The "CPU Gflop/s" is the speed for LAPACK's cgesv (e.g., from MKL), running solely on the CPUs, while the "GPU Gflop/s" is the speed for MAGMA's hybrid cgesv running on the combined CPUs + GPUs. You can run LAPACK using the -l or --lapack flag. We don't run LAPACK by default because it is comparatively slow.

We don't measure how fast the CPU portion of magma_cgesv is, and how fast the GPU portion of magma_cgesv is.

magma_cgesv_gpu will tend to be a bit higher performance, because it doesn't have to transfer the entire matrix to the GPU at the beginning, and transfer the result back to the CPU at the end. But it doesn't support multiple GPUs, nor out-of-GPU-core where the matrix exceeds the GPU's memory.

D3)
magma_cgesv tries to be smart about memory. If you use one GPU and the matrix fits on one GPU, it uses magma_cgetrf_gpu and magma_cgetrs_gpu (essentially magma_cgesv_gpu). If you request multiple GPUs or the matrix does not fit on one GPU, it uses the multi-GPU, out-of-GPU-core magma_cgetrf. This distributes the matrix across the GPUs, and can cycle portions of the matrix through the GPUs if it doesn't fit in GPU memory.

For your N = 48000 matrix, I think you are running into limits of 32-bit integers.
46000 * 46000 = 2116000000, which is < 2^31.
48000 * 48000 = 2304000000L, which is > 2^31.
Check what lda*N is; it overflows and becomes a negative number, which causes malloc to fail.

lda*N = -1990967296
testing_cgesv(5509,0xac8442c0) malloc: *** mmap(size=1252130816) failed (error code=12)

If you have MKL, the easiest fix is to recompile MAGMA using make.inc.mkl-ilp64, which uses 64-bit magma_int_t and links with ILP64 versions of MKL, instead of the LP64 version.

For magma_cgesv_gpu, the entire matrix must fit in one GPU's memory.
26000 * 26000 * 8 bytes is around 5 GB.
Larger than this, malloc_gpu will fail on your GPU with about 5 GB of memory.

D4)
I'm not sure what the problem is when you run with ngpu=8 and N=46000. If you can link with ILP64 as described above, can you replicate this issue? Does it occur with MAGMA 1.4.1 or 1.4.2beta, or only with older 1.3 versions? We can investigate some, but we don't have ready access to a machine with 8 Kepler GPUs.

Hopefully I've answered all your questions, or at least pointed you in the right direction. Feel free to respond if you need further clarification.

-mark
mgates3
 
Posts: 442
Joined: Fri Jan 06, 2012 2:13 pm


Return to User discussion

Who is online

Users browsing this forum: Bing [Bot] and 1 guest

cron