multiple gpu
multiple gpu
Hello,
I have a general question regarding the limitation of the size of the problem by the memory available in GPU. this is about the generalized eigenvalue problem, saying magma_dsygvdx, both eigenvalue and eigenvectors are needed. If i have one GPU with 16GB memory, to what size of the matrix in AX=(lambda)BX, magma can solve? Can mulitple GPU help in solving large size of matrix?
thanks,
boreas
I have a general question regarding the limitation of the size of the problem by the memory available in GPU. this is about the generalized eigenvalue problem, saying magma_dsygvdx, both eigenvalue and eigenvectors are needed. If i have one GPU with 16GB memory, to what size of the matrix in AX=(lambda)BX, magma can solve? Can mulitple GPU help in solving large size of matrix?
thanks,
boreas
Re: multiple gpu
btw, is the mulitple gpu version of routine _m adaptive to the device count? I put ngpu=2, and run on a machine with one GPU only but it runs successfully without error or warning. thanks,
boreas
boreas
Re: multiple gpu
Based on tracing the memory usage, magma_dsygvdx() requires about 2.5 n^2 memory. At various times it needs to hold the matrices A, B, eigenvectors, plus workspace.
Yes, the multi-GPU version magma_dsygvdx_m() can handle larger matrices. Oddly, even with a single GPU, it appears to require less, about 1.5 n^2 memory.
Below is a table of actual time and memory usage, on K20m and 2x10-core Haswell (E5-2650 v3).
-mark
Yes, the multi-GPU version magma_dsygvdx_m() can handle larger matrices. Oddly, even with a single GPU, it appears to require less, about 1.5 n^2 memory.
Below is a table of actual time and memory usage, on K20m and 2x10-core Haswell (E5-2650 v3).
Code: Select all
n | dsygvdx | dsygvdx_m( ngpu = 1 )
| time GPU memory GPU memory | time GPU memory GPU memory
--------+---------------------------------------------+------------------------------------------
1000 | 0.1181 35.1 MiB 4.60 n^2 doubles | 0.1280 23.7 MiB 3.11 n^2 doubles
2000 | 0.4438 115.9 MiB 3.80 n^2 doubles | 0.4319 55.1 MiB 1.80 n^2 doubles
3000 | 1.4149 249.8 MiB 3.64 n^2 doubles | 1.5135 110.6 MiB 1.61 n^2 doubles
4000 | 2.6409 436.8 MiB 3.58 n^2 doubles | 2.8571 196.7 MiB 1.61 n^2 doubles
5000 | 5.1591 679.5 MiB 3.56 n^2 doubles | 5.3543 293.7 MiB 1.54 n^2 doubles
6000 | 9.1811 698.0 MiB 2.54 n^2 doubles | 8.4615 419.6 MiB 1.53 n^2 doubles
7000 | 13.5773 946.1 MiB 2.53 n^2 doubles | 12.6406 568.4 MiB 1.52 n^2 doubles
8000 | 18.9661 1232.3 MiB 2.52 n^2 doubles | 17.2574 740.1 MiB 1.52 n^2 doubles
9000 | 24.2933 1558.7 MiB 2.52 n^2 doubles | 23.6092 934.7 MiB 1.51 n^2 doubles
10000 | 32.4998 1921.2 MiB 2.52 n^2 doubles | 30.8866 1152.1 MiB 1.51 n^2 doubles
11000 | 41.4299 2321.7 MiB 2.51 n^2 doubles | 38.9421 1392.5 MiB 1.51 n^2 doubles
12000 | 51.6525 2760.2 MiB 2.51 n^2 doubles | 49.7362 1655.7 MiB 1.51 n^2 doubles
13000 | 62.0538 3240.0 MiB 2.51 n^2 doubles | 60.6915 1941.8 MiB 1.51 n^2 doubles
14000 | 74.6738 3754.8 MiB 2.51 n^2 doubles | 73.3559 2250.9 MiB 1.51 n^2 doubles
15000 | 89.2196 4307.6 MiB 2.51 n^2 doubles | 90.1193 2582.8 MiB 1.50 n^2 doubles
Re: multiple gpu
Also, I'm surprised that it would work with ngpu = 2 if you have only 1 GPU. You're doing that in your own code, or using the MAGMA tester? The MAGMA tester shouldn't allow it.
-mark
-mark
Re: multiple gpu
hello Mark,
would you elaborate how the memory is required in CPU side and GPU side? the comments say
lwork INTEGER
The length of the array WORK.
- If N <= 1, LWORK >= 1.
- If JOBZ = MagmaNoVec and N > 1, LWORK >= 2*N + N*NB.
- If JOBZ = MagmaVec and N > 1, LWORK >= max( 2*N + N*NB, 1 + 6*N + 2*N**2 ).
NB can be obtained through magma_get_dsytrd_nb(N).
is this for CPU memory, right? But it does not say how much is needed for GPU memory.
also, I guess the size of problem solvable is also bounded by the Magma_int_t. If it is 32 bit, the maximum solvable problem is less than 30,000, to my experience. what do you think?
thanks,
would you elaborate how the memory is required in CPU side and GPU side? the comments say
lwork INTEGER
The length of the array WORK.
- If N <= 1, LWORK >= 1.
- If JOBZ = MagmaNoVec and N > 1, LWORK >= 2*N + N*NB.
- If JOBZ = MagmaVec and N > 1, LWORK >= max( 2*N + N*NB, 1 + 6*N + 2*N**2 ).
NB can be obtained through magma_get_dsytrd_nb(N).
is this for CPU memory, right? But it does not say how much is needed for GPU memory.
also, I guess the size of problem solvable is also bounded by the Magma_int_t. If it is 32 bit, the maximum solvable problem is less than 30,000, to my experience. what do you think?
thanks,
Re: multiple gpu
Yes, the workspaces are on the CPU. It internally allocates GPU workspace.
For all MAGMA routines, magma_int_t needs to address the entire matrix, so if magma_int_t is signed 32-bit, the maximum size is bound by 2**31 entries, or
n = int( sqrt( 2**31 ) / 32 ) * 32 = 46336
for a square matrix. I rounded down to a multiple of 32 to account for ldda, which is typically a multiple of 32 on the GPU.
For sygvdx in particular, it allocates a workspace of size about 3/2*n*n, so that would limit it to n = 37824. However, when I ran it, I was only able to successfully run up to n = 32000. I'm not sure why the discrepancy. There might be another workspace somewhere of size 2*n*n, which would limit it to n = 32768.
The easiest solution is to switch to compiling with ILP64 when solving large systems.
-mark
For all MAGMA routines, magma_int_t needs to address the entire matrix, so if magma_int_t is signed 32-bit, the maximum size is bound by 2**31 entries, or
n = int( sqrt( 2**31 ) / 32 ) * 32 = 46336
for a square matrix. I rounded down to a multiple of 32 to account for ldda, which is typically a multiple of 32 on the GPU.
For sygvdx in particular, it allocates a workspace of size about 3/2*n*n, so that would limit it to n = 37824. However, when I ran it, I was only able to successfully run up to n = 32000. I'm not sure why the discrepancy. There might be another workspace somewhere of size 2*n*n, which would limit it to n = 32768.
Code: Select all
./testing_ssygvdx -n 1234,25000:50000:1000 -JV -c
% MAGMA 2.3.0 svn compiled for CUDA capability >= 3.5, 32-bit magma_int_t, 64-bit pointer.
% CUDA runtime 9020, driver 9020. OpenMP threads 20. MKL 2018.0.1, MKL threads 20.
% device 0: Tesla V100-PCIE-16GB, 1380.0 MHz clock, 16160.5 MiB memory, capability 7.0
% Wed Oct 17 10:30:26 2018
% Usage: ./testing_ssygvdx [options] [-h|--help]
% itype = 1, jobz = Vectors needed, uplo = Lower, ngpu = 1
% N M GPU Time (sec) |AZ-BZD| |D - D_magma|
%======================================================
25000 25000 44.2406 2.36e-10 7.02e-10 ok
26000 26000 48.9626 2.39e-10 7.75e-10 ok
27000 27000 54.2609 2.34e-10 5.16e-10 ok
28000 28000 57.3955 2.46e-10 5.47e-10 ok
29000 29000 63.4750 2.34e-10 5.70e-10 ok
30000 30000 67.1582 2.30e-10 5.46e-10 ok
31000 31000 73.2874 2.39e-10 8.14e-10 ok
32000 32000 80.7398 2.26e-10 4.87e-10 ok
-mark
Re: multiple gpu
thanks. assuming 2*n^^2 is needed by sygvdx. Is the same amount needed by both CPU side and GPU side?
is that possible to mix Magma ILP64 w/ MKL LP64? Thank you
is that possible to mix Magma ILP64 w/ MKL LP64? Thank you
Re: multiple gpu
A cursory look at the sygvdx and related codes didn't show CPU allocations, so the only one would be the workspace passed to sygvdx, which is indeed O( 2n^2 ).
No, if MAGMA is ILP64, then it requires MKL to be ILP64. Consider info or the ipiv vector in getrf — MAGMA and MKL need to agree whether they are 32-bit or 64-bit. Also, similar issues may strike MKL if only LP64 is used, i.e., it could fail for large matrices due to overflow in indexing.
-mark
No, if MAGMA is ILP64, then it requires MKL to be ILP64. Consider info or the ipiv vector in getrf — MAGMA and MKL need to agree whether they are 32-bit or 64-bit. Also, similar issues may strike MKL if only LP64 is used, i.e., it could fail for large matrices due to overflow in indexing.
-mark
Re: multiple gpu
Thanks so much for the reply. The 2*N^^2 really limits the solution capacity of LP64 build. there is some obstacle to use ILP64 in our environments. MKL has other algorithms which needs much smaller work space. will Magma consider implement them, for example MRRR? Just curious.
Re: multiple gpu
MAGMA has [cz]heevr (complex MRRR eigenvalues). It just has never been ported to real ([sd]syevr), nor put into the generalized problem ([cz]hegvr / [sd]sygvr). In fact, LAPACK doesn't have [cz]hegvr / [sd]sygvr, it appears. So there's a possibility, but we aren't currently actively working on these eigenvalue codes.
-mark
-mark