Create distributed matrix on gpus with no cpu to gpu copy
Create distributed matrix on gpus with no cpu to gpu copy
Hi,
I want to use _mgpu functions in magma as we need to work with matrix sizes of N=40,000 and more whose memory can't fit in a single gpu.
Lets say I am working with k gpus, and I am able to independently compute N/k different column blocks of the matrix directly on k gpus.
Next, lets say I want to use magma_int_t magma_dpotrf_mgpu( magma_int_t ngpu, magma_uplo_t uplo, magma_int_t n, magmaDouble_ptr d_lA[], magma_int_t ldda, magma_int_t * info).
I am wondering if there a way to use the distributed matrix memory which is already on the gpus in the magmaDouble_ptr argument without the need to copy the memory to cpu?
Currently, if I understand correctly the magmaDouble_ptr is obtained using magma_dsetmatrix_1D_col_bcyclic(...), which uses the full matrix memory already present on the cpu.
I wouldn't like to copy to cpu and again back to gpu as this will involve additional costs and the full matrix memory might be beyond the cpu memory as well.
Thanks a lot in advance,
Best,
Sambit
I want to use _mgpu functions in magma as we need to work with matrix sizes of N=40,000 and more whose memory can't fit in a single gpu.
Lets say I am working with k gpus, and I am able to independently compute N/k different column blocks of the matrix directly on k gpus.
Next, lets say I want to use magma_int_t magma_dpotrf_mgpu( magma_int_t ngpu, magma_uplo_t uplo, magma_int_t n, magmaDouble_ptr d_lA[], magma_int_t ldda, magma_int_t * info).
I am wondering if there a way to use the distributed matrix memory which is already on the gpus in the magmaDouble_ptr argument without the need to copy the memory to cpu?
Currently, if I understand correctly the magmaDouble_ptr is obtained using magma_dsetmatrix_1D_col_bcyclic(...), which uses the full matrix memory already present on the cpu.
I wouldn't like to copy to cpu and again back to gpu as this will involve additional costs and the full matrix memory might be beyond the cpu memory as well.
Thanks a lot in advance,
Best,
Sambit

 Posts: 276
 Joined: Fri Aug 21, 2009 10:39 pm
Re: Create distributed matrix on gpus with no cpu to gpu cop
The magma_dpotrf_mgpu function requires that the matrix is distributed among the GPUs in 1D block cyclic way, where nb is obtained by magma_get_dpotrf_nb(n). The magma_dsetmatrix_1D_col_bcyclic function is just one example on how one can get to this distribution starting from CPU memory. If you already had it in the required form, you can use magma_dpotrf_mgpu directly, otherwise some modifications would be required. For example, if you had it in 1D block cyclic but nb is something else, you can modify line 89 in magma_dpotrf_mgpu to take the correct nb.
We are thinking to update the auxiliary functions for these types of routines, so we are interested in what formats may be of interest, e.g., how do you have the matrix distributed and how do you describe the distribution? We can provide directly support for different distributions or at least auxiliary routines that move efficiently data from some userspecified format to 1D block cyclic that is taken by the routine, compute, and redistributed back to the userspecified format.
Stan
We are thinking to update the auxiliary functions for these types of routines, so we are interested in what formats may be of interest, e.g., how do you have the matrix distributed and how do you describe the distribution? We can provide directly support for different distributions or at least auxiliary routines that move efficiently data from some userspecified format to 1D block cyclic that is taken by the routine, compute, and redistributed back to the userspecified format.
Stan
Re: Create distributed matrix on gpus with no cpu to gpu cop
Hi Stan,
Thanks for your reply.
I think I can work with the nb provided by magma_get_dpotrf_nb(n).
I have to gather the array of pointers to these memory blocks in different gpus and communicate them to the gpu which
will subsequently call magma_dpotrf_mgpu. I am very new to cuda programming after a little bit of digging I found
that this requires cuda unified memory across multiple gpus using functions like cudaDeviceEnablePeerAccess and cudaMemsetAsync .
Just an aside question Are there an _mgpu functions for a) symmetric eigendecomposition and b) gemm or any plans to do so in the future. I couldn't find any in the documentation.
Best,
Sambit
Thanks for your reply.
I think I can work with the nb provided by magma_get_dpotrf_nb(n).
This is what I am really confused about. Let's say I have the nb blocks of memory across n gpus, but it seems somehowhow do you have the matrix distributed and how do you describe the distribution?
I have to gather the array of pointers to these memory blocks in different gpus and communicate them to the gpu which
will subsequently call magma_dpotrf_mgpu. I am very new to cuda programming after a little bit of digging I found
that this requires cuda unified memory across multiple gpus using functions like cudaDeviceEnablePeerAccess and cudaMemsetAsync .
Thanks, this would be very helpful indeed. I am thinking if an auxiliary function can be written which takes nb memory blocks in different gpus which are allocated normally using cudaMalloc (without unified memory) and copy them to a cuda unified memory format, which can then be used for any _mgpu functions.We can provide directly support for different distributions or at least auxiliary routines that move efficiently data from some userspecified format to 1D block cyclic that is taken by the routine, compute, and redistributed back to the userspecified format.
Just an aside question Are there an _mgpu functions for a) symmetric eigendecomposition and b) gemm or any plans to do so in the future. I couldn't find any in the documentation.
Best,
Sambit
Re: Create distributed matrix on gpus with no cpu to gpu cop
Sorry, the docs are a little confusing about what is on the CPU and what is on the GPU. The array of pointers, d_lA, is itself on the CPU, e.g.,
(With CUDA, magmaDouble_ptr is just an alias for double*, needed for OpenCL compatibility.)
The pointers in d_lA point to data that is allocated on the multiple GPUs, e.g.,
See the code in magma/testing/testing_dpotrf_mgpu.cpp for an example of setting up the matrices. The CPU calls magma_dpotrf_mgpu, not the GPU. There is no need for unified memory.
Presently, the multiGPU eigenvalue codes (e.g., dsyevd_m) are all CPU interfaces where the input and output matrices are on the CPU.
We're also working on a multiGPU gemm that has a CPU interface, similar to cublasXt but optimized for certain gemm shapes. A multiGPU interface for gemm is a bit more complicated because good distributions of A, B, and C depend on their relative sizes.
mark
Code: Select all
double* d_lA[ MagmaMaxGPUs ];
The pointers in d_lA point to data that is allocated on the multiple GPUs, e.g.,
Code: Select all
for (int dev = 0; dev < ngpu; ++dev) {
magma_setdevice( dev );
magma_dmalloc( &d_lA[ dev ], max_size );
}
Presently, the multiGPU eigenvalue codes (e.g., dsyevd_m) are all CPU interfaces where the input and output matrices are on the CPU.
We're also working on a multiGPU gemm that has a CPU interface, similar to cublasXt but optimized for certain gemm shapes. A multiGPU interface for gemm is a bit more complicated because good distributions of A, B, and C depend on their relative sizes.
mark
Re: Create distributed matrix on gpus with no cpu to gpu cop
Hi Mark,
I my last post I had missed asking about one more function potri where a out of memory multigpu implementation would be very beneficial in our application.
What we are trying to do is orthogonalize N column vectors of a X matrix (size M x N) where N can be 40,000 and more. This is the most computationally expensive step in our electronicstructure code when N is very large.
Our orthogonalization algorithm is follows:
Step1) Compute overlap matrix S=X^{T} * X (don't need out of gpu memory implementation here as we plan to do this in a blocked manner on the gpus)
Step 2) Compute cholesky factorization of S=L*L^{T} (This can be done out of gpu memory using magma_dpotrf_mgpu)
Step 3) Compute L^{1} (out of gpu memory for potri?)
Step 4) Compute X=X*L^{1} (don't need out of gpu memory implementation here as we plan to do this in a blocked manner on the gpus)
I am trying to find a path to do all the above four steps without copying data back to cpu.
I am wondering if this is possible currently using MAGMA.
Thank you,
Sambit
Thanks for the clarification and directing me to the appropriate example code.See the code in magma/testing/testing_dpotrf_mgpu.cpp for an example of setting up the matrices. The CPU calls magma_dpotrf_mgpu, not the GPU. There is no need for unified memory.
I think for now I can then use cublasXt for out of gpu memory matrixmatrix multiplication.We're also working on a multiGPU gemm that has a CPU interface, similar to cublasXt but optimized for certain gemm shapes. A multiGPU interface for gemm is a bit more complicated because good distributions of A, B, and C depend on their relative sizes.
I my last post I had missed asking about one more function potri where a out of memory multigpu implementation would be very beneficial in our application.
What we are trying to do is orthogonalize N column vectors of a X matrix (size M x N) where N can be 40,000 and more. This is the most computationally expensive step in our electronicstructure code when N is very large.
Our orthogonalization algorithm is follows:
Step1) Compute overlap matrix S=X^{T} * X (don't need out of gpu memory implementation here as we plan to do this in a blocked manner on the gpus)
Step 2) Compute cholesky factorization of S=L*L^{T} (This can be done out of gpu memory using magma_dpotrf_mgpu)
Step 3) Compute L^{1} (out of gpu memory for potri?)
Step 4) Compute X=X*L^{1} (don't need out of gpu memory implementation here as we plan to do this in a blocked manner on the gpus)
I am trying to find a path to do all the above four steps without copying data back to cpu.
I am wondering if this is possible currently using MAGMA.
Thank you,
Sambit
Re: Create distributed matrix on gpus with no cpu to gpu cop
I don't see a reason to create an explicit inverse here. Inverses are usually slower and less accurate than doing a solve. Just do a triangular solve (trsm) with L on the right. I.e., solve
Xhat L = X
for Xhat, instead of Xhat = X L^{1}. MAGMA has trsm_m (CPU interface for multiGPU), but no outofgpumemory version.
Note the accuracy and orthogonality of the Cholesky QR method may be worse than doing an actual QR factorization (geqrf). See Matlab example below. Ichitaro Yamazaki also has a paper about Cholesky QR.
mark
Xhat L = X
for Xhat, instead of Xhat = X L^{1}. MAGMA has trsm_m (CPU interface for multiGPU), but no outofgpumemory version.
Note the accuracy and orthogonality of the Cholesky QR method may be worse than doing an actual QR factorization (geqrf). See Matlab example below. Ichitaro Yamazaki also has a paper about Cholesky QR.
Code: Select all
>> X = rand( m, n );
>> S = X' * X;
>> L = chol(S);
>> Xhat = X / L; % triangular solve (trsm) with L on right
>> norm( Xhat' * Xhat  eye(n, n) )
ans =
1.4121e14
>> [Q, R] = qr( X, 0 );
>> norm( Q' * Q  eye(n, n) )
ans =
1.1641e15

 Posts: 276
 Joined: Fri Aug 21, 2009 10:39 pm
Re: Create distributed matrix on gpus with no cpu to gpu cop
The routine described is actually in MAGMA, called dgegqr_gpu version 4, along with a few other versions also described there. You can test them and see how they are called, e.g., with
The implementation itself is very simple using the magma building blocks:
Note that this case was targeting tall and skinny matrices so we explicitly copy the overlap matrix to the CPU and do Cholesky there. As N is large in your case you can call directly magma_dpotrf_gpu, i.e., something like this:
Stan
Code: Select all
./testing_dgegqr_gpu version 4 N 10000,64 c
Code: Select all
// ================== Cholesky QR ================================
magma_dgemm( MagmaConjTrans, MagmaNoTrans, n, n, m, c_one,
dA, ldda, dA, ldda, c_zero, dwork, n, queue );
magma_dgetmatrix( n, n, dwork, n, work, n, queue );
lapackf77_dpotrf( "u", &n, work, &n, info );
magma_dsetmatrix( n, n, work, n, dwork, n, queue );
magma_dtrsm( MagmaRight, MagmaUpper, MagmaNoTrans, MagmaNonUnit,
m, n, c_one, dwork, n, dA, ldda, queue );
// ================== end of ikind == 4 ================================
Code: Select all
magma_dgemm( MagmaConjTrans, MagmaNoTrans, n, n, m, c_one,
dA, ldda, dA, ldda, c_zero, dwork, n, queue );
magma_dpotrf_gpu(MagmaUpper, n, dwork, n, &info);
magma_dtrsm( MagmaRight, MagmaUpper, MagmaNoTrans, MagmaNonUnit,
m, n, c_one, dwork, n, dA, ldda, queue );
Re: Create distributed matrix on gpus with no cpu to gpu cop
Hi Mark,
The parallel layout of X comes from the finite element discretization partitioning. We could possibly copy X into a ScaLAPACK block cyclic layout and do triangular solve,
but I am not sure if that would be efficient.
The reason we use Cholesky QR is because its prefactor is much lower than actual QR in a parallel implementation.
Sambit
We do not use triangular solve currently as our X matrix is a parallel distributed memory (M_global x N) where M_global can go beyond 1e+6 degrees of freedom.I don't see a reason to create an explicit inverse here. Inverses are usually slower and less accurate than doing a solve. Just do a triangular solve (trsm) with L on the right. I.e., solve
The parallel layout of X comes from the finite element discretization partitioning. We could possibly copy X into a ScaLAPACK block cyclic layout and do triangular solve,
but I am not sure if that would be efficient.
We have benchmarked the accuracy and robustness of the Cholesky QR in our electronic structure calculations against the actual QR.Note the accuracy and orthogonality of the Cholesky QR method may be worse than doing an actual QR factorization (geqrf). See Matlab example below. Ichitaro Yamazaki also has a paper about Cholesky QR.
The reason we use Cholesky QR is because its prefactor is much lower than actual QR in a parallel implementation.
Sambit
Re: Create distributed matrix on gpus with no cpu to gpu cop
You could simply compute trsm in chunks (potentially in parallel). E.g., in Matlab:
I guess it depends on your distribution. It seems trsm (X / L) wouldn't be harder than trmm (X * L^{1}).
mark
Code: Select all
% compute nb rows at a time
nb = 100;
for i = 1 : nb : m
ib = min( nb, m  i + 1 );
Xhat( i:i+ib1, : ) = X( i:i+ib1, : ) / L;
end
mark
Last edited by mgates3 on Tue Oct 16, 2018 3:58 pm, edited 1 time in total.
Reason: gemm => trmm for L^{1}
Reason: gemm => trmm for L^{1}
Re: Create distributed matrix on gpus with no cpu to gpu cop
Thank you for your suggestions.
I have enough information now to get started with something similar to Stan's suggestion:
Best,
Sambit
I have enough information now to get started with something similar to Stan's suggestion:
Code: Select all
magma_dgemm( MagmaConjTrans, MagmaNoTrans, n, n, m, c_one,
dA, ldda, dA, ldda, c_zero, dwork, n, queue );
magma_dpotrf_gpu(MagmaUpper, n, dwork, n, &info);
magma_dtrsm( MagmaRight, MagmaUpper, MagmaNoTrans, MagmaNonUnit,
m, n, c_one, dwork, n, dA, ldda, queue );
Thanks, I will think about this approach as well.I guess it depends on your distribution. It seems trsm (X / L) wouldn't be harder than trmm (X * L^{1}).
Best,
Sambit