CUDA BLAS XT support

Open discussion for MAGMA library (Matrix Algebra on GPU and Multicore Architectures)

CUDA BLAS XT support

Postby jcbowden12 » Thu Jun 09, 2016 10:30 pm

Hi MAGMA devs,

I'm interested in whether the magma_dgemm() and associated functions will be ported to use the multi-gpu CUDA XT BLAS routines?

It would be good to have a hardware agnostic interface to the latest NVidia libraries. Are there issues in implementing them?

Thanks,
Josh.
jcbowden12
 
Posts: 14
Joined: Tue Mar 06, 2012 2:36 am

Re: CUDA BLAS XT support

Postby jcbowden12 » Fri Jun 17, 2016 1:22 am

Having a better look at the mamga functions it looks like they expect the arrays to be on the GPU, so cublasXt<>gemm functions will not be a simple replacement.
jcbowden12
 
Posts: 14
Joined: Tue Mar 06, 2012 2:36 am

Re: CUDA BLAS XT support

Postby mgates3 » Sun Jun 19, 2016 1:42 pm

Right, magma_dgemm is simply a wrapper around cublasDgemm, which takes a matrices on the GPU. (Or, around clblasDgemm in clMagma.)

We don't have any immediate plans to wrap the cublas Xt functions, but it's helpful to hear if there is interest in that.

-mark
mgates3
 
Posts: 734
Joined: Fri Jan 06, 2012 2:13 pm

Re: CUDA BLAS XT support

Postby jcbowden12 » Thu Jul 07, 2016 9:17 pm

Hi Mark,

Do you think many of the MAGMA routines could be updated to use cuBLAS-Xt? if the MAGMA code rely's mostly on just improving level 3 BLAS performance then cuBLAS-Xt should help. We could possibly get away with very small GPU memories as cuBLAS-XT creates smaller chunks of an input matrix, and it would lead to multi-gpu capability all in one change.

I'd be really interested in your (and other peoples) thoughts on this as having large problems break due to GPU memory capacity is a painful issue when using MAGMA.

Regards,
Josh
jcbowden12
 
Posts: 14
Joined: Tue Mar 06, 2012 2:36 am

Re: CUDA BLAS XT support

Postby mgates3 » Thu Jul 21, 2016 1:08 pm

There may be places where cuBLAS-Xt could be useful. Are there specific problems that you are interested in?

LU and Cholesky have out-of-GPU-memory, multi-GPU implementations already.
QR has a multi-GPU implementation, though not yet out-of-GPU-memory.
geev has a multi-GPU implementation. It is not out-of-GPU-memory, but it relies on BLAS-2, which is does not appear cuBLAS-Xt supports. As BLAS-2 is memory bound, an out-of-GPU-memory version would likely be slower than the CPU-only version.
syevd has multi-GPU implementations. They are not out-of-GPU-memory, though. The classical version relies on BLAS-2, so has the same issues as geev. The 2-stage version would be a better candidate for an out-of-GPU-memory implementation.
gesvd does not yet have a multi-GPU implementation. The current implementation relies an BLAS-2, so has the same issues as geev. A 2-stage version is in progress.

-mark
mgates3
 
Posts: 734
Joined: Fri Jan 06, 2012 2:13 pm

Re: CUDA BLAS XT support

Postby jcbowden12 » Mon Oct 17, 2016 8:00 pm

Hi Mark, thanks for the reply and analysis (and sorry for the late reply).

I am interested in the eigenvalue problems, although I have been working to get HiPLARb working with newer versions of MAGMA which covers most functions of the MAGMA library.

As eigenvalue problems seem to rely on Level 2 BLAS I guess BLAS XT does not help so much. New hardware technology (NVLink) may help with the data movement bottleneck though for these memory bound applications?

The 2-stage algorithm syevd algorithm is great. Good to hear a svd version is on its way. One thing though, I have had trouble with the work size being passed in with 32 bit integers. Is there any chance that this can be an explicit 64 bit int type? R uses a 32 int interface to BLAS/LAPACK so I use the LP64 interface, and forcing the work size to be 32 bits dramatically reduces the size of problems the code can handle. (I'll start another thread for this...)

Regards,
Josh.
jcbowden12
 
Posts: 14
Joined: Tue Mar 06, 2012 2:36 am


Return to User discussion

Who is online

Users browsing this forum: thanasis_giannis and 4 guests