Hi,
I am taking great interest in the MAGMA project, as it looks very promising. I was just wondering there are plans to support multiGPU configurations (if available)? This is the next logical step, and could unlock another level of performance.
FYI, my problem is that of a multiGPU multigrid solver. On the coarsest level, the density of the resulting linear system means that using an LU factorization is much more efficient than an iterative indirect approach. My problem is spread across 4 GPUs (probably more in the future), so the vector to which I wish to apply the factorization is spread over these GPUs. Before I stumbled across MAGMA, I assumed the optimum approach would be just to reassemble the vector in full on the CPU and do the LU factorization there. My problem has O(5000) degrees of freedom.
Cheers,
Mike.
MultiGPU support in the future?

 Posts: 268
 Joined: Fri Aug 21, 2009 10:39 pm
Re: MultiGPU support in the future?
Thanks for your interest in MAGMA!
We are planning to support multiGPU, along with multicore+multiGPUs. Here are some slides summarizing MAGMA's current work and near future plans.
Effect of multiGPU implementation would be observed on large problems though. If you have for example 4 cards, each with 240 cores, i.e. total of 960, scalability can be expected only for large problems. I guess ideal for a coarse problem of size O(5000) (as part of a multigrid solve) would be if it's possible to overlap the coarse LU factorization with the initialization of your multigrid structures (e.g. on the finer levels) and the computation up to the point you need the 1st coarse level solve, and than reuse the factorization in all iterations. The factorization can be done with the help of GPUs but the triangular solves with the L and U factors (at every iteration), as you thought originally, may indeed make sense to be done on the CPU (moreover, current CUBLAS triangular solves are slow). Have you thought to have the coarse solve as an iterative solver using for example block Jacobi preconditioning  every GPU to do the LU for the part of the matrix associated with it and use it as preconditioner? I hope these random thoughts may be helpful.
Regards,
Stan
We are planning to support multiGPU, along with multicore+multiGPUs. Here are some slides summarizing MAGMA's current work and near future plans.
Effect of multiGPU implementation would be observed on large problems though. If you have for example 4 cards, each with 240 cores, i.e. total of 960, scalability can be expected only for large problems. I guess ideal for a coarse problem of size O(5000) (as part of a multigrid solve) would be if it's possible to overlap the coarse LU factorization with the initialization of your multigrid structures (e.g. on the finer levels) and the computation up to the point you need the 1st coarse level solve, and than reuse the factorization in all iterations. The factorization can be done with the help of GPUs but the triangular solves with the L and U factors (at every iteration), as you thought originally, may indeed make sense to be done on the CPU (moreover, current CUBLAS triangular solves are slow). Have you thought to have the coarse solve as an iterative solver using for example block Jacobi preconditioning  every GPU to do the LU for the part of the matrix associated with it and use it as preconditioner? I hope these random thoughts may be helpful.
Regards,
Stan
Re: MultiGPU support in the future?
Hi Stan
Thanks for the responses and thoughts.
I would certainly do the LU factorization during the setup of the algorithm, so the cost of this isn't a problem since it can be amortized. The original fine grid problem is >= 10^8 complex degrees of freedom (FYI: the fine grid matrix is a discretization of the Dirac operator from quantum chromodynamics). I don't think (correct me if I'm wrong) that complex support is there yet in MAGMA. Presumably this will only increase the number of flops/s since complex arithmetic is more compute intensive?
Your suggestion for doing a block solve on each GPU is something I'd considered. However, unless I'm mistaken, this wouldn't get round the fact that the triangular solve would still be slow since this would be done on the GPU. Or are you suggesting to have each GPU do the its associated LU factorization, then on the CPU have it do the triangular solves for each block?
How slow is slow for the triangular solves? Is this something that will be sped up? I of course understand that triangular solves are not so easy to run efficiently on a massively threaded platform.
Cheers.
Thanks for the responses and thoughts.
I would certainly do the LU factorization during the setup of the algorithm, so the cost of this isn't a problem since it can be amortized. The original fine grid problem is >= 10^8 complex degrees of freedom (FYI: the fine grid matrix is a discretization of the Dirac operator from quantum chromodynamics). I don't think (correct me if I'm wrong) that complex support is there yet in MAGMA. Presumably this will only increase the number of flops/s since complex arithmetic is more compute intensive?
Your suggestion for doing a block solve on each GPU is something I'd considered. However, unless I'm mistaken, this wouldn't get round the fact that the triangular solve would still be slow since this would be done on the GPU. Or are you suggesting to have each GPU do the its associated LU factorization, then on the CPU have it do the triangular solves for each block?
How slow is slow for the triangular solves? Is this something that will be sped up? I of course understand that triangular solves are not so easy to run efficiently on a massively threaded platform.
Cheers.

 Posts: 268
 Joined: Fri Aug 21, 2009 10:39 pm
Re: MultiGPU support in the future?
Hi Mike,
CUBLAS is improving the triangular solves, but the last time I checked (version 2.1) cublasStrsm / cublasDtrsm was getting to about 0.24 / 0.09 GFlop/s on a GTX 280 for matrices of size 14,000 / 7,000. We mentioned in an article (page 10) that this can be improved to about 14 / 6.7 GFlop/s. We will include certain BLAS (techniques like in the article, and as given in these MAGMA roadmap slides) in the next MAGMA release by November 14.
We have the MAGMA routines for complex but not all the complex BLAS that is needed. NVIDIA is working on completing their complex BLAS and when done we will also release the MAGMA routines in complex arithmetic. As you mention, complex is more compute intensive but it's effect can not be seen yet in the current CUBLAS implementation. For example, on a GTX 280, sgemm runs at up to ~375 GFlop/s and cgemm at up to ~292 GFlop/s.I don't think (correct me if I'm wrong) that complex support is there yet in MAGMA. Presumably this will only increase the number of flops/s since complex arithmetic is more compute intensive?
CUBLAS is improving the triangular solves, but the last time I checked (version 2.1) cublasStrsm / cublasDtrsm was getting to about 0.24 / 0.09 GFlop/s on a GTX 280 for matrices of size 14,000 / 7,000. We mentioned in an article (page 10) that this can be improved to about 14 / 6.7 GFlop/s. We will include certain BLAS (techniques like in the article, and as given in these MAGMA roadmap slides) in the next MAGMA release by November 14.