IDR perfomance?

Open discussion for MAGMA library (Matrix Algebra on GPU and Multicore Architectures)

Re: IDR perfomance?

Postby bnas » Wed Jun 22, 2016 3:39 pm

Hey again,

Thanks for your info. To set the record straight, turning back on the smoothing option improved the solving time, not decrease it, my bad. The simple (didr.cpp) IDR performance (not converging) that I mentioned, was actually due to not having smoothing on. So in my case, smoothing has GREAT beneficial impact on my convergence behavior. But, yes smoothing=0/1 should be handled, imo, by opts not at compilation; omega's value impact is trivial compared to the former, plus my investigation showed really no effect, since its only a starting value I suppose, right?

Regarding your question, I am into the field of CFD, so my matrices are mainly products of Navier-Stokes equations descretization with finite elements, k-epsilon turbulence etc. Ofc, they are sparse matrices, and with different characteristics per generating equation. I am exploring MAGMA solver/preconditioner performance, especially your _merge implementations on various matrices/sizes. So far they seem very promising :) There are some sparse solvers I never heard about, though I am nowhere near a LA expert, like bombardment etc, so I might give those a try too!
bnas
 
Posts: 8
Joined: Tue Jun 14, 2016 4:06 am

Re: IDR perfomance?

Postby hartwig anzt » Wed Jun 22, 2016 4:09 pm

>> There are some sparse solvers I never heard about, though I am nowhere near a LA expert, like bombardment etc, so I might give those a try too!

bombardment is not a Krylov solver by itself: it combines a number of Krylov solvers in interleaved fashion - i.e. QMR, CGS, BiCGSTAB. Thie idea is: if I have no idea which solver to use, I run a bunch of them.

Did you look into preconditioning? Does ILU give you good benefits?

Thanks, Hartwig
hartwig anzt
 
Posts: 76
Joined: Tue Sep 02, 2014 5:44 pm

Re: IDR perfomance?

Postby bnas » Thu Jun 23, 2016 2:56 am

Yes I have been looking into preconditioning, actually most of my systems do not converge unless properly preconditioned. Block Jacobi and ILU have given me the best results so far, depending on matrix characteristics. I will run quite a few tests for all possible combos of solver/precond and I can share the results if you want :)

Since my matrices are explicitly available at runtime, my next step would be to try to expand the too-large-to-fit matrices into > 1 GPUs, since the typical 2-4GB of Dev RAM is never enough, though I really don't know how that can be implemented in MAGMA. I understand there is OpenMP/pthreads support and some functions have multi-GPU equivalents. Have you ever played around with > 1 GPUs? My feeling is since you are spawning different streams for some of your implementations, you are quite on the border of going multi-GPU :)
bnas
 
Posts: 8
Joined: Tue Jun 14, 2016 4:06 am

Re: IDR perfomance?

Postby hartwig anzt » Thu Jun 23, 2016 8:18 am

Yes, we actually have some multi-GPU code that is not released. It looks like most people only use one GPU, maybe because most of the GPUs nowadays have a good amount of memory - the Tesla line typically 6-12 GB.

Are the matrices confidential? Otherwise, I would appreciate if you could provide me with one example matrix, then I can also take a look what works. In particular, I am working on some new preconditioning techniques, that may work very well.

Also, do the systems arise in a sequence? Or is it individual systems? If they arise in a sequence, you may have a look into updating an existing ILU preconditioner instead of always generating a new one: http://www.netlib.org/utk/people/JackDo ... LU_GPU.pdf
hartwig anzt
 
Posts: 76
Joined: Tue Sep 02, 2014 5:44 pm

Re: IDR perfomance?

Postby bnas » Thu Jun 23, 2016 1:57 pm

Hmm about the matrices will need to ask and get back to you for that one :) If allowed I ll be more than happy to provide.

I think they could arise in a sequence, so I ll have a look at your suggestion! Regarding multi-GPU, the thing is that for some matrix-free implementations, or for some matrix dependent but decomposed domains over multiple devices (imagine a cubic domain split in 8 parts, each handled by a single GPU and only the boundaries intercommunicating), multi-GPU can actually multiply the performance, plus not everyone can afford a Tesla for the single GPU cases :) The multi-GPU code that you mention that's not released, is it because its closed source or due to not enough interest by the community? I suppose I could do the partitioning myself and use standard single GPU MAGMA+OpenMP/MPI and handle the gpu2gpu communication myself.
bnas
 
Posts: 8
Joined: Tue Jun 14, 2016 4:06 am

Previous

Return to User discussion

Who is online

Users browsing this forum: Google [Bot] and 2 guests

cron