Batched dpotri?

Open discussion for MAGMA library (Matrix Algebra on GPU and Multicore Architectures)
Post Reply
rene
Posts: 2
Joined: Sat Aug 18, 2018 12:55 pm

Batched dpotri?

Post by rene » Sat Aug 18, 2018 3:57 pm

Hi all,

there are placeholders in the Doxygen documentation for a batched Cholesky inverse as well as some of the required helper functions. Unfortunately, it appears that no actual implementations exists. Are there any plans to add these in the near future?

I need the Cholesky inverse of several million small (5x5) SPD matrices. I'm currently using magma_dposv_batched() with an identity matrix for B but I guess this is much less efficient than an explicit dpotri probably due to the extra memory reads alone.

Cheers,

Rene

rene
Posts: 2
Joined: Sat Aug 18, 2018 12:55 pm

Re: Batched dpotri?

Post by rene » Wed Sep 19, 2018 8:42 am

Hi again,

I've been using a custom implementation based on LAPACK's dpotri translated into C++ with my own fixed-size matrix library for the past couple of weeks now. I'm using a naive CUDA kernel that processes one 5x5 matrix per thread in registers. I've experimented with shared memory prefetching but this wasn't faster. I've also applied some generic CUDA tricks to several MAGMA routines to speed things up. On two million random 5x5 matrices magma_dpotrf_batched() now takes about 12ms plus about 10ms for my own dpotri kernel on my test hardware (Titan X Pascal). Not bad compared with the previous 90-100ms when using magma_dposv_batched() to do the same thing.

Is this something the MAGMA authors would be interested in? The dpotri implementation is probably only useful for people using C++ as the matrix dimensions are template parameters that need to be known in the calling code at compile time.

The more generic CUDA tricks, on the other hand, could lead to speedups across the board. Is that something you think could be turned into an academic publication despite being in the technical rather than algorithmic domain? I typically publish on real-time 3D computer vision for robotics and would appreciate any guidance you might have or any opportunity to collaborate on a paper. If you think it's realistic to get this published I could devote more time to it, run benchmarks and submit patches.

Cheers,

Rene

Stan Tomov
Posts: 268
Joined: Fri Aug 21, 2009 10:39 pm

Re: Batched dpotri?

Post by Stan Tomov » Wed Sep 19, 2018 10:09 am

Hi Rene,
This sounds very good! We are interested in any improvement in these kernels. I will contact you with the procedure how to contribute it - we will have some software engineering requirements and we have to see if the code can be generalized (and tuned easily for other sizes and precisions).
Stan

Post Reply