Page 1 of 1

Block Cholesky algorithm (xPOTRF)

PostPosted: Mon May 27, 2019 1:17 pm
by pa33
I am trying to understand the block Cholesky (Level 3 BLAS) algorithm as implemented in DPOTRF. The explanation given here: http://www.netlib.org/utk/papers/factor/node9.html is quite clear and states that there are 3 steps:

1. DPOTF2 (compute L11)
2. DTRSM (compute L22)
3. DSYRK (update A22)

That website appears to be based on the 1996 publication:

Choi, J., Dongarra, J. J., Ostrouchov, L. S., Petitet, A. P., Walker, D. W., & Whaley, R. C. (1996). Design and implementation of the ScaLAPACK LU, QR, and Cholesky factorization routines. Scientific Programming, 5(3), 173-184.

However, the actual implementation of DPOTRF uses a seemingly different approach:

1. DSYRK
2. DPOTF2
3. DGEMM
4. DTRSM

I was unable to find an explanation of these steps. But I found an old LAPACK version of DPOTRF from 1993 which predates that 1996 publication and is still based on the 4 steps above. Does anyone know of a reference describing the block Cholesky algorithm in DPOTRF, and whether these 4 steps are more efficient than the 3 step algorithm in the 1996 publication?

Re: Block Cholesky algorithm (xPOTRF)

PostPosted: Mon May 27, 2019 1:34 pm
by admin
Hi. The code in LAPACK/SRC is the left looking variant. The right looking version is at: LAPACK/SRC/VARIANTS/cholesky/RL. Cheers, Julien Langou.

Re: Block Cholesky algorithm (xPOTRF)

PostPosted: Mon May 27, 2019 11:54 pm
by pa33
Thank you, this is quite helpful. I found a publication (Anderson and Dongarra, 1990) which compares the different variants of the block (Level 3) Cholesky, and they find that they all have similar performance, so it doesn't seem to make much difference which one is used.

Re: Block Cholesky algorithm (xPOTRF)

PostPosted: Tue May 28, 2019 9:28 am
by Julien Langou
Thank you, this is quite helpful. I found a publication (Anderson and Dongarra, 1990) which compares the different variants of the block (Level 3) Cholesky, and they find that they all have similar performance, so it doesn't seem to make much difference which one is used.


The left-looking variant has less ``write`` (and more ``read``) than the right-looking variant. The left-looking variant was chosen for LAPACK. The right-looking is more parallel. The critical path length of the right looking variant is O(n) (where the input matrix is n-by-n tiles), while the critical path length for the left looking variant is O(n^2). The right-looking variant was chosen for ScaLAPACK. Using a multithreaded BLAS library on a multicore architecture, you would see a much better performance from the right-looking variant than the left-looking variant. A good (the best?) variant is the recursive variant.