Page 1 of 1

ScaLAPACK band matrix or dense matrix LU?

PostPosted: Tue Feb 21, 2006 4:12 pm
by mrkkbb

I have a very large general matrix (N = 87040) that has an upper band and a lower band of equal size (bw = 8703). Would this still be considered a "narrow" band? A compact block-column storage would save me a lot of space, but will I pay for it by using the divide and conquer algorithms in the LU for narrow band matrices in comparison to the dense matrix LU?

One other question, is the 1D block-column distribution described in the ScaLAPACK user guide intended to be cyclic? For example, if I have 121 processors for my matrix (see above) of NxN, do I use a block size equal to 720, such that the last processor has 640 columns. Or should I be using a smaller block of 200 and cycle through my 1D processor grid?

Thanks in advance

PostPosted: Sun Feb 26, 2006 4:57 pm
by Julien Langou
this is two good questions.
when should you swicth from the banded format to the dense format?

For the LU factorization of a dense n-by-n matrix, you have

The cost for the LU factorization of an n-by-n matrix with lower bandwidth p
and upper bandwidth q is

(Formula for the FLOPS is a upper bound of the number of FLOPS, it is tight
only when p<<n ans q<<n, and has a correct magnitude when p~n or q~m.)
This is for sequential algorithm, but parallel algorithms should have the same order of

So in your case (bandwith = n/10), it means that you will need 5 times less
memory and 33 times less Flops. I guess it's a good bet to go with the banded

If you want to give a try to PGBSV (LU factorization and solve for ScaLAPACK of
a banded matrix), you should definetly have a look at
LAPACK/ScaLAPACK development forum, post 24: PDGBTRF/S Problems
Stanimire gives an example code, and I think also Stanimire and Ashton
figured out a problem in the workspace calculation. (No patch made though...)

Can I use 1D block-column distribution (not cyclic) for an LU factorization?

Yes you can, but you are not going to obtain any performance out of it... (or
very poor): you certainly do not want to do it.

I understand the point. You want to have an easy mapping of your matrix on your
processors and do not want to bother with the cyclic distribution.

The problem with such a distribution is that basically, during the
factorization, while you are progressing in the columns of your matrix, once
you are done with the first processor, it is not used anymore. Once you are
done with second processor, it is not used anymore, etc... So for load
balancing sake, you really want to distribute your matrix in a column block
cyclic fashion.

Then sure you can go with 1D block-column cyclic distribution, ScaLAPACK will
give you the correct answer, but, once more, the performance are not going to be that good. In the
ScaLAPACK LU-factorization design, the panel fatorization is a blocking part of
the algorithm done by a column of processors. All the other processors are
waiting for this operation to be finished to perform their own work. So you
certainly want to have some parallelism in this part of the algorithm as

That's a quick explanataion of why you really need 2D block cyclic distribution
for your matrix to have good performance on an LU factorization.
ScaLAPACK Users' Guide and the LFC project recommend a grid-shape with a ratio 1:4,

So, with 121 processors, you might want to try 4-by-30 grid-shape with block size 64 (?) for example (and leave one processor bailing).


Banded matrices

PostPosted: Mon Mar 06, 2006 5:18 pm
by rvdg
The fundamental problem with your matrix is that it is too wide for ScaLAPACK's narrow band routine, yet too narrow for the wide band case...

To understand the issues, may I suggest you read
Thuan D. Cao, John F. Hall, and Robert A. van de Geijn, "Parallel Cholesky Factorization of a Block Tridiagonal Matrix," 4ths Workshop on High Performance Scientific and Engineering Computing with Applications (HPSECA-02) , In Proceedings of the International Conference on Parallel Processing 2002 (ICPP-02) .
available from