this is two good questions.
when should you swicth from the banded format to the dense format?
For the LU factorization of a dense n-by-n matrix, you have Flops=2/3*n^3
The cost for the LU factorization of an n-by-n matrix with lower bandwidth p
and upper bandwidth q is Flops=2*n*p*q
(Formula for the FLOPS is a upper bound of the number of FLOPS, it is tight
only when p<<n ans q<<n, and has a correct magnitude when p~n or q~m.)
This is for sequential algorithm, but parallel algorithms should have the same order of
So in your case (bandwith = n/10), it means that you will need 5 times less
memory and 33 times less Flops. I guess it's a good bet to go with the banded
If you want to give a try to PGBSV (LU factorization and solve for ScaLAPACK of
a banded matrix), you should definetly have a look atLAPACK/ScaLAPACK development forum, post 24: PDGBTRF/S Problems
Stanimire gives an example code, and I think also Stanimire and Ashton
figured out a problem in the workspace calculation. (No patch made though...)
Can I use 1D block-column distribution (not cyclic) for an LU factorization?
Yes you can, but you are not going to obtain any performance out of it... (or
very poor): you certainly do not want to do it.
I understand the point. You want to have an easy mapping of your matrix on your
processors and do not want to bother with the cyclic distribution.
The problem with such a distribution is that basically, during the
factorization, while you are progressing in the columns of your matrix, once
you are done with the first processor, it is not used anymore. Once you are
done with second processor, it is not used anymore, etc... So for load
balancing sake, you really want to distribute your matrix in a column block
Then sure you can go with 1D block-column cyclic distribution, ScaLAPACK will
give you the correct answer, but, once more, the performance are not going to be that good. In the
ScaLAPACK LU-factorization design, the panel fatorization is a blocking part of
the algorithm done by a column of processors. All the other processors are
waiting for this operation to be finished to perform their own work. So you
certainly want to have some parallelism in this part of the algorithm as
That's a quick explanataion of why you really need 2D block cyclic distribution
for your matrix to have good performance on an LU factorization.
ScaLAPACK Users' Guide
and the LFC
project recommend a grid-shape with a ratio 1:4,
So, with 121 processors, you might want to try 4-by-30 grid-shape with block size 64 (?) for example (and leave one processor bailing).