Hello,

this is two good questions.

when should you swicth from the banded format to the dense format?

For the LU factorization of a dense n-by-n matrix, you have

Flops=2/3*n^3

Storage=n^2The cost for the LU factorization of an n-by-n matrix with lower bandwidth p

and upper bandwidth q is

Flops=2*n*p*q

Storage=n*(p+q)(Formula for the FLOPS is a upper bound of the number of FLOPS, it is tight

only when p<<n ans q<<n, and has a correct magnitude when p~n or q~m.)

This is for sequential algorithm, but parallel algorithms should have the same order of

magnitude.

So in your case (bandwith = n/10), it means that you will need 5 times less

memory and 33 times less Flops. I guess it's a good bet to go with the banded

solver.

If you want to give a try to PGBSV (LU factorization and solve for ScaLAPACK of

a banded matrix), you should definetly have a look at

LAPACK/ScaLAPACK development forum, post 24: PDGBTRF/S ProblemsStanimire gives an example code, and I think also Stanimire and Ashton

figured out a problem in the workspace calculation. (No patch made though...)

Can I use 1D block-column distribution (not cyclic) for an LU factorization?

Yes you can, but you are not going to obtain any performance out of it... (or

very poor): you certainly do not want to do it.

I understand the point. You want to have an easy mapping of your matrix on your

processors and do not want to bother with the cyclic distribution.

The problem with such a distribution is that basically, during the

factorization, while you are progressing in the columns of your matrix, once

you are done with the first processor, it is not used anymore. Once you are

done with second processor, it is not used anymore, etc... So for load

balancing sake, you really want to distribute your matrix in a column block

cyclic fashion.

Then sure you can go with 1D block-column cyclic distribution, ScaLAPACK will

give you the correct answer, but, once more, the performance are not going to be that good. In the

ScaLAPACK LU-factorization design, the panel fatorization is a blocking part of

the algorithm done by a column of processors. All the other processors are

waiting for this operation to be finished to perform their own work. So you

certainly want to have some parallelism in this part of the algorithm as

well.

That's a quick explanataion of why you really need 2D block cyclic distribution

for your matrix to have good performance on an LU factorization.

ScaLAPACK Users' Guide and the

LFC project recommend a grid-shape with a ratio 1:4,

So, with 121 processors, you might want to try 4-by-30 grid-shape with block size 64 (?) for example (and leave one processor bailing).

Julien