ScaLAPACK band matrix or dense matrix LU?

Open discussion regarding features, bugs, issues, vendors, etc.

ScaLAPACK band matrix or dense matrix LU?

Postby mrkkbb » Tue Feb 21, 2006 4:12 pm

Hello,

I have a very large general matrix (N = 87040) that has an upper band and a lower band of equal size (bw = 8703). Would this still be considered a "narrow" band? A compact block-column storage would save me a lot of space, but will I pay for it by using the divide and conquer algorithms in the LU for narrow band matrices in comparison to the dense matrix LU?

One other question, is the 1D block-column distribution described in the ScaLAPACK user guide intended to be cyclic? For example, if I have 121 processors for my matrix (see above) of NxN, do I use a block size equal to 720, such that the last processor has 640 columns. Or should I be using a smaller block of 200 and cycle through my 1D processor grid?

Thanks in advance
Cheers
Rob
mrkkbb
 
Posts: 1
Joined: Tue Feb 21, 2006 3:52 pm

Postby Julien Langou » Sun Feb 26, 2006 4:57 pm

Hello,
this is two good questions.
when should you swicth from the banded format to the dense format?


For the LU factorization of a dense n-by-n matrix, you have
Flops=2/3*n^3
Storage=n^2


The cost for the LU factorization of an n-by-n matrix with lower bandwidth p
and upper bandwidth q is
Flops=2*n*p*q
Storage=n*(p+q)

(Formula for the FLOPS is a upper bound of the number of FLOPS, it is tight
only when p<<n ans q<<n, and has a correct magnitude when p~n or q~m.)
This is for sequential algorithm, but parallel algorithms should have the same order of
magnitude.


So in your case (bandwith = n/10), it means that you will need 5 times less
memory and 33 times less Flops. I guess it's a good bet to go with the banded
solver.

If you want to give a try to PGBSV (LU factorization and solve for ScaLAPACK of
a banded matrix), you should definetly have a look at
LAPACK/ScaLAPACK development forum, post 24: PDGBTRF/S Problems
Stanimire gives an example code, and I think also Stanimire and Ashton
figured out a problem in the workspace calculation. (No patch made though...)


Can I use 1D block-column distribution (not cyclic) for an LU factorization?


Yes you can, but you are not going to obtain any performance out of it... (or
very poor): you certainly do not want to do it.

I understand the point. You want to have an easy mapping of your matrix on your
processors and do not want to bother with the cyclic distribution.

The problem with such a distribution is that basically, during the
factorization, while you are progressing in the columns of your matrix, once
you are done with the first processor, it is not used anymore. Once you are
done with second processor, it is not used anymore, etc... So for load
balancing sake, you really want to distribute your matrix in a column block
cyclic fashion.

Then sure you can go with 1D block-column cyclic distribution, ScaLAPACK will
give you the correct answer, but, once more, the performance are not going to be that good. In the
ScaLAPACK LU-factorization design, the panel fatorization is a blocking part of
the algorithm done by a column of processors. All the other processors are
waiting for this operation to be finished to perform their own work. So you
certainly want to have some parallelism in this part of the algorithm as
well.

That's a quick explanataion of why you really need 2D block cyclic distribution
for your matrix to have good performance on an LU factorization.
ScaLAPACK Users' Guide and the LFC project recommend a grid-shape with a ratio 1:4,

So, with 121 processors, you might want to try 4-by-30 grid-shape with block size 64 (?) for example (and leave one processor bailing).

Julien
Julien Langou
 
Posts: 733
Joined: Thu Dec 09, 2004 12:32 pm
Location: Denver, CO, USA

Banded matrices

Postby rvdg » Mon Mar 06, 2006 5:18 pm

The fundamental problem with your matrix is that it is too wide for ScaLAPACK's narrow band routine, yet too narrow for the wide band case...

To understand the issues, may I suggest you read
Thuan D. Cao, John F. Hall, and Robert A. van de Geijn, "Parallel Cholesky Factorization of a Block Tridiagonal Matrix," 4ths Workshop on High Performance Scientific and Engineering Computing with Applications (HPSECA-02) , In Proceedings of the International Conference on Parallel Processing 2002 (ICPP-02) .
available from http://www.cs.utexas.edu/users/plapack/new/pubs.html
rvdg
 
Posts: 8
Joined: Mon Feb 13, 2006 7:56 pm


Return to User Discussion

Who is online

Users browsing this forum: Bing [Bot] and 1 guest