I have written a simple program to factor and solve a banded matrix in ScaLAPACK using PDGBTSV and PDGBTRS. Basically, I invoke PDGBTSV, and then use the output from the internal call of PDGBTRF for additional calls to PDGBTRS. I have run the program with 1, 2, 4, and 8 processes (everything regarding the matrix and rhs kept the same) and find that the solve (PDGBTRS) portion does get faster as I increase use more processors(scales approx. as P^-.63), but the factorization step (PGBTRF called from PDGBSV) scales as P^3.0 ! I would like to run with more processors, but I am concerned about the factorization scaling...With one process, for this problem, the PDGBSV call takes about 2.6 seconds, but for 8 processes on the same matrix/rhs, I did the factorization once and it took nearly 40 minutes!

My sample problem as N=40000 unknowns, with BWL and BWU both equal to 200 (this is emulating part of a much larger code, and I am considering PDGBSV to obtain the factorization and then repeatedly apply PDGBTRS to get an important quantity, on the order of several million times). I can live with a costly factorization b/c I only do that part once, but other parts of the code need to use a larger number of processors.

My actual question is:

Is the scaling I'm seeing inherent to PDGBTRF or can I fix this by changing how I compile my code, or something else I haven't thought of? If you need more information about the environment I am using the code in, or the code itself, I'd be happy to provide them. What have others with more experience found when using PDGBTRF?

Thank you,

Steve