A quick run of inverting a 10K x 10K matrix
with 10x10 processors with a block size 256x256 takes about the same time (5% extra) as on 5x5 with a block size of 64x64.
with 5 x 5 processors with a block size of 256 x 256 takes about 3/5 of the time as on 5x5 with a block size of 64 x 64.
with 5 x 5 processors with a block size of 1024x1024 takes about 7/8 of the time as on 5x5 with a block size of 64 x64.
Yep that's it, you need to play a little here. Seems to me that 64x64 10x10 might be worth an attempt.
Would my consideration of a block size of 1024x1024 absurd for this problem?
On 5x5 grid, 10Kx10K matrix, 1Kx1K block, each processor ends up with only two blocks. Once half the matrix is processed, MPI nodes start to become useless one after the other. Hopefully there is 7/8 of the work in the first half! (The computation is very badly load balanced though ...) You can say that either you have too many processors, or not a large enough matrix. Still better than 64x64.
So 64x64: too many comm. 1024x1024: not enough load balance. 256x256: seems about right.
I wonder if you could elaborate on the last point saying that inverting a 10k x 10k matrix would not "scale well" on 10x10 proc grid given a block size of 64x64. My understanding is that there will be a 1k x 1k sub matrix associated with each processor. Given a block size of 64x64, there are roughly 15x15=255 blocks for each processor. The sub-matrix seems to be well loaded, i.e. there are a good number of blocks contained in the sub matrix. Would you mind elaborating?
There are two trade-offs.
One, you want lots of blocks on each process so that the computation is well balanced. (You are totally correct.) (We call this the granularity sometimes.)
Two, you want large blocks. Small blocks means more communication. Much more. At all level. At the sequential level you do not have good performances. At the parallel level you have more communication.
So now, 256x256 seems reasonnable in order to have good sequential performance and hide the latency of the network. But with a matrix of size 10Kx10K, you end up with only 4 block per process. It's really tough to get very good performance on 10Kx10K matrix and 100 processors. period. No matter what block size you take.