Selection on the block size on the performance

Open discussion regarding features, bugs, issues, vendors, etc.

Selection on the block size on the performance

Postby sket16 » Tue Jul 05, 2011 5:52 pm

Dear Forum Users,

I am in the middle of tuning my CPP code to call ScaLAPACK, and have it run efficiently on a cluster. I need compute the inverse of a 10K x 10K matrix.

The SLUG suggests a block size of 64x64 and a processor grid of 10 x 10. The execution time to invert the matrix with PZGETRF + PZGETRI, on the 10 x 10 grid is about 1.5 times longer than on a 5 x 5 grid.

My question is, is 64x64 block size still a good choice, given the SLUG was written ten years ago? The SLUG suggests that the block size is dependent on when local matrix multiplication is efficient. How could this number be identify? Is there some test script that could be run on a processor to determine the optimal block size?

Thank you for reading the post. Any comment is welcome.

Kevin
sket16
 
Posts: 8
Joined: Tue May 17, 2011 4:10 pm

Re: Selection on the block size on the performance

Postby Julien Langou » Wed Jul 06, 2011 4:23 am

Hi,

There is no support for autotuning in ScaLAPACK at this point. So what you have done is what you should have done. If using 25 procs is faster than 100, then you should use 25. In term of things to try: 64x64 seems reasonnable even for current architectures. But you might want to play on this. Increase up to 200x200 for example. The shape of the grid is important too. You have been using square grids. What about trying 2x50 for example.

Yep, the material in the SLUG is 10 years old. Maybe we should revise it a little ...

Julien
( Note: for a 10K-by-10K matrix, it's tough to scale on 100 procs ... )
Julien Langou
 
Posts: 735
Joined: Thu Dec 09, 2004 12:32 pm
Location: Denver, CO, USA

Re: Selection on the block size on the performance

Postby sket16 » Wed Jul 06, 2011 2:13 pm

Hi Julien,

Thank you very much for the reply. The SLUG is pleasant to read. So far, it has not raised any question on its relevance to today's processors, except this on the block size.

I have only had the matrix inversion working on the square grid. I vaguely remember there is a post in the forum related to having the matrix inversion work on a rectangular grid, which requires modifying the calculation of lwork, written in FORTRAN. I have little knowledge about FORTRAN, and try to comprehend the effect of block size before the effect of the processor grid.

A quick run of inverting a 10K x 10K matrix
with 10x10 processors with a block size 256x256 takes about the same time (5% extra) as on 5x5 with a block size of 64x64.
with 5 x 5 processors with a block size of 256 x 256 takes about 3/5 of the time as on 5x5 with a block size of 64 x 64.
with 5 x 5 processors with a block size of 1024x1024 takes about 7/8 of the time as on 5x5 with a block size of 64 x64.

Would my consideration of a block size of 1024x1024 absurd for this problem?

I wonder if you could elaborate on the last point saying that inverting a 10k x 10k matrix would not "scale well" on 10x10 proc grid given a block size of 64x64. My understanding is that there will be a 1k x 1k sub matrix associated with each processor. Given a block size of 64x64, there are roughly 15x15=255 blocks for each processor. The sub-matrix seems to be well loaded, i.e. there are a good number of blocks contained in the sub matrix. Would you mind elaborating?


Thank you.

Kevin
sket16
 
Posts: 8
Joined: Tue May 17, 2011 4:10 pm

Re: Selection on the block size on the performance

Postby Julien Langou » Wed Jul 06, 2011 4:00 pm

A quick run of inverting a 10K x 10K matrix
with 10x10 processors with a block size 256x256 takes about the same time (5% extra) as on 5x5 with a block size of 64x64.
with 5 x 5 processors with a block size of 256 x 256 takes about 3/5 of the time as on 5x5 with a block size of 64 x 64.
with 5 x 5 processors with a block size of 1024x1024 takes about 7/8 of the time as on 5x5 with a block size of 64 x64.


Yep that's it, you need to play a little here. Seems to me that 64x64 10x10 might be worth an attempt.

Would my consideration of a block size of 1024x1024 absurd for this problem?


On 5x5 grid, 10Kx10K matrix, 1Kx1K block, each processor ends up with only two blocks. Once half the matrix is processed, MPI nodes start to become useless one after the other. Hopefully there is 7/8 of the work in the first half! (The computation is very badly load balanced though ...) You can say that either you have too many processors, or not a large enough matrix. Still better than 64x64.

So 64x64: too many comm. 1024x1024: not enough load balance. 256x256: seems about right.

I wonder if you could elaborate on the last point saying that inverting a 10k x 10k matrix would not "scale well" on 10x10 proc grid given a block size of 64x64. My understanding is that there will be a 1k x 1k sub matrix associated with each processor. Given a block size of 64x64, there are roughly 15x15=255 blocks for each processor. The sub-matrix seems to be well loaded, i.e. there are a good number of blocks contained in the sub matrix. Would you mind elaborating?


There are two trade-offs.
One, you want lots of blocks on each process so that the computation is well balanced. (You are totally correct.) (We call this the granularity sometimes.)
Two, you want large blocks. Small blocks means more communication. Much more. At all level. At the sequential level you do not have good performances. At the parallel level you have more communication.

So now, 256x256 seems reasonnable in order to have good sequential performance and hide the latency of the network. But with a matrix of size 10Kx10K, you end up with only 4 block per process. It's really tough to get very good performance on 10Kx10K matrix and 100 processors. period. No matter what block size you take.
Julien Langou
 
Posts: 735
Joined: Thu Dec 09, 2004 12:32 pm
Location: Denver, CO, USA

Re: Selection on the block size on the performance

Postby sket16 » Thu Jul 07, 2011 11:20 am

Hi Julien,

With 10x10 processors with a block size 64x64 takes about the 60% extra than with 5x5 processors with a block size of 64x64.

Seems to me that a block size of 256x256 yields a good performance on the cluster.

Thank you for the feedback.

Kevin
sket16
 
Posts: 8
Joined: Tue May 17, 2011 4:10 pm

Re: Selection on the block size on the performance

Postby sket16 » Wed Jul 13, 2011 9:46 am

It was just realized that 256x256 block size corresponds to about 1MB in memory, which "accidentally" is the L3 cache of the cores. Could this be just a coincidence? Would there be other hardware considerations that may affect the block size.


Kevin
sket16
 
Posts: 8
Joined: Tue May 17, 2011 4:10 pm


Return to User Discussion

Who is online

Users browsing this forum: Bing [Bot] and 3 guests