Page 1 of 1

PDGEMM: Problems with extremely non square block sizes

PostPosted: Tue Jun 05, 2012 4:20 pm
by rreddy2001
The user is attempting to do a matrix multiply C = AB using PDGEMM in his application. This is being attempted on an SGI ice system. Apparently the code runs fine on an IBM system.

The one thing that the user is doing that is unusual is that the matrices and blocking factors are *extremely* non-square. To isolate the problem I modified the example code for PDGEMM from netlib to test this call independently of the rest of the code using the data sizes that the user needs.

When I run some weak scaling tests the code crashes with:

Stack trace terminated abnormally.
*** glibc detected *** ./pblas: free(): invalid next size (normal): 0x00000000d7ccef60 ***
forrtl: severe (174): SIGSEGV, segmentation fault occurred

Processor grid: 3 x 16, Col major order

Matrix A: 75,000,000 x 16, blockcyclic, block size: 25,000,000 x 1
Matrix B: 16 x 16, blockcycle, block size: 1 x 1
Matrix C; 75,000,000 x 16, blockcyclic, block size: 25,000,000 x 1

The code crashes with the above the error message when run with these parameters.

A weakly scaled down version by changing "16"s above to "8" and "4"s runs to completion (reduce number of columns and also the number of processors in the coumn direction keeping the memory footprint the same; ignoring the changes in the size of matrix B which is very small).

The code was verified to produce correct results with a very small test case.

Can anyone suggest what may be wrong? I believe generally square blocking factors are recommended, could the extreme shape of the matrix and the blocking factors cause these problems? Any other suggestions for working around this problem?

The test case was created by downloading the PDGEMM example program and modifying the values to match what the user needs in his application. I can provide the code if it is going to help.

Thank you!