performance/scalability for smaller dense linear solver

Open forum for general discussions relating to PLASMA.

performance/scalability for smaller dense linear solver

Postby cn_cologne » Mon Aug 01, 2011 7:14 am

Hi,

i am writing a critical real-time application, where i need to solve
as fast as possible smaller dense un-symmetric linear equation of
sizes 100x100 to 300x300. The usage of LAPACK's DGETRF and Intels
BLAS MKL showed a poor performance/scalability on a multi-core system.
Can PLAMA deliver a better performance i.e. scalability for those
type of small problems? Thanks for any help.

carsten
cn_cologne
 
Posts: 5
Joined: Mon Aug 01, 2011 5:17 am

Re: performance/scalability for smaller dense linear solver

Postby admin » Mon Aug 01, 2011 9:30 am

cn_cologne wrote:Hi,

i am writing a critical real-time application, where i need to solve
as fast as possible smaller dense un-symmetric linear equation of
sizes 100x100 to 300x300. The usage of LAPACK's DGETRF and Intels
BLAS MKL showed a poor performance/scalability on a multi-core system.
Can PLAMA deliver a better performance i.e. scalability for those
type of small problems? Thanks for any help.

carsten


I seriously doubt it.
I think the problem size is simply too small.
But give it a shot.
Say, for the 300x300 problem, set the tile size to something small, e.g. 60 or 50.
Make sure to use PLASMA with static scheduling, not dynamic:

PLASMA_Set(PLASMA_SCHEDULING_MODE, PLASMA_STATIC_SCHEDULING);

Use only one socket, i.e., 4 to 6 cores.
Let us know what happens.
Good luck,
Jakub
admin
Site Admin
 
Posts: 79
Joined: Wed May 13, 2009 1:27 pm

Re: performance/scalability for smaller dense linear solver

Postby cn_cologne » Mon Aug 01, 2011 4:45 pm

Hi Jakub,

thanks for the info.i'll give it a try and keep you
updated.

carsten

admin wrote:
cn_cologne wrote:Hi,

i am writing a critical real-time application, where i need to solve
as fast as possible smaller dense un-symmetric linear equation of
sizes 100x100 to 300x300. The usage of LAPACK's DGETRF and Intels
BLAS MKL showed a poor performance/scalability on a multi-core system.
Can PLAMA deliver a better performance i.e. scalability for those
type of small problems? Thanks for any help.

carsten


I seriously doubt it.
I think the problem size is simply too small.
But give it a shot.
Say, for the 300x300 problem, set the tile size to something small, e.g. 60 or 50.
Make sure to use PLASMA with static scheduling, not dynamic:

PLASMA_Set(PLASMA_SCHEDULING_MODE, PLASMA_STATIC_SCHEDULING);

Use only one socket, i.e., 4 to 6 cores.
Let us know what happens.
Good luck,
Jakub
cn_cologne
 
Posts: 5
Joined: Mon Aug 01, 2011 5:17 am

Re: performance/scalability for smaller dense linear solver

Postby cn_cologne » Tue Aug 23, 2011 12:12 pm

Hi Jakub,

finally i have a time slot to do some testing with plasma. I compared the execution
time for dense systems of different sizes with different solvers. The solvers are based
on LAPACK, the C++ Eigen library, a simple C++ Gaussian-Elimination algorithm with
partial pivoting and plasma. The plasma code looks basically as followed:

PLASMA_Init(4);
PLASMA_Set(PLASMA_SCHEDULING_MODE, PLASMA_STATIC_SCHEDULING);
....
PLASMA_Alloc_Workspace_sgetrf_incpiv(...)
....
PLASMA_sgetrf_incpiv(.....)
PLASMA_sgetrs_incpiv(...)

The code is compiled with gcc ver. 4.5.1 and the native optimization option -march=native.
The binary is then executed on a AMD Phenom II X6 system with six cores.

For single precision the excution times have been:

Solved random matrix 64x64 256 times with LAPACK_LU in 1.500000e-02 sec. with rel_err = 1.561094e-05
Solved random matrix 64x64 256 times with Eigen_LU in 2.100000e-02 sec. with rel_err = 1.349157e-05
Solved random matrix 64x64 256 times with GEWPP in 1.700000e-02 sec. with rel_err = 1.560395e-05
Solved random matrix 64x64 256 times with PLASMA_solver in 1.770000e-01 sec. with rel_err = 1.561094e-05
Solved random matrix 128x128 128 times with LAPACK_LU in 5.900000e-02 sec. with rel_err = 6.562295e-02
Solved random matrix 128x128 128 times with Eigen_LU in 5.300000e-02 sec. with rel_err = 1.037400e-01
Solved random matrix 128x128 128 times with GEWPP in 5.600000e-02 sec. with rel_err = 6.562294e-02
Solved random matrix 128x128 128 times with PLASMA_solver in 1.410000e-01 sec. with rel_err = 6.562295e-02
Solved random matrix 256x256 64 times with LAPACK_LU in 1.720000e-01 sec. with rel_err = 1.361465e-02
Solved random matrix 256x256 64 times with Eigen_LU in 1.320000e-01 sec. with rel_err = 7.396387e-03
Solved random matrix 256x256 64 times with GEWPP in 2.390000e-01 sec. with rel_err = 1.361464e-02
Solved random matrix 256x256 64 times with PLASMA_solver in 2.530000e-01 sec. with rel_err = 1.361466e-02
Solved random matrix 512x512 1 times with LAPACK_LU in 2.100000e-02 sec. with rel_err = 8.295681e-01
Solved random matrix 512x512 1 times with Eigen_LU in 1.500000e-02 sec. with rel_err = 4.666448e+00
Solved random matrix 512x512 1 times with GEWPP in 3.100000e-02 sec. with rel_err = 8.295683e-01
Solved random matrix 512x512 1 times with PLASMA_solver in 1.600000e-02 sec. with rel_err = 8.295681e-01
Solved random matrix 1024x1024 1 times with LAPACK_LU in 1.550000e-01 sec. with rel_err = 1.845431e-01
Solved random matrix 1024x1024 1 times with Eigen_LU in 1.290000e-01 sec. with rel_err = 6.221509e-02
Solved random matrix 1024x1024 1 times with GEWPP in 3.990000e-01 sec. with rel_err = 1.845433e-01
Solved random matrix 1024x1024 1 times with PLASMA_solver in 7.700000e-02 sec. with rel_err = 1.845431e-01
Solved random matrix 2048x2048 1 times with LAPACK_LU in 1.231000e+00 sec. with rel_err = 2.718243e-01
Solved random matrix 2048x2048 1 times with Eigen_LU in 8.650000e-01 sec. with rel_err = 1.101133e-01
Solved random matrix 2048x2048 1 times with GEWPP in 3.751000e+00 sec. with rel_err = 2.718244e-01
Solved random matrix 2048x2048 1 times with PLASMA_solver in 4.620000e-01 sec. with rel_err = 2.718245e-01

I would guess the poor results for the smaller systems are from the overhead of plasma's scheduler and the copying
into the tile based format. This effect is probably amplified, since the smaller systems are solved several times. What
makes me wonder, is that even for larger systems, where the overhead should be small, the performance does not
scale well. Any idea, what i might do wrong?

thanks,
Carsten
cn_cologne
 
Posts: 5
Joined: Mon Aug 01, 2011 5:17 am

Re: performance/scalability for smaller dense linear solver

Postby cn_cologne » Thu Aug 25, 2011 2:44 am

I performed some additional tests using the provided timing test-code time_sgetrf_incpiv.c. To
get comperable results i made an additional version where i replaced PLASMA_sgetrf_incpiv( .. )
with the original LAPACK method sgetrf_(...). Generally i think it would be helpfull to have addtionally
the orginal LAPACK routines in the timig code to directly compare the performance. The results on
my system have been:

1. PLASMA_sgetrf_incpiv(..)

./time_sgetrf_incpiv --n_range=256:2048:256 --threads=4 --atun --niter=10
# N NRHS threads seconds Gflop/s Deviation
256 1 4 0.005 2.98 0.76
512 1 4 0.012 7.73 0.10
768 1 4 0.031 9.67 0.04
1024 1 4 0.063 11.39 0.10
1280 1 4 0.122 11.42 0.04
1536 1 4 0.203 11.91 0.04
1792 1 4 0.323 11.85 0.03
2048 1 4 0.466 12.28 0.02

2. sgetrf_(...)

./time_sgetrf_incpiv --n_range=256:2048:256 --threads=4 --atun --niter=10
# N NRHS threads seconds Gflop/s Deviation
256 1 4 0.003 3.87 0.06
512 1 4 0.021 4.29 0.05
768 1 4 0.062 4.84 0.06
1024 1 4 0.152 4.72 0.04
1280 1 4 0.268 5.21 0.05
1536 1 4 0.470 5.14 0.05
1792 1 4 0.726 5.28 0.02
2048 1 4 1.292 4.43 0.01

These performance results are more or less consisten with the results in
the previous post. Jakup any hints to speed up thing for the smaller problems?

thanks,
carsten
cn_cologne
 
Posts: 5
Joined: Mon Aug 01, 2011 5:17 am


Return to User discussion

Who is online

Users browsing this forum: Bing [Bot], Yahoo [Bot] and 1 guest

cron