No improvement for PLAMSA's dgemm

Open forum for general discussions relating to PLASMA.

No improvement for PLAMSA's dgemm

Postby Sebastiaan » Thu Jun 10, 2010 10:36 am

Dear forum,

I have been trying PLASMA and compared it to the normal MKL blas for the dgemm function. Unfortunately, PLASMA performs slower than the reference.

To test, I have adapted the testing/testing_dgemm.c and put a timer around the blas calls:
Code: Select all
    /* PLASMA DGEMM */
    start = clock();
    PLASMA_dgemm(PlasmaNoTrans, PlasmaNoTrans, M, N, K, alpha, A, LDA, B, LDB, beta, Cfinal, LDC);
    end = clock();
    printf("\n\nPLASMA took %f seconds.\n",  ((double) end - (double) start)/CLOCKS_PER_SEC);

and similar:
Code: Select all
    start = clock();
    CORE_dgemm(transA, transB, M, N, K, (alpha), A, LDA, B, LDB, (beta), Cref, LDC);
    end = clock();
    printf("\n\nCORE took %f seconds.\n", ((double) end - (double) start)/CLOCKS_PER_SEC);

(where start and end are of clock_t from time.h).

The system is Linux, using Intel compilers version 11.1 and MKL 10.0.5. Computer is 8-core Xeon (X5570@2.93GHz).

To test the CORE_blas performance, I called:
Code: Select all
./testing_dgemm_time 8 1 1 5000 5000 5000 5000 5000 5000


For the PLASMA test I put:
Code: Select all
export OMP_NUM_THREADS=1
export MKL_NUM_THREADS=1
export GOTO_NUM_THREADS=1

before calling the program.

Timings were were 23.69 seconds for the CORE blas and 36.18 seconds for the PLASMA blas. How can it be that the core blas outperforms the PLASMA blas?
Sebastiaan
 
Posts: 4
Joined: Thu Jun 10, 2010 10:23 am

Re: No improvement for PLAMSA's dgemm

Postby mateo70 » Thu Jun 10, 2010 4:10 pm

Hi,

You don't use the right way to get time. The time you get is user + system time but for 8 threads. So you have to divide by 8 to have a approximation of the time.
We use this function to measure the time :

Code: Select all
double clock(void){
    struct timeval tp;
    gettimeofday( &tp, NULL );
    return tp.tv_sec + 1e-6 * tp.tv_usec;
}


With this function to measure time on PLASMA_dgemm, I get :
Code: Select all
% ./time_dgemm --n_range=5000:5000:500
#   N NRHS threads seconds   Gflop/s Deviation
 5000    1     8     5.829     42.89      0.00


And with yours :
Code: Select all
% ./time_dgemm --n_range=5000:5000:500
#   N NRHS threads seconds   Gflop/s Deviation
 5000    1     8    43.090      5.80      0.00


By the way, CORE_dgemm is less than 8 times more expensive because PLASMA_dgemm has to convert the matrix in block data layout before apply the gemm, and convert the result to Lapack Layout to give it to you.

Mathieu
mateo70
 
Posts: 98
Joined: Fri May 07, 2010 3:48 pm

Re: No improvement for PLAMSA's dgemm

Postby Sebastiaan » Thu Jun 10, 2010 4:27 pm

Thank you, that explains a lot. Also noticed that the same multiplication takes ~3 seconds with Matlab, to which it comes pretty close when I divide the times by 8.

But for plasma: is it possible to separate the multiplication from the block generation?
Sebastiaan
 
Posts: 4
Joined: Thu Jun 10, 2010 10:23 am

Re: No improvement for PLAMSA's dgemm

Postby mateo70 » Fri Jun 11, 2010 10:07 am

Sure you can, you need to use the expert interface :

PLASMA_Lapack_to_Tile(); For the 3 matrices A, B and C.
PLASMA_zgemm_Tile();
PLASMA_Tile_to_Lapack() to convert again the result C.

Mathieu
mateo70
 
Posts: 98
Joined: Fri May 07, 2010 3:48 pm

Re: No improvement for PLAMSA's dgemm

Postby Sebastiaan » Fri Jun 11, 2010 2:52 pm

Thank you!
Sebastiaan
 
Posts: 4
Joined: Thu Jun 10, 2010 10:23 am

Re: No improvement for PLAMSA's dgemm

Postby luiceur » Fri Apr 19, 2013 9:02 am

Well, that is far from what I am getting:

./time_dgemm --threads=24 --n_range=1000:100000:1000
#
# PLASMA 2.4.6, ./time_dgemm
# Nb threads: 24
# NB: 128
# IB: 32
# eps: 1.110223e-16
#
# M N K/NRHS seconds Gflop/s Deviation
1000 1000 1 0.012 0.17 0.00
2000 2000 1 0.021 0.39 0.00
3000 3000 1 0.036 0.49 0.00
4000 4000 1 0.060 0.53 0.00
5000 5000 1 0.092 0.54 0.00
6000 6000 1 0.124 0.58 0.00
7000 7000 1 0.169 0.58 0.00
8000 8000 1 0.218 0.59 0.00
9000 9000 1 0.271 0.60 0.00
10000 10000 1 0.337 0.59 0.00
11000 11000 1 0.415 0.58 0.00
12000 12000 1 0.490 0.59 0.00
13000 13000 1 0.578 0.58 0.00
14000 14000 1 0.669 0.59 0.00
15000 15000 1 0.777 0.58 0.00
16000 16000 1 0.433 1.18 0.00
17000 17000 1 0.982 0.59 0.00
18000 18000 1 1.090 0.59 0.00
19000 19000 1 1.213 0.60 0.00
20000 20000 1 1.352 0.59 0.00
21000 21000 1 1.526 0.58 0.00
22000 22000 1 1.615 0.60 0.00
23000 23000 1 1.759 0.60 0.00
24000 24000 1 1.903 0.61 0.00
25000 25000 1 2.155 0.58 0.00
26000 26000 1 2.251 0.60 0.0
luiceur
 
Posts: 3
Joined: Fri Apr 19, 2013 4:52 am

Re: No improvement for PLAMSA's dgemm

Postby admin » Mon Apr 22, 2013 11:16 am

The command line interface of time_dgemm is not the most intuitive.
Your K dimension was set to 1 in all cases.
You need to use the --nrhs=X option to set it to a more reasonable size.
Also, if your system is a NUMA system, using numactl --interleave=all usually improves performance.
Try the following call:

numactl --interleave=all ./time_dgemm --threads=24 --n_range=10000:10000:10000 --nrhs=10000
admin
Site Admin
 
Posts: 79
Joined: Wed May 13, 2009 1:27 pm


Return to User discussion

Who is online

Users browsing this forum: No registered users and 3 guests

cron