I have been trying PLASMA and compared it to the normal MKL blas for the dgemm function. Unfortunately, PLASMA performs slower than the reference.
To test, I have adapted the testing/testing_dgemm.c and put a timer around the blas calls:
- Code: Select all
/* PLASMA DGEMM */
start = clock();
PLASMA_dgemm(PlasmaNoTrans, PlasmaNoTrans, M, N, K, alpha, A, LDA, B, LDB, beta, Cfinal, LDC);
end = clock();
printf("\n\nPLASMA took %f seconds.\n", ((double) end - (double) start)/CLOCKS_PER_SEC);
and similar:
- Code: Select all
start = clock();
CORE_dgemm(transA, transB, M, N, K, (alpha), A, LDA, B, LDB, (beta), Cref, LDC);
end = clock();
printf("\n\nCORE took %f seconds.\n", ((double) end - (double) start)/CLOCKS_PER_SEC);
(where start and end are of clock_t from time.h).
The system is Linux, using Intel compilers version 11.1 and MKL 10.0.5. Computer is 8-core Xeon (X5570@2.93GHz).
To test the CORE_blas performance, I called:
- Code: Select all
./testing_dgemm_time 8 1 1 5000 5000 5000 5000 5000 5000
For the PLASMA test I put:
- Code: Select all
export OMP_NUM_THREADS=1
export MKL_NUM_THREADS=1
export GOTO_NUM_THREADS=1
before calling the program.
Timings were were 23.69 seconds for the CORE blas and 36.18 seconds for the PLASMA blas. How can it be that the core blas outperforms the PLASMA blas?
