Hello again:
Attached I send a simple test program (uncompress it, please) in order to try to clarify the behavior of a mix program PLASMA+threaded BLAS. It is a simple program that performs the Cholesky factorization via PLASMA and calls cblas_dgemm with dummy matrices. It performs too a #pragma omp parallel section. I use cblas_dgemm only as an example, any other CBLAS or Lapack function nor present in PLASMA could be used. The program fix the number of cores prior to initializa PLASMA to 1 and fix to 4 after PLASMA finalization. I'm using PLASMA 2.4.5 in a Debian GNU/Linux system with GCC 4.7.1. The processor is an 4 core Intel Core i5 2500 3.30 GHz and I use a 3000x3000 matrices
In my tests I linked the program against OpenBLAS (continuation of GotoBLAS,
https://github.com/xianyi/OpenBLAS). One important issue of [Open|Goto]BLAS is that it can be compiled with the flag (see Makefile.rule in the distribution) USE_OPENMP=0 or USE_OPENMP=1 in order to use the library in an OpenMP program. So the results:
In my first test I have linked the program with OpenBLAS compiled with the option USE_OPENMP=0. If I set manually [GOTO|OMP|MKL]_NUM_THREADS to 1 the results are:
- Code: Select all
OMP_NUM_THREADS=1
GOTO_NUM_THREADS=1
MKL_NUM_THREADS=1
PLASMA_ executed in 0.276148 seconds
cblas_dgemm executed in 1.990249 seconds
Hello from OpenMP thread number 0
Hello from OpenMP thread number 3
Hello from OpenMP thread number 2
Hello from OpenMP thread number 1
In this case the internal omp_set_num_threads() function does not affect to OpenBLAS functions due the USE_OPENMP=0 so the environment variables are used to control the number of threads in PLASMA_dpotrf and in cblas_dgemm.
If I set the environment variables to 4 the results are:
- Code: Select all
OMP_NUM_THREADS=4
GOTO_NUM_THREADS=4
MKL_NUM_THREADS=4
PLASMA_ executed in 21.861205 seconds
cblas_dgemm executed in 0.566869 seconds
Hello from OpenMP thread number 3
Hello from OpenMP thread number 2
Hello from OpenMP thread number 0
Hello from OpenMP thread number 1
We can see that PLASMA function is affected but the cblas_dgemm function uses the 4 cores and the speedup with respect to the prior test is between 3x-4x
In the second test I hace used OpenBLAS compiled with USE_OPENMP=1. The results are:
- Code: Select all
OMP_NUM_THREADS=4
GOTO_NUM_THREADS=4
MKL_NUM_THREADS=4
PLASMA_ executed in 0.165748 seconds
cblas_dgemm executed in 1.987422 seconds
Hello from OpenMP thread number 0
Hello from OpenMP thread number 1
Hello from OpenMP thread number 2
Hello from OpenMP thread number 3
In this case OpenBLAS apparently uses the OpenMP omp_set_num_threads() function. Prior to PLASMA_Init was set to omp_set_num_threads(1) and apparently the OpenBLAS used by PLASMA now uses only one thread. The PLASMA_dpotrf is better than the first test win OpenBLAS USE_OPENMP=0. So is clear that PLASMA internal CBLAS is using only 1 thread. But after PLASMA_Finalize the function omp_set_num_threads(4) is called but cblas_dgemm() shows times as if it uses only one thread (as in the first test).
So my question is: could be PLASMA_Finalize() some buggy behavior? Apparently the OpenBLAS library compiled with USE_OPENMP=1 detects correctly the omp_set_num_threads() function as we can see in the PLASMA_dpotrf behavior. But I'm confused...
Cheers