Control number of threads in a mix program PLASMA+BLAS

Open forum for general discussions relating to PLASMA.

Control number of threads in a mix program PLASMA+BLAS

Postby jgpallero » Mon Jul 30, 2012 6:39 am

Hello:

Imagine I'm a program in which I perform an Lapack operation with PLASMA and after it I need to use a BLAS function (DGEMM from GotoBLAS, for example). Prior to the program execution I set:

GOTO_NUM_THREADS=1
OMP_NUM_THREADS=1

This assignations make PLASMA runs efficiently, but the DGEMM function will use only one thread. How can I set inside the program (between the PLASMA call and the BLAS call) the maximum thread usage in order to perform efficiently the DGEMM function?

Cheers

NOTE: GotoBLAS is discontinued in the TACC and since a few months the project has been continued and renamed to OpenBLAS (https://github.com/xianyi/OpenBLAS). In the PLASMA manual I see that the library is referred as Goto, so I think that this should be changed to the new OpenBLAS
jgpallero
 
Posts: 29
Joined: Sat Jul 28, 2012 12:12 pm

Re: Control number of threads in a mix program PLASMA+BLAS

Postby dobson156 » Mon Jul 30, 2012 9:28 am

I have no experience with GotoBLAS but you can try this until someone with more experience comes along.

You should finalize PLASMA once you have finished with it, this will remove any affinity the threads have with the cores. Then you can use the openMP runtime library function to set the number of threads you require.

Code: Select all
PLASMA_Finalize(); //finalize plasma
omp_set_num_threads(8); //or however many
//use Goto blas


Here is where my knowledge of GotoBlas gets a bit thread bare:
According to this (http://www.tacc.utexas.edu/tacc-projects/gotoblas2/faq/) `GOTO_NUM_THREADS` is the same as `OMP_NUM_THREADS` except for priority, so using the OpenMP runtime library function should over rule them both (it changes the ICV).

Don't forget to reverse the procedure if you wish to use PLASMA again (reinit plasma and set num threads to 1).

It is also probably worth while reading this thread, and make sure you are using the latest version of PLASMA.
http://icl.cs.utk.edu/plasma/forum/viewtopic.php?f=2&t=118

Depending on your problem size the initialising and finalising of PLASMA might carry a significant cost, it is probably easier if you stick inside of PLASMA for all your functions where possible (there is a PLASMA_dgemm and other level 3 blas) especially if you convert your data to tile major.

NOTE: admins feel free to delete this if you feel any of it is inaccurate.
dobson156
 
Posts: 11
Joined: Thu Jun 21, 2012 12:50 pm

Re: Control number of threads in a mix program PLASMA+BLAS

Postby mateo70 » Mon Jul 30, 2012 11:28 am

Hello,

Thanks for the note for GotoBLAS. We know about OpenBLAS but we forgot to change it in the documentation. We will change that for the next release.
Concerning the number of threads problem I would recomment to looks at the different threads talking about it already. But the main idea is that PLASMA bind its own threads to get better performance. If OpenMP, Goto, or any other application creates threads after the PLASMA_Init call, they will inherit the properties of the main thread (bound to core 0). The solution is then to create any other threads prior to call PLASMA_Init, for example by having a call to an openmp function with the maximal number of threads. Those will stay alive in the backgroun while you will call PLASMA with omp_set_num_threads(1) and will be reused when you will set omp_num_threads(X) after the plasma call.

In you case I would still recommend to use the PLASMA BLAS 3, as you can interleave those functions with the previous computations thanks to the asynchronous interface.

Cheers,
Mathieu
mateo70
 
Posts: 98
Joined: Fri May 07, 2010 3:48 pm

Re: Control number of threads in a mix program PLASMA+BLAS

Postby jgpallero » Mon Jul 30, 2012 12:28 pm

Hello:

I'm trying to do:
Code: Select all
omp_set_num_threads(1);
PLASMA_Init(0);
//some PLASMA computations...
PLASMA_Finalize();
omp_set_num_Threads(4);
//other parallel computations


But PLASMA behavior is as if omp_set_num_threads(1) does not runs. I have not defined GOTO_NUM_THREADS nor OMP_NUM_THREADS nor MKL_NUM_THREADS, so the system will use the maximum ones permitted. I think omp_set_num_threads(1); should has priority over environment variables. Is this assumption correct? Is my code wrong?

I'm using the latest 2.4.5 PLASMA version under Debian GNU/Linux GCC 4.7.1
jgpallero
 
Posts: 29
Joined: Sat Jul 28, 2012 12:12 pm

Re: Control number of threads in a mix program PLASMA+BLAS

Postby jgpallero » Tue Jul 31, 2012 11:33 am

Hello again:

Attached I send a simple test program (uncompress it, please) in order to try to clarify the behavior of a mix program PLASMA+threaded BLAS. It is a simple program that performs the Cholesky factorization via PLASMA and calls cblas_dgemm with dummy matrices. It performs too a #pragma omp parallel section. I use cblas_dgemm only as an example, any other CBLAS or Lapack function nor present in PLASMA could be used. The program fix the number of cores prior to initializa PLASMA to 1 and fix to 4 after PLASMA finalization. I'm using PLASMA 2.4.5 in a Debian GNU/Linux system with GCC 4.7.1. The processor is an 4 core Intel Core i5 2500 3.30 GHz and I use a 3000x3000 matrices

In my tests I linked the program against OpenBLAS (continuation of GotoBLAS, https://github.com/xianyi/OpenBLAS). One important issue of [Open|Goto]BLAS is that it can be compiled with the flag (see Makefile.rule in the distribution) USE_OPENMP=0 or USE_OPENMP=1 in order to use the library in an OpenMP program. So the results:

In my first test I have linked the program with OpenBLAS compiled with the option USE_OPENMP=0. If I set manually [GOTO|OMP|MKL]_NUM_THREADS to 1 the results are:
Code: Select all
OMP_NUM_THREADS=1
GOTO_NUM_THREADS=1
MKL_NUM_THREADS=1
PLASMA_ executed in 0.276148 seconds
cblas_dgemm executed in 1.990249 seconds
Hello from OpenMP thread number 0
Hello from OpenMP thread number 3
Hello from OpenMP thread number 2
Hello from OpenMP thread number 1

In this case the internal omp_set_num_threads() function does not affect to OpenBLAS functions due the USE_OPENMP=0 so the environment variables are used to control the number of threads in PLASMA_dpotrf and in cblas_dgemm.
If I set the environment variables to 4 the results are:
Code: Select all
OMP_NUM_THREADS=4
GOTO_NUM_THREADS=4
MKL_NUM_THREADS=4
PLASMA_ executed in 21.861205 seconds
cblas_dgemm executed in 0.566869 seconds
Hello from OpenMP thread number 3
Hello from OpenMP thread number 2
Hello from OpenMP thread number 0
Hello from OpenMP thread number 1

We can see that PLASMA function is affected but the cblas_dgemm function uses the 4 cores and the speedup with respect to the prior test is between 3x-4x

In the second test I hace used OpenBLAS compiled with USE_OPENMP=1. The results are:
Code: Select all
OMP_NUM_THREADS=4
GOTO_NUM_THREADS=4
MKL_NUM_THREADS=4
PLASMA_ executed in 0.165748 seconds
cblas_dgemm executed in 1.987422 seconds
Hello from OpenMP thread number 0
Hello from OpenMP thread number 1
Hello from OpenMP thread number 2
Hello from OpenMP thread number 3

In this case OpenBLAS apparently uses the OpenMP omp_set_num_threads() function. Prior to PLASMA_Init was set to omp_set_num_threads(1) and apparently the OpenBLAS used by PLASMA now uses only one thread. The PLASMA_dpotrf is better than the first test win OpenBLAS USE_OPENMP=0. So is clear that PLASMA internal CBLAS is using only 1 thread. But after PLASMA_Finalize the function omp_set_num_threads(4) is called but cblas_dgemm() shows times as if it uses only one thread (as in the first test).

So my question is: could be PLASMA_Finalize() some buggy behavior? Apparently the OpenBLAS library compiled with USE_OPENMP=1 detects correctly the omp_set_num_threads() function as we can see in the PLASMA_dpotrf behavior. But I'm confused...

Cheers
Attachments
test_thread_plasma.c.tar.gz
gcc -fopenmp -O2 test_thread_plasma.c -o test_thread_plasma -lplasma -lquark -lcoreblas -lpthread -lopenblas -llapack -llapacke
(930 Bytes) Downloaded 156 times
jgpallero
 
Posts: 29
Joined: Sat Jul 28, 2012 12:12 pm

Re: Control number of threads in a mix program PLASMA+BLAS

Postby jgpallero » Thu Aug 02, 2012 6:22 am

Hello again:

I managed to write a pair of function that solves the problem of, without explicitly set environment variables ([OMP|GOTO|MKL]_NUM_THREADS) to 1, set the correct number of threads before and after the use of PLASMA. First of all, in order to init and finalize PLASMA I have two macros:
Code: Select all
#define MY_PLASMA_INIT() \
MyplasmaSetNumThreadsToOne(); \
PLASMA_Init(0)

and
Code: Select all
#define MY_PLASMA_SHUTDOWN() \
PLASMA_Finalize()
MyplasmaRestoreNumThreads();


Set and restore threads are performed by two functions. The first one is void MyplasmaSetNumThreadsToOne(void):
Code: Select all
void MyplasmaSetNumThreadsToOne(void)
{
    omp_set_num_threads(1);
    goto_set_num_threads(1);
    mkl_set_num_threads(1);
    return;
}

The function sets the number of threads to one for generic OpenMP environment (omp_set_num_threads), for OpenBLAS/GotoBLAS library (goto_set_num_threads) and for Intel MKL (mkl_set_num_threads). OpenBLAS/GotoBLAS and MKL can select dynamically the number of threads, so we should use the functions for setting them to one. I don't know if ACML or SUNPERF have equivalent functions or can use the OpenMP generic function. Anyone knows? In the case of ATLAS, in PLASMA the non threaded version should be used because this library cannot select the number of threads to use at run time. My function could be better written using preprocessor defines in order to check if include the MKL or OpenBLAS functions.

The second function is void MyplasmaRestoreNumThreads(void) and restores the number of threads after PLASMA finalization:
Code: Select all
void MyplasmaRestauraNumThreads(void)
{
    //global maximum number of threads
    int hilosMax=omp_get_num_procs();
    int defOMP=0,defOPEN=0,defMKL=0;
    char* valOMP=NULL;
    char* valOPEN=NULL;
    char* valMKL=NULL;
    int varOMP=0,varOPEN=0,varMKL=0;

    //try to extract the environment variables
    valOMP = getenv("OMP_NUM_THREADS");
    valOPEN = getenv("GOTO_NUM_THREADS");
    valMKL = getenv("MKL_NUM_THREADS");
    //test if environment variables exist
    if(valOMP!=NULL)
    {
        defOMP = 1;
    }
    if(valOPEN!=NULL)
    {
        defOPEN = 1;
    }
    if(valMKL!=NULL)
    {
        defMKL = 1;
    }
    //generic OpenMP
    if(defOMP)
    {
        varOMP = atoi(valOMP);
        if(varOMP>=0)
        {
            omp_set_num_threads(varOMP);
        }
        else
        {
            omp_set_num_threads(hilosMax);
        }
    }
    else
    {
        omp_set_num_threads(hilosMax);
    }
    //for OpenBLAS
    if(defOPEN)
    {
        varOPEN = atoi(valOPEN);
        if(varOPEN>=0)
        {
            goto_set_num_threads(varOPEN);
        }
        else
        {
            goto_set_num_threads(hilosMax);
        }
    }
    else
    {
        if(varOMP>=0)
        {
            goto_set_num_threads(varOMP);
        }
        else
        {
            goto_set_num_threads(hilosMax);
        }
    }
    //for MKL
    if(defMKL)
    {
        varMKL = atoi(valMKL);
        if(varMKL>=0)
        {
            mkl_set_num_threads(varMKL);
        }
        else
        {
           mkl_set_num_threads(hilosMax);
        }
    }
    else
    {
        if(varOMP>=0)
        {
            mkl_set_num_threads(varOMP);
        }
        else
        {
            mkl_set_num_threads(hilosMax);
        }
    }

    return;
}

The function checks if the environment variables exist and set the values stored in them. If are not defined sets the maximum number of threads in the machine

Cheers
jgpallero
 
Posts: 29
Joined: Sat Jul 28, 2012 12:12 pm


Return to User discussion

Who is online

Users browsing this forum: Yahoo [Bot] and 2 guests

cron