PLASMA 2.4.0 and OpenMP compatibility issue?

Open forum for general discussions relating to PLASMA.

Re: PLASMA 2.4.0 and OpenMP compatibility issue?

Postby srdegraaf » Tue Jul 05, 2011 3:29 pm

Mathieu,

Correct. I'm not using BLAS, LAPACK or ATLAS anywhere else in this application.

Stuart
Last edited by srdegraaf on Tue Jul 05, 2011 3:34 pm, edited 1 time in total.
srdegraaf
 
Posts: 12
Joined: Sat Jul 02, 2011 12:00 am

Re: PLASMA 2.4.0 and OpenMP compatibility issue?

Postby srdegraaf » Tue Jul 05, 2011 3:31 pm

Mathieu,

Thanks for your continuing support. I hope you can duplicate the problem.

Stuart
srdegraaf
 
Posts: 12
Joined: Sat Jul 02, 2011 12:00 am

Re: PLASMA 2.4.0 and OpenMP compatibility issue?

Postby mateo70 » Wed Jul 06, 2011 6:20 am

Suart,

I cannot reproduce the problem, the part of the code using OpenMP print the correct number of threads and apparently use the correct number of thread.
I added some print in the for loop to know which thread is executing each iteration of the loop and all threads appear.

Mathieu
mateo70
 
Posts: 95
Joined: Fri May 07, 2010 3:48 pm

Re: PLASMA 2.4.0 and OpenMP compatibility issue?

Postby srdegraaf » Wed Jul 06, 2011 9:01 am

Mathieu,

I was afraid you were going to say that... Thanks for testing! I tried stripping out some of the unused baggage from my makefile (GSL FFTs, Toeplitz and filtering stuff that support other code in my build tree) just in case one of these unused things was causing the problem, but it made no difference. I'll keep looking at my code/makefiles and try different size problems to see if I can figure out what is happening. Out of curiosity, are you running Linux and/or gcc? If so, what versions? Any reason to think it would matter?

Thanks,

Stuart
srdegraaf
 
Posts: 12
Joined: Sat Jul 02, 2011 12:00 am

Re: PLASMA 2.4.0 and OpenMP compatibility issue?

Postby mateo70 » Wed Jul 06, 2011 9:17 am

Stuart,

I'm running Linux (ubuntu 4.11) with gcc/gfortran (4.5.2) and gotoblas2 but I don't think it makes a difference. I will give it another shot with a similar problem I had before with MKL. So I will see if I still have it and if it is somehow related.

Mathieu
mateo70
 
Posts: 95
Joined: Fri May 07, 2010 3:48 pm

Re: PLASMA 2.4.0 and OpenMP compatibility issue?

Postby uhle89 » Wed Jul 06, 2011 10:59 am

Hi to all,
we can confirm the behaviour found by srdegraaf, since I also combine OpenMP and PLASMA 2.4.0.
(Fedora12+atlas+gcc-4.4.4+OMP3.0)
We use both within a field simulation program employing boundary elements.
In our case in a single process
1.) runs the OpenMP accellerated calculation of the elements of the dense matrix
2.) runs the LU-factorization by means of PLASMA_dgetrf() and evtl. PLASMA_dgetrs()
3.) runs the OpenMP accellerated evaluation of the solution.

My observation: 1)+2) run in parrallel, 3) in just one thread.
If I choose LAPACK for 2.) 1)+3) run in parallel.

I have put omp_get_num_procs() before and after 2.) and surprisingly
I get 16 before and 1 after. Thus PLASMA eats cpus ;)
regards, Stephan
uhle89
 
Posts: 6
Joined: Wed Jul 06, 2011 10:30 am

Re: PLASMA 2.4.0 and OpenMP compatibility issue?

Postby srdegraaf » Wed Jul 06, 2011 5:18 pm

Mathieu,

I ran an experiment/comparison of OpenMP thread info printed out using LAPACK (serial) or PLASMA (parallel) to do the initial linear algebra part of my algorithm. Within my #pragma omp parallel for loop I printed out several variables: the index of the loop (nx), the operating system's thought on what the thread is (OSthread) as reported by syscall(SYS_gettid), OpenMP's thought on what the thread is (OMPthread) as reported by omp_get_thread_num(), as well as OpenMP's thoughts on what the number of threads (OMPnumthreads) and maximum available number of threads (OMPmaxthreads) at each point in the loop. Specifically, I was looking to see if there was a difference in the behaviour of the OSthread and OMPthread variables.

What I expected to see with PLASMA (where OpenMP fails) was that there would be 24 different values of OMPthread in use, but only one value of OSthread. Further, since things are apparently running "sequentially", I expected to see the print statements come out in some kind of sequential order. This did
NOT happen. I see 24 different OSthread values and see a "random" interleaving of the print statements. All of this suggests, to me, that parallelism is actually happening, as it should. (Below are snippet of the printouts for both the PLASMA and LAPACK "experiments".) HOWEVER, THESE PRINTOUT LINES DO NOT COME OUT IN A STEADY STREAM. INSTEAD, THEY COME IN BURSTS, AS IF THE LINUX SCHEDULER IS ONLY ALLOWING ONE PHYSICAL COR TO BE USED AT A TIME. OpenMP and the OS both seem to think that 24 cores/threads are available, yet gkrellm/top shows that only one is actually being used. Interestingly, the core that is busy does not seem to hop around amongst the CPUs shown in the gkrellm display as it often does when running a single threaded application.

When I do this using LAPACK (where OpenMP works), while I also see 24 values of OSthread and OMPthread being used (also shown below) and "random" ordering of the lines, the printout comes out in a steady stream, and gkrellm/top shows that all 24 cores/threads are being used fully.

My knowledge of how all this works is (obviously) limited. However, I've shown these results to a colleague who is quite knowledgeable in these matters, and he suspects that the PLASMA_Finalize() routine is somehow causing the Linux scheduler to restrict the number of available physical cores to one. (God knows how.)

I hope these clues, together with the corroborating "testimony" of Stephan/uhle89, help you to discover the underlying problem. Is it possible that this only happens in conjunction with using ATLAS CBLAS? (Again, God knows why.) You weren't able to duplicate the problem, but then perhaps you didn't use ATLAS.
For what it's worth, I'm using gcc/gfortran version 4.5.1, not quite as new as yours.

I appreciate your efforts to track down this insidious/subtle "bug". I suspect that it will be to many people's benefit.

Thanks,
Stuart

Below are the printouts mentioned above:

Code: Select all
Using 3686 of 4096 control points with quality 2.000000 to 0.145472
Setup and solve TPS equations
PLASMA_Init: 0
Just before plasma omp_get_num_threads() yields: 1
PLASMA_dgels: 0
Just after plasma omp_get_num_threads() yields: 1
PLASMA_Finalize: 0
Normalized bending energy: 1.213144e-01
Fit error energy: 2.759992e+00
Interpolate z surface using TPS splines
PLASMA nx: 43 OSthread: 19814 OMPthread: 1 OMPnumthreads: 24 OMPmaxthreads 24
PLASMA nx: 86 OSthread: 19815 OMPthread: 2 OMPnumthreads: 24 OMPmaxthreads 24
PLASMA nx: 129 OSthread: 19816 OMPthread: 3 OMPnumthreads: 24 OMPmaxthreads 24
PLASMA nx: 172 OSthread: 19817 OMPthread: 4 OMPnumthreads: 24 OMPmaxthreads 24
PLASMA nx: 215 OSthread: 19818 OMPthread: 5 OMPnumthreads: 24 OMPmaxthreads 24
PLASMA nx: 301 OSthread: 19820 OMPthread: 7 OMPnumthreads: 24 OMPmaxthreads 24
PLASMA nx: 344 OSthread: 19821 OMPthread: 8 OMPnumthreads: 24 OMPmaxthreads 24
PLASMA nx: 387 OSthread: 19822 OMPthread: 9 OMPnumthreads: 24 OMPmaxthreads 24
PLASMA nx: 258 OSthread: 19819 OMPthread: 6 OMPnumthreads: 24 OMPmaxthreads 24
PLASMA nx: 430 OSthread: 19823 OMPthread: 10 OMPnumthreads: 24 OMPmaxthreads 24
PLASMA nx: 473 OSthread: 19824 OMPthread: 11 OMPnumthreads: 24 OMPmaxthreads 24
PLASMA nx: 559 OSthread: 19826 OMPthread: 13 OMPnumthreads: 24 OMPmaxthreads 24
PLASMA nx: 516 OSthread: 19825 OMPthread: 12 OMPnumthreads: 24 OMPmaxthreads 24
PLASMA nx: 645 OSthread: 19828 OMPthread: 15 OMPnumthreads: 24 OMPmaxthreads 24
PLASMA nx: 602 OSthread: 19827 OMPthread: 14 OMPnumthreads: 24 OMPmaxthreads 24
PLASMA nx: 731 OSthread: 19830 OMPthread: 17 OMPnumthreads: 24 OMPmaxthreads 24
PLASMA nx: 688 OSthread: 19829 OMPthread: 16 OMPnumthreads: 24 OMPmaxthreads 24
PLASMA nx: 774 OSthread: 19831 OMPthread: 18 OMPnumthreads: 24 OMPmaxthreads 24
PLASMA nx: 817 OSthread: 19832 OMPthread: 19 OMPnumthreads: 24 OMPmaxthreads 24
PLASMA nx: 860 OSthread: 19833 OMPthread: 20 OMPnumthreads: 24 OMPmaxthreads 24
PLASMA nx: 903 OSthread: 19834 OMPthread: 21 OMPnumthreads: 24 OMPmaxthreads 24
PLASMA nx: 989 OSthread: 19836 OMPthread: 23 OMPnumthreads: 24 OMPmaxthreads 24
PLASMA nx: 946 OSthread: 19835 OMPthread: 22 OMPnumthreads: 24 OMPmaxthreads 24
PLASMA nx: 0 OSthread: 19798 OMPthread: 0 OMPnumthreads: 24 OMPmaxthreads 24
PLASMA nx: 44 OSthread: 19814 OMPthread: 1 OMPnumthreads: 24 OMPmaxthreads 24
PLASMA nx: 689 OSthread: 19829 OMPthread: 16 OMPnumthreads: 24 OMPmaxthreads 24
PLASMA nx: 603 OSthread: 19827 OMPthread: 14 OMPnumthreads: 24 OMPmaxthreads 24
PLASMA nx: 388 OSthread: 19822 OMPthread: 9 OMPnumthreads: 24 OMPmaxthreads 24
PLASMA nx: 474 OSthread: 19824 OMPthread: 11 OMPnumthreads: 24 OMPmaxthreads 24
PLASMA nx: 904 OSthread: 19834 OMPthread: 21 OMPnumthreads: 24 OMPmaxthreads 24
PLASMA nx: 87 OSthread: 19815 OMPthread: 2 OMPnumthreads: 24 OMPmaxthreads 24
PLASMA nx: 302 OSthread: 19820 OMPthread: 7 OMPnumthreads: 24 OMPmaxthreads 24
PLASMA nx: 173 OSthread: 19817 OMPthread: 4 OMPnumthreads: 24 OMPmaxthreads 24
PLASMA nx: 130 OSthread: 19816 OMPthread: 3 OMPnumthreads: 24 OMPmaxthreads 24
PLASMA nx: 775 OSthread: 19831 OMPthread: 18 OMPnumthreads: 24 OMPmaxthreads 24
PLASMA nx: 517 OSthread: 19825 OMPthread: 12 OMPnumthreads: 24 OMPmaxthreads 24
PLASMA nx: 560 OSthread: 19826 OMPthread: 13 OMPnumthreads: 24 OMPmaxthreads 24
PLASMA nx: 646 OSthread: 19828 OMPthread: 15 OMPnumthreads: 24 OMPmaxthreads 24
PLASMA nx: 947 OSthread: 19835 OMPthread: 22 OMPnumthreads: 24 OMPmaxthreads 24
PLASMA nx: 732 OSthread: 19830 OMPthread: 17 OMPnumthreads: 24 OMPmaxthreads 24
PLASMA nx: 259 OSthread: 19819 OMPthread: 6 OMPnumthreads: 24 OMPmaxthreads 24
PLASMA nx: 861 OSthread: 19833 OMPthread: 20 OMPnumthreads: 24 OMPmaxthreads 24
PLASMA nx: 1 OSthread: 19798 OMPthread: 0 OMPnumthreads: 24 OMPmaxthreads 24
PLASMA nx: 431 OSthread: 19823 OMPthread: 10 OMPnumthreads: 24 OMPmaxthreads 24
PLASMA nx: 345 OSthread: 19821 OMPthread: 8 OMPnumthreads: 24 OMPmaxthreads 24
PLASMA nx: 990 OSthread: 19836 OMPthread: 23 OMPnumthreads: 24 OMPmaxthreads 24
PLASMA nx: 216 OSthread: 19818 OMPthread: 5 OMPnumthreads: 24 OMPmaxthreads 24
PLASMA nx: 818 OSthread: 19832 OMPthread: 19 OMPnumthreads: 24 OMPmaxthreads 24
PLASMA nx: 561 OSthread: 19826 OMPthread: 13 OMPnumthreads: 24 OMPmaxthreads 24
PLASMA nx: 647 OSthread: 19828 OMPthread: 15 OMPnumthreads: 24 OMPmaxthreads 24
PLASMA nx: 518 OSthread: 19825 OMPthread: 12 OMPnumthreads: 24 OMPmaxthreads 24
PLASMA nx: 733 OSthread: 19830 OMPthread: 17 OMPnumthreads: 24 OMPmaxthreads 24
PLASMA nx: 260 OSthread: 19819 OMPthread: 6 OMPnumthreads: 24 OMPmaxthreads 24
PLASMA nx: 475 OSthread: 19824 OMPthread: 11 OMPnumthreads: 24 OMPmaxthreads 24
PLASMA nx: 389 OSthread: 19822 OMPthread: 9 OMPnumthreads: 24 OMPmaxthreads 24
PLASMA nx: 346 OSthread: 19821 OMPthread: 8 OMPnumthreads: 24 OMPmaxthreads 24
PLASMA nx: 862 OSthread: 19833 OMPthread: 20 OMPnumthreads: 24 OMPmaxthreads 24
PLASMA nx: 432 OSthread: 19823 OMPthread: 10 OMPnumthreads: 24 OMPmaxthreads 24
PLASMA nx: 690 OSthread: 19829 OMPthread: 16 OMPnumthreads: 24 OMPmaxthreads 24
PLASMA nx: 217 OSthread: 19818 OMPthread: 5 OMPnumthreads: 24 OMPmaxthreads 24
PLASMA nx: 819 OSthread: 19832 OMPthread: 19 OMPnumthreads: 24 OMPmaxthreads 24
PLASMA nx: 604 OSthread: 19827 OMPthread: 14 OMPnumthreads: 24 OMPmaxthreads 24
PLASMA nx: 905 OSthread: 19834 OMPthread: 21 OMPnumthreads: 24 OMPmaxthreads 24
PLASMA nx: 303 OSthread: 19820 OMPthread: 7 OMPnumthreads: 24 OMPmaxthreads 24
PLASMA nx: 131 OSthread: 19816 OMPthread: 3 OMPnumthreads: 24 OMPmaxthreads 24
PLASMA nx: 776 OSthread: 19831 OMPthread: 18 OMPnumthreads: 24 OMPmaxthreads 24
PLASMA nx: 174 OSthread: 19817 OMPthread: 4 OMPnumthreads: 24 OMPmaxthreads 24
PLASMA nx: 948 OSthread: 19835 OMPthread: 22 OMPnumthreads: 24 OMPmaxthreads 24
PLASMA nx: 88 OSthread: 19815 OMPthread: 2 OMPnumthreads: 24 OMPmaxthreads 24
PLASMA nx: 45 OSthread: 19814 OMPthread: 1 OMPnumthreads: 24 OMPmaxthreads 24
PLASMA nx: 2 OSthread: 19798 OMPthread: 0 OMPnumthreads: 24 OMPmaxthreads 24
PLASMA nx: 991 OSthread: 19836 OMPthread: 23 OMPnumthreads: 24 OMPmaxthreads 24
... on and on ...


Code: Select all
Using 3686 of 4096 control points with quality 2.000000 to 0.145472
Setup and solve TPS equations
Normalized bending energy: 1.213144e-01
Fit error energy: 2.759992e+00
Interpolate z surface using TPS splines
LAPACK nx: 989 OSthread: 19777 OMPthread: 23 OMPnumthreads: 24 OMPmaxthreads 24
LAPACK nx: 817 OSthread: 19773 OMPthread: 19 OMPnumthreads: 24 OMPmaxthreads 24
LAPACK nx: 602 OSthread: 19768 OMPthread: 14 OMPnumthreads: 24 OMPmaxthreads 24
LAPACK nx: 43 OSthread: 19755 OMPthread: 1 OMPnumthreads: 24 OMPmaxthreads 24
LAPACK nx: 688 OSthread: 19770 OMPthread: 16 OMPnumthreads: 24 OMPmaxthreads 24
LAPACK nx: 645 OSthread: 19769 OMPthread: 15 OMPnumthreads: 24 OMPmaxthreads 24
LAPACK nx: 258 OSthread: 19760 OMPthread: 6 OMPnumthreads: 24 OMPmaxthreads 24
LAPACK nx: 430 OSthread: 19764 OMPthread: 10 OMPnumthreads: 24 OMPmaxthreads 24
LAPACK nx: 860 OSthread: 19774 OMPthread: 20 OMPnumthreads: 24 OMPmaxthreads 24
LAPACK nx: 344 OSthread: 19762 OMPthread: 8 OMPnumthreads: 24 OMPmaxthreads 24
LAPACK nx: 774 OSthread: 19772 OMPthread: 18 OMPnumthreads: 24 OMPmaxthreads 24
LAPACK nx: 86 OSthread: 19756 OMPthread: 2 OMPnumthreads: 24 OMPmaxthreads 24
LAPACK nx: 215 OSthread: 19759 OMPthread: 5 OMPnumthreads: 24 OMPmaxthreads 24
LAPACK nx: 559 OSthread: 19767 OMPthread: 13 OMPnumthreads: 24 OMPmaxthreads 24
LAPACK nx: 516 OSthread: 19766 OMPthread: 12 OMPnumthreads: 24 OMPmaxthreads 24
LAPACK nx: 731 OSthread: 19771 OMPthread: 17 OMPnumthreads: 24 OMPmaxthreads 24
LAPACK nx: 129 OSthread: 19757 OMPthread: 3 OMPnumthreads: 24 OMPmaxthreads 24
LAPACK nx: 387 OSthread: 19763 OMPthread: 9 OMPnumthreads: 24 OMPmaxthreads 24
LAPACK nx: 172 OSthread: 19758 OMPthread: 4 OMPnumthreads: 24 OMPmaxthreads 24
LAPACK nx: 301 OSthread: 19761 OMPthread: 7 OMPnumthreads: 24 OMPmaxthreads 24
LAPACK nx: 903 OSthread: 19775 OMPthread: 21 OMPnumthreads: 24 OMPmaxthreads 24
LAPACK nx: 946 OSthread: 19776 OMPthread: 22 OMPnumthreads: 24 OMPmaxthreads 24
LAPACK nx: 473 OSthread: 19765 OMPthread: 11 OMPnumthreads: 24 OMPmaxthreads 24
LAPACK nx: 0 OSthread: 19745 OMPthread: 0 OMPnumthreads: 24 OMPmaxthreads 24
LAPACK nx: 388 OSthread: 19763 OMPthread: 9 OMPnumthreads: 24 OMPmaxthreads 24
LAPACK nx: 732 OSthread: 19771 OMPthread: 17 OMPnumthreads: 24 OMPmaxthreads 24
LAPACK nx: 259 OSthread: 19760 OMPthread: 6 OMPnumthreads: 24 OMPmaxthreads 24
LAPACK nx: 87 OSthread: 19756 OMPthread: 2 OMPnumthreads: 24 OMPmaxthreads 24
LAPACK nx: 775 OSthread: 19772 OMPthread: 18 OMPnumthreads: 24 OMPmaxthreads 24
LAPACK nx: 345 OSthread: 19762 OMPthread: 8 OMPnumthreads: 24 OMPmaxthreads 24
LAPACK nx: 560 OSthread: 19767 OMPthread: 13 OMPnumthreads: 24 OMPmaxthreads 24
LAPACK nx: 517 OSthread: 19766 OMPthread: 12 OMPnumthreads: 24 OMPmaxthreads 24
LAPACK nx: 818 OSthread: 19773 OMPthread: 19 OMPnumthreads: 24 OMPmaxthreads 24
LAPACK nx: 302 OSthread: 19761 OMPthread: 7 OMPnumthreads: 24 OMPmaxthreads 24
LAPACK nx: 431 OSthread: 19764 OMPthread: 10 OMPnumthreads: 24 OMPmaxthreads 24
LAPACK nx: 603 OSthread: 19768 OMPthread: 14 OMPnumthreads: 24 OMPmaxthreads 24
LAPACK nx: 861 OSthread: 19774 OMPthread: 20 OMPnumthreads: 24 OMPmaxthreads 24
LAPACK nx: 216 OSthread: 19759 OMPthread: 5 OMPnumthreads: 24 OMPmaxthreads 24
LAPACK nx: 689 OSthread: 19770 OMPthread: 16 OMPnumthreads: 24 OMPmaxthreads 24
LAPACK nx: 44 OSthread: 19755 OMPthread: 1 OMPnumthreads: 24 OMPmaxthreads 24
LAPACK nx: 947 OSthread: 19776 OMPthread: 22 OMPnumthreads: 24 OMPmaxthreads 24
LAPACK nx: 173 OSthread: 19758 OMPthread: 4 OMPnumthreads: 24 OMPmaxthreads 24
LAPACK nx: 130 OSthread: 19757 OMPthread: 3 OMPnumthreads: 24 OMPmaxthreads 24
LAPACK nx: 646 OSthread: 19769 OMPthread: 15 OMPnumthreads: 24 OMPmaxthreads 24
LAPACK nx: 990 OSthread: 19777 OMPthread: 23 OMPnumthreads: 24 OMPmaxthreads 24
LAPACK nx: 474 OSthread: 19765 OMPthread: 11 OMPnumthreads: 24 OMPmaxthreads 24
LAPACK nx: 904 OSthread: 19775 OMPthread: 21 OMPnumthreads: 24 OMPmaxthreads 24
LAPACK nx: 389 OSthread: 19763 OMPthread: 9 OMPnumthreads: 24 OMPmaxthreads 24
LAPACK nx: 733 OSthread: 19771 OMPthread: 17 OMPnumthreads: 24 OMPmaxthreads 24
LAPACK nx: 260 OSthread: 19760 OMPthread: 6 OMPnumthreads: 24 OMPmaxthreads 24
LAPACK nx: 303 OSthread: 19761 OMPthread: 7 OMPnumthreads: 24 OMPmaxthreads 24
LAPACK nx: 346 OSthread: 19762 OMPthread: 8 OMPnumthreads: 24 OMPmaxthreads 24
LAPACK nx: 88 OSthread: 19756 OMPthread: 2 OMPnumthreads: 24 OMPmaxthreads 24
LAPACK nx: 432 OSthread: 19764 OMPthread: 10 OMPnumthreads: 24 OMPmaxthreads 24
LAPACK nx: 819 OSthread: 19773 OMPthread: 19 OMPnumthreads: 24 OMPmaxthreads 24
LAPACK nx: 518 OSthread: 19766 OMPthread: 12 OMPnumthreads: 24 OMPmaxthreads 24
LAPACK nx: 561 OSthread: 19767 OMPthread: 13 OMPnumthreads: 24 OMPmaxthreads 24
LAPACK nx: 604 OSthread: 19768 OMPthread: 14 OMPnumthreads: 24 OMPmaxthreads 24
LAPACK nx: 862 OSthread: 19774 OMPthread: 20 OMPnumthreads: 24 OMPmaxthreads 24
LAPACK nx: 217 OSthread: 19759 OMPthread: 5 OMPnumthreads: 24 OMPmaxthreads 24
LAPACK nx: 948 OSthread: 19776 OMPthread: 22 OMPnumthreads: 24 OMPmaxthreads 24
LAPACK nx: 174 OSthread: 19758 OMPthread: 4 OMPnumthreads: 24 OMPmaxthreads 24
LAPACK nx: 390 OSthread: 19763 OMPthread: 9 OMPnumthreads: 24 OMPmaxthreads 24
LAPACK nx: 131 OSthread: 19757 OMPthread: 3 OMPnumthreads: 24 OMPmaxthreads 24
LAPACK nx: 1 OSthread: 19745 OMPthread: 0 OMPnumthreads: 24 OMPmaxthreads 24
LAPACK nx: 45 OSthread: 19755 OMPthread: 1 OMPnumthreads: 24 OMPmaxthreads 24
LAPACK nx: 690 OSthread: 19770 OMPthread: 16 OMPnumthreads: 24 OMPmaxthreads 24
LAPACK nx: 991 OSthread: 19777 OMPthread: 23 OMPnumthreads: 24 OMPmaxthreads 24
LAPACK nx: 647 OSthread: 19769 OMPthread: 15 OMPnumthreads: 24 OMPmaxthreads 24
LAPACK nx: 475 OSthread: 19765 OMPthread: 11 OMPnumthreads: 24 OMPmaxthreads 24
LAPACK nx: 905 OSthread: 19775 OMPthread: 21 OMPnumthreads: 24 OMPmaxthreads 24
LAPACK nx: 776 OSthread: 19772 OMPthread: 18 OMPnumthreads: 24 OMPmaxthreads 24
... on and on ...


For completeness, here's the code fragment that generated the above prints (only difference is PLASMA vs. LAPACK identifier):

Code: Select all
void tpssurf(numpoints,nsampx,nsampy,nsampz,ctlpoints,tpsvec,rescelldted)
int numpoints,nsampx,nsampy,nsampz;
CTLPOINT *ctlpoints;
MUDS_DOUBLE *tpsvec;
MUDS_DOUBLE **rescelldted;
{
  /*
    Evaluate the thin-plate spline z surface function for all image sample locations.

    Could potentially speed this up by using spline to perform SOME
    interpolation and zero-padded FFTs to perform the rest.
  */

  MUDS_DOUBLE x,y,dxi,dyi,r2i,basis;
  int nx,ny,i;
  int ompnumthreads,ompmaxthreads,ompthread;
  pid_t osthread;

  ompmaxthreads=omp_get_max_threads();
  /*omp_set_num_threads(ompmaxthreads);*/

  #pragma omp parallel for \
  private(nx,ny,i,x,y,dxi,dyi,r2i,basis,ompnumthreads,ompthread,osthread) \
  shared(nsampx,nsampy,numpoints,tpsvec,rescelldted,ompmaxthreads)
  for(nx=0;nx<nsampx;nx++){
    ompnumthreads=omp_get_num_threads();
    ompthread=omp_get_thread_num();
    osthread=(pid_t) syscall(SYS_gettid);
    /*osthread=gettid();/*doesn't seem to work/compile*/
    fprintf(stderr,"PLASMA nx: %d OSthread: %d OMPthread: %d OMPnumthreads: %d OMPmaxthreads %d\n",n
x,osthread,ompthread,ompnumthreads,ompmaxthreads);
    x=(MUDS_DOUBLE)nx/(MUDS_DOUBLE)nsampx;
    for(ny=0;ny<nsampy;ny++){
      y=(MUDS_DOUBLE)ny/(MUDS_DOUBLE)nsampy;
      rescelldted[nx][ny]=(MUDS_DOUBLE)0.;
      /*Bending/perturbation/warping part*/
      for(i=0;i<numpoints;i++){
        dxi=x-ctlpoints[i].x;
        dyi=y-ctlpoints[i].y;
        r2i=dxi*dxi+dyi*dyi;
        if(r2i==(MUDS_DOUBLE)0.) basis=(MUDS_DOUBLE)0.;
        else basis=r2i*log(r2i);
        rescelldted[nx][ny]+=tpsvec[i]*basis;
      }
      rescelldted[nx][ny]*=nsampz; /*don't want normalized units*/
      /*Bi-linear (affine) part*/
      rescelldted[nx][ny]+=(tpsvec[numpoints]+tpsvec[numpoints+1]*x+tpsvec[numpoints+2]*y)*nsampz; /
*don't want normalized units*/
    }
  }
}
srdegraaf
 
Posts: 12
Joined: Sat Jul 02, 2011 12:00 am

Re: PLASMA 2.4.0 and OpenMP compatibility issue?

Postby srdegraaf » Wed Jul 06, 2011 6:50 pm

Mathiew & Stephan,

I confirm Stephan's observation that omp_get_num_procs() returns a different value before PLASMA_Init() and after PLASMA_Finalize(). This doesn't seem right, but is consistent with the behaviour I described in the last post. The following output was produced by the subsequent code:

Just before PLASMA_Init(12) omp_get_num_procs() yields: 24
PLASMA_Init: 0
PLASMA_dgels: 0
PLASMA_Finalize: 0
Just after PLASMA_Finalize() omp_get_num_procs() yields: 1

Code: Select all
#include <stdlib.h>
#include <stdio.h>
#include <string.h>
#include <math.h>
#include <plasma.h>
#include <cblas.h>
#include <lapacke.h>
#include <core_blas.h>
#include <omp.h>
#include "muds_types.h"
#include "muds_complex.h"
#include "muds_allocation.h"

void dsolveviaqr_plasma(nthreads,rows,cols,a,rhs,soln)
int nthreads,rows,cols;
MUDS_DOUBLE **a,*rhs,*soln;
{

  /*
    Interface to FORTRAN LAPACK complex QR decomposition. Assumes full-rank
    matrix with rows>=columns.
  */

  MUDS_DOUBLE **atrans,**rhstrans,*work;
  /*char normal='N'; /*normal, i.e. no transpose*/
  int i,j,info,one=1;
  extern int PLASMA_dgels();

  fprintf(stderr,"Just before PLASMA_Init(%d) omp_get_num_procs() yields: %d\n",nthreads,omp_get_num
_procs());
  info=PLASMA_Init(nthreads);
  fprintf(stderr,"PLASMA_Init: %d\n",info);

  /*Allocate work space*/
  matrix(atrans,MUDS_DOUBLE,cols,rows);
  matrix(rhstrans,MUDS_DOUBLE,1,rows);
  PLASMA_Alloc_Workspace_dgels(rows,cols,&work);

  /*Copy A into transposed array for FORTRAN; QR destroys this array*/
  for(i=0;i<rows;i++){
    for(j=0;j<cols;j++){
      atrans[j][i]=a[i][j];
    }
  }

  /*Copy rhs into transposed array for FORTRAN; overwritten by solution*/
  for(i=0;i<rows;i++){
    rhstrans[0][i]=rhs[i];
  }

  /*Do it*/
  info=PLASMA_dgels(PlasmaNoTrans,rows,cols,one,&atrans[0][0],rows,&work[0],&rhstrans[0][0],rows);
  fprintf(stderr,"PLASMA_dgels: %d\n",info);
  if(info<0) fprintf (stderr,"QR argument %d bad\n",-info);

  /*Recover solution part, which is transposed*/
  for(i=0;i<rows;i++){
    soln[i]=rhstrans[0][i];
  }

  /*Deallocate work space*/
  freematrix(atrans);
  freematrix(rhstrans);
  free(work);

  info=PLASMA_Finalize();
  fprintf(stderr,"PLASMA_Finalize: %d\n",info);
  fprintf(stderr,"Just after PLASMA_Finalize() omp_get_num_procs() yields: %d\n",omp_get_num_procs()
);

  return;
}


Thanks,

Stuart
srdegraaf
 
Posts: 12
Joined: Sat Jul 02, 2011 12:00 am

Re: PLASMA 2.4.0 and OpenMP compatibility issue?

Postby mateo70 » Thu Jul 07, 2011 4:43 am

Hello,

Stuart thanks for the hint, I found the problem.

The problem is that PLASMA binds all the threads used by PLASMA, including the master thread. Once you enter the next openmp section, the threads that are created by the master thread are thus binded to the same core and are all running on core 0.

Here is a patch, IF you are using hwloc. I'm in meeting today but I will fix the problem in case you are not using hwloc too and generate a new release tomorrow.

Code: Select all
Index: control.c
===================================================================
--- control.c   (révision 2590)
+++ control.c   (copie de travail)
@@ -342,5 +342,11 @@
         plasma_fatal_error("PLASMA_Finalize", "plasma_context_remove() failed");
         return status;
     }
+
+    /* Restore the concurency */
+    /* actually it's really bad, we should set the concurrency only
+     * if it's not already done and restore it only we had change it */
+    pthread_setconcurrency( 0 );
+
     return PLASMA_SUCCESS;
 }
Index: control.h
===================================================================
--- control.h   (révision 2590)
+++ control.h   (copie de travail)
@@ -30,6 +30,7 @@
 void  plasma_barrier(plasma_context_t *plasma);
 void *plasma_parallel_section(void *plasma);
 int   plasma_setaffinity(int rank);
+int   plasma_unsetaffinity();
 int   plasma_yield();
 void  plasma_topology_init();
 void  plasma_topology_finalize();
Index: plasmaos-hwloc.c
===================================================================
--- plasmaos-hwloc.c   (révision 2590)
+++ plasmaos-hwloc.c   (copie de travail)
@@ -48,6 +48,8 @@
 
 void plasma_topology_finalize(){
 
+    plasma_unsetaffinity();
+       
     pthread_mutex_lock(&mutextopo);
     plasma_nbr--;
     if ((topo_initialized ==1) && (plasma_nbr == 0)) {
@@ -66,7 +68,7 @@
  If there are multiple instances of PLASMA then affinity will be wrong: all ranks 0
  will be pinned to core 0.
 
- Also, affinity is not resotred when PLASMA_Finalize() is called.
+ Also, affinity is not restored when PLASMA_Finalize() is called, but is removed.
  */
 int plasma_setaffinity(int rank) {
     hwloc_obj_t      obj;      /* Hwloc object    */
@@ -117,6 +119,57 @@
     return PLASMA_SUCCESS;
 }
 
+/**
+ This routine will unset the affinity set by a previous call to
+ plasma_setaffinity.
+ */
+int plasma_unsetaffinity() {
+    hwloc_obj_t      obj;      /* Hwloc object    */
+    hwloc_cpuset_t   cpuset;   /* HwLoc cpuset    */
+   
+    if (!topo_initialized) {
+        plasma_error("plasma_unsetaffinity", "Topology not initialized");
+        return PLASMA_ERR_UNEXPECTED;
+    }
+
+    /* Get last one.  */
+    obj = hwloc_get_obj_by_type(plasma_topology, HWLOC_OBJ_MACHINE, 0);
+    if (!obj) {
+        plasma_warning("plasma_unsetaffinity", "Could not get object");
+        return PLASMA_ERR_UNEXPECTED;
+    }
+   
+    /* Get a copy of its cpuset that we may modify.  */
+    /* Get only one logical processor (in case the core is SMT/hyperthreaded).  */
+#if !defined(HAVE_HWLOC_BITMAP)
+    cpuset = hwloc_cpuset_dup(obj->cpuset);
+#else
+    cpuset = hwloc_bitmap_dup(obj->cpuset);
+#endif
+   
+    /* And try to bind ourself there.  */
+    if (hwloc_set_cpubind(plasma_topology, cpuset, HWLOC_CPUBIND_THREAD)) {
+        char *str = NULL;
+#if !defined(HAVE_HWLOC_BITMAP)
+        hwloc_cpuset_asprintf(&str, obj->cpuset);
+#else
+        hwloc_bitmap_asprintf(&str, obj->cpuset);
+#endif
+        plasma_warning("plasma_unsetaffinity", "Could not bind to the whole machine");
+        printf("Couldn't bind to cpuset %s\n", str);
+        free(str);
+        return PLASMA_ERR_UNEXPECTED;
+    }
+   
+    /* Free our cpuset copy */
+#if !defined(HAVE_HWLOC_BITMAP)
+    hwloc_cpuset_free(cpuset);
+#else
+    hwloc_bitmap_free(cpuset);
+#endif
+    return PLASMA_SUCCESS;
+}
+
 int plasma_getnuma_size() {
     hwloc_cpuset_t   cpuset;   /* HwLoc cpuset    */
     hwloc_obj_t      obj;


Mathieu
mateo70
 
Posts: 95
Joined: Fri May 07, 2010 3:48 pm

Re: PLASMA 2.4.0 and OpenMP compatibility issue?

Postby uhle89 » Thu Jul 07, 2011 5:51 am

Hi Mathiew+Stuart,
here are further observations and reasonings:
1) I've replaced libcblas.so by libgslcblas.so -another blas lib hanging around.
Beside the fact that PLASMA is now 4x slower (which proves that the different lib is used)
the number of procs is still affected by the PLASMA usage.

2) In our case I can recover the OpenMP multithreading after PLASMA usage, if I just avoid
the evaluation of the result of omp_get_num_procs().

Diving in our code I found a forced thread number limitation (useful in case of only a few iterations):
Code: Select all
     
       const int numIterPerT=MIN_NUM_ITER;  //min. number of iterations per thread (>=200 is reasonable)
       const int cpus=omp_get_num_procs();
       int numThreads=numIter/numIterPerT>=cpus ? cpus : numIter/numIterPerT;
       if(numThreads==0) numThreads=1;
#pragma omp parallel for reduction(+:pot,Ex,Ey,Ez) schedule(static,numIterPerT) num_threads(numThreads)

It is obvious, that I have limited the thread number to 1 with omp_get_num_procs()==1.
If I just remove the num_threads() clause, the loop runs within a team of 16 threads @ 16 cpus!

I understand that in contrast to my findings, Stuart has 24 threads running @ 1 cpu.

3) If I re-insert above clause as
Code: Select all
#pragma omp parallel for reduction(+:pot,Ex,Ey,Ez) schedule(static,numIterPerT) num_threads(4)

The loop runs with 4 threads at 4 cpus -as one would expect.

However, if the execution reaches the next parallel omp loop WITHOUT any thread number limitation,
this section is executed by 16 threads but at 4 cpus! I guess this is the point where OMP cannot figure out, howmany
(new) LWPs have to be forked. It just uses the number of LWPs which are still there.
(I've checked that with LAPACK only, LWP number is 4 and 16, respectively -this is ok since LWPs are killed if not neccessary in a parallel section/loop)

My conclusion is, that the situation of Stuart differs wrt to the number of LWPs forked before the first
PLASMA call. Stuart, could you check please, if executing a loop in 24 threads @ 24cpus before the first PLASMA call
can restore the execution to 24 @ 24 after PLASMA usage?

If I'm not wrong, a more precise description of the bug is
"omp_get_num_procs() returns always 1 after PLASMA usage" in conjunction with
"New LWPs are not/cannot be forked by OMP (if neccessary) after PLASMA usage."

This also means that the actual thread number constructed by OMP is correct after PLASMA usage.

Stephan
uhle89
 
Posts: 6
Joined: Wed Jul 06, 2011 10:30 am

PreviousNext

Return to User discussion

Who is online

Users browsing this forum: No registered users and 1 guest

cron