PLASMA 2.4.0 and OpenMP compatibility issue?

Open forum for general discussions relating to PLASMA.

PLASMA 2.4.0 and OpenMP compatibility issue?

Postby srdegraaf » Sat Jul 02, 2011 11:17 am

Dear PLASMA gurus,

First, let me say that I greatly appreciate your efforts. Fast numerical linear algebra is the foundation upon which all modern digital signal processing is built! Please keep up the great work! I'm trying to use PLASMA 2.4.0 (rather than LAPACK) to evaluate thin-plate splines (TPS) in a digital elevation map extraction algorithm I've developed for synthetic aperture radar. I think I'm running into an apparent clash between PLASMA and OpenMP. I'm running on Fedora 14, using gcc 4.5.1.

There are two computationally intensive parts to my problem: 1) perform a QR decomposition (dgels or PLASMA_dgels) to solve for the TPS coefficients; 2) use the TPS coefficients to evaluate/interpolate a surface. Both parts scream out to be parallelized. The matrices involved are on the order of 32K x 32K (or larger), and I'm seeing a roughly 12x speedup of the first part using PLASMA_dgels() rather than dgels() on my 12-core Westmere workstation. This is wonderful! It works, and is accurate!

Unfortunately, what isn't so wonderful is that I use OpenMP parallelization for the second part of the problem. When I use LAPACK, the OpenMP part flies, i.e. I get the anticipated 12x speedup of the second part. When I use PLASMA, for some reason the program refuses to make use of more than one core! I can't figure out why, and would appreciate your help!

I built PLASMA 2.4.0 using the following options:


./setup.py --blaslib=/usr/lib64/atlas-sse2/libf77blas.so --cblaslib=/usr/lib64/atlas-sse2/libcblas.so \
--lapacklib=/usr/lib64/atlas-sse2/liblapack.so --lapclib=/usr/lib64/atlas-sse2/libclapack.so


As you can see, I'm using atlas-sse2 BLAS (single-threaded), so I do NOT use "export OMP_NUM_THREADS=1" (or the MKL or GOTO equivalents) during the build. In fact, these variables are not set. (Note: using --notesting doesn't prevent the installer from downloading its own version of the netlib lapack C interface; I guess it is still required to build liblapacke.a and/or libcoreblas.a.)

Thinking that PLASMA_dgels() had somehow (via something in the libraries) constrained my OpenMP number of threads to 1, I added the following statements to print and set the relevant number-of-threads variables in the OpenMP portion of the code:


Code: Select all
void tpssurf(numpoints,nsampx,nsampy,nsampz,ctlpoints,tpsvec,rescelldted)
int numpoints,nsampx,nsampy,nsampz;
CTLPOINT *ctlpoints;
double *tpsvec;
double **rescelldted;
{
  /*
    Evaluate the thin-plate spline z surface function for all image sample locations.
  */

  MUDS_DOUBLE x,y,dxi,dyi,r2i,basis;
  int nx,ny,i,nthreads,maxthreads;

  /*Earlier plasma usage somehow sets the number of OpenMP threads to one*/
  /*Hopefully this overcomes the problem. IT DIDN"T; NOW WHAT?*/
  maxthreads=omp_get_max_threads();
  omp_set_num_threads(maxthreads);

  #pragma omp parallel for \
  private(nx,ny,i,x,y,dxi,dyi,r2i,basis,nthreads) \
  shared(nsampx,nsampy,numpoints,tpsvec,rescelldted,maxthreads)
  for(nx=0;nx<nsampx;nx++){
    if(nx==0){
      nthreads=omp_get_num_threads();
      fprintf(stderr,"OMP threads used for tpssurf: %d of %d\n",nthreads,maxthreads);
    }
    x=(double)nx/(double)nsampx;
    for(ny=0;ny<nsampy;ny++){
      y=(double)ny/(double)nsampy;
      rescelldted[nx][ny]=(double)0.;
      /*Bending/perturbation/warping part*/
      for(i=0;i<numpoints;i++){
        dxi=x-ctlpoints[i].x;
        dyi=y-ctlpoints[i].y;
        r2i=dxi*dxi+dyi*dyi;
        if(r2i==(double)0.) basis=(double)0.;
        else basis=r2i*log(r2i);
        rescelldted[nx][ny]+=tpsvec[i]*basis;
      }
      rescelldted[nx][ny]*=nsampz; /*don't want normalized units*/
      /*Bi-linear (affine) part*/
      rescelldted[nx][ny]+=(tpsvec[numpoints]+tpsvec[numpoints+1]*x+tpsvec[numpoints+2]*y)*nsampz; /*don't want normalized units*/
    }
  }
}



When I run the program, it prints out:

OMP threads used for tpssurf: 24 of 24

This would seem to be good. It's what I'm supposed to see! (The machine is hyperthreaded so I see 24, not the 12 actual cores; yes, I know I should divide by 2.) However, the load average (gkrellm/top) is only ~100%, not the ~1200% I've come to love, during the execution of this portion of the program. It is only using one core. What could/should I do to make it use all 12 again?

For completeness, below is the earlier part of the code that utilizes PLASMA, and my makefile. I believe that the MUDS/SPLASH and GSL FFT stuff is not relevant to the problem. The only difference between this and the code and makefile that behave correctly is the use of PLASMA (rather than LAPACK) for the first part of the algorithm.


Code: Select all
#include <stdlib.h>
#include <stdio.h>
#include <string.h>
#include <math.h>

#include "muds_types.h"
#include "muds_allocation.h"

#include <plasma.h>
#include <cblas.h>
#include <lapacke.h>
#include <core_blas.h>

void dsolveviaqr_plasma(nthreads,rows,cols,a,rhs,soln)
int nthreads,rows,cols;
double **a,*rhs,*soln;
{

  /*
    Interface to FORTRAN LAPACK complex QR decomposition. Assumes full-rank
    matrix with rows>=columns.
  */

 double **atrans,**rhstrans,*work;
  /*char normal='N'; /*normal, i.e. no transpose*/
  int i,j,info,one=1,worksize;
  extern int PLASMA_dgels();

  info=PLASMA_Init(nthreads);
  fprintf(stderr,"PLASMA_Init: %d\n",info);

  /*Allocate work space*/
  matrix(atrans,double,cols,rows);
  matrix(rhstrans,double,1,rows);
  worksize=rows*cols; /*generous? not clear what opt block size is*/
  vector(work,double,worksize);
  /*PLASMA_Alloc_Workspace_dgels(rows,cols,work); /*don't use courtesy rtn*/

  /*Copy A into transposed array for FORTRAN; QR destroys this array*/
  for(i=0;i<rows;i++){
    for(j=0;j<cols;j++){
      atrans[j][i]=a[i][j];
    }
  }

  /*Copy rhs into transposed array for FORTRAN; overwritten by solution*/
  for(i=0;i<rows;i++){
    rhstrans[0][i]=rhs[i];
  }

  /*Do it*/
  /*dgels(&normal,&rows,&cols,&one,&atrans[0][0],&rows,&rhstrans[0][0],&rows,&work[0],&worksize,&info); /*lapack arguments differ slightly*/
  info=PLASMA_dgels(PlasmaNoTrans,rows,cols,one,&atrans[0][0],rows,&work[0],&rhstrans[0][0],rows); /*plasma scalars by value, return*/
  fprintf(stderr,"PLASMA_dgels: %d\n",info);
  if(info<0) fprintf (stderr,"QR argument %d bad\n",-info);

  /*Recover solution part, which is transposed*/
  for(i=0;i<rows;i++){
    soln[i]=rhstrans[0][i];
  }

  /*Deallocate work space*/
  freematrix(atrans);
  freematrix(rhstrans);
  freevector(work);
  /*free(work); /*don't use plasma courtesy rtn*/
  info=PLASMA_Finalize();
  fprintf(stderr,"PLASMA_Finalize: %d\n",info);

  return;
}



Code: Select all
MUDS = ../../muds
SPLASH = ../../splash
MUDS_CORE = $(MUDS)/muds_core
HEADERS = $(MUDS_CORE)/muds_types.h $(MUDS_CORE)/mudsio_prototypes.h $(MUDS_CORE)/muds_allocation.h $(MUDS_CORE)/muds_complex.h

ARCH = lnx64
BIN = ../bin$(ARCH)

PLASMA = /home/work/plasma-installer_2.4.0/install
INCLUDES = -I. -I$(MUDS_CORE)

LIBMUDS = $(MUDS)/obj$(ARCH)/libmuds.a
LIBMUDSMATH = $(MUDS)/obj$(ARCH)/libmudsmath.a
LIBMUDSMATHPLASMA = $(MUDS)/obj$(ARCH)/dqroverdet_plasma.o
LIBTOEP = $(SPLASH)/obj$(ARCH)/dtoep.o
LIBFILT = $(SPLASH)/obj$(ARCH)/dwindowstuff.o $(SPLASH)/obj$(ARCH)/dppfilt.o $(SPLASH)/obj$(ARCH)/resample.o
LIBFFT = $(SPLASH)/obj$(ARCH)/mtdfft2d.o $(SPLASH)/obj$(ARCH)/mtdfft3d.o $(SPLASH)/obj$(ARCH)/mtdfft2dwrapper.o \
$(SPLASH)/obj$(ARCH)/mtdfft3dwrapper.o $(SPLASH)/obj$(ARCH)/dfft2d.o $(SPLASH)/obj$(ARCH)/dfft3d.o $(SPLASH)/obj$(ARCH)/dfft1d.o

LIBGSL = -L/usr/lib64 -lgsl -L/usr/lib64/atlas -lcblas
LIBLAPACK = -L/usr/lib/atlas-sse2 -llapack -lcblas
LIBPLASMA = $(PLASMA)/lib/libplasma.a $(PLASMA)/lib/libcoreblas.a $(PLASMA)/lib/liblapacke.a $(PLASMA)/lib/libquark.a

LIBMATHEMATICS = $(LIBMUDSMATH) $(LIBFFT) $(LIBGSL) $(LIBLAPACK) $(LIBTOEP) $(LIBFILT) $(LIBMUDSMATHPLASMA) $(LIBPLASMA) -lc -lm
BASIC = $(LIBMUDS) $(LIBMATHEMATICS)

DEPMATHEMATICS = $(LIBMUDSMATH) $(LIBFFT) $(LIBTOEP) $(LIBFILT) $(LIBMUDSMATHPLASMA)
DEPBASIC = $(LIBMUDS) $(DEPMATHEMATICS)

CC = gcc -m64
IFLAGS = $(INCLUDES)
DFLAGS = -g -o
LFLAGS = -funroll-all-loops -O3 -o
CFLAGS = -funroll-all-loops -O3 -c

all : \
mt_inferdtedp

mt_inferdtedp : mt_inferdtedp.c $(HEADERS) $(DEPBASIC)
   $(CC) $(IFLAGS) $(LFLAGS) $(BIN)/mt_inferdtedp mt_inferdtedp.c $(BASIC) -fopenmp
   strip $(BIN)/mt_inferdtedp

clean :
   rm *.o



Any idea what could be going wrong? I have no clue, and would appreciate any help you can provide!
By the way, everything behaves exactly the same with PLASMA 2.3.1.

Thanks,

Stuart DeGraaf
Last edited by srdegraaf on Sat Jul 02, 2011 4:39 pm, edited 1 time in total.
srdegraaf
 
Posts: 12
Joined: Sat Jul 02, 2011 12:00 am

Re: PLASMA 2.4.0 and OpenMP compatibility issue?

Postby mateo70 » Tue Jul 05, 2011 3:26 am

Hello,

Try to put the plasma_init and plasma_finalize just around the dgels call, if it's not already done this way. It should kill the PLASMA threadsafter the call and perturb less your OpenMP application.
I will try to find out where it comes from.

Mathieu
mateo70
 
Posts: 98
Joined: Fri May 07, 2010 3:48 pm

Re: PLASMA 2.4.0 and OpenMP compatibility issue?

Postby mateo70 » Tue Jul 05, 2011 3:41 am

Ok, sorry I havn't seen the code using PLASMA the first time. So, just a few comment concerning the code:
PLASMA_Alloc_Workspace_dgels allocate a workspace of a really specific size, so you should use it instead of using your own allocation because it hides the change in some of our structure from one version to another.

And the exact size is :
if NT = N / NB + ( N%NB ) ? 1 : 0
and MT = M / MB + ( M%MB ) ? 1 : 0
So the size is NT*NB * MT*IB*x where NB is you block size; IB the inner block size and x equal 1 for the general case or 2 if you want to use a tree for the reduction.
So normally, your size should be fine in most of the cases but will be wrong for some of them and it's quite dangerous to use it this way :).

Otherwise I don't see anything wrong in your use of PLASMA.

Just to be sure, when you do a ldd on your executable, is it linked to the correct libraries ?

Mathieu
mateo70
 
Posts: 98
Joined: Fri May 07, 2010 3:48 pm

Re: PLASMA 2.4.0 and OpenMP compatibility issue?

Postby mateo70 » Tue Jul 05, 2011 4:03 am

Concerning the OMP_NUM_THREADS set to 1. You should check the environment but normally it's done only by the installer to run the timing. You can unset them just after. I will add a warning for that.

And thanks for the LAPACKE, I will change the message because we always need it.

Mathieu
mateo70
 
Posts: 98
Joined: Fri May 07, 2010 3:48 pm

Re: PLASMA 2.4.0 and OpenMP compatibility issue?

Postby srdegraaf » Tue Jul 05, 2011 12:02 pm

Mathieu

Thanks for your responses. I was also concerned about the use of my own allocation routine for workspace, and replaced it with the plasma one, as you suggested. Unfortunately, it made no difference. My OpenMP code still uses only one thread/core after using PLASMA_dgels(). I left the location of the PLASMA_init() and PLASMA_Finalize() as originally shown, since that matches their use in example_dgels.c and encloses the other PLASMA routines - specifically PLASMA_Alloc_Workspace_dgels() and free(). I will also try putting these directly adjacent to the PLASMA_dgels() call, but am not optimistic about success.

Your other suggestion, about using ldd yielded some surprises, at least to me:

Code: Select all
[srd@clipper backproject]$ ldd ../binlnx64/mt_inferdtedp
   linux-vdso.so.1 =>  (0x00007fff9c16f000)
   libgsl.so.0 => /usr/lib64/libgsl.so.0 (0x00000031f5800000)
   libcblas.so.3 => /usr/lib64/atlas/libcblas.so.3 (0x00000031f4800000)
   liblapack.so.3 => /usr/lib64/atlas/liblapack.so.3 (0x00000031f6c00000)
   libc.so.6 => /lib64/libc.so.6 (0x0000003655a00000)
   libm.so.6 => /lib64/libm.so.6 (0x0000003656a00000)
   libgomp.so.1 => /usr/lib64/libgomp.so.1 (0x00000031f9400000)
   libpthread.so.0 => /lib64/libpthread.so.0 (0x0000003655e00000)
   libgslcblas.so.0 => /usr/lib64/libgslcblas.so.0 (0x00000031f5e00000)
   libatlas.so.3 => /usr/lib64/atlas/libatlas.so.3 (0x00000031f7600000)
   libgfortran.so.3 => /usr/lib64/libgfortran.so.3 (0x00007f9704b4d000)
   libf77blas.so.3 => /usr/lib64/atlas/libf77blas.so.3 (0x00007f9704930000)
   /lib64/ld-linux-x86-64.so.2 (0x0000003655600000)
   librt.so.1 => /lib64/librt.so.1 (0x0000003656600000)
   libgcc_s.so.1 => /lib64/libgcc_s.so.1 (0x0000003657a00000)


The first surprise is that there is no reference to any of the plasma libraries that are in my makefile:
/home/work/plasma-installer_2.4.0/install/libplasma.a, libcoreblas.a, liblapacke.a or libquark.a.
Yet the plasma stuff is clearly working - I get the dramatic multi-core speedup with PLASMA_dgels() - and the initialization, finalization, and allocation all occur.

The second surprise is that ldd -u shows that none of the BLAS (or LAPACK) libraries are being used. (Some of the other libraries not being used are not a surprise to me - I'm linking with them to get make to work, rather than because I actually need them here.)

Code: Select all
[srd@clipper backproject]$ ldd -u ../binlnx64/mt_inferdtedp
Unused direct dependencies:
   
   /usr/lib64/libgsl.so.0                      (expected this to be unused)
   /usr/lib64/atlas/libcblas.so.3           (I thought this was crucial)
   /usr/lib64/atlas/liblapack.so.3         (not sure if this actually gets used by plasma or not - apparently not)
   /lib64/libm.so.6                             (I guess this is just for completeness)
   /usr/lib64/libgomp.so.1                   (don't know what this is or where it came from)


Does this suggest anything to you?

Thanks for your help!

Stuart
srdegraaf
 
Posts: 12
Joined: Sat Jul 02, 2011 12:00 am

Re: PLASMA 2.4.0 and OpenMP compatibility issue?

Postby mateo70 » Tue Jul 05, 2011 12:34 pm

Ok , so this is my mistake. PLASMA is compiled as a static library so you won't see it in ldd. And it's normal all these libraries appear in the ldd -u, it's because they are used by another library like plasma or else and not directly by your program.

For you information, the libgomp is the OpenMP library :). and everything looks fine in the links.

So, I checked every lines of PLASMA code and I cannot find any setenv except in the installer. Which means that, as soon as you have unset the variable after the installation, it shouldn't appear again. So "maybe" the ATLAS library is setting it to 1. I suggest you add a omp_get_num_threads before and after the call to the function using PLASMA and see if the value has changed between.

I look forward to your answer.
Mathieu
mateo70
 
Posts: 98
Joined: Fri May 07, 2010 3:48 pm

Re: PLASMA 2.4.0 and OpenMP compatibility issue?

Postby srdegraaf » Tue Jul 05, 2011 1:49 pm

Mathieu,

Old business: if I put the PLASMA_Init() and PLASMA_Finalize() calls adjacent to the PLASMA_dgels() call (so that the PLASMA_Alloc_Workspace_dgels() and free() calls are outside, then I get the following error:
PLASMA FATAL ERROR: plasma_alloc_ibnb(): PLASMA not initialized
so I put them back in their original position.

New business: I made the change you suggested (omp_get_numthreads()) to my code (inside initialization/finalization):

Code: Select all
/*Do it*/
  fprintf(stderr,"Just before plasma omp_get_num_threads() yields: %d\n",omp_get_num_threads());
  info=PLASMA_Init(nthreads);
   /*dgels(&normal,&rows,&cols,&one,&atrans[0][0],&rows,&rhstrans[0][0],&rows,&work[0],&worksize,&inf
o); /*lapack arguments differ slightly*/
  info=PLASMA_dgels(PlasmaNoTrans,rows,cols,one,&atrans[0][0],rows,&work[0],&rhstrans[0][0],rows); /
*plasma scalars by value, return*/
  fprintf(stderr,"PLASMA_dgels: %d\n",info);
  if(info<0) fprintf (stderr,"QR argument %d bad\n",-info);
   fprintf(stderr,"Just after plasma omp_get_num_threads() yields: %d\n",omp_get_num_threads());


with the following unsatisfying result:

Just before plasma omp_get_num_threads() yields: 1
PLASMA_dgels: 0
Just after plasma omp_get_num_threads() yields: 1

Both are as the OpenMP documentation says they should be. The number of threads doesn't show up as higher until you're INSIDE one of the
#pragma parallel for
parts of the code, which is how I checked the value initially. When I dump the variable there, it shows up correctly as 24. omp_get_max_threads() also returns 24 after the plasma usage. Here's what the printout of those variables are:

OMP threads used for tpssurf: 24 of 24

Both PLASMA_Init() and PLASMA_Finalize() return 0. Is there any chance that the finalization stuff in plasma-installer_2.4.0/build/plasma_2.4.0/control/control.c isn't doing something that it should be doing? Or that there is some SUBTLE way in which the apparently correct OpenMP variables could be being ignored because of something else that plasma has done?

My OMP_NUM_THREADS variable is definitely NOT set (to 1 or anything else).

Thanks,

Stuart
srdegraaf
 
Posts: 12
Joined: Sat Jul 02, 2011 12:00 am

Re: PLASMA 2.4.0 and OpenMP compatibility issue?

Postby ltaief » Tue Jul 05, 2011 1:58 pm

Stuart,
You mentioned in your first email that you

"built PLASMA 2.4.0 using the following options:
./setup.py --blaslib=/usr/lib64/atlas-sse2/libf77blas.so --cblaslib=/usr/lib64/atlas-sse2/libcblas.so \
--lapacklib=/usr/lib64/atlas-sse2/liblapack.so --lapclib=/usr/lib64/atlas-sse2/libclapack.so
As you can see, I'm using atlas-sse2 BLAS (single-threaded)"

My guess would be that you would have to build plasma with multi threaded atlas, set OMP_NUM_THREADS=1
whenever plasma is running... Then, once plasma completed, just unleash the horses
by setting OMP_NUM_THREADS to whatever is available out there ! ;)

Hope this helps?!

Regards,
hatem
ltaief
 
Posts: 3
Joined: Fri Dec 11, 2009 11:32 am

Re: PLASMA 2.4.0 and OpenMP compatibility issue?

Postby mateo70 » Tue Jul 05, 2011 2:23 pm

I seriously doubt about Hatem told you except if you are using atlas multi threaded somewhere else. And it doesn't look like you are using blas somewhere else than in PLASMA, am I wrong ?

So I cannot do it right now, but I will do an example on my own mixing plasma and openMP code and let you know if I have the same problem. ( will be later in the evening or tomorrow)

Mathieu
mateo70
 
Posts: 98
Joined: Fri May 07, 2010 3:48 pm

Re: PLASMA 2.4.0 and OpenMP compatibility issue?

Postby srdegraaf » Tue Jul 05, 2011 3:27 pm

Hatem,

Your suggestion sounded promising, and I tried it with enthusiasm! Unfortunately, it did not fix the problem. Things look the same as in the past:

Using 33177 of 36864 control points with quality 2.000000 to 0.122309
Setup and solve TPS equations
PLASMA_Init: 0
Just before plasma omp_get_num_threads() yields: 1
PLASMA_dgels: 0
Just after plasma omp_get_num_threads() yields: 1
PLASMA_Finalize: 0
Normalized bending energy: 1.245033e+00
Fit error energy: 2.813520e+01
Interpolate z surface using TPS splines
OMP threads used for tpssurf: 24 of 24

Despite the last "claim", the OMP part of the code is still running on ONE core/thread.

Thanks for your (continuing?) help!

Stuart
srdegraaf
 
Posts: 12
Joined: Sat Jul 02, 2011 12:00 am

Next

Return to User discussion

Who is online

Users browsing this forum: No registered users and 0 guests

cron