First, let me say that I greatly appreciate your efforts. Fast numerical linear algebra is the foundation upon which all modern digital signal processing is built! Please keep up the great work! I'm trying to use PLASMA 2.4.0 (rather than LAPACK) to evaluate thin-plate splines (TPS) in a digital elevation map extraction algorithm I've developed for synthetic aperture radar. I think I'm running into an apparent clash between PLASMA and OpenMP. I'm running on Fedora 14, using gcc 4.5.1.
There are two computationally intensive parts to my problem: 1) perform a QR decomposition (dgels or PLASMA_dgels) to solve for the TPS coefficients; 2) use the TPS coefficients to evaluate/interpolate a surface. Both parts scream out to be parallelized. The matrices involved are on the order of 32K x 32K (or larger), and I'm seeing a roughly 12x speedup of the first part using PLASMA_dgels() rather than dgels() on my 12-core Westmere workstation. This is wonderful! It works, and is accurate!
Unfortunately, what isn't so wonderful is that I use OpenMP parallelization for the second part of the problem. When I use LAPACK, the OpenMP part flies, i.e. I get the anticipated 12x speedup of the second part. When I use PLASMA, for some reason the program refuses to make use of more than one core! I can't figure out why, and would appreciate your help!
I built PLASMA 2.4.0 using the following options:
./setup.py --blaslib=/usr/lib64/atlas-sse2/libf77blas.so --cblaslib=/usr/lib64/atlas-sse2/libcblas.so \
--lapacklib=/usr/lib64/atlas-sse2/liblapack.so --lapclib=/usr/lib64/atlas-sse2/libclapack.so
As you can see, I'm using atlas-sse2 BLAS (single-threaded), so I do NOT use "export OMP_NUM_THREADS=1" (or the MKL or GOTO equivalents) during the build. In fact, these variables are not set. (Note: using --notesting doesn't prevent the installer from downloading its own version of the netlib lapack C interface; I guess it is still required to build liblapacke.a and/or libcoreblas.a.)
Thinking that PLASMA_dgels() had somehow (via something in the libraries) constrained my OpenMP number of threads to 1, I added the following statements to print and set the relevant number-of-threads variables in the OpenMP portion of the code:
- Code: Select all
void tpssurf(numpoints,nsampx,nsampy,nsampz,ctlpoints,tpsvec,rescelldted)
int numpoints,nsampx,nsampy,nsampz;
CTLPOINT *ctlpoints;
double *tpsvec;
double **rescelldted;
{
/*
Evaluate the thin-plate spline z surface function for all image sample locations.
*/
MUDS_DOUBLE x,y,dxi,dyi,r2i,basis;
int nx,ny,i,nthreads,maxthreads;
/*Earlier plasma usage somehow sets the number of OpenMP threads to one*/
/*Hopefully this overcomes the problem. IT DIDN"T; NOW WHAT?*/
maxthreads=omp_get_max_threads();
omp_set_num_threads(maxthreads);
#pragma omp parallel for \
private(nx,ny,i,x,y,dxi,dyi,r2i,basis,nthreads) \
shared(nsampx,nsampy,numpoints,tpsvec,rescelldted,maxthreads)
for(nx=0;nx<nsampx;nx++){
if(nx==0){
nthreads=omp_get_num_threads();
fprintf(stderr,"OMP threads used for tpssurf: %d of %d\n",nthreads,maxthreads);
}
x=(double)nx/(double)nsampx;
for(ny=0;ny<nsampy;ny++){
y=(double)ny/(double)nsampy;
rescelldted[nx][ny]=(double)0.;
/*Bending/perturbation/warping part*/
for(i=0;i<numpoints;i++){
dxi=x-ctlpoints[i].x;
dyi=y-ctlpoints[i].y;
r2i=dxi*dxi+dyi*dyi;
if(r2i==(double)0.) basis=(double)0.;
else basis=r2i*log(r2i);
rescelldted[nx][ny]+=tpsvec[i]*basis;
}
rescelldted[nx][ny]*=nsampz; /*don't want normalized units*/
/*Bi-linear (affine) part*/
rescelldted[nx][ny]+=(tpsvec[numpoints]+tpsvec[numpoints+1]*x+tpsvec[numpoints+2]*y)*nsampz; /*don't want normalized units*/
}
}
}
When I run the program, it prints out:
OMP threads used for tpssurf: 24 of 24
This would seem to be good. It's what I'm supposed to see! (The machine is hyperthreaded so I see 24, not the 12 actual cores; yes, I know I should divide by 2.) However, the load average (gkrellm/top) is only ~100%, not the ~1200% I've come to love, during the execution of this portion of the program. It is only using one core. What could/should I do to make it use all 12 again?
For completeness, below is the earlier part of the code that utilizes PLASMA, and my makefile. I believe that the MUDS/SPLASH and GSL FFT stuff is not relevant to the problem. The only difference between this and the code and makefile that behave correctly is the use of PLASMA (rather than LAPACK) for the first part of the algorithm.
- Code: Select all
#include <stdlib.h>
#include <stdio.h>
#include <string.h>
#include <math.h>
#include "muds_types.h"
#include "muds_allocation.h"
#include <plasma.h>
#include <cblas.h>
#include <lapacke.h>
#include <core_blas.h>
void dsolveviaqr_plasma(nthreads,rows,cols,a,rhs,soln)
int nthreads,rows,cols;
double **a,*rhs,*soln;
{
/*
Interface to FORTRAN LAPACK complex QR decomposition. Assumes full-rank
matrix with rows>=columns.
*/
double **atrans,**rhstrans,*work;
/*char normal='N'; /*normal, i.e. no transpose*/
int i,j,info,one=1,worksize;
extern int PLASMA_dgels();
info=PLASMA_Init(nthreads);
fprintf(stderr,"PLASMA_Init: %d\n",info);
/*Allocate work space*/
matrix(atrans,double,cols,rows);
matrix(rhstrans,double,1,rows);
worksize=rows*cols; /*generous? not clear what opt block size is*/
vector(work,double,worksize);
/*PLASMA_Alloc_Workspace_dgels(rows,cols,work); /*don't use courtesy rtn*/
/*Copy A into transposed array for FORTRAN; QR destroys this array*/
for(i=0;i<rows;i++){
for(j=0;j<cols;j++){
atrans[j][i]=a[i][j];
}
}
/*Copy rhs into transposed array for FORTRAN; overwritten by solution*/
for(i=0;i<rows;i++){
rhstrans[0][i]=rhs[i];
}
/*Do it*/
/*dgels(&normal,&rows,&cols,&one,&atrans[0][0],&rows,&rhstrans[0][0],&rows,&work[0],&worksize,&info); /*lapack arguments differ slightly*/
info=PLASMA_dgels(PlasmaNoTrans,rows,cols,one,&atrans[0][0],rows,&work[0],&rhstrans[0][0],rows); /*plasma scalars by value, return*/
fprintf(stderr,"PLASMA_dgels: %d\n",info);
if(info<0) fprintf (stderr,"QR argument %d bad\n",-info);
/*Recover solution part, which is transposed*/
for(i=0;i<rows;i++){
soln[i]=rhstrans[0][i];
}
/*Deallocate work space*/
freematrix(atrans);
freematrix(rhstrans);
freevector(work);
/*free(work); /*don't use plasma courtesy rtn*/
info=PLASMA_Finalize();
fprintf(stderr,"PLASMA_Finalize: %d\n",info);
return;
}
- Code: Select all
MUDS = ../../muds
SPLASH = ../../splash
MUDS_CORE = $(MUDS)/muds_core
HEADERS = $(MUDS_CORE)/muds_types.h $(MUDS_CORE)/mudsio_prototypes.h $(MUDS_CORE)/muds_allocation.h $(MUDS_CORE)/muds_complex.h
ARCH = lnx64
BIN = ../bin$(ARCH)
PLASMA = /home/work/plasma-installer_2.4.0/install
INCLUDES = -I. -I$(MUDS_CORE)
LIBMUDS = $(MUDS)/obj$(ARCH)/libmuds.a
LIBMUDSMATH = $(MUDS)/obj$(ARCH)/libmudsmath.a
LIBMUDSMATHPLASMA = $(MUDS)/obj$(ARCH)/dqroverdet_plasma.o
LIBTOEP = $(SPLASH)/obj$(ARCH)/dtoep.o
LIBFILT = $(SPLASH)/obj$(ARCH)/dwindowstuff.o $(SPLASH)/obj$(ARCH)/dppfilt.o $(SPLASH)/obj$(ARCH)/resample.o
LIBFFT = $(SPLASH)/obj$(ARCH)/mtdfft2d.o $(SPLASH)/obj$(ARCH)/mtdfft3d.o $(SPLASH)/obj$(ARCH)/mtdfft2dwrapper.o \
$(SPLASH)/obj$(ARCH)/mtdfft3dwrapper.o $(SPLASH)/obj$(ARCH)/dfft2d.o $(SPLASH)/obj$(ARCH)/dfft3d.o $(SPLASH)/obj$(ARCH)/dfft1d.o
LIBGSL = -L/usr/lib64 -lgsl -L/usr/lib64/atlas -lcblas
LIBLAPACK = -L/usr/lib/atlas-sse2 -llapack -lcblas
LIBPLASMA = $(PLASMA)/lib/libplasma.a $(PLASMA)/lib/libcoreblas.a $(PLASMA)/lib/liblapacke.a $(PLASMA)/lib/libquark.a
LIBMATHEMATICS = $(LIBMUDSMATH) $(LIBFFT) $(LIBGSL) $(LIBLAPACK) $(LIBTOEP) $(LIBFILT) $(LIBMUDSMATHPLASMA) $(LIBPLASMA) -lc -lm
BASIC = $(LIBMUDS) $(LIBMATHEMATICS)
DEPMATHEMATICS = $(LIBMUDSMATH) $(LIBFFT) $(LIBTOEP) $(LIBFILT) $(LIBMUDSMATHPLASMA)
DEPBASIC = $(LIBMUDS) $(DEPMATHEMATICS)
CC = gcc -m64
IFLAGS = $(INCLUDES)
DFLAGS = -g -o
LFLAGS = -funroll-all-loops -O3 -o
CFLAGS = -funroll-all-loops -O3 -c
all : \
mt_inferdtedp
mt_inferdtedp : mt_inferdtedp.c $(HEADERS) $(DEPBASIC)
$(CC) $(IFLAGS) $(LFLAGS) $(BIN)/mt_inferdtedp mt_inferdtedp.c $(BASIC) -fopenmp
strip $(BIN)/mt_inferdtedp
clean :
rm *.o
Any idea what could be going wrong? I have no clue, and would appreciate any help you can provide!
By the way, everything behaves exactly the same with PLASMA 2.3.1.
Thanks,
Stuart DeGraaf
