Magma 1.0/RC2 and Matlab

Open discussion for MAGMA

Magma 1.0/RC2 and Matlab

Postby Boxed Cylon » Tue Dec 14, 2010 4:20 am

I've been able to compile and run the tests for the new RC2 release on my openSUSE 11.1 machine. I had to modify the make.inc file to read as follows (perhaps with more standard definitions than in the tarball of the distribution):
Code: Select all
#//////////////////////////////////////////////////////////////////////////////
#   -- MAGMA (version 1.0) --
#      Univ. of Tennessee, Knoxville
#      Univ. of California, Berkeley
#      Univ. of Colorado, Denver
#      November 2010
#//////////////////////////////////////////////////////////////////////////////

#
# GPU_TARGET specifies for which GPU you want to compile MAGMA
#      0: Tesla family
#      1: Fermi Family
#
GPU_TARGET = 1

CC        = gcc
NVCC      = nvcc
FORT      = gfortran

ARCH      = ar
ARCHFLAGS = cr
RANLIB    = ranlib

OPTS      = -O3 -DADD_ -fPIC
NVOPTS    = --compiler-options -fno-strict-aliasing -DUNIX -O3 -DADD_ -Xcompiler "-fPIC -D_GNU_SOURCE -pthread -fexceptions -m64"

LDOPTS    = -fPIC -z muldefs

LIB       = -lmkl_gf_lp64 -lmkl_intel_thread -lmkl_core -lguide -lpthread -lcublas -lm

CUDADIR   = /usr/local/cuda

LIBDIR    = -L/opt/intel/Compiler/11.0/074/mkl/lib/em64t/  \
            -L$(CUDADIR)/lib64
INC       = -I$(CUDADIR)/include

LIBMAGMA     = ../lib/libmagma.a
LIBMAGMABLAS = ../lib/libmagmablas.a


As you see, this uses the intel MKL.

./testing_sgemm gives:

Code: Select all
Usage:
  testing_sgemm [-NN|NT|TN|TT] [-N 1024]

device 0: GeForce GTX 480, 1401.0 MHz clock, 1535.7 MB memory
device 1: GeForce 8400 GS, 1400.0 MHz clock, 511.7 MB memory

Testing TRANSA = N  TRANSB = N
    N     MAGMA GFLop/s    CUBLAS GFlop/s       error
========================================================
 1024       675.10           635.91         0.000000e+00
 2048       774.11           765.39         0.000000e+00
 3072       837.44           831.38         0.000000e+00
 4096       831.06           802.14         0.000000e+00
 5120       827.31           822.87         0.000000e+00
 6144       847.68           840.57         0.000000e+00
 7168       843.59           820.54         0.000000e+00
 8192       837.13           833.47         0.000000e+00



The various -fPIC's and the -Xcompiler "-fPIC -D_GNU_SOURCE -pthread -fexceptions -m64" were so that I could compile a small test code for sgesv for matlab - a mex file. This code is:

Code: Select all
#include "mex.h"
#include "cuda.h"
#include "cublas.h"
#include "magma.h"

#include "sys/time.h"

void mexFunction( int nlhs, mxArray *plhs[],
        int nrhs, const mxArray *prhs[])

{
      int I,L;
      int Ic,Lc;
      int dims0[2];

      // INPUT VARIABLES   %%%%%%%%%%%%%%%%%%%%%%%%%
      // A is dimensioned LXL
      // B is dimensioned LXI
      float *A,*B;
 
      // OUTPUT VARIABLE, X=A\B   %%%%%%%%%%%%%%%%%%
      float *X;

      // CUDA/GPU VARIABLES %%%%%%%%%%%%%%%%%%%%%%%%
      float *ga, *gb;
      int  *ipiv;
      int info;

      if (nrhs != 2) {
         mexErrMsgTxt("gpu_sgesv_magma requires 2 input arguments");
      } else if (nlhs != 1) {
         mexErrMsgTxt("gpu_sgesv_magma requires 1 output argument");
      }

      if ( !mxIsSingle(prhs[0]) || !mxIsSingle(prhs[1]) ) {
           mexErrMsgTxt("Input arrays must be single precision.");
      }
 
// %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
// Single-precision input arrays */
// Dimensions, and then array data
      L = mxGetN(prhs[0]);
      I = mxGetN(prhs[1]);
      printf("L = %i\n",L);
      printf("I = %i\n",I);
      A =   (float*) mxGetData(prhs[0]);
      B =   (float*) mxGetData(prhs[1]);

// Left hand side matrix set up    (the solution) 
      dims0[0]=L;
      dims0[1]=I;
      plhs[0] = mxCreateNumericArray(2,dims0,mxSINGLE_CLASS,mxREAL);
      X = (float*) mxGetData(plhs[0]);

      // Make modulo 96 dimensions  - speeds up the sgemm calculations significantly
      // Just used as an example here.
     //  Ic=I+(96-I%96);
      // Lc=L+(96-L%96);
       Ic=I;
       Lc=L;

     // cuInit( 0 );
      cublasInit();

      cublasAlloc (Lc*Lc, sizeof(float), (void**)&ga);
      cudaMemset(ga,0,Lc*Lc*4);  /* zero these since we've padded them */
      cublasSetMatrix (Lc, Lc, sizeof(float), A, L, ga, Lc);

      cublasAlloc (L*I, sizeof(float), (void**)&gb);
      cudaMemset(gb,0,L*I*4);
      cublasSetMatrix (L, I, sizeof(float), B, L, (void*)gb, Lc);
     
      printf("Set A,B\n");

    // Allocate for ipiv - a working matrix used by sgesv, and ignored here.
    //  ipiv = ( int *) malloc ( sizeof (int) * L ) ;
      ipiv = ( int *) mxCalloc (L,sizeof (int));

      printf("%i, %i\n",L,Lc);

    // Ready to go...
    // First numbers L, I pertain only to the non-padded sections of the arrays.

      printf("Ready for sgesv...\n");

      magma_sgesv_gpu( L,   I,  ga,  Lc,  ipiv,  gb,  Lc, &info);

      printf("Done with sgesvs.\n");

    // Get the solution off the GPU
      cublasGetMatrix (L, I, sizeof(float), gb, Lc, X, L);
    // X has the solution we need; now back to matlab after a bit of clean up.

    // Print the first three elements of the first row (debugging)
      printf("X-top = %e %e %e\n",X[0],X[L],X[L+L]); 
    // Print the  last three elements of the  last row (debugging)
      printf("X-bottom = %e %e %e\n",X[L*(I-2)-1],X[L*(I-1)-1],X[L*I-1]); 

    // Clear the variables to avoid GPU memory leak (and GPU crash!)
       mxFree(ipiv);
       cublasFree (ga);
       cublasFree (gb);
       cublasShutdown(); 

}


The idea is that with this mex file compiled, a simple call "[X]= gpu_sgesv_magma(A,B);" in matlab will give the solution X=A\B calculated on the GPU.

While this compiles o.k., alas it causes matlab to crash during the call to the magma_sgesv_gpu routine. I've not been able to get it to work - I've tried various MKL/BLAS/etc. I suspect that the MKL or BLAS calls are the problem - stepping on matlab's routines/memory space, but I really don't know. It would be nice to get MAGMA going in matlab - so this is some feedback and feature request. (This set of procedures worked fine with the more primitive sgemm routine.)

The relevant lines in my Makefile for the matlab routine are:

Code: Select all
BLASHOME    = /opt/intel/Compiler/11.0/074/mkl/lib/em64t/

INCLUDELIB  = -L$(CUDAHOME) -L$(MAGMAHOME) -L$(BLASHOME)  -lmkl_gf_lp64 -lmkl_intel_thread -lmkl_core -lguide -lmagma -lmagmablas -lcudart -lcuda -lcublas -Wl,-rpath,$(CUDAHOME)



then with
Code: Select all
export LD_LIBRARY_PATH="/opt/intel/Compiler/11.0/074/mkl/lib/em64t/":$LD_LIBRARY_PATH

set before starting matlab.

Thanks, and thanks for the new release!
Boxed Cylon
 
Posts: 27
Joined: Sat Nov 21, 2009 6:03 pm

Re: Magma 1.0/RC2 and Matlab

Postby Boxed Cylon » Tue Dec 14, 2010 9:23 am

Tracing back into the code to find the location of the crash, I've identified this line:

lapackf77_sgetrf( &rows, &nb, work, &lddwork, ipiv+i*nb, &iinfo);

of src/sgetrf_gpu.cpp (near line 275) as the source of the crash. I think this is the first time a lapackf77 routine is called.

Google searches suggest that calling lapack or blas routines from matlab mex files are tricky - no luck yet.
Boxed Cylon
 
Posts: 27
Joined: Sat Nov 21, 2009 6:03 pm

Re: Magma 1.0/RC2 and Matlab

Postby Boxed Cylon » Thu Dec 16, 2010 5:31 am

Ahhhh I love the smell of victory... I've got the magma sgesv working in matlab now.

The trick seems to be to statically link the blas/lapack libraries. The line in the Makefile that worked for me is:

Code: Select all
INCLUDELIB  = -L$(CUDAHOME) -L$(MAGMAHOME) -L$(BLASHOME) -lmagma -lmagmablas /opt/intel/Compiler/11.0/074/mkl/lib/em64t/libmkl_lapack.a  /opt/intel/Compiler/11.0/074/mkl/lib/em64t/libmkl_intel_lp64.a /opt/intel/Compiler/11.0/074/mkl/lib/em64t/libmkl_intel_thread.a /opt/intel/Compiler/11.0/074/mkl/lib/em64t/libmkl_core.a -lguide  -lcuda -lcublas -Wl,-rpath,$(CUDAHOME)


AND THE ORDER OF THE LIBRARIES IS IMPORTANT: -lmagma and -lmagmablas have to come first, it seems.

I also needed to find a libguide.so that would work. I set my LD_LIBRARY_PATH to aim at a version of this library from an older version of matlab. Presumably with a more generic blas/lapack than the intel MKL version, this would not be needed.

Its also helpful to use a bug free program (I call it gpu_sgesv_magma.cu) (not guaranteed bug free, but at least it works; with the bonus that the answers returned seem to be correct):

Code: Select all
#include "mex.h"
#include "cuda.h"
#include "cublas.h"
#include "magma.h"

#include "sys/time.h"

void magma_sgesv_gpu_( int n, int nrhs, float *dA, int ldda, int *ipiv, float *dB, int lddb, int info);


void mexFunction( int nlhs, mxArray *plhs[],
        int nrhs, const mxArray *prhs[])

{
      int I,L;
      int Ic,Lc;
      int dims0[2];

      // INPUT VARIABLES   %%%%%%%%%%%%%%%%%%%%%%%%%
      // A is dimensioned LXL
      // B is dimensioned LXI
      float *A,*B;
 
      // OUTPUT VARIABLE, X=A\B   %%%%%%%%%%%%%%%%%%
      float *X;

      // CUDA/GPU VARIABLES %%%%%%%%%%%%%%%%%%%%%%%%
      float *ga, *gb;
       int  *ipiv;
       int info;

      if (nrhs != 2) {
         mexErrMsgTxt("gpu_sgesv_magma requires 2 input arguments");
      } else if (nlhs != 1) {
         mexErrMsgTxt("gpu_sgesv_magma requires 1 output argument");
      }

      if ( !mxIsSingle(prhs[0]) || !mxIsSingle(prhs[1]) ) {
           mexErrMsgTxt("Input arrays must be single precision.");
      }
 
// %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
// Single-precision input arrays */
// Dimensions, and then array data
      L = mxGetN(prhs[0]);
      I = mxGetN(prhs[1]);

      A =   (float*) mxGetData(prhs[0]);
      B =   (float*) mxGetData(prhs[1]);

// Left hand side matrix set up    (the solution) 
      dims0[0]=L;
      dims0[1]=I;
      plhs[0] = mxCreateNumericArray(2,dims0,mxSINGLE_CLASS,mxREAL);
      X = (float*) mxGetData(plhs[0]);

      // from magma testing sgesv_gpu...
       Lc = ((L+31)/32)*32;       

      cublasInit();

      cublasAlloc (Lc*Lc, sizeof(float), (void**)&ga);
      cudaMemset(ga,0,Lc*Lc*4);  /* zero these since we've padded them */
      cublasSetMatrix (L, L, sizeof(float), A, L, (void*)ga, Lc);

      cublasAlloc (Lc*I, sizeof(float), (void**)&gb);
      cudaMemset(gb,0,Lc*I*4);
      cublasSetMatrix (L, I, sizeof(float), B, L, (void*)gb, Lc);
     

    // Allocate for ipiv - a working matrix used by sgesv, and ignored here.
      ipiv = ( int *) mxCalloc (L,sizeof (int));

    // Ready to go...
    // First numbers L, I pertain only to the non-padded sections of the arrays.

      magma_sgesv_gpu( L,   I,  ga,  Lc,  ipiv,  gb,  Lc, &info);

     if (info < 0)
            printf("Argument %d of magma_sgesv had an illegal value.\n", -info);

    // Get the solution off the GPU
      cublasGetMatrix (L, I, sizeof(float), gb, Lc, X, L);
    // X has the solution we need; now back to matlab after a bit of clean up.

    // Print the first three elements of the first row (debugging)
      printf("X-top = %e %e %e\n",X[0],X[L],X[L+L]); 
    // Print the  last three elements of the  last row (debugging)
      printf("X-bottom = %e %e %e\n",X[L*(I-2)-1],X[L*(I-1)-1],X[L*I-1]); 

    // Clear the variables to avoid GPU memory leak (and GPU crash!)
       mxFree(ipiv);
       cublasFree (ga);
       cublasFree (gb);
       cublasShutdown(); 

}


I've compared this routine with an equivalent routine that calls CULA's version of sgesv - the matlab script runs sgesv over a set of matrices sized NX5000, where N runs from 10 to 5000. CULA's version runs just a shade faster, e.g., for the largest sized matrices magma gives me a 17.31-fold speed up compared to my (single) CPU, whereas cula gives me a 17.92-fold speed up.

Also:
Code: Select all
INCLUDELIB  = -L$(CUDAHOME) -L$(MAGMAHOME) -L$(BLASHOME) -lmagma -lmagmablas  /usr/lib64/liblapack_pic.a /usr/lib64/libblas_pic.a -lgfortran -lcuda -lcublas -Wl,-rpath,$(CUDAHOME)

works for me, perhaps a more convenient alternative than the intel libraries.

Or another alternative:
Code: Select all
INCLUDELIB  = -L$(CUDAHOME) -L$(MAGMAHOME) -L$(BLASHOME) -lmagma -lmagmablas  /opt/acml4.4.0/gfortran64/lib/libacml.a /opt/acml4.4.0/gfortran64/lib/libacml_mv.a -lgfortran -lcuda -lcublas -Wl,-rpath,$(CUDAHOME)

which works well and is faster - precise speed up depends quite a bit on the BLAS/LAPACK used! The initial intel libraries above seem to be fastest.
Boxed Cylon
 
Posts: 27
Joined: Sat Nov 21, 2009 6:03 pm

Re: Magma 1.0/RC2 and Matlab

Postby Boxed Cylon » Tue Dec 28, 2010 2:11 am

I was able to get magma routines to compile and run with Intel's MKL version 10.2.5.035, but a curious trick was needed... The three libraries needed are libmkl_intel_lp64.a libmkl_intel_thread.a and libmkl_core.a . I could not seem to get my matlab routine to run properly with these, it would compile but then crash with unresolved symbols. I tried various orderings of these, all to no avail. It seems likely that these libraries have circular dependencies - that is, this depends on that, while that depends on this. The trick was to make a new file in the Intel library directory ( /opt/intel/mkl/10.2.5.035/lib/em64t/ in my case ) called "libmkl_em64t.a", although its precise name is likely irrelevant so long as it starts with "lib" and ends with ".a". This file has the single line:

Code: Select all
GROUP (libmkl_intel_lp64.a libmkl_intel_thread.a libmkl_core.a)


which when linked with $(BLASHOME)/libmkl_em64t.a apparently causes the linker to make several passes through this list and resolves the circular dependencies. No more unresolved symbols, yay!

Here are the relevant lines in my make file, should it prove useful to anyone:

Code: Select all
BLASHOME    = /opt/intel/mkl/10.2.5.035/lib/em64t
INCLUDELIB  = -L$(CUDAHOME) -L$(MAGMAHOME) -L$(BLASHOME) -lmagma -lmagmablas $(BLASHOME)/libmkl_em64t.a  -liomp5  -lcuda -lcublas -lcudart -Wl,-rpath,$(CUDAHOME)
Boxed Cylon
 
Posts: 27
Joined: Sat Nov 21, 2009 6:03 pm

Re: Magma 1.0/RC2 and Matlab

Postby Boxed Cylon » Wed Dec 29, 2010 7:51 am

I've been learning some things about matlab, external routines and OMP this evening.

On the mail list announcing Magma 1.0 ( http://icl.cs.utk.edu/magma/forum/viewtopic.php?f=2&t=100 ), I noted that the performance of MAGMA is greatly influenced by the version of BLAS employed, for array dimensions larger than 1000 or so. I found GotoBLAS2 to be the fastest of all by 5-10%.

When one makes a mex file (such as that listed above in this forum thread) that calls the external BLAS (from somewhere within a called MAGMA routine), the CPU parallelization of that call has to be set before matlab is started. That is, the value of the environment variable "OMP_NUM_THREADS" (set by "export OMP_NUM_THREADS=4", say) has to be set before matlab is started. Matlab normally sets the number of processors for OMP using the utility maxNumCompThreads (set within matlab by ">> maxNumCompThreads(4)", say), but this does not work with the external BLAS libraries. So, to squeeze another 10+% performance out of your matlab mex file that calls a MAGMA routine that uses a parallelized BLAS, set the OMP_NUM_THREADS variable before starting matlab. (which is quite a sentence...) This allows the BLAS employed by MAGMA to put the host multiprocessors to optimal use.

This also means that the external BLAS can run in SMP mode, while matlab runs in single processor mode ( maxNumCompThreads(1) ).

Another quirk is that sometimes the matlab libraries - those that came with the matlab installation - are incompatible with the equivalent libraries associated with the compiler used to compile the mex file. Examples may be libgfortran and libgomp. To ensure that the compatible libraries are used, rather than matlab's libraries, use the environment variable LD_PRELOAD, e.g.,:
Code: Select all
export LD_PRELOAD=/usr/lib64/libgfortran.so.1:/usr/lib64/libgomp.so.1

before starting matlab. With this, the mex routine will use the proper libraries, and the OMP parallelization may work properly (if it didn't before). I needed to do this with an Ubuntu distribution (for a non-MAGMA/CUDA application), but not for a Suse 11.1 distribution. I could not begin to explain that...

POSTSCRIPT: As an aside, if you wanted to use the GotoBLAS2 as the default BLAS for matlab, rather than the MKL that comes with matlab, you need to set the LD_PRELOAD variable as above and also set the usual variable, e.g.,
Code: Select all
export BLAS_VERSION=/PATH/TO/GotoBLAS2/libgoto2_barcelonap-r1.13.so

With these shell variables, matlab should use the GotoBLAS2 and perform calculations on matrices and vectors just a little bit faster.
Last edited by Boxed Cylon on Wed Dec 29, 2010 10:29 pm, edited 1 time in total.
Boxed Cylon
 
Posts: 27
Joined: Sat Nov 21, 2009 6:03 pm

Re: Magma 1.0/RC2 and Matlab

Postby fletchjp » Wed Dec 29, 2010 11:48 am

Boxed Cylon wrote:I've been learning some things about matlab, external routines and OMP this evening.

(large portion cut)

Another quirk is that sometimes the matlab libraries - those that came with the matlab installation - are incompatible with the
equivalent libraries associated with the compiler used to compile the mex file. Examples may be libgfortran and libgomp. To ensure that the compatible libraries are used, rather than matlab's libraries, use the environment variable LD_PRELOAD, e.g.,:
Code: Select all
export LD_PRELOAD=/usr/lib64/libgfortran.so.1:/usr/lib64/libgomp.so.1

before starting matlab. With this, the mex routine will use the proper libraries, and the OMP parallelization may work properly (if it didn't before). I needed to do this with an Ubuntu distribution (for a non-MAGMA/CUDA application), but not for a Suse 11.1 distribution. I could not begin to explain that...


I may be able to shed some light on this for you. I discovered recently that 64 bit Ubuntu (which follows Debian policy) has a different arrangement of 32 bit and 64 bit libraries from the one followed by Fedora and Suse (with which I have been more familiar). When I started to use Ubuntu I found some things stopped working. The reason is this.

Fedora and Suse: 32 bit libraries in folder lib with 64 libraries in lib64
Ubuntu and Debian: 32 bit libraries in folder lib32 with 64 bit libraries in lib and lib64 as a link to lib.

There are more details on this here: http://c2.com/cgi/wiki?UbuntuLinux

I hope this helps.

John
fletchjp
 
Posts: 170
Joined: Mon Dec 27, 2010 7:29 pm

Re: Magma 1.0/RC2 and Matlab

Postby lascott » Thu Mar 29, 2012 6:13 am

Wow. Thanks for putting in the effort and leaving your notes here. Two comments:

1) Have you tried the dll approach rather than the older mex?
http://www.mathworks.co.uk/help/techdoc ... 43202.html

2) For the moderators: Did you know that using the forum search for the word matlab claims that the word is too common to search?
lascott
 
Posts: 2
Joined: Thu Mar 29, 2012 4:43 am

Re: Magma 1.0/RC2 and Matlab

Postby fletchjp » Wed May 02, 2012 6:24 am

Sorry, I don't use matlab. The library issues came up in a different context.

John
fletchjp
 
Posts: 170
Joined: Mon Dec 27, 2010 7:29 pm


Return to User discussion

Who is online

Users browsing this forum: Google [Bot], Yahoo [Bot] and 1 guest

cron