ATLAS, BLAS, GotoBLAS No Speed Ups

Post here if you have a question about LAPACK performance

ATLAS, BLAS, GotoBLAS No Speed Ups

Postby nunoxic » Thu Jun 16, 2011 9:35 am

I installed GotoBlas2 and I have Intel Fortran with MKL installed.
I am trying to speed-up a code which basically consists of a large number of matrix vector multiplications and dot products

I devised a simple code which will multiply a random 15000x15000 matrix with a 15000x1 vector (Both NON Zero/ Non Sparse) which goes like this
Code: Select all
call system_clock(t1)
b = matmul(a,x)
call system_clock(t2)
print*,"MATMUL  :",(t2-t1)

call system_clock(t1)
b = matmult(n,a,x)
call system_clock(t2)
print*,"MATMULT :",(t2-t1)

call system_clock(t1)
call dgemv ( 'N', n, n,1.0D+00, a, n, x, 1, 1.0D+00, b, 1 )
call system_clock(t2)
print*,"DGEMV   :",(t2-t1)


where
Code: Select all
FUNCTION MATMULT(dim,a,x)
  INTEGER :: dim
  double precision, DIMENSION(dim,dim) :: a
  double precision, DIMENSION(dim) :: x
  INTEGER ::i,j
  double precision, DIMENSION(dim) :: MATMULT
!$omp parallel
!$omp workshare

  MATMULT(1:n) = matmul( a(1:n,1:n), x(1:n) )

!$omp end workshare

!$omp end parallel

END FUNCTION MATMULT



I am using the commands as follows (WITHOUT CHANGING THE CODE AT ALL) :
gfortran matrix_multi.f90 -lblas && ./a.out (Clean i.e without openmp parallellization)
gfortran matrix_multi.f90 -lblas -fopenmp && ./a.out
gfortran matrix_multi.f90 -lf77blas -latlas && ./a.out
gfortran matrix_multi.f90 -lgoto2 -lpthread && ./a.out
gfortran m.f90 -L$MKLROOT/lib/ia32 -lmkl_blas95 -Wl,--start-group -lmkl_gf -lmkl_gnu_thread -lmkl_core -Wl,--end-group -lpthread -m32 -fopenmp (from linkline advisor)

But for some reason, the intrinsic MATMUL is almost always faster than all others.
Even if it isn't the fastest, it is VERY close to the fastest version.
Only in the case of mkl compiler is the DGEMV ALWAYS faster than the others.

Am I doing something wrong ? I am completely confused.
I don't know anything about installation and stuff and was barely able to install the files from synaptic and link them.
I am a complete N00B

I have already seen all links including one where someone is comparing lapack with mathematica and matlab but none seem to answer my questions


Thanks
nunoxic
 
Posts: 2
Joined: Thu Jun 16, 2011 8:41 am

Re: ATLAS, BLAS, GotoBLAS No Speed Ups

Postby admin » Thu Jun 16, 2011 12:38 pm

Hi,
how many cores do you have on your machine?
Did you set OMP_NUM_THREADS, MKL_NUM_THREADS and GOTO_NUM_THREADS?
admin
Site Admin
 
Posts: 499
Joined: Wed Dec 08, 2004 7:07 pm

Re: ATLAS, BLAS, GotoBLAS No Speed Ups

Postby nunoxic » Fri Jun 17, 2011 2:14 am

admin wrote:Hi,
how many cores do you have on your machine?

I have a Core2 Quad with 4 cores and 4 threads
I have Ubuntu installed, my system has 8 GB RAM but I (unknowingly) installed 32 bit Ubuntu and hence it shows only 3 GB (But thats not a problem since my matrix fits on the RAM)

Did you set OMP_NUM_THREADS, MKL_NUM_THREADS and GOTO_NUM_THREADS?

I had used
export OMP_NUM_THREADS=4
before running but not the other two and then the gfortran matrix_multi.f90 -lblas -fopenmp after that.

After reading your suggestion,
I used
Code: Select all
export MKL_NUM_THREADS=4
gfortran matrix_multi.f90 -L$MKLROOT/lib/ia32  -lmkl_blas95  -Wl,--start-group -lmkl_gf -lmkl_gnu_thread -lmkl_core -Wl,--end-group -fopenmp -lpthread && ./a.out

and
Code: Select all
GOTO_NUM_THREADS=4
gfortran matrix_multi.f90 -lgoto2 -lpthread && ./a.out

The problem persists, the DGEMV is barely better than the MATMUL intrinsic routine.

Also, In the system monitor, only 1 CPU was 100% the rest lingered at max. 30 << If that helps
Also, I tried using 2 and 8 threads instead of 4. No use.

Thanks
nunoxic
 
Posts: 2
Joined: Thu Jun 16, 2011 8:41 am

Re: ATLAS, BLAS, GotoBLAS No Speed Ups

Postby Julien Langou » Fri Jun 17, 2011 12:09 pm

But for some reason, the intrinsic MATMUL is almost always faster than all others.
Even if it isn't the fastest, it is VERY close to the fastest version.
Only in the case of mkl compiler is the DGEMV ALWAYS faster than the others.
Am I doing something wrong ? I am completely confused.


Your results make sense to me. Your parallelization of GEMV looks good and there is not much more you can do.
GEMV moves n^2 data for 2n^2 computation. It's hard to accelerate better than what openmp would do.
The operation is essentially bus bandwidth limited. It's not because you use more core that you have more bandwidth.
Given your time, you can compute the effective bandwidth you are using with
( n^2 * 8 * 1e-6 ) / time (in MB/sec)

Julien.
Julien Langou
 
Posts: 734
Joined: Thu Dec 09, 2004 12:32 pm
Location: Denver, CO, USA


Return to Performance

Who is online

Users browsing this forum: No registered users and 1 guest