Page 1 of 1

Why "magma_dsyevd" performs better with parallel MKL?

PostPosted: Thu Jun 30, 2011 9:05 am
by xinwu
Hi, everyone!

I succeeded in compiling MAGMA. But for testing purpose, the parallel linked "testing_dsyevd" is faster than the sequential linked binary on GPU, why it is that? Does "magma_dsyevd" have something to run on CPU?

Code: Select all
#
# this is a sequential linked binary
#
./testing_dsyevd -N 4000
device 0: Tesla C2070, 1147.0 MHz clock, 5375.2 MB memory
  testing_dsyevd -N 4000



  N     CPU Time(s)    GPU Time(s)     ||R||_F / ||A||_F
==========================================================
 4000      29.51          11.62         4.113991e-16 2.838989e-13
#
# this is a parallel linked binary
#
./testing_dsyevd -N 4000
device 0: Tesla C2070, 1147.0 MHz clock, 5375.2 MB memory
  testing_dsyevd -N 4000



  N     CPU Time(s)    GPU Time(s)     ||R||_F / ||A||_F
==========================================================
 4000       9.60           7.45         2.607371e-16 4.292615e-13



the parallel link was
Code: Select all
-lmkl_intel_lp64 -lmkl_intel_thread -lmkl_core -liomp5 -lpthread


the sequential link was
Code: Select all
-lmkl_intel_lp64 -lmkl_sequential -lmkl_core -lpthread

Re: Why "magma_dsyevd" performs better with parallel MKL?

PostPosted: Thu Jun 30, 2011 9:55 am
by xinwu
I took a look at the source code, and I now finally understand that "magma_dsyevd" is a hybrid function of both CPU and GPU. So the linker options affect the performance.

Re: Why "magma_dsyevd" performs better with parallel MKL?

PostPosted: Mon Jul 04, 2011 3:07 pm
by Stan Tomov
Hi,
Actually, most of the MAGMA algorithms are hybrid.
In particular, for the dsyevd algorithm, the most time consuming part is the reduction to tridiagonal (dsytrd). The dsytrd becomes memory bound for large matrices (e.g., above ~2048), so the magma dsytrd will call CPU dsytrd (e.g., from MKL) for the small matrices and switch to hybrid code for larger ones.
Stan