QR factorization on multiple GPUs

Open discussion for MAGMA

QR factorization on multiple GPUs

Postby gadou » Mon May 12, 2014 12:49 pm

Hi,

I am using MAGMA to do QR factorization on multiple GPUs. the problem I am facing is that the time it took to factorize on 4 gpus is the same as the time to factorize on 1 gpu. I am using a systme with 4 Tesla gpus. any hints about what I may be doing wrong
I am using this function "magma_dgeqrf2_mgpu". I also got the same performance when I used the example in testing folder "testing_dgeqrf_mgpu.cpp"

Thanks
gadou
 
Posts: 7
Joined: Mon May 12, 2014 12:38 pm

Re: QR factorization on multiple GPUs

Postby mgates3 » Mon May 12, 2014 5:44 pm

Can you post the complete input & output of the tester, including what the environment variables OMP_NUM_THREADS, MKL_NUM_THREADS, and MAGMA_NUM_GPUS are set to, if you set any of those? For instance, see below.

Also, your make.inc file would be helpful, and any environment variables you set for that, such as GPU_TARGET.

Code: Select all
romulus ~/magma-trunk-fermi> ls -l make.inc
lrwxrwxrwx    make.inc -> make.inc.mkl-ilp64

romulus ~/magma-trunk-fermi> echo $GPU_TARGET
Fermi

romulus ~/magma-trunk-fermi/testing> echo $OMP_NUM_THREADS
8

romulus ~/magma-trunk-fermi/testing> echo $MAGMA_NUM_GPUS
MAGMA_NUM_GPUS: Undefined variable.

romulus ~/magma-trunk-fermi/testing> ./testing_dgeqrf_mgpu --ngpu 1
MAGMA 1.4.0 svn compiled for CUDA capability >= 2.0
CUDA runtime 6000, driver 6000. OpenMP threads 8. MKL 11.1.0, MKL threads 8.
device 0: Tesla S2050, 1147.0 MHz clock, 2687.4 MB memory, capability 2.0
device 1: Tesla S2050, 1147.0 MHz clock, 2687.4 MB memory, capability 2.0
Usage: ./testing_dgeqrf_mgpu [options] [-h|--help]

ngpu 1
    M     N   CPU GFlop/s (sec)   GPU GFlop/s (sec)   ||R||_F /(M*||A||_F)
==========================================================================
 1088  1088     ---   (  ---  )     43.42 (   0.04)     --- 
 2112  2112     ---   (  ---  )     94.92 (   0.13)     --- 
 3136  3136     ---   (  ---  )    137.60 (   0.30)     --- 
 4160  4160     ---   (  ---  )    167.53 (   0.57)     --- 
 5184  5184     ---   (  ---  )    200.67 (   0.93)     --- 
 6208  6208     ---   (  ---  )    217.93 (   1.46)     --- 
 7232  7232     ---   (  ---  )    234.47 (   2.15)     --- 
 8256  8256     ---   (  ---  )    241.81 (   3.10)     --- 
 9280  9280     ---   (  ---  )    251.19 (   4.24)     --- 
10304 10304     ---   (  ---  )    256.47 (   5.69)     --- 

romulus ~/magma-trunk-fermi/testing> ./testing_dgeqrf_mgpu --ngpu 2
MAGMA 1.4.0 svn compiled for CUDA capability >= 2.0
CUDA runtime 6000, driver 6000. OpenMP threads 8. MKL 11.1.0, MKL threads 8.
device 0: Tesla S2050, 1147.0 MHz clock, 2687.4 MB memory, capability 2.0
device 1: Tesla S2050, 1147.0 MHz clock, 2687.4 MB memory, capability 2.0
Usage: ./testing_dgeqrf_mgpu [options] [-h|--help]

ngpu 2
    M     N   CPU GFlop/s (sec)   GPU GFlop/s (sec)   ||R||_F /(M*||A||_F)
==========================================================================
 1088  1088     ---   (  ---  )     38.71 (   0.04)     --- 
 2112  2112     ---   (  ---  )     86.91 (   0.14)     --- 
 3136  3136     ---   (  ---  )    135.04 (   0.30)     --- 
 4160  4160     ---   (  ---  )    156.39 (   0.61)     --- 
 5184  5184     ---   (  ---  )    208.53 (   0.89)     --- 
 6208  6208     ---   (  ---  )    244.51 (   1.30)     --- 
 7232  7232     ---   (  ---  )    283.23 (   1.78)     --- 
 8256  8256     ---   (  ---  )    310.16 (   2.42)     --- 
 9280  9280     ---   (  ---  )    349.18 (   3.05)     --- 
10304 10304     ---   (  ---  )    366.47 (   3.98)     --- 


Note using 2 GPUs is faster for large matrices, but not for smaller (N < 5000) matrices.
-mark
mgates3
 
Posts: 442
Joined: Fri Jan 06, 2012 2:13 pm

Re: QR factorization on multiple GPUs

Postby gadou » Tue May 13, 2014 11:29 am

I am using ubuntu blas and lapack liblapack-dev

I just run the test case with default input, I only changed ngpu to 2 and 4. This is the complete output

Thanks

Code: Select all
MAGMA 1.4.1 , compiled for CUDA capability >= 1.0
device 0: Tesla C2050, 1147.0 MHz clock, 2687.4 MB memory, capability 2.0
device 1: Tesla C2050, 1147.0 MHz clock, 2687.4 MB memory, capability 2.0
device 2: Tesla C2050, 1147.0 MHz clock, 2687.4 MB memory, capability 2.0
device 3: Tesla C2050, 1147.0 MHz clock, 2687.4 MB memory, capability 2.0
Usage: ./testing_dgeqrf_mgpu [options] [-h|--help]

ngpu 1
M N CPU GFlop/s (sec) GPU GFlop/s (sec) ||R||_F /(M*||A||_F)
==========================================================================
1088 1088 --- ( --- ) 17.82 ( 0.10) ---
2112 2112 --- ( --- ) 35.42 ( 0.35) ---
3136 3136 --- ( --- ) 52.42 ( 0.78) ---
4160 4160 --- ( --- ) 36.88 ( 2.60) ---
5184 5184 --- ( --- ) 45.14 ( 4.12) ---
6208 6208 --- ( --- ) 53.44 ( 5.97) ---
7232 7232 --- ( --- ) 61.31 ( 8.23) ---
8256 8256 --- ( --- ) 69.05 ( 10.87) ---
9280 9280 --- ( --- ) 77.31 ( 13.78) ---
10304 10304 --- ( --- ) 85.12 ( 17.14) ---


MAGMA 1.4.1 , compiled for CUDA capability >= 1.0
device 0: Tesla C2050, 1147.0 MHz clock, 2687.4 MB memory, capability 2.0
device 1: Tesla C2050, 1147.0 MHz clock, 2687.4 MB memory, capability 2.0
device 2: Tesla C2050, 1147.0 MHz clock, 2687.4 MB memory, capability 2.0
device 3: Tesla C2050, 1147.0 MHz clock, 2687.4 MB memory, capability 2.0
Usage: ./testing_dgeqrf_mgpu [options] [-h|--help]

ngpu 4
M N CPU GFlop/s (sec) GPU GFlop/s (sec) ||R||_F /(M*||A||_F)
==========================================================================
1088 1088 --- ( --- ) 16.06 ( 0.11) ---
2112 2112 --- ( --- ) 34.11 ( 0.37) ---
3136 3136 --- ( --- ) 50.69 ( 0.81) ---
4160 4160 --- ( --- ) 35.83 ( 2.68) ---
5184 5184 --- ( --- ) 44.48 ( 4.18) ---
6208 6208 --- ( --- ) 52.23 ( 6.11) ---
7232 7232 --- ( --- ) 60.71 ( 8.31) ---
8256 8256 --- ( --- ) 68.41 ( 10.97) ---
9280 9280 --- ( --- ) 76.51 ( 13.93) ---
10304 10304 --- ( --- ) 84.31 ( 17.30) ---
Last edited by gadou on Tue May 13, 2014 12:02 pm, edited 1 time in total.
gadou
 
Posts: 7
Joined: Mon May 12, 2014 12:38 pm

Re: QR factorization on multiple GPUs

Postby mgates3 » Tue May 13, 2014 6:32 pm

If by "ubuntu blas" you mean the libblas3 ubuntu package, that will be exceedingly slow. Try with libopenblas-base, which is an Optimized BLAS (linear algebra) library based on GotoBLAS2. Or get ATLAS (libatlas3-base), which is another optimized BLAS library. Ideally, you would use a multi-threaded BLAS library.

Again, it would be helpful to have your make.inc file and your complete input (command line and relevant environment variables), as well as the output, as I showed.

-mark
mgates3
 
Posts: 442
Joined: Fri Jan 06, 2012 2:13 pm

Re: QR factorization on multiple GPUs

Postby gadou » Wed May 14, 2014 2:52 pm

Hi I tried to install atlas and use it but it seems no change in results
Here is make file contents

Code: Select all
#//////////////////////////////////////////////////////////////////////////////
#   -- MAGMA (version 1.4.1) --
#      Univ. of Tennessee, Knoxville
#      Univ. of California, Berkeley
#      Univ. of Colorado, Denver
#      December 2013
#//////////////////////////////////////////////////////////////////////////////

# GPU_TARGET contains one or more of Tesla, Fermi, or Kepler,
# to specify for which GPUs you want to compile MAGMA:
#     Tesla  - NVIDIA compute capability 1.x cards
#     Fermi  - NVIDIA compute capability 2.x cards
#     Kepler - NVIDIA compute capability 3.x cards
# The default is all, "Tesla Fermi Kepler".
# See http://developer.nvidia.com/cuda-gpus
#
#GPU_TARGET ?= Tesla Fermi Kepler

CC        = gcc
NVCC      = nvcc
FORT      = gfortran

ARCH      = ar
ARCHFLAGS = cr
RANLIB    = ranlib

OPTS      = -O3 -DADD_ -fopenmp -DMAGMA_SETAFFINITY
F77OPTS   = -O3 -DADD_
FOPTS     = -O3 -DADD_ -x f95-cpp-input
NVOPTS    = -O3 -DADD_ -Xcompiler -fno-strict-aliasing
LDOPTS    = -fopenmp

# Depending on how ATLAS and LAPACK were compiled, you may need one or more of:
# -lifcore -ldl -lf2c -lgfortran
LIB       = -llapack -lf77blas -latlas -lcblas -lcublas -lcudart -lstdc++ -lm -lgfortran #-lifcore

# define library directories here or in your environment
LAPACKDIR = /usr/lib/lapack
ATLASDIR = /usr/lib/atlas-base
CUDADIR   = /usr/local/cuda
-include make.check-atlas
-include make.check-cuda

LIBDIR    = -L$(LAPACKDIR) \
            -L$(ATLASDIR)/lib \
            -L$(CUDADIR)/lib64

INC       = -I$(CUDADIR)/include


here is the outpout results

Code: Select all
gpus5:148% ./testing_dgeqrf_mgpu --ngpu 2
MAGMA 1.4.1 , compiled for CUDA capability >= 1.0
device 0: Tesla K20c, 705.5 MHz clock, 4799.6 MB memory, capability 3.5
device 1: Tesla K20c, 705.5 MHz clock, 4799.6 MB memory, capability 3.5
Usage: ./testing_dgeqrf_mgpu [options] [-h|--help]

ngpu 2
    M     N   CPU GFlop/s (sec)   GPU GFlop/s (sec)   ||R||_F /(M*||A||_F)
==========================================================================
 1088  1088     ---   (  ---  )     16.12 (   0.11)     --- 
 2112  2112     ---   (  ---  )     31.96 (   0.39)     --- 
 3136  3136     ---   (  ---  )     24.41 (   1.69)     --- 
 4160  4160     ---   (  ---  )     32.32 (   2.97)     --- 
 5184  5184     ---   (  ---  )     40.14 (   4.63)     --- 
 6208  6208     ---   (  ---  )     47.54 (   6.71)     --- 
 7232  7232     ---   (  ---  )     54.71 (   9.22)     --- 
 8256  8256     ---   (  ---  )     61.67 (  12.17)     --- 
 9280  9280     ---   (  ---  )     67.69 (  15.74)     --- 
10304 10304     ---   (  ---  )     47.02 (  31.03)     --- 
gpus5:149% ./testing_dgeqrf_mgpu --ngpu 1
MAGMA 1.4.1 , compiled for CUDA capability >= 1.0
device 0: Tesla K20c, 705.5 MHz clock, 4799.6 MB memory, capability 3.5
device 1: Tesla K20c, 705.5 MHz clock, 4799.6 MB memory, capability 3.5
Usage: ./testing_dgeqrf_mgpu [options] [-h|--help]

ngpu 1
    M     N   CPU GFlop/s (sec)   GPU GFlop/s (sec)   ||R||_F /(M*||A||_F)
==========================================================================
 1088  1088     ---   (  ---  )     16.71 (   0.10)     --- 
 2112  2112     ---   (  ---  )     32.29 (   0.39)     --- 
 3136  3136     ---   (  ---  )     24.51 (   1.68)     --- 
 4160  4160     ---   (  ---  )     32.40 (   2.96)     --- 
 5184  5184     ---   (  ---  )     40.22 (   4.62)     --- 
 6208  6208     ---   (  ---  )     47.64 (   6.70)     --- 
 7232  7232     ---   (  ---  )     54.87 (   9.19)     --- 
 8256  8256     ---   (  ---  )     61.49 (  12.20)     --- 
 9280  9280     ---   (  ---  )     67.81 (  15.72)     --- 
10304 10304     ---   (  ---  )     46.86 (  31.13)     --- 
gadou
 
Posts: 7
Joined: Mon May 12, 2014 12:38 pm


Return to User discussion

Who is online

Users browsing this forum: No registered users and 3 guests

cron