Page 1 of 1

dot slower in gpu

Posted: Thu Nov 09, 2017 2:58 pm
by thanasis_giannis
So I found out that magma_?dot() seems to be a lot slower than the cpu code...Can anyone tell me why this is happening?

Thank you!

Re: dot slower in gpu

Posted: Thu Nov 09, 2017 4:22 pm
by mgates3
Please be more specific about what CPU and GPU you are using, what CPU and GPU software, and what size and precision vectors. Specific output from a tester would be helpful.

For small to modest size vectors, I would expect the CPU to be faster -- especially if the vectors are in cache memory.
For large vectors, say several times the size of cache, I would expect the GPU with its faster memory to be faster.

Currently, MAGMA does not have a specific dot tester, since we use cuBLAS dot, but recent revisions available from Bitbucket do include an axpy tester, which should give similar performance to dot, and exemplifies this crossover.

Code: Select all

bunsen magma/testing> ./testing_daxpy -n 100 -n 1000 -n 10000 -n 100000 -n 1000000
% MAGMA 2.2.0 svn compiled for CUDA capability >= 3.5, 64-bit magma_int_t, 64-bit pointer.
% CUDA runtime 7050, driver 9000. OpenMP threads 1. MKL 11.3.3, MKL threads 1. 
% device 0: Tesla K40c, 745.0 MHz clock, 11439.9 MiB memory, capability 3.5
% device 1: Tesla K40c, 745.0 MHz clock, 11439.9 MiB memory, capability 3.5
% Thu Nov  9 15:17:12 2017
% Usage: ./testing_daxpy [options] [-h|--help]

%   M   cnt     cuBLAS Gflop/s (ms)       CPU Gflop/s (ms)  cuBLAS error
  100   100      0.0294 (   0.6809)      0.4877 (   0.0410)    0.00e+00   ok
 1000   100      0.3199 (   0.6251)      2.4105 (   0.0830)    0.00e+00   ok
10000   100      2.8896 (   0.6921)      1.7243 (   1.1599)    0.00e+00   ok
100000   100     11.1063 (   1.8008)      1.3741 (  14.5550)    0.00e+00   ok
1000000   100     15.3857 (  12.9991)      1.3988 ( 142.9799)    0.00e+00   ok

Re: dot slower in gpu

Posted: Thu Nov 09, 2017 4:56 pm
by thanasis_giannis
I am using a i7-3770K CPU @ 3.50GHz with Tesla K40c with linux ubuntu. The vectors are 10000 and I am not timing data transfers. I am using magma_ddot as part of a bigger code and I see that If i use the cpu code is 5 times faster than the gpu code

dot is called many times and the cumulative times are

gpu: 7.19 seconds
cpu: 1.30 seconds

I know i shouldn t expect good performance because dot doesn t have many computations, but getting 5 times slower i think is strange

Re: dot slower in gpu

Posted: Sat Nov 11, 2017 5:13 am
by thanasis_giannis
Actually I timed the specific function (dot) every timed it was called in a certain function and in cpu us way faster from gpu

Re: dot slower in gpu

Posted: Sat Nov 11, 2017 9:45 am
by thanasis_giannis
So I did measured dot in a seperate file and indeed starts to have a good performance for vectors of 50 000 elements..that s why i get bad timings i didn t test my code for 50 000 elements...

Re: dot slower in gpu

Posted: Mon Nov 13, 2017 4:07 am
by mgates3
A vector of length 10000 double precision values takes only 78 KiB. If you are calling dot many times on the same vector, then it will stay in L2 or even L1 cache, so the CPU will be quite fast.

On the other hand, the GPU has a hard time parallelizing a modest size vector like that. If it uses one thread block, it is limited to one SMX out of 15 SMX. If it uses 15 thread blocks, each thread block has only 667 elements to reduce, then it has to synchronize the thread blocks somehow and do another reduction.

Note that magma_ddot is simply a wrapper around cublasDdot, used for portability to other platforms.