dot slower in gpu
-
- Posts: 9
- Joined: Thu Aug 24, 2017 7:35 am
dot slower in gpu
So I found out that magma_?dot() seems to be a lot slower than the cpu code...Can anyone tell me why this is happening?
Thank you!
Thank you!
Re: dot slower in gpu
Please be more specific about what CPU and GPU you are using, what CPU and GPU software, and what size and precision vectors. Specific output from a tester would be helpful.
For small to modest size vectors, I would expect the CPU to be faster -- especially if the vectors are in cache memory.
For large vectors, say several times the size of cache, I would expect the GPU with its faster memory to be faster.
Currently, MAGMA does not have a specific dot tester, since we use cuBLAS dot, but recent revisions available from Bitbucket do include an axpy tester, which should give similar performance to dot, and exemplifies this crossover.
-mark
For small to modest size vectors, I would expect the CPU to be faster -- especially if the vectors are in cache memory.
For large vectors, say several times the size of cache, I would expect the GPU with its faster memory to be faster.
Currently, MAGMA does not have a specific dot tester, since we use cuBLAS dot, but recent revisions available from Bitbucket do include an axpy tester, which should give similar performance to dot, and exemplifies this crossover.
Code: Select all
bunsen magma/testing> ./testing_daxpy -n 100 -n 1000 -n 10000 -n 100000 -n 1000000
% MAGMA 2.2.0 svn compiled for CUDA capability >= 3.5, 64-bit magma_int_t, 64-bit pointer.
% CUDA runtime 7050, driver 9000. OpenMP threads 1. MKL 11.3.3, MKL threads 1.
% device 0: Tesla K40c, 745.0 MHz clock, 11439.9 MiB memory, capability 3.5
% device 1: Tesla K40c, 745.0 MHz clock, 11439.9 MiB memory, capability 3.5
% Thu Nov 9 15:17:12 2017
% Usage: ./testing_daxpy [options] [-h|--help]
% M cnt cuBLAS Gflop/s (ms) CPU Gflop/s (ms) cuBLAS error
%===========================================================================
100 100 0.0294 ( 0.6809) 0.4877 ( 0.0410) 0.00e+00 ok
1000 100 0.3199 ( 0.6251) 2.4105 ( 0.0830) 0.00e+00 ok
10000 100 2.8896 ( 0.6921) 1.7243 ( 1.1599) 0.00e+00 ok
100000 100 11.1063 ( 1.8008) 1.3741 ( 14.5550) 0.00e+00 ok
1000000 100 15.3857 ( 12.9991) 1.3988 ( 142.9799) 0.00e+00 ok
-
- Posts: 9
- Joined: Thu Aug 24, 2017 7:35 am
Re: dot slower in gpu
I am using a i7-3770K CPU @ 3.50GHz with Tesla K40c with linux ubuntu. The vectors are 10000 and I am not timing data transfers. I am using magma_ddot as part of a bigger code and I see that If i use the cpu code is 5 times faster than the gpu code
dot is called many times and the cumulative times are
gpu: 7.19 seconds
cpu: 1.30 seconds
I know i shouldn t expect good performance because dot doesn t have many computations, but getting 5 times slower i think is strange
dot is called many times and the cumulative times are
gpu: 7.19 seconds
cpu: 1.30 seconds
I know i shouldn t expect good performance because dot doesn t have many computations, but getting 5 times slower i think is strange
-
- Posts: 9
- Joined: Thu Aug 24, 2017 7:35 am
Re: dot slower in gpu
Actually I timed the specific function (dot) every timed it was called in a certain function and in cpu us way faster from gpu
-
- Posts: 9
- Joined: Thu Aug 24, 2017 7:35 am
Re: dot slower in gpu
So I did measured dot in a seperate file and indeed starts to have a good performance for vectors of 50 000 elements..that s why i get bad timings i didn t test my code for 50 000 elements...
Re: dot slower in gpu
A vector of length 10000 double precision values takes only 78 KiB. If you are calling dot many times on the same vector, then it will stay in L2 or even L1 cache, so the CPU will be quite fast.
On the other hand, the GPU has a hard time parallelizing a modest size vector like that. If it uses one thread block, it is limited to one SMX out of 15 SMX. If it uses 15 thread blocks, each thread block has only 667 elements to reduce, then it has to synchronize the thread blocks somehow and do another reduction.
Note that magma_ddot is simply a wrapper around cublasDdot, used for portability to other platforms.
-mark
On the other hand, the GPU has a hard time parallelizing a modest size vector like that. If it uses one thread block, it is limited to one SMX out of 15 SMX. If it uses 15 thread blocks, each thread block has only 667 elements to reduce, then it has to synchronize the thread blocks somehow and do another reduction.
Note that magma_ddot is simply a wrapper around cublasDdot, used for portability to other platforms.
-mark