magma_dsytrd questions

Open discussion for MAGMA

Re: magma_dsytrd questions

Postby mtacconi » Wed Feb 23, 2011 11:32 am

brom wrote:The paper also claims a worse GPU (their GTX280 vs my C2050) and a better CPU (their Xeon vs my Desktop PC). So I'm not sure what I'm doing wrong! Perhaps the numbers in the paper were theoretical throughput but mistakenly presented as observed results?


probably in the paper they compare 1 CPU core/thread against 1 GPU. It is a common pactice: most of the works on GP-GPU compare GPU performance vs single cpucore/thread performance.
Anyway, are you using a multithreaded LAPACK library CPU side? This could explain the discrepancy you see in the CPU performance.
mtacconi
 
Posts: 11
Joined: Tue Dec 07, 2010 4:21 am

Re: magma_dsytrd questions

Postby brom » Wed Feb 23, 2011 11:40 am

mtacconi wrote: [...] probably in the paper they compare 1 CPU core/thread against 1 GPU. [...] Anyway, are you using a multithreaded LAPACK library CPU side?


The paper claims they are using "MKL's parallel BLAS" with "MKL 10.0". I'm using MKL as well. I'd think that hindering the multi-core BLAS would hurt the GPU performance as well considering MAGMA is a hybrid implementation.
brom
 
Posts: 18
Joined: Tue Jan 25, 2011 8:20 pm

Re: magma_dsytrd questions

Postby mtacconi » Wed Feb 23, 2011 12:20 pm

brom wrote:
mtacconi wrote: [...] probably in the paper they compare 1 CPU core/thread against 1 GPU. [...] Anyway, are you using a multithreaded LAPACK library CPU side?


The paper claims they are using "MKL's parallel BLAS" with "MKL 10.0". I'm using MKL as well. I'd think that hindering the multi-core BLAS would hurt the GPU performance as well considering MAGMA is a hybrid implementation.


I see your point but some time ago I gave a try to the hessemberg factorization dgehrd routine of magma 0.2. and I was able to see the claimed 20x speed-up only against a single core/thread. Here the speedups I recorded:

Code: Select all
 
MKL_NUM_THREADS=1
dim  speedup
512  0.852941
1024 1.96906
2048 6.05904
4096 10.0686
8192 16.1507
12288 19.3209

MKL_NUM_THREADS=6
512 0.512821
1024 1.04185
2048 3.52746
4096 7.532
8192 8.00784
11264 9.74508
12288 9.00145



From these results at that time I concluded they are comparing against 1 cpu thread. I have to say I didn't investigate any further because I was and still I am basically intersted in the tridiagonalization routine and hermitian eigenproblem.

The testing code is written in F95 + some F2003 extension (iso_c_binding module) pm me if interested.
mtacconi
 
Posts: 11
Joined: Tue Dec 07, 2010 4:21 am

Re: magma_dsytrd questions

Postby Stan Tomov » Wed Feb 23, 2011 2:07 pm

Hello,

Thanks for reporting and trying to figure out the reasons for these descrepenacies. They are due to the GPU BLAS implementation used - in the paper we used customized BLAS kernels that are not yet in the release. The high level algorithms though, as described in the paper, are in the release.

Talking specificly for the SSYTRD routine, its performance critically depends on the speed of SSYMV (as 50% of the flops are in SYMV). Theoretically SSYMV can run up to 142 GFlop/s on a GTX280 (bus speed 142 GB/s), so if this is available, the SSYTRD from MAGMA 1.0 RC3 would run asymptotically at speed above 142 GFlop/s. In reality though this SYMV performance is not possible. CUBLAS SSYMV would run at below 10 GFlop/s and as a result the MAGMA SSYTRD, using CUBLAS SSYMV, would run at about that speed as well. The paper used a SSYMV kernel running at up to ~80 GFlop/s and so the MAGMA SSYTRD using that kernel goes to about that speed. Although this may sound impressive, there is obviously a lot of room for improvement. Indeed, we developed another SSYMV (shortly after the paper was submitted) that reached up to a little above 100 GFlop/s and along other optimizations the SSYTRD actually reached close to 120 GFlop/s.

The development of BLAS consumes a lot of effort especially with GPU changes coming frequently. For example in Fermi we had to redesign some BLAS algorithms. Level 2 BLAS on Fermi is also slow as the bus bandwidth was not increased, while ECC was added, further reducing the bandwidth available to users. Therefore we may even consider dropping MAGMA BLAS support from MAGMA. The CUBLAS GEMM is based on the MAGMA GEMM, so similarly, we will be happy to provide highly optimized MAGMA BLAS to NVIDIA to be incorporated and maintained in CUBLAS.

Stan
Stan Tomov
 
Posts: 250
Joined: Fri Aug 21, 2009 10:39 pm

Re: magma_dsytrd questions

Postby brom » Wed Feb 23, 2011 5:35 pm

So MAGMA's SSYMV and SGEMV aren't the ones from the paper?
brom
 
Posts: 18
Joined: Tue Jan 25, 2011 8:20 pm

Re: magma_dsytrd questions

Postby Stan Tomov » Wed Feb 23, 2011 5:48 pm

That's correct. The currently released MAGMA 1.0 RC3 SSYRK allocates and frees work memory inside the kernel, just to be compliant with the BLAS standard. This by itself reduces performance by about 20 GFlop/s (compared to an expert interface passing the work space from outside the routine). What is released is still much faster than CUBLAS SSYMV but it is not the 100 GFlop/s kernel.
Stan
Stan Tomov
 
Posts: 250
Joined: Fri Aug 21, 2009 10:39 pm

Re: magma_dsytrd questions

Postby brom » Wed Feb 23, 2011 6:58 pm

OK, thanks. That explains why the GPU performance is so low. Do you think these routines will be in RC4?

Is it also safe to assume the numbers from the paper are from only single-threaded LAPACK? That seems to be the conclusion we have found here.
brom
 
Posts: 18
Joined: Tue Jan 25, 2011 8:20 pm

Re: magma_dsytrd questions

Postby mtacconi » Thu Feb 24, 2011 1:05 pm

Following the suggestion made by Stan, I modified the dsytrd routine to call the magma_dsymv or the
"expert driver" magma_dsymv6_fermi which needs some basic GPU memory management.
Here the results:

Code: Select all
MKL_NUM_THREADS=1
cublasDsymv:
  M     N   CPU GFlop/s    CPU etime   GPU GFlop/s  GPU etime   
==============================================================
 1024  1024    2.55         563.33       7.27       197.51   
 2048  2048    6.89         1664.27      12.80      895.77
 3072  3072    5.91         6546.29      15.97      2423.07
 4032  4032    5.77         15163.73     17.20      5084.59
 5184  5184    6.16         30185.97     19.04      9761.71
 6016  6016    6.32         45922.65     19.64      14787.11

magma_dsymv_fermi:
============================================================
 1024  1024    5.25         273.40       6.97       205.79 
 2048  2048    6.40         1790.65      10.35      1108.28
 3072  3072    4.99         7758.61      15.13      2556.72
 4032  4032    5.02         17412.83     18.82      4647.89
 5184  5184    5.35         34762.57     22.30      8332.76
 6016  6016    5.56         52197.23     24.08      12062.48
 7040  7040    5.67         82059.43     25.75      18074.68
 
magma_dsymv6_fermi:
============================================================
 1024  1024    6.90         207.88       7.07       203.16   
 2048  2048    6.47         1772.58      15.09      760.21 
 3072  3072    5.78         6688.94      19.93      1941.46
 4032  4032    5.80         15082.18      23.01     3800.43
 5184  5184    5.83         31871.25      26.35     7053.29
 6016  6016    5.95         48819.00      27.76     10462.53
 7040  7040    5.97         77925.12      29.28     15896.75
 8064  8064    5.96         117431.18     29.87     23415.7


In this way is actually possible to squeeze some more Gflops from the current release of magma:
from a somewhat disappointing 19Gflops of the dsytrd that uses the cublasDsymv to a more comfortable sustained 29 Gflops by using magma_dsymv6_fermi :)

Look forward to see the optimized dsymv at work!
mtacconi
 
Posts: 11
Joined: Tue Dec 07, 2010 4:21 am

Re: magma_dsytrd questions

Postby jeremiahpalmer » Wed Mar 02, 2011 4:52 pm

Stan, I have read a little bit about fusing operations together - creating a "BLAS 2.5". Is it possible to incorporate some of those BLAS 2.5 ideas into the magma_dsytrd?

BTW, I noticed the fancy magma_dsymv's performance right from the start. Is there any discussion between the MAGMA folks and nVidia about incorporating the much faster magma_dsymv into CUBLAS?

Thanks again,
Jeremiah
jeremiahpalmer
 
Posts: 58
Joined: Fri Jan 28, 2011 12:46 pm

Re: magma_dsytrd questions

Postby jeremiahpalmer » Tue Mar 08, 2011 6:51 pm

Does anyone know much about fusing operations in magma_dsytrd?

Also, what about CUBLAS incorporating MAGMA's dsymv?
jeremiahpalmer
 
Posts: 58
Joined: Fri Jan 28, 2011 12:46 pm

PreviousNext

Return to User discussion

Who is online

Users browsing this forum: Bing [Bot], Google [Bot] and 2 guests