Timing of dgetrs_gpu and zgetrs_gpu

Open discussion for MAGMA

Timing of dgetrs_gpu and zgetrs_gpu

Postby fletchjp » Wed Mar 23, 2011 7:14 pm

I have been exploring why my FORTRAN routine using zgetrs_gpu is so slow and I have some interesting results.

I decided to work in C++ to start with, and have adapted testing_zgetrf_gpu.cpp to go on and use magma_zgetrs_gpu and have timed the back substitution. I have also done the same for testing_dgetrf_gpu.cpp.

I have also put in comparative calls to the lapack routines lapackf77_zgetrs and lapackf77_dgetrs (which I had to add to the magma headers as they were not there).

Bear in mind that I am using gotoblas2 and running 4 cores on my CPU.

In each case I have wrapped the call in a call to get the timing and then report the value, so these are not flops.

Code: Select all
        start = get_current_time();
        lapackf77_dgetrs(trans_str, &N, &NRHS, h_A, &lda, ipiv, h_X, &ldx, &info );
        end = get_current_time();

        h_time = GetTimerValue(start, end);


Results for dgetrs. The two extra numbers are the lapack value and then the magma value.

Code: Select all
fletcher@fletcher-desktop:~/magma_1.0.0-rc4/testing$ ./testing_dgetrf_gpu
device 0: GeForce GTX 460, 1400.0 MHz clock, 2047.2 MB memory

Usage:
  testing_dgetrf_gpu -M 1024 -N 1024



  M     N   CPU GFlop/s    GPU GFlop/s   ||PA-LU||/(||A||*N)
============================================================
  960   960   19.47          25.11   4.197521e-18        0.735       4.960
 1920  1920   24.91          47.15   3.660122e-18        3.775      13.188
 3072  3072   26.02          60.89   4.107697e-18        8.827      25.595
 4032  4032   25.71          64.79   3.820998e-18       16.558      40.142
 4992  4992   25.91          66.93   3.624484e-18       21.497      58.968
 5952  5952   26.84          68.55   3.530351e-18       31.591      81.068
 7104  7104   26.46          69.52   3.407946e-18       43.230     110.173
 8064  8064   26.12          70.65   2.741031e-18       79.834     137.785
 9024  9024   26.50          71.26   2.611909e-18       71.255     170.356
 9984  9984   26.48          71.38   2.544773e-18       82.859     206.296


I have run this twice as I was puzzled by the lapack value at 9024 which is below the previous case.

Results for zgetrs.

Code: Select all
fletcher@fletcher-desktop:~/magma_1.0.0-rc4/testing$ ./testing_zgetrf_gpu
device 0: GeForce GTX 460, 1400.0 MHz clock, 2047.2 MB memory

Usage:
  testing_zgetrf_gpu -M 1024 -N 1024



  M     N   CPU GFlop/s    GPU GFlop/s   ||PA-LU||/(||A||*N)
============================================================
  960   960   22.91          45.56    1.102403e-17        2.076      22.102
 1920  1920   26.98          59.43    1.111488e-17        8.005      81.726
 3072  3072   26.91          62.95    1.082018e-17       20.065     204.777
 4032  4032   27.67          67.07    1.066401e-17       42.218     348.620
 4992  4992   27.81          68.36    1.039162e-17       44.185     531.328
 5952  5952   27.36          68.98    1.034474e-17       67.249     751.652
 7104  7104   27.52          69.56    1.008222e-17       86.253    1066.505
 8064  8064   26.89          69.80    1.010409e-17      177.652    1378.809
 9024  9024   26.75          70.10    9.962978e-18      138.896    1720.921


This case is the one of interest in my other work. Again there is something odd about the 9024 result on lapack.

I have also done this including the transfer times, but they do not make much difference. The MAGMA routine is about 10 times worse than the LAPACK routine, which explains my problems with the case I have been working on.

zgetrs spends most of its time in ztrsm which I think is a CUBLAS routine, whereas you have done a dtrsm.

The double precision results are not too bad, but the double complex ones are amazingly unhelpful.

Is there something in your work programme on this? I think there should be a warning somewhere.

I will continue to look at zero copy to see where it can help, but in terms of my other problem I am back to what I called strategy 1, use MAGMA for zgetrf but not for zgetrs, unless I am missing something here.

Best wishes

John

P.S. Modified codes available on request.
fletchjp
 
Posts: 175
Joined: Mon Dec 27, 2010 7:29 pm

Re: Timing of dgetrs_gpu and zgetrs_gpu

Postby fletchjp » Sat Mar 26, 2011 1:26 pm

I have continued this work as follows. I have converted the timings to GFlops for calculating the right hand side values.

I will report some values below but first the general conclusions.

1. The MAGMA dtrsm is much better than the CUBLAS dtrsm in all cases.
2. There is not yet a MAGMA ztrsm so CUBLAS ztrsm is used.
3. All the routines perform better when there are more right hand sides to be solved.

As in the problem I working with, the case is complex and the righthandsides are defined one at a time, I am in the worst of the worst case, for which LAPACK on the CPU is the quicker solution, unless I can find some way to reorganise the calculations.

I am wondering whether the one RHS case needs to be specially handled via ztrsv/dtrsv. This turns out to be a VERY good thought - see the end for the comparative results using CUBLAS dtrsv.

Please would you add dtrsv and ztrsv as well as ztrsm to your todo list.

Now some results. These show that the MAGMA routine is fine for the case of many righthand sides.

dgetrs_gpu using CUBLAS dtrsm, 1 righthandside

My Gflop values are the last two columns, for LAPACK and MAGMA.

Code: Select all
fletcher@fletcher-desktop:~/magma_1.0.0-rc4/testing$ ./testing_dgetrf_gpu
device 0: GeForce GTX 460, 1400.0 MHz clock, 2047.2 MB memory

Usage:
  testing_dgetrf_gpu -M 1024 -N 1024



  M     N   CPU GFlop/s    GPU GFlop/s   ||PA-LU||/(||A||*N)
============================================================
  960   960   21.79          23.17   4.197521e-18        2.602       0.205
 1920  1920   25.02          47.14   3.660122e-18        1.932       0.243
 3072  3072   25.91          60.92   4.107697e-18        2.078       0.259
 4032  4032   26.30          64.81   3.820998e-18        1.972       0.246
 4992  4992   26.23          66.88   3.624484e-18        2.324       0.264
 5952  5952   26.09          68.56   3.530351e-18        2.361       0.264
 7104  7104   26.23          69.50   3.407946e-18        2.399       0.260
 8064  8064   27.37          70.68   2.741031e-18        1.630       0.264
 9024  9024   27.03          71.17   2.611909e-18        2.268       0.264
 9984  9984   27.28          71.37   2.544773e-18        2.447       0.264


Not impressive. I think those are the lowest GPU results I have seen.

dgetrs using MAGMA dtrsm, 1 righthand side.

Code: Select all
fletcher@fletcher-desktop:~/magma_1.0.0-rc4/testing$ ./testing_dgetrf_gpu
device 0: GeForce GTX 460, 1400.0 MHz clock, 2047.2 MB memory

Usage:
  testing_dgetrf_gpu -M 1024 -N 1024



  M     N   CPU GFlop/s    GPU GFlop/s   ||PA-LU||/(||A||*N)
============================================================
  960   960   13.37          25.09   4.197521e-18        0.265       0.371
 1920  1920   25.23          47.02   3.660122e-18        1.972       0.595
 3072  3072   25.87          60.91   4.107697e-18        2.130       0.693
 4032  4032   26.24          64.88   3.820998e-18        1.971       0.810
 4992  4992   26.27          67.18   3.624484e-18        2.318       0.865
 5952  5952   26.70          68.53   3.530351e-18        2.364       0.877
 7104  7104   26.33          69.49   3.407946e-18        2.396       0.916
 8064  8064   26.74          70.66   2.741031e-18        1.626       0.946
 9024  9024   27.11          71.27   2.611909e-18        2.272       0.959
 9984  9984   27.19          71.35   2.544773e-18        2.423       0.977


Better, but the GPU is still slower than the LAPACK routine.

Now the same with 10 right hand sides.

Code: Select all
fletcher@fletcher-desktop:~/magma_1.0.0-rc4/testing$ ./testing_dgetrf_gpu
device 0: GeForce GTX 460, 1400.0 MHz clock, 2047.2 MB memory

Usage:
  testing_dgetrf_gpu -M 1024 -N 1024

NRHS = 10


  M     N   CPU GFlop/s    GPU GFlop/s   ||PA-LU||/(||A||*N)
============================================================
  960   960   21.75          21.19   4.197521e-18        9.418       3.626
 1920  1920   24.74          44.59   3.621168e-18        9.330       5.856
 3072  3072   25.40          60.30   3.975995e-18       10.428       7.106
 4032  4032   25.73          64.48   3.854173e-18        9.867       7.736
 4992  4992   25.75          66.90   3.633974e-18        7.814       8.296
 5952  5952   25.87          68.14   3.520094e-18        7.308       8.885
 7104  7104   26.77          69.39   3.385858e-18        7.452       9.111
 8064  8064   27.15          70.58   2.681532e-18        7.214       9.427
 9024  9024   27.44          71.20   2.611202e-18        8.469       9.485
 9984  9984   27.31          71.31   2.547179e-18        8.258       9.685


Now the GPU overtakes LAPACK at larger sizes.

And 100 righthand sides:

Code: Select all
fletcher@fletcher-desktop:~/magma_1.0.0-rc4/testing$ ./testing_dgetrf_gpu
device 0: GeForce GTX 460, 1400.0 MHz clock, 2047.2 MB memory

Usage:
  testing_dgetrf_gpu -M 1024 -N 1024

NRHS = 100


  M     N   CPU GFlop/s    GPU GFlop/s   ||PA-LU||/(||A||*N)
============================================================
  960   960   21.80          22.49   4.197521e-18       24.915      24.488
 1920  1920   24.69          44.67   3.626896e-18       23.971      35.903
 3072  3072   25.84          59.92   4.114784e-18       24.510      42.520
 4032  4032   25.71          64.56   3.783869e-18       22.867      43.994
 4992  4992   26.54          66.96   3.609227e-18       22.782      46.268
 5952  5952   26.34          68.29   3.495318e-18       23.051      48.125
 7104  7104   26.70          69.44   3.364697e-18       23.063      48.983
 8064  8064   26.75          70.43   2.694515e-18       23.345      49.752
 9024  9024   27.13          71.10   2.623542e-18       23.510      50.089
 9984  9984   26.74          71.30   2.543488e-18       23.445      50.929


The GPU is now better, and aproaching the speeds for the dgetrf part.

Now 1000 righthandsides.

Code: Select all
fletcher@fletcher-desktop:~/magma_1.0.0-rc4/testing$ ./testing_dgetrf_gpu
device 0: GeForce GTX 460, 1400.0 MHz clock, 2047.2 MB memory

Usage:
  testing_dgetrf_gpu -M 1024 -N 1024

NRHS = 1000


  M     N   CPU GFlop/s    GPU GFlop/s   ||PA-LU||/(||A||*N)
============================================================
  960   960   19.64          21.08   4.197521e-18       29.570      46.513
 1920  1920   25.32          44.24   3.578105e-18       28.758      55.446
 3072  3072   25.81          60.59   4.063590e-18       27.939      60.757
 4032  4032   26.07          64.59   3.827668e-18       27.835      61.852
 4992  4992   26.26          66.65   3.594493e-18       28.636      63.287
 5952  5952   27.21          68.34   3.467570e-18       28.204      64.415
 7104  7104   26.60          69.40   3.376634e-18       28.439      65.292
 8064  8064   26.56          70.51   2.703956e-18       28.427      66.031
 9024  9024   27.64          71.10   2.607001e-18       28.215      66.455
 9984  9984   27.23          71.29   2.531860e-18       28.037      66.766


I don't think it is going to get better. The MAGMA version is now outperforming my 4 core LAPACK blas.

Here are the results for one righthand side, now using CUBLAS dtrsv in place of MAGMA dtrsm.

Code: Select all
fletcher@fletcher-desktop:~/magma_1.0.0-rc4/testing$ ./testing_dgetrf_gpu
device 0: GeForce GTX 460, 1400.0 MHz clock, 2047.2 MB memory

Usage:
  testing_dgetrf_gpu -M 1024 -N 1024

NRHS = 1


  M     N   CPU GFlop/s    GPU GFlop/s   ||PA-LU||/(||A||*N)
============================================================
  960   960   21.80          25.10   4.197521e-18        2.566       0.520
 1920  1920   24.96          46.79   3.660122e-18        1.945       0.812
 3072  3072   26.38          60.82   4.107697e-18        2.141       1.287
 4032  4032   27.03          64.69   3.820998e-18        2.056       1.577
 4992  4992   26.45          66.94   3.624484e-18        2.314       1.959
 5952  5952   26.15          68.51   3.530351e-18        2.367       2.177
 7104  7104   26.67          69.48   3.407946e-18        2.395       2.622
 8064  8064   27.32          70.61   2.741031e-18        1.670       2.899
 9024  9024   26.76          71.13   2.611909e-18        2.295       3.147
 9984  9984   27.65          71.33   2.544773e-18        2.412       3.344


This is a big improvement on the previous results.

I am doing the same for the complex version and will add the results later.

John
fletchjp
 
Posts: 175
Joined: Mon Dec 27, 2010 7:29 pm

Re: Timing of dgetrs_gpu and zgetrs_gpu

Postby fletchjp » Sun Mar 27, 2011 6:48 am

Results for the same comparison on zgetrs_gpu

Unchanged routine:

Code: Select all
fletcher@fletcher-desktop:~/magma_1.0.0-rc4/testing$ ./testing_zgetrf_gpudevice 0: GeForce GTX 460, 1400.0 MHz clock, 2047.2 MB memory

Usage:
  testing_zgetrf_gpu -M 1024 -N 1024



  M     N   CPU GFlop/s    GPU GFlop/s   ||PA-LU||/(||A||*N)
============================================================
  960   960   23.98          46.15    1.102403e-17        1.191       0.333
 1920  1920   26.32          59.13    1.111488e-17        3.676       0.361
 3072  3072   26.57          62.95    1.082018e-17        4.366       0.369
 4032  4032   28.25          67.20    1.066401e-17        3.049       0.373
 4992  4992   27.50          68.34    1.039162e-17        4.517       0.375
 5952  5952   27.91          68.99    1.034474e-17        4.266       0.377
 7104  7104   28.57          69.53    1.008222e-17        4.659       0.378
 8064  8064   28.72          69.81    1.010409e-17        2.924       0.377
 9024  9024   28.68          70.10    9.962978e-18        4.715       0.378
 9984  9984   28.40          70.24    1.002366e-17        4.743       0.379


This is clearly limited by something.

With 10 righthandsides:

Code: Select all
fletcher@fletcher-desktop:~/magma_1.0.0-rc4/testing$ ./testing_zgetrf_gpu
device 0: GeForce GTX 460, 1400.0 MHz clock, 2047.2 MB memory

Usage:
  testing_zgetrf_gpu -M 1024 -N 1024

NRHS = 10


  M     N   CPU GFlop/s    GPU GFlop/s   ||PA-LU||/(||A||*N)
============================================================
  960   960   23.65          47.05    1.102403e-17        4.203       2.972
 1920  1920   26.81          59.18    1.106640e-17       12.798       3.162
 3072  3072   27.03          63.07    1.063146e-17       12.685       3.260
 4032  4032   27.26          67.14    1.048040e-17       11.310       3.293
 4992  4992   27.66          68.30    1.039150e-17       11.177       3.311
 5952  5952   28.01          69.00    1.018645e-17       10.932       3.323
 7104  7104   27.96          69.56    1.016385e-17       10.872       3.334
 8064  8064   28.35          69.80    1.001095e-17       10.827       3.333
 9024  9024   27.77          70.08    9.938399e-18       10.726       3.348
 9984  9984   27.84          70.25    9.879407e-18       12.103       3.355


With 100 righthandsides:

Code: Select all
fletcher@fletcher-desktop:~/magma_1.0.0-rc4/testing$ ./testing_zgetrf_gpu
device 0: GeForce GTX 460, 1400.0 MHz clock, 2047.2 MB memory

Usage:
  testing_zgetrf_gpu -M 1024 -N 1024

NRHS = 100


  M     N   CPU GFlop/s    GPU GFlop/s   ||PA-LU||/(||A||*N)
============================================================
  960   960   23.95          45.73    1.102403e-17       27.501      20.346
 1920  1920   26.78          59.29    1.108298e-17       26.878      21.859
 3072  3072   27.38          63.02    1.081718e-17       25.915      22.551
 4032  4032   27.50          67.17    1.067282e-17       25.913      22.860
 4992  4992   27.88          68.31    1.023099e-17       26.557      22.917
 5952  5952   28.56          68.97    1.026714e-17       26.893      23.121
 7104  7104   28.44          69.54    1.016349e-17       26.413      23.168
 8064  8064   28.02          69.78    1.006641e-17       25.373      23.078
 9024  9024   27.69          70.09    9.909059e-18       25.824      23.277
 9984  9984   27.82          70.28    9.917569e-18       25.997      23.163



With 1000 righthandsides:

Code: Select all
fletcher@fletcher-desktop:~/magma_1.0.0-rc4/testing$ ./testing_zgetrf_gpu
device 0: GeForce GTX 460, 1400.0 MHz clock, 2047.2 MB memory

Usage:
  testing_zgetrf_gpu -M 1024 -N 1024

NRHS = 1000


  M     N   CPU GFlop/s    GPU GFlop/s   ||PA-LU||/(||A||*N)
============================================================
  960   960   23.37          46.94    1.102403e-17       30.029      34.842
 1920  1920   27.45          59.08    1.093841e-17       29.500      37.468
 3072  3072   27.58          63.05    1.075738e-17       29.423      38.531
 4032  4032   28.61          67.15    1.053005e-17       29.204      38.967
 4992  4992   28.07          68.31    1.029715e-17       29.066      39.195
 5952  5952   28.67          68.98    1.032660e-17       28.336      39.447
 7104  7104   28.33          69.53    1.695558e-18       27.936      44.262


cases beyond this fail for lack of GPU memory.

Finally, the modified version using ztrsv for 1 righthandside.

Code: Select all
fletcher@fletcher-desktop:~/magma_1.0.0-rc4/testing$ ./testing_zgetrf_gpu
device 0: GeForce GTX 460, 1400.0 MHz clock, 2047.2 MB memory

Usage:
  testing_zgetrf_gpu -M 1024 -N 1024

NRHS = 1


  M     N   CPU GFlop/s    GPU GFlop/s   ||PA-LU||/(||A||*N)
============================================================
  960   960   23.98          45.78    1.102403e-17        3.436       1.323
 1920  1920   26.48          59.25    1.111488e-17        3.835       2.488
 3072  3072   27.08          62.92    1.082018e-17        4.202       3.648
 4032  4032   28.40          67.17    1.066401e-17        3.317       4.855
 4992  4992   28.42          68.34    1.039162e-17        4.613       5.758
 5952  5952   28.30          68.98    1.034474e-17        4.330       6.711
 7104  7104   28.34          69.55    1.008222e-17        4.660       7.788
 8064  8064   28.12          69.77    1.010409e-17        3.004       8.499
 9024  9024   27.96          70.12    9.962978e-18        4.685       9.132
 9984  9984   27.96          70.28    1.002366e-17        4.657       9.979


I have sent a copy of the modified zgetrs_gpu.cpp to Stan.

I hope this helps.

John

P.S. There are lots of other places where calls to Xtrsv can replace Xtrsm when the nrhs==1. There is almost always a gain. The only exception I have found is sgemv where the MAGMA strsm is so good that CUBLAS strsv cannot better it. The best gains are on complex cases, where trsm is still CUBLAS.
fletchjp
 
Posts: 175
Joined: Mon Dec 27, 2010 7:29 pm

Re: Timing of dgetrs_gpu and zgetrs_gpu

Postby fletchjp » Wed Apr 06, 2011 4:00 am

Have you picked up on these ideas for better performance with one RHS?

John
fletchjp
 
Posts: 175
Joined: Mon Dec 27, 2010 7:29 pm

Re: Timing of dgetrs_gpu and zgetrs_gpu

Postby mateo70 » Wed Apr 06, 2011 2:43 pm

Yes, we discussed about that with Stan, but apparently the version actually in the code was the one giving the best performance.
But I will add the test case for one RHS.

Mathieu
mateo70
 
Posts: 41
Joined: Tue Mar 08, 2011 12:38 pm

Re: Timing of dgetrs_gpu and zgetrs_gpu

Postby fletchjp » Wed Apr 06, 2011 4:27 pm

Thank you.

I will check this out when I get the new release, as it is central to one of my applications. It is complex, and so will show a big gain I expect if you have done zgetrsm, as we are still using the CUBLAS one for that in RC4.

John
fletchjp
 
Posts: 175
Joined: Mon Dec 27, 2010 7:29 pm

Re: Timing of dgetrs_gpu and zgetrs_gpu

Postby mateo70 » Thu Apr 14, 2011 1:22 am

I just added the switch between the two functions trsm and trsv
We still have to implement a magma version.

Mathieu
mateo70
 
Posts: 41
Joined: Tue Mar 08, 2011 12:38 pm


Return to User discussion

Who is online

Users browsing this forum: Google [Bot] and 3 guests

cron