I decided to work in C++ to start with, and have adapted testing_zgetrf_gpu.cpp to go on and use magma_zgetrs_gpu and have timed the back substitution. I have also done the same for testing_dgetrf_gpu.cpp.
I have also put in comparative calls to the lapack routines lapackf77_zgetrs and lapackf77_dgetrs (which I had to add to the magma headers as they were not there).
Bear in mind that I am using gotoblas2 and running 4 cores on my CPU.
In each case I have wrapped the call in a call to get the timing and then report the value, so these are not flops.
- Code: Select all
start = get_current_time();
lapackf77_dgetrs(trans_str, &N, &NRHS, h_A, &lda, ipiv, h_X, &ldx, &info );
end = get_current_time();
h_time = GetTimerValue(start, end);
Results for dgetrs. The two extra numbers are the lapack value and then the magma value.
- Code: Select all
fletcher@fletcher-desktop:~/magma_1.0.0-rc4/testing$ ./testing_dgetrf_gpu
device 0: GeForce GTX 460, 1400.0 MHz clock, 2047.2 MB memory
Usage:
testing_dgetrf_gpu -M 1024 -N 1024
M N CPU GFlop/s GPU GFlop/s ||PA-LU||/(||A||*N)
============================================================
960 960 19.47 25.11 4.197521e-18 0.735 4.960
1920 1920 24.91 47.15 3.660122e-18 3.775 13.188
3072 3072 26.02 60.89 4.107697e-18 8.827 25.595
4032 4032 25.71 64.79 3.820998e-18 16.558 40.142
4992 4992 25.91 66.93 3.624484e-18 21.497 58.968
5952 5952 26.84 68.55 3.530351e-18 31.591 81.068
7104 7104 26.46 69.52 3.407946e-18 43.230 110.173
8064 8064 26.12 70.65 2.741031e-18 79.834 137.785
9024 9024 26.50 71.26 2.611909e-18 71.255 170.356
9984 9984 26.48 71.38 2.544773e-18 82.859 206.296
I have run this twice as I was puzzled by the lapack value at 9024 which is below the previous case.
Results for zgetrs.
- Code: Select all
fletcher@fletcher-desktop:~/magma_1.0.0-rc4/testing$ ./testing_zgetrf_gpu
device 0: GeForce GTX 460, 1400.0 MHz clock, 2047.2 MB memory
Usage:
testing_zgetrf_gpu -M 1024 -N 1024
M N CPU GFlop/s GPU GFlop/s ||PA-LU||/(||A||*N)
============================================================
960 960 22.91 45.56 1.102403e-17 2.076 22.102
1920 1920 26.98 59.43 1.111488e-17 8.005 81.726
3072 3072 26.91 62.95 1.082018e-17 20.065 204.777
4032 4032 27.67 67.07 1.066401e-17 42.218 348.620
4992 4992 27.81 68.36 1.039162e-17 44.185 531.328
5952 5952 27.36 68.98 1.034474e-17 67.249 751.652
7104 7104 27.52 69.56 1.008222e-17 86.253 1066.505
8064 8064 26.89 69.80 1.010409e-17 177.652 1378.809
9024 9024 26.75 70.10 9.962978e-18 138.896 1720.921
This case is the one of interest in my other work. Again there is something odd about the 9024 result on lapack.
I have also done this including the transfer times, but they do not make much difference. The MAGMA routine is about 10 times worse than the LAPACK routine, which explains my problems with the case I have been working on.
zgetrs spends most of its time in ztrsm which I think is a CUBLAS routine, whereas you have done a dtrsm.
The double precision results are not too bad, but the double complex ones are amazingly unhelpful.
Is there something in your work programme on this? I think there should be a warning somewhere.
I will continue to look at zero copy to see where it can help, but in terms of my other problem I am back to what I called strategy 1, use MAGMA for zgetrf but not for zgetrs, unless I am missing something here.
Best wishes
John
P.S. Modified codes available on request.