MAGMA_SGETRS_GPU less powerful than SGETRS (ACML) !!!

Open discussion for MAGMA

MAGMA_SGETRS_GPU less powerful than SGETRS (ACML) !!!

Postby JulienECE » Tue Mar 16, 2010 10:55 am

I have made a benchmark of the subroutine sgetrs solving a linear system after an LU factorization (sgetrf). I used these materials :
a GPU Nvidia Geforce 9600M with 16 cores, 1250Mhz, 512MO DRAM and a CPU Intel Core 2 Duo with 3,06Ghz and 4GO DDR3.

The results (Time is in milliseconds):
Image

I don't understand why MAGMA subroutine takes more time than LAPACK (ACML) subroutine?


Someone tried to use this subroutine?

This is the benchmark of the subroutine SGETRF (MAGMA & LAPACK) :
Image

Thank you for your answer !
JulienECE
 
Posts: 4
Joined: Tue Mar 16, 2010 7:30 am

Re: MAGMA_SGETRS_GPU less powerful than SGETRS (ACML) !!!

Postby Stan Tomov » Thu Mar 18, 2010 12:45 pm

Hello,
Your benchmark must be similar to testing_sgesv_gpu from the magma distribution. Do you get lower than expected performance with testing_sgesv_gpu as well? If yes, the reason may be that the matrix that you factor does not start at address divisible by 16*sizeof(float). If no, probably you loose performance in sgetrs. Magma's sgetrs uses magma_strsm (not cublasStrsm) and a parallel implementation of slaswp.
Regards,
Stan
Stan Tomov
 
Posts: 247
Joined: Fri Aug 21, 2009 10:39 pm

Re: MAGMA_SGETRS_GPU less powerful than SGETRS (ACML) !!!

Postby JulienECE » Thu Mar 18, 2010 1:27 pm

There are almost the same results I have with sgetrf from vasily volkov routine. So, the linear system solver from MAGMA is very powerful!
I need to factorize the matrice once and after solve my linear system N times (N = 1000)

Which function provides the best performancet to solve a linear system ? magma_sgetrS_gpu ? Because it seems to slow!

Julien


N GPU GFlop/s || b-Ax || / ||A|| time in ms
========================================================
1024 14.84 2.513783e-07 48.384000
2048 24.54 2.111665e-06 233.674000
3072 33.97 5.364181e-06 569.561000
4032 36.58 6.626522e-07 1195.526000
5184 37.98 6.161365e-07 2446.879000
6016 38.34 2.017927e-06 3787.724000
7040 38.91 1.475847e-06 5980.986000
8064 39.38 3.561859e-06 8880.406000
9088 39.50 1.063919e-06 12673.880000
10112 39.91 8.097373e-07 17275.719000
JulienECE
 
Posts: 4
Joined: Tue Mar 16, 2010 7:30 am

Re: MAGMA_SGETRS_GPU less powerful than SGETRS (ACML) !!!

Postby JulienECE » Thu Mar 18, 2010 2:42 pm

Some one has an example of how to use the routine magma_sgetrs_gpu("N", N, NRHS, d_A, dlda, IPIV, d_B, LDB, INFO, h_work_M_S);?

Specialy, sizes of tables ?

Thanks a lot!

Julien
JulienECE
 
Posts: 4
Joined: Tue Mar 16, 2010 7:30 am

Re: MAGMA_SGETRS_GPU less powerful than SGETRS (ACML) !!!

Postby Stan Tomov » Thu Mar 18, 2010 4:56 pm

The example is in testing_sgesv_gpu.cpp. I see you gave the performance of testing_sgesv_gpu and it seems good as it goes up to 39.91 GFlop/s for magma_sgetrf_gpu followed by magma_sgetrs_gpu (with 1 RHS) vs 40.17 GFlop/s for just the factorization. Do you mean it gets slow when you do 1000 solves? Do you have the 1000 RHSs at once or you get them (and solve) one by one in some iterative process. Also, I see that the current magma sgetrs_gpu distribution does the following
Code: Select all
 if (notran) {
      /* Solve A * X = B. */
      cublasGetMatrix( n, nrhs, sizeof(float), b,n ,hwork, n);
      int k1 = 1 ;
      int k2 = n;
      int k3 = 1 ;
      slaswp_(&nrhs, hwork, &ldb, &k1, &k2, ipiv, &k3);
      cublasSetMatrix( n, nrhs, sizeof(float), hwork, n, b, n);

      magmablas_strsm('L','L','N','U', n , nrhs, 1.0, a , lda , b , ldb );
      magmablas_strsm('L','U','N','N', n , nrhs, 1.0, a , lda , b , ldb );
    } else {
  . . .

If you have your RHSs on the CPU you can do the slasw at once on the CPU and send the data only once to the GPU for the triangular solves. We are preparing the new release and one of the changes is to do the slaswp directly on the GPU (so it is much faster and there are no data copies). Actually, this is already done in the current distribution in the mixed precision iterative refinement LU where we factor once in single precision and use the factorization for many triangular solves in an iterative process (entirely on the GPU, including the swapping).
Stan Tomov
 
Posts: 247
Joined: Fri Aug 21, 2009 10:39 pm

Re: MAGMA_SGETRS_GPU less powerful than SGETRS (ACML) !!!

Postby JulienECE » Fri Mar 19, 2010 2:29 pm

Hi Stan,

Do you have the 1000 RHSs at once or you get them (and solve) one by one in some iterative process.


It's a iterative process, in other words, for Imax = 1000 & N=10112 => the global time is 1000*184.706(ms) = 184s
But in reality, I have to increase accuracy of my calculs, I have to maximize the size of A. And actualy, I don't know the max size of A and so of the system (respecting nhrs=1) I can allocate on a GTX 295 with almost 1792MB GDDR3???

If you have your RHSs on the CPU you can do the slaswp at once on the CPU and send the data only once to the GPU for the triangular solves.


As I said, I have to compute juste one time the LU factorization and iterate N times the linear system solving. So, I can do juste one time slaswp after the LU factorization.
I benchmark slaswp CPU routine with data copie.

Code: Select all
      start = get_current_time();
      cublasGetMatrix( N, NRHS, sizeof(float), B,N ,h_work_M_S, N);
      int k1 = 1 ;
      int k2 = N;
      int k3 = 1;
      slaswp_(&NRHS, h_work_M_S, &LDB, &k1, &k2, IPIV, &k3);
      cublasSetMatrix( N, NRHS, sizeof(float), h_work_M_S, N, B, N);
      end = get_current_time();

  N            GPU GFlop/s    time(ms)
========================================================
 1024             16695.93        0.043000
 2048             95583.53        0.060000
 3072             268697.59      0.072000
 4032             546642.44      0.080000
 5184             978208.38      0.095000
 6016             1344698.75    0.108000
 7040             1972103.62    0.118000
 8064             2649402.25    0.132000
 9088             3405177.00    0.147000
10112             1318399.62   0.523000


So, this has no influence on global speed of routine: N=1024, 0,043ms among 4.059ms.


And my last question, Can you explain me what is the hwork array ?
I konw only that:
HWORK (workspace) REAL array, dimension N*NRHS[/code] from MAGMA guide



PS:
Performance of SGETRS_GPU with NRHS = 1:
Code: Select all
  N            GPU GFlop/s      || b-Ax || / ||A||       Time (ms)
========================================================
 1024             176.87          2.513783e-07       4.059000
 2048             517.23          2.111665e-06       11.088000
 3072             911.57          5.364181e-06       21.223000
 4032             1282.67        6.626522e-07       34.094000
 5184             1748.31        6.161365e-07       53.154000
 6016             2092.16        2.017927e-06       69.415000
 7040             2494.92        1.475847e-06       93.273000
 8064             2919.91        3.561859e-06       119.771000
 9088             3324.24        1.063919e-06       150.579000
10112             3733.08       8.097373e-07       184.706000
JulienECE
 
Posts: 4
Joined: Tue Mar 16, 2010 7:30 am

Re: MAGMA_SGETRS_GPU less powerful than SGETRS (ACML) !!!

Postby Stan Tomov » Sat Mar 20, 2010 9:03 pm

Argument hwork in magma_sgetrs_gpu is work space on the CPU memory. If you want to solve 1 RHS, hwork should point to at least N single precision floating point numbers. Can you try sgetrs_gpu on problems of sizes divisible by 32 - in magma 0.2 we were going to cublas strsm if N is not divisible by 32 which is slower. If this is the problem I can send you a pre-release version that works for all sizes. On a GTX 280 for example I solve a problem of size 9000 for 11ms (vs 152 in your case, according to the table that you provide).
Stan Tomov
 
Posts: 247
Joined: Fri Aug 21, 2009 10:39 pm


Return to User discussion

Who is online

Users browsing this forum: Yahoo [Bot] and 2 guests