trsv_kernel-1 running in a single threadblock?

Open discussion for MAGMA

trsv_kernel-1 running in a single threadblock?

Postby lanbanger » Wed Dec 05, 2012 12:44 pm

Hi, is there any particular reason why trsv_kernel-1 (and also trsv_kernel-0) should be running in a single threadblock of 512 threads? It's accounting for around 45-50% of my program's runtime. I wondered if there had been an error? I'm seeing it in Magma 1.1, although I'm about to re-compile with 1.3 to see if the result is the same.

EDIT: still the same with MAGMA 1.3
lanbanger
 
Posts: 9
Joined: Wed Feb 16, 2011 8:36 am

Re: trsv_kernel-1 running in a single threadblock?

Postby mgates3 » Wed Dec 05, 2012 3:03 pm

Can you give a little context here, like is this trsv part of a larger function such as gesv or getrs? What matrix dimensions are you using? MAGMA uses the underlying cublas trsv from Nvidia. We do not implement our own trsv.
-mark
mgates3
 
Posts: 329
Joined: Fri Jan 06, 2012 2:13 pm

Re: trsv_kernel-1 running in a single threadblock?

Postby lanbanger » Wed Dec 05, 2012 5:16 pm

Sure. I presume it's being called by spotrs. Matrix dimension will be 3072*3072.

I see this in ComputeProf

Grid Size Thread block size
[1 1 1] [512 1 1]
lanbanger
 
Posts: 9
Joined: Wed Feb 16, 2011 8:36 am

Re: trsv_kernel-1 running in a single threadblock?

Postby lanbanger » Thu Dec 06, 2012 5:02 am

Ah, it seems that the triangular solve in CuBLAS is not particularly effective and is beaten by MKL in many instances. However, this author has proposed a new solve method that hugely outperforms both MKL and CuBLAS: http://www.oerc.ox.ac.uk/downloads/pres ... /dtrsv.pdf
lanbanger
 
Posts: 9
Joined: Wed Feb 16, 2011 8:36 am

Re: trsv_kernel-1 running in a single threadblock?

Postby mgates3 » Mon Dec 10, 2012 7:19 pm

I did some testing and found some interesting things. First, cublas trsv appears to be much slower than cublas trsm, for a single RHS. This seems odd to me. It's really easy to always use trsm in src/dpotrs_gpu.cpp. Using cublas trsv was slower than MKL, but using cublas trsm is faster than MKL. In terms of flops/sec, we don't expect potrs (with a single RHS) to be very high since it is a memory bound operation. On the other hand, it shouldn't take very long since it's only O(n^2) operations, whereas potrf is O(n^3) operations. I put timers in src/dposv.cpp and src/dposv_gpu.cpp. Below are results on two systems.

-mark

The first system has two 8-core Intel @ 2.60GHz (Sandy Bridge) and Kepler GPU.

using CUBLAS trsv
bunsen ~/magma-1.2.1/testing> ./testing_dposv_gpu -R 1
device 0: Tesla K20c, 705.5 MHz clock, 4799.6 MB memory, capability 3.5
N NRHS GPU GFlop/s (sec) ||B - AX|| / ||A||*||X||
===========================================================
1024 1 15.87 ( 0.02) 2.68e-15 magma_dpotrf_gpu 0.015412 sec, magma_dpotrs_gpu 0.007290 sec
2048 1 68.20 ( 0.04) 4.14e-15 magma_dpotrf_gpu 0.020505 sec, magma_dpotrs_gpu 0.021615 sec
3072 1 130.99 ( 0.07) 4.75e-15 magma_dpotrf_gpu 0.044176 sec, magma_dpotrs_gpu 0.029752 sec
4032 1 176.66 ( 0.12) 5.56e-15 magma_dpotrf_gpu 0.080770 sec, magma_dpotrs_gpu 0.043116 sec
5184 1 274.02 ( 0.17) 6.35e-15 magma_dpotrf_gpu 0.116402 sec, magma_dpotrs_gpu 0.053283 sec
6016 1 308.93 ( 0.24) 7.47e-15 magma_dpotrf_gpu 0.170020 sec, magma_dpotrs_gpu 0.065176 sec
7040 1 350.66 ( 0.33) 8.27e-15 magma_dpotrf_gpu 0.257436 sec, magma_dpotrs_gpu 0.074562 sec
8064 1 377.43 ( 0.46) 7.67e-15 magma_dpotrf_gpu 0.374665 sec, magma_dpotrs_gpu 0.088853 sec
9088 1 407.53 ( 0.61) 9.13e-15 magma_dpotrf_gpu 0.516033 sec, magma_dpotrs_gpu 0.098385 sec
10112 1 426.90 ( 0.81) 9.04e-15 magma_dpotrf_gpu 0.694545 sec, magma_dpotrs_gpu 0.113372 sec


using CUBLAS trsm -- why is this faster?
bunsen ~/magma-1.2.1/testing> ./testing_dposv_gpu -R 1
device 0: Tesla K20c, 705.5 MHz clock, 4799.6 MB memory, capability 3.5
N NRHS GPU GFlop/s (sec) ||B - AX|| / ||A||*||X||
===========================================================
1024 1 18.51 ( 0.02) 1.91e-15 magma_dpotrf_gpu 0.016973 sec, magma_dpotrs_gpu 0.002487 sec
2048 1 114.73 ( 0.03) 3.67e-15 magma_dpotrf_gpu 0.020575 sec, magma_dpotrs_gpu 0.004457 sec
3072 1 187.42 ( 0.05) 3.76e-15 magma_dpotrf_gpu 0.044199 sec, magma_dpotrs_gpu 0.007469 sec
4032 1 240.13 ( 0.09) 5.39e-15 magma_dpotrf_gpu 0.080702 sec, magma_dpotrs_gpu 0.010437 sec
5184 1 355.74 ( 0.13) 5.75e-15 magma_dpotrf_gpu 0.116275 sec, magma_dpotrs_gpu 0.014432 sec
6016 1 387.25 ( 0.19) 6.32e-15 magma_dpotrf_gpu 0.169981 sec, magma_dpotrs_gpu 0.017648 sec
7040 1 416.37 ( 0.28) 7.61e-15 magma_dpotrf_gpu 0.257439 sec, magma_dpotrs_gpu 0.022162 sec
8064 1 434.77 ( 0.40) 8.27e-15 magma_dpotrf_gpu 0.375145 sec, magma_dpotrs_gpu 0.027247 sec
9088 1 452.60 ( 0.55) 8.76e-15 magma_dpotrf_gpu 0.520771 sec, magma_dpotrs_gpu 0.032458 sec
10112 1 470.91 ( 0.73) 7.83e-15 magma_dpotrf_gpu 0.694460 sec, magma_dpotrs_gpu 0.037953 sec


using CPU interface and MKL trsv
bunsen ~/magma-1.2.1/testing> ./testing_dposv -R 1
device 0: Tesla K20c, 705.5 MHz clock, 4799.6 MB memory, capability 3.5
N NRHS GPU GFlop/s (sec) ||B - AX|| / ||A||*||X||
===========================================================
1024 1 17.47 ( 0.02) 2.86e-15 magma_dpotrf 0.019951 sec, mkl dpotrs 0.000650 sec
2048 1 72.90 ( 0.04) 4.36e-15 magma_dpotrf 0.036304 sec, mkl dpotrs 0.003085 sec
3072 1 116.57 ( 0.08) 6.34e-15 magma_dpotrf 0.075577 sec, mkl dpotrs 0.007495 sec
4032 1 152.18 ( 0.14) 6.79e-15 magma_dpotrf 0.131757 sec, mkl dpotrs 0.012051 sec
5184 1 219.13 ( 0.21) 7.45e-15 magma_dpotrf 0.191553 sec, mkl dpotrs 0.020641 sec
6016 1 244.31 ( 0.30) 9.17e-15 magma_dpotrf 0.270517 sec, mkl dpotrs 0.026894 sec
7040 1 269.84 ( 0.43) 9.98e-15 magma_dpotrf 0.394030 sec, mkl dpotrs 0.037406 sec
8064 1 290.71 ( 0.60) 9.93e-15 magma_dpotrf 0.549841 sec, mkl dpotrs 0.051965 sec
9088 1 310.89 ( 0.81) 9.91e-15 magma_dpotrf 0.741516 sec, mkl dpotrs 0.063902 sec
10112 1 324.01 ( 1.06) 1.19e-14 magma_dpotrf 0.986795 sec, mkl dpotrs 0.077698 sec


--------------------------------------------------------------------------------
The second system has two 6-core Intel X5660 @ 2.80GHz (Westmere) and Fermi M2090 GPU.

using CUBLAS trsv
[kid100 testing]$ ./testing_dposv_gpu -R 1
device 0: Tesla M2090, 1301.0 MHz clock, 5375.4 MB memory, capability 2.0
N NRHS GPU GFlop/s (sec) ||B - AX|| / ||A||*||X||
===========================================================
1024 1 13.48 ( 0.03) 2.69e-15 magma_dpotrf_gpu 0.019563 sec, magma_dpotrs_gpu 0.007127 sec
2048 1 71.69 ( 0.04) 4.56e-15 magma_dpotrf_gpu 0.017485 sec, magma_dpotrs_gpu 0.022553 sec
3072 1 130.23 ( 0.07) 4.74e-15 magma_dpotrf_gpu 0.044351 sec, magma_dpotrs_gpu 0.029984 sec
4032 1 164.51 ( 0.13) 5.46e-15 magma_dpotrf_gpu 0.088299 sec, magma_dpotrs_gpu 0.044709 sec
5184 1 202.43 ( 0.23) 6.35e-15 magma_dpotrf_gpu 0.175451 sec, magma_dpotrs_gpu 0.054228 sec
6016 1 230.68 ( 0.32) 7.30e-15 magma_dpotrf_gpu 0.247839 sec, magma_dpotrs_gpu 0.067125 sec
7040 1 256.18 ( 0.45) 8.35e-15 magma_dpotrf_gpu 0.378191 sec, magma_dpotrs_gpu 0.076222 sec
8064 1 272.77 ( 0.64) 7.67e-15 magma_dpotrf_gpu 0.549292 sec, magma_dpotrs_gpu 0.092066 sec
9088 1 290.11 ( 0.86) 9.22e-15 magma_dpotrf_gpu 0.761855 sec, magma_dpotrs_gpu 0.101211 sec
10112 1 301.16 ( 1.15) 9.22e-15 magma_dpotrf_gpu 1.028292 sec, magma_dpotrs_gpu 0.116934 sec

using CUBLAS trsm -- why is this faster?
[kid100 testing]$ ./testing_dposv_gpu -R 1
device 0: Tesla M2090, 1301.0 MHz clock, 5375.4 MB memory, capability 2.0
N NRHS GPU GFlop/s (sec) ||B - AX|| / ||A||*||X||
===========================================================
1024 1 7.86 ( 0.05) 1.91e-15 magma_dpotrf_gpu 0.043242 sec, magma_dpotrs_gpu 0.002603 sec
2048 1 125.35 ( 0.02) 3.76e-15 magma_dpotrf_gpu 0.017599 sec, magma_dpotrs_gpu 0.005279 sec
3072 1 183.21 ( 0.05) 3.86e-15 magma_dpotrf_gpu 0.043173 sec, magma_dpotrs_gpu 0.009655 sec
4032 1 217.45 ( 0.10) 5.39e-15 magma_dpotrf_gpu 0.086841 sec, magma_dpotrs_gpu 0.013775 sec
5184 1 240.05 ( 0.19) 5.96e-15 magma_dpotrf_gpu 0.173617 sec, magma_dpotrs_gpu 0.020065 sec
6016 1 269.88 ( 0.27) 6.32e-15 magma_dpotrf_gpu 0.244861 sec, magma_dpotrs_gpu 0.024345 sec
7040 1 286.95 ( 0.41) 7.87e-15 magma_dpotrf_gpu 0.374864 sec, magma_dpotrs_gpu 0.030832 sec
8064 1 299.70 ( 0.58) 8.27e-15 magma_dpotrf_gpu 0.545473 sec, magma_dpotrs_gpu 0.038238 sec
9088 1 311.89 ( 0.80) 8.76e-15 magma_dpotrf_gpu 0.757409 sec, magma_dpotrs_gpu 0.045400 sec
10112 1 320.72 ( 1.08) 7.65e-15 magma_dpotrf_gpu 1.021452 sec, magma_dpotrs_gpu 0.053942 sec

using CPU interface and MKL trsv
[kid100 testing]$ ./testing_dposv -R 1
device 0: Tesla M2090, 1301.0 MHz clock, 5375.4 MB memory, capability 2.0
N NRHS GPU GFlop/s (sec) ||B - AX|| / ||A||*||X||
===========================================================
1024 1 6.08 ( 0.06) 2.86e-15 magma_dpotrf 0.058080 sec, mkl dpotrs 0.001077 sec
2048 1 80.20 ( 0.04) 4.27e-15 magma_dpotrf 0.031970 sec, mkl dpotrs 0.003843 sec
3072 1 118.25 ( 0.08) 6.25e-15 magma_dpotrf 0.072358 sec, mkl dpotrs 0.009539 sec
4032 1 138.11 ( 0.16) 6.69e-15 magma_dpotrf 0.142748 sec, mkl dpotrs 0.015731 sec
5184 1 159.74 ( 0.29) 7.62e-15 magma_dpotrf 0.265158 sec, mkl dpotrs 0.025957 sec
6016 1 191.09 ( 0.38) 9.17e-15 magma_dpotrf 0.345530 sec, mkl dpotrs 0.034733 sec
7040 1 210.11 ( 0.55) 1.02e-14 magma_dpotrf 0.507374 sec, mkl dpotrs 0.046734 sec
8064 1 224.90 ( 0.78) 9.93e-15 magma_dpotrf 0.713862 sec, mkl dpotrs 0.064060 sec
9088 1 238.54 ( 1.05) 9.91e-15 magma_dpotrf 0.967993 sec, mkl dpotrs 0.081729 sec
10112 1 249.70 ( 1.38) 1.22e-14 magma_dpotrf 1.280745 sec, mkl dpotrs 0.100541 sec
mgates3
 
Posts: 329
Joined: Fri Jan 06, 2012 2:13 pm


Return to User discussion

Who is online

Users browsing this forum: Google [Bot] and 2 guests

cron