Multiple stream

Open discussion for MAGMA

Multiple stream

Postby eyalhir74 » Sun Mar 17, 2013 8:06 am

Hi,
I'm using CUDA 5 with Magma with a K20 GPU.
I have a 1024x1024 which I'm calling magma_zgesv_gpu on. I'm trying to see if I can speed performance up with using multistream.
I guess I have a couple of questions:
- How do I know if the GPU is already busy with this setup such that the additional streams will not help?
- I'm doing the following:
Code: Select all
      magmablasSetKernelStream(streams[0]);
      magma_zsetmatrix_async(N, N, (cuDoubleComplex *)A, N, magmaWrapper.d_pA, N, streams[0]);
      magma_zsetmatrix_async(N, N, (cuDoubleComplex *)B, N, magmaWrapper.d_pB, N, streams[0]);
      magma_queue_sync(streams[0]);
      magmaWrapper.SolveAXequB(N, true, magmaWrapper.d_pA, magmaWrapper.d_pB, magmaWrapper.d_pS);

      magmablasSetKernelStream(streams[1]);
      magma_zsetmatrix_async(N, N, (cuDoubleComplex *)A, N, magmaWrapper2.d_pA, N, streams[1]);
      magma_zsetmatrix_async(N, N, (cuDoubleComplex *)B, N, magmaWrapper2.d_pB, N, streams[1]);
      magma_queue_sync(streams[1]);
      magmaWrapper2.SolveAXequB(N, true, magmaWrapper2.d_pA, magmaWrapper2.d_pB, magmaWrapper2.d_pS);

      magmablasSetKernelStream(streams[2]);
      magma_zsetmatrix_async(N, N, (cuDoubleComplex *)A, N, magmaWrapper3.d_pA, N, streams[2]);
      magma_zsetmatrix_async(N, N, (cuDoubleComplex *)B, N, magmaWrapper3.d_pB, N, streams[2]);
      magma_queue_sync(streams[2]);
      magmaWrapper3.SolveAXequB(N, true, magmaWrapper3.d_pA, magmaWrapper3.d_pB, magmaWrapper3.d_pS);

I dont see any wall-clock-time improvement of using 1 or 3 streams. 3 streams take 3 times the time it takes to calculate with one stream.

Furthermore with nvprof, I see 4 streams (even though I opened only 3) as if there was some additional stream syncronizing all the memcopies caused
by the magma_zgesv_gpu call.


Thanks
eyalhir74
 
Posts: 5
Joined: Sun Mar 17, 2013 7:56 am

Re: Multiple stream

Postby mgates3 » Sun Mar 17, 2013 4:16 pm

magma_zgesv is a synchronous function, as are most magma functions, whereas magmablas functions are asynchronous. (I assume magma_zgesv is what you call inside magmaWrapper.SolveAXequB.) So it won't currently run multiple solves in parallel. Since part of the factorization happens in CPU code, you would actually need multiple CPU threads to even possibly run multiple solvers in parallel. Even that is not yet supported.
-mark
mgates3
 
Posts: 330
Joined: Fri Jan 06, 2012 2:13 pm

Re: Multiple stream

Postby eyalhir74 » Sun Mar 17, 2013 11:37 pm

Thanks Mark for your prompt answer.

Yes indeed I call the magma_zgesv inside the SolveAXequB function.
Is changing the zgesv to be able to run conncurrently (CPU and GPU) is something which is planned?
How hard do you think it would be to do this myself?
And maybe a more basic/important question would be - do you think it would benefit, performance wise,
to do so? i.e. do you think that if I'll be able to run multiple solvers on a K20 card with a 1024x1024 matrices,
I'll see performance gains over serial runs?

oh, will I be able to run 2 solvers if I use, for example, a two GPU system? or the CPU part would again prevent
the code from truely run in parallel and solve two sets of equations at the same time?

thanks a lot.
eyalhir74
 
Posts: 5
Joined: Sun Mar 17, 2013 7:56 am

Re: Multiple stream

Postby mgates3 » Mon Mar 18, 2013 3:05 pm

I'm not sure whether or when support for running multiple parallel solves will be added. Partly this is because a large enough matrix will completely occupy the GPU, so there would be no performance benefit. For smaller matrices, there may be some benefit, but it may need a more specialized interface to pipeline the operations effectively.

For multiple GPUs, if you call magma_zgesv from separate CPU threads, they ought to run in parallel. But I've never tried it. There may also be issues with multi-threaded BLAS on the CPU side. For instance, if you have 12 cores and set MKL number of threads to 12, then will the multiple panel factorizations conflict with each other and over-subscribe the CPU cores?

So, basically, it's on our agenda to look at it, but we don't as yet have all the answers.
-mark
mgates3
 
Posts: 330
Joined: Fri Jan 06, 2012 2:13 pm

Re: Multiple stream

Postby eyalhir74 » Mon Mar 18, 2013 4:58 pm

Thanks again Mark.
What would you consider a big enough matrix to occupy the GPU? what size?

thanks
eyal
eyalhir74
 
Posts: 5
Joined: Sun Mar 17, 2013 7:56 am

Re: Multiple stream

Postby mgates3 » Tue Mar 19, 2013 4:07 pm

I did some tests running multiple GEMMs in parallel, varying n,m and with fixed k=128. This is the kind of GEMM that occurs in factorizations.

For sgemm, percentage of gemm peak performance (538 Gflop/s) on S2050:
Code: Select all
          nstream=1    nstream=2    nstream=3
n=512     53%          71%          79%
n=1024    82%          89%          93%
n=2048    94%          97%          98%


For zgemm, percentage of gemm peak performance (327 Gflop/s) on S2050:
Code: Select all
          nstream=1    nstream=2    nstream=3
n=512     87%          92%          94%
n=1024    96%          98%          98%
n=2048    98%          99%          99%


As you can see, for small n like 512, doing multiple gemms in parallel using streams does improve the overall performance. But as n increases, it quickly attains nearly the whole peak speed, especially for double-complex. So I would not expect significant improvements using streams to solve multiple problems in parallel for n > 1000, even less for double and complex.

-mark
mgates3
 
Posts: 330
Joined: Fri Jan 06, 2012 2:13 pm

Re: Multiple stream

Postby eyalhir74 » Wed Mar 20, 2013 4:05 am

Hi Mark,
Thanks a lot for the detailed benchmark.
I do have another followup question though.

I'm using the magma_zgesv_gpu function. As far as I understand it first calls the magma_zgetrf_gpu (LU factorization) and then the magma_zgetrs_gpu which finally calls some version of the zgemm.
If I run the code for one call in one stream via the profiler I see the attached image
MagmaOut.jpg
magma_zgesv_gpu profiler output
MagmaOut.jpg (281.99 KiB) Viewed 944 times


As far as I understand the left side (of the memcopies) is the magma_zgetrf_gpu code, which has many copies from host-to-device and device-to-host, CPU code and a lot of code on the cpu. All happening syncorniously.
Then the synced memcopies and then on the right the magma_zgetrs_gpu code, which looks better utilizing the gpu (at least from how it looks in the profiler, no memcopies in the middle and the kernels seems to be running one
right after the other).

Now I have the following questions (I guess those should be looked more like thoughts and obviously not complaints whatsoever :) ):
- I guess that if you did the measurements you did in the previous post on the whole process (zgetrf_gpu and zgetrs_gpu) we'll see that the peak performance for one stream would drop significally and hence
maybe multiple streams (though not possible today because of your earlier explainations in previous posts) would boost performance.
- Since the code is synchronizing the device, it will probably also be hard to run "other" (magma/custom) kernels while the magma code is run to achieve concurrency for other tasks.
- The kernels, for both zgetrf and zgetrs, seems to be using low number of blocks in each kernel launch. This might also indicate low utilization of the GPU (at least for the LU part, though the solver itself also
uses a small grid).
- For kepler, my target gpu, the gpu seems extremely ideal for this flow, it could have been nice if we could use streams and/or dynamic parallelism to boost it.

Thanks a lot,
Eyal
eyalhir74
 
Posts: 5
Joined: Sun Mar 17, 2013 7:56 am

Re: Multiple stream

Postby eyalhir74 » Wed Mar 20, 2013 7:31 am

Hi Mark,
Further to my previous post I ran the testing_sgetrf_gpu app from the testing folder under Magma on the K20.

Here's the result I got for one binary running:
Code: Select all
./a.ksh
MAGMA 1.3.0
device 0: Tesla K20, 705.5 MHz clock, 5119.8 MB memory, capability 3.5
device 1: Tesla C2075, 1147.0 MHz clock, 6143.2 MB memory, capability 2.0

Usage: ./testing_sgetrf_gpu -N <m,n> -c
  -N can be repeated up to 10 times. If only m is given, then m=n.
  -c or setting $MAGMA_TESTINGS_CHECK runs LAPACK and checks result.

  M     N     CPU GFlop/s (sec)   GPU GFlop/s (sec)   ||PA-LU||/(||A||*N)
=========================================================================
 1024  1024     ---   (  ---  )     52.42 (   0.01)     --- 
 2048  2048     ---   (  ---  )    217.08 (   0.03)     --- 
 3072  3072     ---   (  ---  )    331.25 (   0.06)     --- 
 4032  4032     ---   (  ---  )    467.83 (   0.09)     --- 
 5184  5184     ---   (  ---  )    606.83 (   0.15)     --- 
 6016  6016     ---   (  ---  )    686.22 (   0.21)     --- 
 7040  7040     ---   (  ---  )    763.72 (   0.30)     --- 
 8064  8064     ---   (  ---  )    828.78 (   0.42)     --- 
 9088  9088     ---   (  ---  )    892.32 (   0.56)     --- 
10112 10112     ---   (  ---  )    934.60 (   0.74)     --- 

real   0m8.80s
user   0m11.71s
sys   0m1.18s


And here's the output for 4 apps running at the same time:
Code: Select all
./a.ksh & ./a.ksh & ./a.ksh & ./a.ksh &
[2] 8027
[3] 8028
[4] 8029
[5] 8030
device 0: Tesla K20, 705.5 MHz clock, 5119.8 MB memory, capability 3.5
device 1: Tesla C2075, 1147.0 MHz clock, 6143.2 MB memory, capability 2.0

Usage: ./testing_sgetrf_gpu -N <m,n> -c
  -N can be repeated up to 10 times. If only m is given, then m=n.
  -c or setting $MAGMA_TESTINGS_CHECK runs LAPACK and checks result.

MAGMA 1.3.0
device 0: Tesla K20, 705.5 MHz clock, 5119.8 MB memory, capability 3.5
device 1: Tesla C2075, 1147.0 MHz clock, 6143.2 MB memory, capability 2.0

Usage: ./testing_sgetrf_gpu -N <m,n> -c
  -N can be repeated up to 10 times. If only m is given, then m=n.
  -c or setting $MAGMA_TESTINGS_CHECK runs LAPACK and checks result.
  M     N     CPU GFlop/s (sec)   GPU GFlop/s (sec)   ||PA-LU||/(||A||*N)

=========================================================================
MAGMA 1.3.0
MAGMA 1.3.0
device 0: Tesla K20, 705.5 MHz clock, 5119.8 MB memory, capability 3.5
device 0: Tesla K20, 705.5 MHz clock, 5119.8 MB memory, capability 3.5
device 1: Tesla C2075, 1147.0 MHz clock, 6143.2 MB memory, capability 2.0

Usage: ./testing_sgetrf_gpu -N <m,n> -c
  -N can be repeated up to 10 times. If only m is given, then m=n.
  -c or setting $MAGMA_TESTINGS_CHECK runs LAPACK and checks result.
device 1: Tesla C2075, 1147.0 MHz clock, 6143.2 MB memory, capability 2.0


Usage: ./testing_sgetrf_gpu -N <m,n> -c
  -N can be repeated up to 10 times. If only m is given, then m=n.
  M     N     CPU GFlop/s (sec)   GPU GFlop/s (sec)   ||PA-LU||/(||A||*N)
  -c or setting $MAGMA_TESTINGS_CHECK runs LAPACK and checks result.

=========================================================================
 1024  1024     ---   (  ---  )      2.24 (   0.32)     --- 
  M     N     CPU GFlop/s (sec)   GPU GFlop/s (sec)   ||PA-LU||/(||A||*N)
  M     N     CPU GFlop/s (sec)   GPU GFlop/s (sec)   ||PA-LU||/(||A||*N)
=========================================================================
=========================================================================
 1024  1024     ---   (  ---  )      2.85 (   0.25)     --- 
 1024  1024     ---   (  ---  )     18.38 (   0.04)     --- 
 2048  2048     ---   (  ---  )    180.75 (   0.03)     --- 
 1024  1024     ---   (  ---  )      7.51 (   0.10)     --- 
 2048  2048     ---   (  ---  )    111.59 (   0.05)     --- 
 2048  2048     ---   (  ---  )     54.39 (   0.11)     --- 
 2048  2048     ---   (  ---  )    112.02 (   0.05)     --- 
 3072  3072     ---   (  ---  )     96.48 (   0.20)     --- 
 3072  3072     ---   (  ---  )     97.00 (   0.20)     --- 
 3072  3072     ---   (  ---  )     70.94 (   0.27)     --- 
 3072  3072     ---   (  ---  )     80.63 (   0.24)     --- 
 4032  4032     ---   (  ---  )     84.24 (   0.52)     --- 
 4032  4032     ---   (  ---  )     96.26 (   0.45)     --- 
 4032  4032     ---   (  ---  )    100.82 (   0.43)     --- 
 4032  4032     ---   (  ---  )     83.40 (   0.52)     --- 
 5184  5184     ---   (  ---  )    120.07 (   0.77)     --- 
 5184  5184     ---   (  ---  )    125.13 (   0.74)     --- 
 5184  5184     ---   (  ---  )    112.15 (   0.83)     --- 
 5184  5184     ---   (  ---  )    109.41 (   0.85)     --- 
 6016  6016     ---   (  ---  )    132.24 (   1.10)     --- 
 6016  6016     ---   (  ---  )    134.45 (   1.08)     --- 
 6016  6016     ---   (  ---  )    138.44 (   1.05)     --- 
 6016  6016     ---   (  ---  )    120.78 (   1.20)     --- 
 7040  7040     ---   (  ---  )    163.58 (   1.42)     --- 
 7040  7040     ---   (  ---  )    163.02 (   1.43)     --- 
 7040  7040     ---   (  ---  )    169.24 (   1.37)     --- 
 7040  7040     ---   (  ---  )    147.59 (   1.58)     --- 
 8064  8064     ---   (  ---  )    202.38 (   1.73)     --- 
 8064  8064     ---   (  ---  )    188.68 (   1.85)     --- 
 8064  8064     ---   (  ---  )    182.99 (   1.91)     --- 
 8064  8064     ---   (  ---  )    181.91 (   1.92)     --- 
 9088  9088     ---   (  ---  )    219.06 (   2.28)     --- 
 9088  9088     ---   (  ---  )    214.32 (   2.33)     --- 
 9088  9088     ---   (  ---  )    218.98 (   2.28)     --- 
 9088  9088     ---   (  ---  )    208.27 (   2.40)     --- 
10112 10112     ---   (  ---  )    230.16 (   2.99)     --- 
10112 10112     ---   (  ---  )    224.46 (   3.07)     --- 
10112 10112     ---   (  ---  )    209.83 (   3.28)     --- 
10112 10112     ---   (  ---  )    209.82 (   3.28)     --- 

real   0m20.70s
user   0m25.29s
sys   0m2.66s

real   0m21.19s
user   0m25.39s
sys   0m2.84s

real   0m21.19s
user   0m25.02s
sys   0m2.78s

real   0m21.19s
user   0m25.46s
sys   0m2.64s

[2]   Done                    ./a.ksh
[3]   Done                    ./a.ksh
[4]-  Done                    ./a.ksh
[5]+  Done                    ./a.ksh



Seems that 4 binaries running at the same time ran only ~2.5 times slower than one binary.

Eyal
eyalhir74
 
Posts: 5
Joined: Sun Mar 17, 2013 7:56 am


Return to User discussion

Who is online

Users browsing this forum: No registered users and 2 guests

cron