Eigen MAGMA backend implementation project

Open discussion for MAGMA

Eigen MAGMA backend implementation project

Postby bravegag » Thu Jul 25, 2013 5:01 am

Hello,

I have "created a fork" of Eigen 3.2 and incorporated some (small) progress I have preparing a MAGMA backend to best exploit GPU & CPU. This is an alternative to using MKL which indirectly uses MKL because MAGMA does use MKL in the back. I have been testing it using MAGMA 1.4.0-beta2 and so far all my project tests pass without having to change our Eigen-based code base which is great! Anyone who wants to contribute please contact me to bravegag@hotmail.com or via the GitHub account below.

The code base is available here:
https://github.com/bravegag/eigen-magma

I have working the first port corresponding to GeneralMatrixMatrix_MAGMA.h which in reality uses MAGMA API but invokes CUBLAS which is slightly faster:
https://github.com/bravegag/eigen-magma ... ix_MAGMA.h

Another partial implementation (currently a bit of work in progress) ColPivHouseholderQR_MAGMA.h it is missing enabling the macro for float and complex types:
https://github.com/bravegag/eigen-magma ... QR_MAGMA.h

I have been adding implementations prioritizing the functions we use as part of our project.

The remaining *_MAGMA.h implementations are simply mock copies of the MKL counterparts with some basic pre-processing changes i.e. MKL -> MAGMA replacement.

Best regards,
Giovanni
bravegag
 
Posts: 23
Joined: Wed Jun 19, 2013 9:51 am

Re: Eigen MAGMA backend implementation project

Postby bravegag » Thu Jul 25, 2013 1:13 pm

I have created a simple benchmark project to check the implementation:
https://github.com/bravegag/eigen-magma-benchmark

I have added documentation including the Gflops results for DGEMM and DGEQP3 so far implemented.
bravegag
 
Posts: 23
Joined: Wed Jun 19, 2013 9:51 am

Re: Eigen MAGMA backend implementation project

Postby bravegag » Thu Aug 08, 2013 10:03 am

Quick update, I have additionally implemented:

- dgemv (matrix vector multiplication)
- dtrsm (triangular matrix solver)
- dpotrf (Cholesky decomposition)

The results are very disappointing, unless I have bugs (e.g. copying Host <-> Device more memory than needed ) MAGMA overall and taking into account the memory transfer underperforms in these three cases see:
https://github.com/bravegag/eigen-magma-benchmark

The Cholesky decomposition result is the most surprising because Eigen beats both MKL and MAGMA (see the Gflops):
https://raw.github.com/bravegag/eigen-m ... gflops.png

If anyone would be willing to donate a code review I will be more than happy ;)

Best regards,
Giovanni

PS: I think I will include the MAGMA mgpu version of Cholesky to see how it compares running on 2x nVidia Titan GTX. Note that unlike the testings implementations I'm including in the benchmark Gflop/s the memory transfer times which is effectively relevant for my purposes (get the problems solved asap).
bravegag
 
Posts: 23
Joined: Wed Jun 19, 2013 9:51 am

Re: Eigen MAGMA backend implementation project

Postby mgates3 » Fri Aug 09, 2013 12:34 pm

First, on this page
https://github.com/bravegag/eigen-magma-benchmark
the images appear broken for me. I have to click on each one to see the image in a separate window.

For Cholesky, where does your matrix reside, on the CPU or the GPU, and which magma routine do you call to do the factorization? The performance there (250 Gflop/s) seems very low. We easily achieve 600 Gflop/s on a Kepler K20c (705 MHz). What performance do you get using the magma testers, testing_dpotrf and testing_dpotrf_gpu?

Yes, increasing the MKL num threads should improve the MAGMA performance. I usually use one socket, say MKL_NUM_THREADS=8. Also, for multi-threaded code, using numactl --interleave=all can have a huge impact on MKL performance. But surely Cholesky was run multi-threaded, to achieve 400 Gflop/s?

For the dgemm, dgemv, dtrsm, are you calling cublas routines, or magmablas routines? Currently, I generally recommend calling cublas routines, as particularly their gemm is optimized for newer generation Nvidia cards (Keplers, etc.). In which case the graphs should be labelled "cublas" instead of "magma". You can of course use the magma wrappers; magma_dgemm is a wrapper around cublasDgemm, while magmablas_dgemm is our own kernel.

Counting the data transfer time is unfair for BLAS operations (gemm, gemv, trsm, etc.) -- one should not transfer matrices to the GPU to do only a single BLAS operation and then transfer the results back. That is generally a losing strategy. Also, often data transfers can be done asynchronously while other computation is done on the GPU. Perhaps it would be best to plot performance both with and without data transfer time, to emphasize that data transfers should be avoided or overlapped as much as possible.

Can you be a bit more specific about the matrix sizes? Are these all square (n x n) matrices? For trsm, how many RHS are you solving?

-mark
mgates3
 
Posts: 399
Joined: Fri Jan 06, 2012 2:13 pm

Re: Eigen MAGMA backend implementation project

Postby bravegag » Mon Aug 12, 2013 4:40 am

Hi Mark,

Thank you very much for your response! Please find my comments below:

mgates3 wrote:First, on this page
https://github.com/bravegag/eigen-magma-benchmark
the images appear broken for me. I have to click on each one to see the image in a separate window.

Thank you. I have corrected that. Now all the images are embedded.

mgates3 wrote:For Cholesky, where does your matrix reside, on the CPU or the GPU, and which magma routine do you call to do the factorization? The performance there (250 Gflop/s) seems very low. We easily achieve 600 Gflop/s on a Kepler K20c (705 MHz). What performance do you get using the magma testers, testing_dpotrf and testing_dpotrf_gpu?

First the code is here for reference:
https://github.com/bravegag/eigen-magma ... LT_MAGMA.h

The matrix is passed to the method from Eigen and it resides on the Host, it is unpinned Host memory. There I use the function magma_?potrf_gpu therefore the matrix is copied from Host to the Device and the result L matrix copied back from the Device to the Host. The copying times are accounted for in the benchmark. The benchmark was obtained using MKL_NUM_THREADS=1 but increasing this doesn't make a big difference. I get with N=10k about 200Gflop/s and testings with two GPU cards I reach 300Gflop/s but note I have two nVidia GTX Titan and not a Tesla card. I can afford 3x GTX Titan cards for the price of a Tesla but maybe I will switch later to Teslas. But also please note I account for the memory transfer times whereas the MAGMA testing benchmarks don't, for me it is important to know the overall performance and answer whether makes sense to use MAGMA in each case.

I get the following for N=10k:
testing_dpotrf 177.82 Gflop/s
testing_dpotrf_gpu 190.83 Gflop/s
and testing_dpotrf_gpu with --ngpu 2 I get 300 Gflop/s

I have two Xeon E5-2690 and MKL with MKL_NUM_THREADS=16 which is the total number of cores in this machine reaches 300 Gflop/s

mgates3 wrote:Yes, increasing the MKL num threads should improve the MAGMA performance. I usually use one socket, say MKL_NUM_THREADS=8. Also, for multi-threaded code, using numactl --interleave=all can have a huge impact on MKL performance. But surely Cholesky was run multi-threaded, to achieve 400 Gflop/s?

I could not find any --interleave=all option anywhere in the MAGMA installation. Where do you specify that in testings program arguments? [UPDATE: I found out how to do --interleave=all in the Intel knowledge base, thank you. I will check whether the benchmarks improve] I ran my benchmarks using MKL_NUM_THREADS=1 because I am more interested in exploiting intra-parallelism at GPU level and leave the cores free for inter-parallelism or a MapReduce algorithm that invokes my processes that in turn use Eigen MAGMA. If I use the same cores for Inter and Intra parallelism performance will not be great. For very big problem sizes it may make sense to put all resources available to solve one single problem. However, I am now facing a huge grid search (+2 million evaluations) of parameters for a ML problem and need the cores for the inter-parallelism with MapReduce.

mgates3 wrote:For the dgemm, dgemv, dtrsm, are you calling cublas routines, or magmablas routines? Currently, I generally recommend calling cublas routines, as particularly their gemm is optimized for newer generation Nvidia cards (Keplers, etc.). In which case the graphs should be labelled "cublas" instead of "magma". You can of course use the magma wrappers; magma_dgemm is a wrapper around cublasDgemm, while magmablas_dgemm is our own kernel.

Yes you are right I use CUBLAS in these cases, but through the MAGMA API. I will correct the labels.

mgates3 wrote:Counting the data transfer time is unfair for BLAS operations (gemm, gemv, trsm, etc.) -- one should not transfer matrices to the GPU to do only a single BLAS operation and then transfer the results back. That is generally a losing strategy. Also, often data transfers can be done asynchronously while other computation is done on the GPU. Perhaps it would be best to plot performance both with and without data transfer time, to emphasize that data transfers should be avoided or overlapped as much as possible.

I can do that plotting both with and without accounting for the transfers. I hear you about the losing strategy but this is a simple solution to take advantage of MAGMA and speed up Eigen code easily which is the case while using ?gemm or ?geqp3. A better strategy would be integrating MAGMA deeper into the Eigen framework and caching matrices in the Device for multiple use given that Eigen uses expression templates I believe it would be possible to do so. A relatively cheap improvement would be to allocate all Host memory in Eigen to be pinned, this would speed up transfer and improve the results. Of course, changing to a more powerful Tesla K20 card may also improve the results drastically. I have been playing with my two nVidia GTX Titans so far.

mgates3 wrote:Can you be a bit more specific about the matrix sizes? Are these all square (n x n) matrices? For trsm, how many RHS are you solving?

For ?trsm I use a RHS matrix of size N too.

Many thanks again for your help!

Best regards,
Giovanni
bravegag
 
Posts: 23
Joined: Wed Jun 19, 2013 9:51 am

Re: Eigen MAGMA backend implementation project

Postby bravegag » Mon Aug 12, 2013 8:39 am

Bug fix related to the Cholesky decomposition, the benchmark input matrix A was not SPD, this has been fixed and now the results are correct. Now MAGMA shines reaching over 120 Gflop/s:
https://github.com/bravegag/eigen-magma-benchmark
bravegag
 
Posts: 23
Joined: Wed Jun 19, 2013 9:51 am

Re: Eigen MAGMA backend implementation project

Postby bravegag » Tue Aug 13, 2013 10:55 am

After integrating magma_?geqrf3_gpu implementation I get very disappointing benchmark results. This was very surprising to me given the excellent result of the magma_?geqp3_gpu.

Eigen MAGMA integration:
https://github.com/bravegag/eigen-magma ... QR_MAGMA.h

Benchmark result:
https://raw.github.com/bravegag/eigen-m ... gflops.png

The MAGMA testing_dgeqrf_gpu that invokes magma_dgeqrf3_gpu produced very good results topping at 193.88 GFlop/s for N=10k so the memory costs may be really slowing this one down.
bravegag
 
Posts: 23
Joined: Wed Jun 19, 2013 9:51 am

Re: Eigen MAGMA backend implementation project

Postby mgates3 » Tue Aug 13, 2013 4:55 pm

You could also try magma_dpotrf and magma_dgeqrf, that is, the CPU interfaces, since your matrix is on the CPU. However, your matrix is not pinned, so I don't know if that will improve results or not. Pinning memory can also be a slow operation.
-mark
mgates3
 
Posts: 399
Joined: Fri Jan 06, 2012 2:13 pm

Re: Eigen MAGMA backend implementation project

Postby bravegag » Wed Aug 14, 2013 8:46 am

Hi Mark,

Thank you very much for your feedback and help.

I tried all versions already and the gpu versions performed best in my benchmarks even while having to copy unpinned memory between Host <-> Device. It would be great to have the same benchmarks executed using a Tesla K20 card and before asking my department to buy them :)

One question. I have 2x Xeon E5-2690 with a total of 16 cores available and the gain I get exporting MKL_NUM_THREADS from 1 to 16 is very marginal for the MAGMA Host versions e.g. testing_dgeqrf improves by a small margin same goes for testing_dgesvd and so on. MKL has a significant performance increase when enabling multiple threads and using "numactl --interleave=all"

Best regards,
Giovanni Azua
bravegag
 
Posts: 23
Joined: Wed Jun 19, 2013 9:51 am

Re: Eigen MAGMA backend implementation project

Postby mgates3 » Wed Aug 14, 2013 3:17 pm

We use MKL for factoring the panel on the CPU. I'm not sure how parallel MKL has made this. Most of MKL's performance increase with multiple threads is in updating the trailing matrix, which we do on the GPU, so we benefit much less from multiple threads. Also, once the panel is faster than the trailing matrix update, making it any faster won't help -- the CPU is waiting for the GPU.

The big exception is eigenvalue and SVD routines, where we do the Hessenberg, tri- or bi-diagonal reduction with the GPU, but then call LAPACK's eigensolver directly, so MAGMA potentially benefits much more from MKL's multithreading. I haven't benchmarked them with different MKL number of threads, though.

-mark
mgates3
 
Posts: 399
Joined: Fri Jan 06, 2012 2:13 pm

Next

Return to User discussion

Who is online

Users browsing this forum: No registered users and 2 guests