Hi Mark,
Thank you very much for your response! Please find my comments below:
Thank you. I have corrected that. Now all the images are embedded.
mgates3 wrote:For Cholesky, where does your matrix reside, on the CPU or the GPU, and which magma routine do you call to do the factorization? The performance there (250 Gflop/s) seems very low. We easily achieve 600 Gflop/s on a Kepler K20c (705 MHz). What performance do you get using the magma testers, testing_dpotrf and testing_dpotrf_gpu?
First the code is here for reference:
https://github.com/bravegag/eigen-magma ... LT_MAGMA.hThe matrix is passed to the method from Eigen and it resides on the Host, it is unpinned Host memory. There I use the function magma_?potrf_gpu therefore the matrix is copied from Host to the Device and the result L matrix copied back from the Device to the Host. The copying times are accounted for in the benchmark. The benchmark was obtained using MKL_NUM_THREADS=1 but increasing this doesn't make a big difference. I get with N=10k about 200Gflop/s and testings with two GPU cards I reach 300Gflop/s but note I have two nVidia GTX Titan and not a Tesla card. I can afford 3x GTX Titan cards for the price of a Tesla but maybe I will switch later to Teslas. But also please note I account for the memory transfer times whereas the MAGMA testing benchmarks don't, for me it is important to know the overall performance and answer whether makes sense to use MAGMA in each case.
I get the following for N=10k:
testing_dpotrf 177.82 Gflop/s
testing_dpotrf_gpu 190.83 Gflop/s
and testing_dpotrf_gpu with --ngpu 2 I get 300 Gflop/s
I have two Xeon E5-2690 and MKL with MKL_NUM_THREADS=16 which is the total number of cores in this machine reaches 300 Gflop/s
mgates3 wrote:Yes, increasing the MKL num threads should improve the MAGMA performance. I usually use one socket, say MKL_NUM_THREADS=8. Also, for multi-threaded code, using numactl --interleave=all can have a huge impact on MKL performance. But surely Cholesky was run multi-threaded, to achieve 400 Gflop/s?
I could not find any --interleave=all option anywhere in the MAGMA installation. Where do you specify that in testings program arguments? [UPDATE: I found out how to do --interleave=all in the Intel knowledge base, thank you. I will check whether the benchmarks improve] I ran my benchmarks using MKL_NUM_THREADS=1 because I am more interested in exploiting intra-parallelism at GPU level and leave the cores free for inter-parallelism or a MapReduce algorithm that invokes my processes that in turn use Eigen MAGMA. If I use the same cores for Inter and Intra parallelism performance will not be great. For very big problem sizes it may make sense to put all resources available to solve one single problem. However, I am now facing a huge grid search (+2 million evaluations) of parameters for a ML problem and need the cores for the inter-parallelism with MapReduce.
mgates3 wrote:For the dgemm, dgemv, dtrsm, are you calling cublas routines, or magmablas routines? Currently, I generally recommend calling cublas routines, as particularly their gemm is optimized for newer generation Nvidia cards (Keplers, etc.). In which case the graphs should be labelled "cublas" instead of "magma". You can of course use the magma wrappers; magma_dgemm is a wrapper around cublasDgemm, while magmablas_dgemm is our own kernel.
Yes you are right I use CUBLAS in these cases, but through the MAGMA API. I will correct the labels.
mgates3 wrote:Counting the data transfer time is unfair for BLAS operations (gemm, gemv, trsm, etc.) -- one should not transfer matrices to the GPU to do only a single BLAS operation and then transfer the results back. That is generally a losing strategy. Also, often data transfers can be done asynchronously while other computation is done on the GPU. Perhaps it would be best to plot performance both with and without data transfer time, to emphasize that data transfers should be avoided or overlapped as much as possible.
I can do that plotting both with and without accounting for the transfers. I hear you about the losing strategy but this is a simple solution to take advantage of MAGMA and speed up Eigen code easily which is the case while using ?gemm or ?geqp3. A better strategy would be integrating MAGMA deeper into the Eigen framework and caching matrices in the Device for multiple use given that Eigen uses expression templates I believe it would be possible to do so. A relatively cheap improvement would be to allocate all Host memory in Eigen to be pinned, this would speed up transfer and improve the results. Of course, changing to a more powerful Tesla K20 card may also improve the results drastically. I have been playing with my two nVidia GTX Titans so far.
mgates3 wrote:Can you be a bit more specific about the matrix sizes? Are these all square (n x n) matrices? For trsm, how many RHS are you solving?
For ?trsm I use a RHS matrix of size N too.
Many thanks again for your help!
Best regards,
Giovanni