For the CPU part of my code I am using the goto2 BLAS compiled for CORE2 which consistently uses four cores, which I can see by monitoring CPU cores on my computer. I am running this on an 8 core CPU (I cannot use the 8 core version of goto2 BLAS because it has a bug).
When I run a program which makes a call to magma_zgetrf_gpu I have to make a call to cublas_set_matrix to transfer the data and one to cublas_get_matrix to get it back afterwards. What I observe is that during these calls three of the four cores show a sharp drop in usage, and an extra core has a peak. I don't know what use the cublas routines make of blas calls, but it looks very much as though these calls are not using the four cores available to the program through gotoblas.
I think that these calls are in fact what is giving me poor performance, particularly when I want to do repeat back substitution using magma_zgetrs_gpu when in fact the performance is much worse than doing the work in the CPU. The matrix size is about 4000.
The following data for zgetrf_gpu shows that I should get a considerable speed up but I am not seeing it because of the overhead of the data transfers.
- Code: Select all
device 0: GeForce GTX 460, 1400.0 MHz clock, 2047.2 MB memory
testing_zgetrf_gpu -M 1024 -N 1024
M N CPU GFlop/s GPU GFlop/s ||PA-LU||/(||A||*N)
960 960 19.62 47.00 1.102403e-17
1920 1920 26.94 59.55 1.096587e-17
3072 3072 27.30 63.06 1.075028e-17
4032 4032 27.76 67.16 1.033353e-17
4992 4992 27.93 68.29 1.044090e-17
5952 5952 27.96 68.98 1.025062e-17
7104 7104 28.13 69.48 1.020955e-17
8064 8064 27.92 69.82 1.004068e-17
9024 9024 27.68 70.09 9.916281e-18