Thanks for trying out MAGMA and the input. The GTX 285 results look impressive!
the GPU interface offer better GPU's performance than the same ones in the CPU interface's case (e.g.: sgetrf and sgetrf_gpu). Is that because the time to exchange data between CPU's memory and GPU's memory in the CPU's interface is bigger than in the GPU's interface?
Briefly, yes. As most of the computation is done on the GPU, to minimize the communications, the matrix to be factored has to mostly reside on the GPU memory. In the CPU interface the matrix starts from the CPU and the result is expected to be on the CPU, so an overhead of copying the original matrix to the GPU and bringing the result back to the CPU is to be expected. For some algorithms, like QR for example, we can better intermix computation and communication and hide some of this overhead.
Also, when handling computing on CPU and GPU at the same time by using MAGMA, can you explain how you divide the data to handle on them?
There are of course variations for the different algorithms, but in general, if we look at
Figure 4 and the notations there, the panel A1 has to be factored and A2 updated. For the one-sided factorizations that are currently in MAGMA no more data than A1 is needed in order to factor it, so A1 is sent to the CPU and factored there. This is overlapped with updating A2 (from previous iterations) on the GPU. More on this can be found in
Tomov, S., Dongarra, J., Baboulin, M.
Towards Dense Linear Algebra for Hybrid GPU Accelerated Manycore Systems, LAPACK Working Note 210, October 17, 2008.
for the one-sided factorizations and in
Tomov, S., Dongarra, J.
Accelerating the reduction to upper Hessenberg form through hybrid GPU-based computing, LAPACK Working Note 219, May 24, 2009.
for the two-sided.
Regards,
Stan Tomov