Open discussion for MAGMA library (Matrix Algebra on GPU and Multicore Architectures)
3 posts • Page 1 of 1
I'm looking at magma_dgeqrf_gpu (dgeqrf_gpu.cpp), for example, and see host allocations and data exchange between CPU and GPU. Can't we allocate all the memory, including input, output, and work, and call a sequence of GPU kernels to operate on it? Without calling CPU LAPACK functions?
This is possible and people have tried it but it is in general slower. The computations/tasks that are offloaded to the CPU are small and can not be executed efficiently in parallel on the GPU. These small tasks can be offloaded to the CPU and overlapped with more efficient work (e.g., Level 3 BLAS) on the GPU. What happens is that asymptotically, for large matrices, the execution of the small tasks on the CPU get totally overlapped by work on the GPU, and as a result the overall algorithm runs with the speed that one can execute the Level 3 BLAS on the GPU.