Improving GPU utilisation

Open discussion for MAGMA library (Matrix Algebra on GPU and Multicore Architectures)
Post Reply
Posts: 3
Joined: Sat Aug 11, 2018 5:52 am

Improving GPU utilisation

Post by jlangworthy » Thu Sep 20, 2018 2:47 pm

I need to perform a repeated OLS type calculations, along the lines of the pseudo code below. My arrays are small (400x20 say) but they are used say 350 times, so I was hoping to achieve a relative speed improvement compared to CPU on account of the limited number of cpu->gpu transfers, this seems not to be the case, which is rather upsetting. GPU utilisation is only around 5% and CPU around 50%. Profiling the code shows 96% of time is spent in sgels, so cpu->gpu transfer is not the problem.

I have tried using lots of OMP threads, since I have plenty of gpu memory, but it makes no difference to gpu utilisation. I have 4 GPUs, so even if I only match the CPU, its still 4x faster!

My algorithm works along the lines of what is listed below (please forgive the python-like syntax):

Allocate various bits of GPU memory for each omp thread

Lots of omp threads
....Send m-by-n matrix A and m-by-1 vector b to the GPU.
....for i in m/8 : m work on A[0:m-1,:] and b[0:m-i] (mainly 3x sgels, some dgemm and dgemv and various gpu->gpu slacpy and slaset) scalar result on gpu
....Send the vector of calculated scalar values back to the cpu.

I am not sure of where the bottleneck is. Any help would be greatly appreciated. Apologies if the above is not clear, I was trying to keep it brief!


Posts: 893
Joined: Fri Jan 06, 2012 2:13 pm

Re: Improving GPU utilisation

Post by mgates3 » Thu Sep 20, 2018 3:18 pm

My guess is the matrix is too small to make use of the hybrid code in MAGMA. Most likely, the entire computation is being done on the CPU. You can try comparing time for magma_sgeqrf (hybrid CPU-GPU) vs. magma_sgeqrf_batched (GPU-only) with a batch count of 1. If the batched routine is a significant improvement, that indicates you need a GPU-only version of gels, which is basically geqrf + ormqr + trsm.

Possibly the matrix is even too small to make efficient use of the GPU, as only one of the GPU's multiprocessors will be active and the others idle. But if you have a lot of identical size matrices at the same time, batching them together allows you to occupy all the GPU's multiprocessors, which should be a big win.


Post Reply