Hello

I have a problem where i have to call 9 different sgemv_batched calls, on completely different data, save for the batch of A arrays, which is really the same matrix over and over. So i thought i could parallelize the bunch by creating 9 different queues and assigning each queue to one batched sgemv. However, the total time is still the sum of the times of each batch. Im using magma_v2, and declare the queues like so.

int device = 0;

magma_queue_t queue;

magma_queue_create(device, &queue);

So my question is : Is it impossible to cast all those batched sgemvs simulaneously, because of the function or something else I am unaware of, or am I making a mistake in my execution, (in which case i shall post my full code) ?

(Btw I built MAGMA with sequential mkl, not sure if that has anything to do with it)

A matrix is the same 128x128 matrix , the x and y are vectors of 128 components, and the batchCount is around 16000

Also, a slightly different question- do 3-4 milliseconds sound ok for each batch, on a GTX 970?

Any help would be greatly appreciated

Cheers

## Multiple queues and sgemv_batched

### Re: Multiple queues and sgemv_batched

Hi,

if you create different queues and launch different sgemv_batched this mean, you are telling the GPU, that whenever he has slot available for work he can launch work from queue 2, 3, 4, etc. Now two questions:

1- if you dispatch them over 9 queues that can run in parallel so why you didn't made your batchcount larger and submit in 1 queue and let Magma handle it (it mean the same)

2- if you want to stay with your decision and use many queues, it is OK. Now if the GPU has slot meaning has SMX without work, the Nvidia scheduler will start launching from the queues 2, 3, 4 etc. but if your batch (number of matrices in 1 call) is large enough to fulfill the GPU then it will behaves like you launched 9 call to sgemv_batched in sequential to the same queue (so you will get same time). Since you have 16000 matrices of size 128x128 I believe each batch is large enough to fulfill the GPU and so your timing by doing 9 sequential call, 9 queues of calls to sgemv_batched or 1 call to sgemv_batched with all of them will be very close to each other.

Note that today released version of sgemv_batched support a maximum of 65000 matrices as batchcount so you need at least 3 batch call for your 9*16000 matrices. Now between 3 and 9 it is not going to matter since the batchcount is 16000 large enough.

For performance:

That the way you verify your timing numbers.

The peak performance in Gflop/s of a GEMV (matrix-vector product) is the bandwidth of the hardware divided by 2 for single precision (sgemv) and by 4 for double precision (dgemv).

The performance of GEMV is the number of operation/time = 2n^2/time.

for your case it is 16000*2*(128^2) / (3-4.10^-3 sec) = about 130-140 Gflop/s

According to Nvidia https://www.geforce.com/hardware/deskto ... ifications, I think you hardware GTX 970 has a Memory Bandwidth of 224 GB/sec meaning your sgemv is really reaching the peak since the peak sgemv will be around 224/2=112 Gflop/s

MKL is not important for MAGMA BLAS routine, it is useful for MAGMA LAPACK routine.

Azzam

if you create different queues and launch different sgemv_batched this mean, you are telling the GPU, that whenever he has slot available for work he can launch work from queue 2, 3, 4, etc. Now two questions:

1- if you dispatch them over 9 queues that can run in parallel so why you didn't made your batchcount larger and submit in 1 queue and let Magma handle it (it mean the same)

2- if you want to stay with your decision and use many queues, it is OK. Now if the GPU has slot meaning has SMX without work, the Nvidia scheduler will start launching from the queues 2, 3, 4 etc. but if your batch (number of matrices in 1 call) is large enough to fulfill the GPU then it will behaves like you launched 9 call to sgemv_batched in sequential to the same queue (so you will get same time). Since you have 16000 matrices of size 128x128 I believe each batch is large enough to fulfill the GPU and so your timing by doing 9 sequential call, 9 queues of calls to sgemv_batched or 1 call to sgemv_batched with all of them will be very close to each other.

Note that today released version of sgemv_batched support a maximum of 65000 matrices as batchcount so you need at least 3 batch call for your 9*16000 matrices. Now between 3 and 9 it is not going to matter since the batchcount is 16000 large enough.

For performance:

That the way you verify your timing numbers.

The peak performance in Gflop/s of a GEMV (matrix-vector product) is the bandwidth of the hardware divided by 2 for single precision (sgemv) and by 4 for double precision (dgemv).

The performance of GEMV is the number of operation/time = 2n^2/time.

for your case it is 16000*2*(128^2) / (3-4.10^-3 sec) = about 130-140 Gflop/s

According to Nvidia https://www.geforce.com/hardware/deskto ... ifications, I think you hardware GTX 970 has a Memory Bandwidth of 224 GB/sec meaning your sgemv is really reaching the peak since the peak sgemv will be around 224/2=112 Gflop/s

MKL is not important for MAGMA BLAS routine, it is useful for MAGMA LAPACK routine.

Azzam

### Re: Multiple queues and sgemv_batched

So it really doesn't matter for these sizes. Thank you for your thorough reply, really helps to know specific numbers.