Hello,

I have been using MAGMA(BLAS). I have been experiencing some bottlenecks in my code since some operations are performed in the cpu. Basically I perform some operations via MAGMA, bring the matrices to the host, back to the device and so forth. I have two options to speed up my code either use pthreads library or perform the operations in the GPU (they are simple comparisons/operations extremely suitable for the CUDA framework). My question is if I can access the arrays created by MAGMA routines via a CUDA kernel, perform some operations at the GPU and then either call MAGMA routines from a CUDA kernel or download them to the host and lauch the routine, and thus avoiding the overhead of multiple siple oeprations and/or the communication device-host.

I am using C, and MAGMA compiled with BLAS. The pseudocode is:

Setting matrices at cpu

for:

Matrix Multiplications via MAGMA

Download to Host

Check which coefficients are positive and negative

Depending on the result multiply each column from the matrix by a scalar (different scalar per column)

As you can see, I keep downloading everything to the host after the matrix multiplications, but I know that all the other operations are simple enough and are completely suitable for a GPU and CUDA. I would be happy if either an other matrix is created with the new matrix coefficients or the original matrix is modified from the GPU. I don't know if I can access the coefficients via pointers in the host with the CUDA kernels, or how they behave.

I am using double precision routines.

Hope my explanation is not a mess. Thanks for your time!!!!

## MAGAMA routines and CUDA kernels

### Re: MAGAMA routines and CUDA kernels

I'm not sure what you mean by "the arrays created by MAGMA routines". Do you mean arrays allocated by, say, magma_dmalloc? Yes, that's just a chunk of memory on the GPU, so you can process it equally well with MAGMA, cuBLAS, and your own custom CUDA kernels.

It sounds like checking the coefficients and multiplying columns by a scalar would be a relatively easy custom CUDA kernel to write. From that description, it doesn't seem to fit any routines that we already have available in MAGMA.

-mark

It sounds like checking the coefficients and multiplying columns by a scalar would be a relatively easy custom CUDA kernel to write. From that description, it doesn't seem to fit any routines that we already have available in MAGMA.

-mark

### Re: MAGAMA routines and CUDA kernels

Yes, I want to access the numbers stored in the GPU memory via pointers. Say I have a magma_dmalloc pointer "p" where I store the result of matrix multiplication. I want to check the coefficients from "p" and apply some operations to t them given some criteria. And then repeat the MAGMA routines.mgates3 wrote: ↑Wed Nov 06, 2019 4:31 pmI'm not sure what you mean by "the arrays created by MAGMA routines". Do you mean arrays allocated by, say, magma_dmalloc? Yes, that's just a chunk of memory on the GPU, so you can process it equally well with MAGMA, cuBLAS, and your own custom CUDA kernels.

It sounds like checking the coefficients and multiplying columns by a scalar would be a relatively easy custom CUDA kernel to write. From that description, it doesn't seem to fit any routines that we already have available in MAGMA.

-mark

+Should I be worried about the asynchronous process, I don't want to call the MAGMA routines before all the operations from the CUDA kernels are finished.

+Finally, can I access "p" as if it was assigned by cudaMalloc?

Thanks for the super fast response!

### Re: MAGAMA routines and CUDA kernels

Yes, magma_dmalloc is just a wrapper around cudaMalloc. It is type-safe (you don't need to use sizeof(double) as you do with cudaMalloc), but otherwise nothing special going on.

If you call asynchronous MAGMA routines that take a magma_queue, use the stream from the magma_queue to call CUDA functions to have them execute on the same stream. (See magma_queue_get_cuda_stream.) magma_queue is just a simple struct wrapping a CUDA stream and cuBLAS handle. Or you can explicitly synchronize after the MAGMA function using magma_queue_sync.

If you call MAGMA routines that don't take a stream, those are generally synchronous — they don't return until the computation is done.

-mark

If you call asynchronous MAGMA routines that take a magma_queue, use the stream from the magma_queue to call CUDA functions to have them execute on the same stream. (See magma_queue_get_cuda_stream.) magma_queue is just a simple struct wrapping a CUDA stream and cuBLAS handle. Or you can explicitly synchronize after the MAGMA function using magma_queue_sync.

If you call MAGMA routines that don't take a stream, those are generally synchronous — they don't return until the computation is done.

-mark

Last edited by mgates3 on Thu Nov 07, 2019 4:04 pm, edited 1 time in total.

**Reason:***clarify "magma_dmalloc", not "magma_malloc", is typesafe.*### Re: MAGAMA routines and CUDA kernels

Thanks for the help. I really appreciate it, you are really kind. I will experiment with the cuda kernels!