## Allocation and Execution on a single GPU

Open discussion for MAGMA library (Matrix Algebra on GPU and Multicore Architectures)
pedro_diniz
Posts: 2
Joined: Mon Dec 03, 2018 11:37 pm

### Allocation and Execution on a single GPU

All,

I've just began to use MAGMA and clearly have some simple questions.

Consider the loop below:

do i = 1, N
call F(...)
call G(...)
enddo

where G uses a MAGMA function (eigensolver) and uses a private matrix S.
In addition all inputs and outputs of G are disjoint, i.e., the inputs and
outputs ar slices of a cube.

One approach would be to have the matrix S allocated on a GPU and batch the
various executions of G for each of the various iterations of the do loop.

I'm having a hard time figuring out how to do the following:

1. controlling the execution of the eigensolver on a single GPU since my system has two GPUs
2. allocating S on the GPU.

I could possibly go to C and use the queue and all that stuff but I'm wondering if there is a easier
route programming wise.

Best,

Pedro

mgates3
Posts: 842
Joined: Fri Jan 06, 2012 2:13 pm

### Re: Allocation and Execution on a single GPU

2. For allocating matrices on the GPU using Fortran, in the latest release (MAGMA 2.5 rc 1) there is magmaf_[sdcz]malloc (notice the f for Fortran in magmaf prefix).

Otherwise, you can use the (very old) cuBLAS Fortran interface that comes in CUDA/src/fortran.c. It has cublas_alloc, cublas_free, cublas_set_matrix, etc. See for example, magma/example/example_f.F90 and magma/example/Makefile.

1. As for using multiple GPUs, there are a couple choices. You can use the multi-GPU magma functions such as dsyevd_m or dsyevdx_2stage_m, where internally MAGMA uses multi-GPUs to solve a single problem. This is best if your problem size is large, like 20,000 x 20,000. Otherwise, you can create two threads (using OpenMP or pthreads, for example), and run different problems on each GPU, in parallel, using the single-GPU magma functions such as dsyevd or dsyevdx_2stage. Each thread should set its device using magmaf_setdevice( dev ), available in MAGMA 2.5. CUDA maintains the device per thread. Prior to this release, you might have to write your own Fortran wrapper to call cudaSetDevice, since it isn't in the cuBLAS fortran.c.

Note that dsyevd takes the matrix (A) in CPU memory, while dsyevd_gpu takes the matrix (dA) in GPU device memory. The "d" prefix usually denotes GPU device memory.

Only the MAGMA BLAS functions that directly launch a GPU kernel (gemm, symm, lacpy, transpose, ...) take a queue. Most of the higher level LAPACK-like routines such as syevd do not take a queue but internally allocate their queues.

-mark
Last edited by mgates3 on Tue Dec 04, 2018 2:29 pm, edited 1 time in total.
Reason: clarify