papers to dgeqrf_gpu/dgeqrf algorithms, explanations?

Open discussion for MAGMA library (Matrix Algebra on GPU and Multicore Architectures)

papers to dgeqrf_gpu/dgeqrf algorithms, explanations?

Postby nahla » Mon Feb 29, 2016 2:31 pm

Hello everyone,

I'm doing some research on the QR factorization and its implementation on GPUs for my thesis. I want to consider MAGMA's dgeqrf_gpu / dgeqrf routine from the latest version 2.0.1 and apply one of these routines to factorize big dense matrices with more rows than columns.
My first question is: What is the difference between these two routines? Does dgeqrf_gpu run entirely on the GPU?

Furthermore, even more important, I would like to understand how the QR factorization is implemented and how the GPU-CPU communication looks like. Are there any detailed documentations or papers on this topic? So far I only found explanations of the dgeqrf routine from version 1.0.0 or 1.1.0. Does anybody also know what has changed since then?
Maybe there is a paper proposing how to improve an earlier version which is now implemented in the current version?

I hope that you can help me and I'm looking forward to trying out some calculations on the GPU using MAGMA.

Thanks,
nahla
nahla
 
Posts: 5
Joined: Mon Feb 29, 2016 1:48 pm

Re: papers to dgeqrf_gpu/dgeqrf algorithms, explanations?

Postby mgates3 » Tue Mar 01, 2016 5:26 pm

For magma_*geqrf_gpu, the matrix dA is in GPU memory. Thus it avoids allocating memory and a copy.
For magma_*geqrf, the matrix A is in CPU memory.
Both are hybrid routines, using the CPU and GPU together. In both cases, the panel is factored on the CPU by calling LAPACK geqrf, and the trailing matrix is updated on the GPU using gemm (inside larfb_gpu).

-mark
mgates3
 
Posts: 738
Joined: Fri Jan 06, 2012 2:13 pm

Re: papers to dgeqrf_gpu/dgeqrf algorithms, explanations?

Postby Linuxboy » Wed Mar 02, 2016 6:00 am

Hi mark! How about LU decomposition magma_dgetrf and magma_dgetrf_gpu(magma_dgetrf_mgpu)?
Linuxboy
 
Posts: 15
Joined: Tue Nov 29, 2011 9:24 pm

Re: papers to dgeqrf_gpu/dgeqrf algorithms, explanations?

Postby nahla » Wed Mar 02, 2016 7:32 am

Hi mark,

thanks a lot so far. Do you also know some papers describing the latest magma_dgeqrf version?

nahla
nahla
 
Posts: 5
Joined: Mon Feb 29, 2016 1:48 pm

Re: papers to dgeqrf_gpu/dgeqrf algorithms, explanations?

Postby mgates3 » Thu Mar 10, 2016 10:05 am

Papers are available from our website
http://icl.cs.utk.edu/magma/pubs/

-mark
mgates3
 
Posts: 738
Joined: Fri Jan 06, 2012 2:13 pm

Re: papers to dgeqrf_gpu/dgeqrf algorithms, explanations?

Postby nahla » Mon Mar 14, 2016 6:46 am

hey mark,

yeah I have seen them already of course. They were not exactly what I was looking for.

-nahla
nahla
 
Posts: 5
Joined: Mon Feb 29, 2016 1:48 pm

Re: papers to dgeqrf_gpu/dgeqrf algorithms, explanations?

Postby haidar » Fri Mar 25, 2016 2:19 pm

Hi Nahla,
all the current Magma routine are hybrid meaning they use both CPU and GPU.
So overall, Cholesky, LU and QR follows the LAPACK fashion of factorization, meaning a panel facto followed by an update of the trailing matrix.
a general overview about it is explained in:
https://www.google.com/url?sa=t&rct=j&q ... 8183,d.dmo
or
http://ieeexplore.ieee.org/xpl/articleD ... er=6877282
a brief description:
CPU is used to factorize the panel while the GPU is used to performs the update.
In Magma we implement what we call lookahead panel meaning that the trailing matrix update is split over two portion:
updating the next panel (portion 1) in order to be sent to CPU to be factorized while the remaining trailing matrix is continued in the GPU (portion 2).
This way the while the GPU is updating portion 2 the CPU is perfoming the factorization of the next panel and resending data to GPU which results in hiding the cost of the panel factorization.
As consequence, the performance of the LU./QR will be close to the performance of the update (which is mostly some kind of GEMM kernel).
The algorithm will look like this:

for step=1, step<N step+=nb
0- send panel of step to CPU
1- factorize panel of step (ON CPU)
2- send factorized panel to GPU
3- update panel of step+1 (ON GPU)
4- update remaining (ON GPU)
Note that 0,1,2,3 go to 1 stream while 4 is on another stream such a way to be parallel and overlapped. there is dependency to be satisfied as well.


What is your interest ?
We have native code that run only on GPU but not released yet.
Thanks
Azzam
haidar
 
Posts: 18
Joined: Fri Sep 19, 2014 3:43 pm


Return to User discussion

Who is online

Users browsing this forum: Bing [Bot] and 3 guests

cron