Different results for DGEQRF_GPU and DGEQRF2_GPU
Different results for DGEQRF_GPU and DGEQRF2_GPU
For my application I need to extract both Q and R matrices from a QR decomposition.
It seems that the right routines for doing this in MAGMA are magma_dgeqrf_gpu and magma_dorgqr_gpu.
However, the R matrix computed by magma_dgeqrf_gpu is not the same as returned by magma_dgeqrf2_gpu
and LAPACK. Does magma_dgeqrf_gpu use a different scheme to store R?
Also, which is the purpose of magma_dgeqrf3_gpu? The documentation talks about some parts of R stored
separately, but this is not quite clear to me.
Thanks for your wonderful library, its source code is an excellent way of learning how to use the GPU.
Regards
It seems that the right routines for doing this in MAGMA are magma_dgeqrf_gpu and magma_dorgqr_gpu.
However, the R matrix computed by magma_dgeqrf_gpu is not the same as returned by magma_dgeqrf2_gpu
and LAPACK. Does magma_dgeqrf_gpu use a different scheme to store R?
Also, which is the purpose of magma_dgeqrf3_gpu? The documentation talks about some parts of R stored
separately, but this is not quite clear to me.
Thanks for your wonderful library, its source code is an excellent way of learning how to use the GPU.
Regards

 Posts: 264
 Joined: Fri Aug 21, 2009 10:39 pm
Re: Different results for DGEQRF_GPU and DGEQRF2_GPU
Hello,
We are happy you find MAGMA useful for your research!
The reason for the different versions (which indeed can make use confusing!) is performance considerations. If you need to explicitly generate both Q and R you must use version 3. As an example of how to use it you can look in files dgels3_gpu.cpp and dgeqrs3_gpu.cpp. For example, you may have to call
The reason for this is that to make operations involving Q fast we put zeroes in the upper triangular parts of the panels and ones on the diagonals (and the Householder vectors are stored below, as in LAPACK). This destroys R though and therefore we had to store it separately (in dT). To regenerate R we provide the magmablas_dswapdblk routine (see above). Examples on how to avoid the above copies if you just need to solve a least squares problem is given in dgeqrs3_gpu.cpp (there we have multiplication with Q' and solve using R, where we first generate R in place, use it to solve, and move back the data to the original stage for direct application/use of Q if needed).
Stan
We are happy you find MAGMA useful for your research!
The reason for the different versions (which indeed can make use confusing!) is performance considerations. If you need to explicitly generate both Q and R you must use version 3. As an example of how to use it you can look in files dgels3_gpu.cpp and dgeqrs3_gpu.cpp. For example, you may have to call
Code: Select all
// get nb  in all functions has to be the same
int nb = magma_get_dgeqrf_nb(m);
// Do the QR factorization
magma_dgeqrf3_gpu( m, n, dA, ldda, tau, dT, info );
// copy dA to dQ to generate Q in dQ
cudaMemcpy2D(dQ, m*sizeof(double), dA, ldda*sizeof(double), m*sizeof(double), n, cudaMemcpyDeviceToDevice);
// copy dA to dR to generate R in dR
cudaMemcpy2D(dR, m*sizeof(double), dA, ldda*sizeof(double), m*sizeof(double), n, cudaMemcpyDeviceToDevice);
// generate Q
magma_dorgqr_gpu(m, m, min(m, n), dQ, lda, tau, dT, nb, &info);
// generate R
magmablas_dswapdblk(min(m, n), nb, dR, lda, 1, dT+min(m,n)*nb, nb, 0);
Stan
Re: Different results for DGEQRF_GPU and DGEQRF2_GPU
Is there is a simpler way to read off just the diagonal elements of the R matrix that does not require creating a second copy of the dA matrix (require at least an additional dR matrix, the Q matrix can be calculate in the location of dA since I don't wish to keep this information)?
Also, I'm attempting to compile my code and keep getting an error that there is an undefined reference to magma_stream in libmagmablas.a. Any thoughts?
Thanks,
B
Also, I'm attempting to compile my code and keep getting an error that there is an undefined reference to magma_stream in libmagmablas.a. Any thoughts?
Thanks,
B

 Posts: 264
 Joined: Fri Aug 21, 2009 10:39 pm
Re: Different results for DGEQRF_GPU and DGEQRF2_GPU
You can generate Q at the place of dA but you will loose R, unless you copy it somewhere else first. If you need just the diagonal elements of R you can get them as the diagonals of the nb x nb matrices starting from dT+min(m,n)*nb. magma_stream is defined as
in file control/common.cpp. This file gets compiled and added to libmagma.a. The magma_stream variable is used in magmablas. It is declared in common_magma.h as
and common_magma.h is included in the magmablas routines that use magma_stream.
Code: Select all
cudaStream_t magma_stream = 0;
Code: Select all
extern cudaStream_t magma_stream;
Re: Different results for DGEQRF_GPU and DGEQRF2_GPU
Stan,
From looking at the code magmablas_dswapdblk swaps values from dR and dT. It also seems to me from looking at the code and your reply that should I compute Q prior to grabbing R from dT I would destroy the information regarding R that is stored in dT, but that is precisely what is done in the code example above. Additionally, since magmablas_dswapdblk swaps values if I were to call this prior to computing Q I would destroy any information in dT that I would need to compute Q. Therefore, shouldn't it be dT that is temporarily copied, and not dA?
Lastly, I still don't understand why I continue to get an error stating that magma_stream is undefined. I've stripped testing_sgeqrf_gpu.cpp and testing_sorgqr.cpp to their respective bare minimums, and it seems the only header file that is necessary to execute magma_sgeqrf_gpu and magma_sorgqr_gpu is magma.h Additionally, I am linking all the same libraries when I try to compile my code as when I compile the two testing files. The testing files compile and run, but I continue to receive the same error.
As always, your help is much appreciated.
Thank you,
B
From looking at the code magmablas_dswapdblk swaps values from dR and dT. It also seems to me from looking at the code and your reply that should I compute Q prior to grabbing R from dT I would destroy the information regarding R that is stored in dT, but that is precisely what is done in the code example above. Additionally, since magmablas_dswapdblk swaps values if I were to call this prior to computing Q I would destroy any information in dT that I would need to compute Q. Therefore, shouldn't it be dT that is temporarily copied, and not dA?
Lastly, I still don't understand why I continue to get an error stating that magma_stream is undefined. I've stripped testing_sgeqrf_gpu.cpp and testing_sorgqr.cpp to their respective bare minimums, and it seems the only header file that is necessary to execute magma_sgeqrf_gpu and magma_sorgqr_gpu is magma.h Additionally, I am linking all the same libraries when I try to compile my code as when I compile the two testing files. The testing files compile and run, but I continue to receive the same error.
As always, your help is much appreciated.
Thank you,
B
Re: Different results for DGEQRF_GPU and DGEQRF2_GPU
Specifically the lack of definition in magma_stream comes up in magmablas_ssetdiag1subdiag0, szero_32x32_block, szero_nbxnb_block, magmablas_slaset.
Re: Different results for DGEQRF_GPU and DGEQRF2_GPU
@stan,
Thank you very much for your help, it works wonderfully.
@maranhao
I had the same problem, the linker is not able to solve the circular dependencies between magmablas and magma.
Try to link using something like this: L$(MAGMAPATH)/lib lmagma lmagmablas lmagma
Thank you very much for your help, it works wonderfully.
@maranhao
I had the same problem, the linker is not able to solve the circular dependencies between magmablas and magma.
Try to link using something like this: L$(MAGMAPATH)/lib lmagma lmagmablas lmagma

 Posts: 264
 Joined: Fri Aug 21, 2009 10:39 pm
Re: Different results for DGEQRF_GPU and DGEQRF2_GPU
The magma_stream variable is indeed not defined in the routines mentioned but is defiled in libmagma.a. Try the suggestion from Andreas of listing lmagma twice during the linking.
You can try out the suggestion about the copies; I am not sure I understand what you need to keep and generate. The example in the code is working, but it may have copies that you would not need in your specific case; it is just to show how to use the routines. Another example is solving the least squares problem Q R x = b (see files dgels3_gpu.cpp and dgeqrs3_gpu.cpp) without generating Q (only the application of Q^T) and generating R in order to do a triangular solve.
You can try out the suggestion about the copies; I am not sure I understand what you need to keep and generate. The example in the code is working, but it may have copies that you would not need in your specific case; it is just to show how to use the routines. Another example is solving the least squares problem Q R x = b (see files dgels3_gpu.cpp and dgeqrs3_gpu.cpp) without generating Q (only the application of Q^T) and generating R in order to do a triangular solve.
Re: Different results for DGEQRF_GPU and DGEQRF2_GPU
@andres,
Thanks, that did the trick.
@stan
Thanks for all your help. I was able to figure out how to read off just the diagonal. I was wondering how you guys would like me to cite the use of MAGMA in a publication.
Thanks, that did the trick.
@stan
Thanks for all your help. I was able to figure out how to read off just the diagonal. I was wondering how you guys would like me to cite the use of MAGMA in a publication.

 Posts: 264
 Joined: Fri Aug 21, 2009 10:39 pm
Re: Different results for DGEQRF_GPU and DGEQRF2_GPU
I am glad the issue got resolved!
Here are a few publications that we use for the onesided factorizations and solvers, the twosided factorizations and solvers, and MAGMA BLAS.
S. Tomov, J. Dongarra, and M. Baboulin, Towards dense linear algebra for hybrid GPU accelerated manycore systems. Parallel Computing , Vol. 36, No 5&6, pp. 232240 (2010).
S. Tomov, R. Nath, and J. Dongarra, Accelerating the reduction to upper Hessenberg, tridiagonal, and bidiagonal forms through hybrid GPUbased computing, Parallel Computing, vol. 36, number 12, December 2010, pp 645 – 654.
R. Nath, S. Tomov, and J. Dongarra, An Improved MAGMA GEMM for Fermi GPUs, International Journal of High Performance Computing Applications, volume 24, number 4, 2010, pp 511515, ISSN 10943420.
Here is also a site with the MAGMArelated publications.
Here are a few publications that we use for the onesided factorizations and solvers, the twosided factorizations and solvers, and MAGMA BLAS.
S. Tomov, J. Dongarra, and M. Baboulin, Towards dense linear algebra for hybrid GPU accelerated manycore systems. Parallel Computing , Vol. 36, No 5&6, pp. 232240 (2010).
S. Tomov, R. Nath, and J. Dongarra, Accelerating the reduction to upper Hessenberg, tridiagonal, and bidiagonal forms through hybrid GPUbased computing, Parallel Computing, vol. 36, number 12, December 2010, pp 645 – 654.
R. Nath, S. Tomov, and J. Dongarra, An Improved MAGMA GEMM for Fermi GPUs, International Journal of High Performance Computing Applications, volume 24, number 4, 2010, pp 511515, ISSN 10943420.
Here is also a site with the MAGMArelated publications.