Page 1 of 1

Increased GPU memory requirement with MAGMA 1.0

PostPosted: Mon Feb 07, 2011 10:22 am
by evanlezar
Hi,

I have recently made the move from MAGMA 0.2 to 1.0 RC3, and have noticed that in certain cases, the GPU LU decomposition (device interface) such as is provided by magma_cgetrf_gpu(), the device memory used has increased by about a factor 2.

Upon investigation of cgetrf_gpu.cpp I found the following (from line 138).
Code: Select all
if ((m == n) && (m % 32 == 0) && (ldda%32 == 0))
       magmablas_cinplace_transpose( dAT, ldda, lddat );
   else {
       if ( CUBLAS_STATUS_SUCCESS != cublasAlloc(maxm*maxn, sizeof(cuFloatComplex), (void**)&dAT) ) {
      cublasFree( dAP );
      return MAGMA_ERR_CUBLASALLOC;
       }
       magmablas_ctranspose2( dAT, lddat, dA, ldda, m, n );
   }


It seems that if the matrix is not square, or not a multiple of 32, then additional memory equal to the matrix size (padded to 32) is allocated on the device. I am sure that this padding helps performance, but it does make the interaction with my existing code a little difficult. Is there anyone else that has come across this and has an easy fix for me.

If this is not the case, I will have a look at what is required and submit a patch for square matrices that are not a multiple of 32.

Thanks, and keep up the good work.

Re: Increased GPU memory requirement with MAGMA 1.0

PostPosted: Mon Feb 14, 2011 5:32 pm
by Stan Tomov
Yes, this is done for performance reasons. The algorithm is faster if we move the matrix to row-major data format. Ideally we wont to do this inplace but only the case
Code: Select all
((m == n) && (m % 32 == 0) && (ldda%32 == 0))

will guarantee we will not overwrite user data. If your matrices are such that no data would be lost in inplace transposition you can modify the code to always go through the inplace code so that you don't use extra memory. You can see also in gmagma_cgetrf how we relax the assumption for memory allocation.

Re: Increased GPU memory requirement with MAGMA 1.0

PostPosted: Thu Feb 24, 2011 4:51 am
by evanlezar
Stan,

Thanks for the reply. Sorry that it has taken me so long to respond.

I understand fully that the in-place transpose is done for performance reasons. I just have a few quick questions.

I have looked a the magma_cgetrf routine, and see that there is non %32 check there, so as long as enough device memory has been allocated for the matrix, then the in-place transpose should function as required? Assuming that the entire matrix is being factorised, and not just a submatrix.

I think also assume that the transpose kernel is robust enough to handle matrices that are not a multiple of 32?

I will play around with the code a little more and see what I can come up with.

Thanks again for your help.

Re: Increased GPU memory requirement with MAGMA 1.0

PostPosted: Fri Feb 25, 2011 9:45 am
by evanlezar
Just a note. I have implemented a set of magma_Xgetrf_gpu_square routines, which work pretty much as the standard magma_Xgetrf_gpu routines, but don't have the requirement that the matrix sizes be a multiple of 32. What is required is that the leading dimension specified is a multiple of 32 and that the matrix be square.

Is there someone else interested in these routines? What would be the best way of submitting a patch to MAGMA?

Thanks.

Re: Increased GPU memory requirement with MAGMA 1.0

PostPosted: Fri Feb 25, 2011 11:10 am
by fletchjp
I want to comment on a number of issues which would help with the choice for best performance.

I have an existing program in FORTRAN which calculates the matrix I want to factorise a row at a time. I have already decided to store these rows as colunns, for efficiency of memory access when doing some BLAS operations to generate the row.

I can then use DGETRF to do the factorisation and then tell DGETRS the matrix is transposed to get the answers.

This thread is suggesting that DGETRF needs to transpose as well. As my matrix is in row major format anyway, could I tell it that and save it the bother?

Do I make sense?

John