Increased GPU memory requirement with MAGMA 1.0

Open discussion for MAGMA

Increased GPU memory requirement with MAGMA 1.0

Postby evanlezar » Mon Feb 07, 2011 10:22 am

Hi,

I have recently made the move from MAGMA 0.2 to 1.0 RC3, and have noticed that in certain cases, the GPU LU decomposition (device interface) such as is provided by magma_cgetrf_gpu(), the device memory used has increased by about a factor 2.

Upon investigation of cgetrf_gpu.cpp I found the following (from line 138).
Code: Select all
if ((m == n) && (m % 32 == 0) && (ldda%32 == 0))
       magmablas_cinplace_transpose( dAT, ldda, lddat );
   else {
       if ( CUBLAS_STATUS_SUCCESS != cublasAlloc(maxm*maxn, sizeof(cuFloatComplex), (void**)&dAT) ) {
      cublasFree( dAP );
      return MAGMA_ERR_CUBLASALLOC;
       }
       magmablas_ctranspose2( dAT, lddat, dA, ldda, m, n );
   }


It seems that if the matrix is not square, or not a multiple of 32, then additional memory equal to the matrix size (padded to 32) is allocated on the device. I am sure that this padding helps performance, but it does make the interaction with my existing code a little difficult. Is there anyone else that has come across this and has an easy fix for me.

If this is not the case, I will have a look at what is required and submit a patch for square matrices that are not a multiple of 32.

Thanks, and keep up the good work.
evanlezar
 
Posts: 33
Joined: Tue Aug 25, 2009 7:20 pm
Location: Stellenbosch, South Africa

Re: Increased GPU memory requirement with MAGMA 1.0

Postby Stan Tomov » Mon Feb 14, 2011 5:32 pm

Yes, this is done for performance reasons. The algorithm is faster if we move the matrix to row-major data format. Ideally we wont to do this inplace but only the case
Code: Select all
((m == n) && (m % 32 == 0) && (ldda%32 == 0))

will guarantee we will not overwrite user data. If your matrices are such that no data would be lost in inplace transposition you can modify the code to always go through the inplace code so that you don't use extra memory. You can see also in gmagma_cgetrf how we relax the assumption for memory allocation.
Stan Tomov
 
Posts: 251
Joined: Fri Aug 21, 2009 10:39 pm

Re: Increased GPU memory requirement with MAGMA 1.0

Postby evanlezar » Thu Feb 24, 2011 4:51 am

Stan,

Thanks for the reply. Sorry that it has taken me so long to respond.

I understand fully that the in-place transpose is done for performance reasons. I just have a few quick questions.

I have looked a the magma_cgetrf routine, and see that there is non %32 check there, so as long as enough device memory has been allocated for the matrix, then the in-place transpose should function as required? Assuming that the entire matrix is being factorised, and not just a submatrix.

I think also assume that the transpose kernel is robust enough to handle matrices that are not a multiple of 32?

I will play around with the code a little more and see what I can come up with.

Thanks again for your help.
evanlezar
 
Posts: 33
Joined: Tue Aug 25, 2009 7:20 pm
Location: Stellenbosch, South Africa

Re: Increased GPU memory requirement with MAGMA 1.0

Postby evanlezar » Fri Feb 25, 2011 9:45 am

Just a note. I have implemented a set of magma_Xgetrf_gpu_square routines, which work pretty much as the standard magma_Xgetrf_gpu routines, but don't have the requirement that the matrix sizes be a multiple of 32. What is required is that the leading dimension specified is a multiple of 32 and that the matrix be square.

Is there someone else interested in these routines? What would be the best way of submitting a patch to MAGMA?

Thanks.
evanlezar
 
Posts: 33
Joined: Tue Aug 25, 2009 7:20 pm
Location: Stellenbosch, South Africa

Re: Increased GPU memory requirement with MAGMA 1.0

Postby fletchjp » Fri Feb 25, 2011 11:10 am

I want to comment on a number of issues which would help with the choice for best performance.

I have an existing program in FORTRAN which calculates the matrix I want to factorise a row at a time. I have already decided to store these rows as colunns, for efficiency of memory access when doing some BLAS operations to generate the row.

I can then use DGETRF to do the factorisation and then tell DGETRS the matrix is transposed to get the answers.

This thread is suggesting that DGETRF needs to transpose as well. As my matrix is in row major format anyway, could I tell it that and save it the bother?

Do I make sense?

John
fletchjp
 
Posts: 175
Joined: Mon Dec 27, 2010 7:29 pm


Return to User discussion

Who is online

Users browsing this forum: Bing [Bot] and 2 guests