Page 1 of 1

### Inplace matrix multiplication in BLAS level 3

Posted: Sun Jan 01, 2017 7:22 pm
I have a large B matrix of size 10Million x 1000 and I want to compute the result of its multiplication with a square 1000 by 1000 matrix in place. Currently the level 3 blas provides the *trmm functions which allow us to overwrite the input matrix during multiplication i.e. B := alpha*op( A )*B , where A must be triangular.
but there is no such level 3 functionality when the input matrix is a general square matrix instead of triangular.

Now to do my inplace matrix multiplication, I can either do 10Million level 2 BLAS calls to matrix vector multiplication, or
I can do a LU decomposition of A and then two level 3 BLAS calls to trmm. I was interested in discussing if there was a third solution that I had overlooked.

### Re: Inplace matrix multiplication in BLAS level 3

Posted: Mon Jan 02, 2017 7:13 am
Hello.

Indeed, there is `no way` to do in-place matrix-matrix multiplication of the type A = A * B with A m-by-n and B n-by-n (for example) using a transformation (reordering of operations) of the standard algorithm.

If you are trying to minimize the memory requirement and use a transformation (reordering of operations) of the standard algorithm, then the best you can do is use the row-version of matrix-matrix multiplication ( for i=1:m, A(i,1:n) = A(i,1:n)*B(1:n,1:n); end; ) This requires a buffer of size n and regular copies from buffer to matrix and this can be done using GEMV. (As you wrote.)

Indeed another way to do in-place matrix-matrix multiplication is to do an LU factorization of B and then rely on the fact that TRMM can be done (and is done) in place. So a factorization of the type B=B1*B2 with B1 and B2 triangular and then A = ( A * B1 ) * B2. As far as stability, clearly we will want to use pivoting during the LU factorization and this is not a problem. We can also use a QR factorization and then rely on ORMQR (which can be done, and is done, in place) and TRMM.

Note that as far as stability goes, this kind of algorithm is normwise stable, as opposed to componentwise stable for matrix-matrix multiplication. So it is less stable than the standard algorithm. (Kind of the same `weakness` as Strassen's algorithm has for example.)

Note that B will either have to be overwritten by L and U, or we will need a buffer of size n^2 for B (in general n is not too big so not a big deal), or we can consider overwrite by L and U and reconstruct in-place in output (which sounds complicated and an overkill but possible with something like LAUUM).

I do not know many references for doing an LU factorization for enabling in-place matrix-matrix multiplication. One reference is my PhD thesis. . See Section 2.4.3 (The LU–matrix–matrix product) on page 111. Let me know if there is a better reference.

Best wishes,
Julien.