by **Julien Langou** » Tue Sep 01, 2015 9:26 pm

Quick explanation of why there is no subroutine in the BLAS so that x <- A*x

for a dense matrix A.

Observation: We cannot do the operation x <- A*x in place. We have to copy x in a temporaray

vector y and then perform x <- A*y. This is because, once we have computed the output value x(i) (for any i), we cannot reuse the input value x(i)

to compute the other output x(j) (for j not i). So we cannot do x <- A*x in place.

So, this leaves us to two choices to design a BLAS subroutine. (1) Either stick

with the idea that we need to create a subroutine such that x <- A*x. This

means that the BLAS will have to have a routine that allocates a temporary

workspace for y, copies y <- x, then calls the existing and useful x <- A*y,

then frees y. (2) Or we focus on performing x <- A*y and leave the tasks of

allocation, copy, free to user. The choice was (2) over (1) since it is easy

enough to get (1) from (2). The design choice for BLAS was to only implement

(2) and leave (1) to the user.

Comment #1: If A is triangular, then it is possible to implement x <- A*x in

place. If A is upper triangular, you first compute x(1) using x(1), x(2) . . .

x(n). And this is OK since x(1) is not needed to compute x(2) . . . x(n). Then

you compute x(2), then x(3), etc. This explains why the TRMV routine is by

design in-place.

Comment #2: This explanation about GEMV and TRMV holds for GEMM and TRMM. There

is no subroutine to perform A <- A*B in the BLAS. There is only GEMM of the

type A <- B*C. But there is TRMM such that A <- A*B when B is triangular or

A <- B*A when B is triangular. TRMM is in place by design.

Comment #3: If you really really are short in memory and have to implement say

A <- A*B where A is m-by-n, B is n-by-n, and n << m, you can perform the LU

factorization of B: B = P*L*U and then compute A <- (A*P)*L)*U). This way you

do not need to allocate an extra m-by-n workspace for a temporary C. You can

actually do A <- A*B in place. This is an extreme trick. I did use it a few

times in some iterative solvers where memory was scarce. For example, n is of

the order of 10, and m is of the order of 10^9, so the trade off is either

perform the LU factorization of 10-by-10 matrix B or allocate a 10^9-by-10

temporary array. Yes you may worry a little about stability with LU, so you can

use a more stable factorization for B like the QR factorization. That would

work similarly.

Julien