Page 1 of 1

### cblas_gemv / cblas_gemm much faster with vector of 0s

Posted: Mon Jun 27, 2016 1:11 pm
I saw a question on stackoverflow.

http://stackoverflow.com/questions/3804 ... rix-values

It shows that in a vector-matrix multiplication, gemv/gemm runs much faster when the vector contains a lot of 0s, and the result is correct.
Until now I can only reproduce this with the package libblas-dev on Ubuntu 14.04.

Could you help explain why?

### Re: cblas_gemv / cblas_gemm much faster with vector of 0s

Posted: Tue Jun 28, 2016 1:29 am
Hi Usman,

I have no idea. You multiply a 1-by-m vector A by the m-by-m identity B and you want to compute C = AB. (So you will get C=A. Fine.)
Then when A is a sparse-vector, you observe that the code runs (significantly) faster than when A is a dense vector.
OK. No idea why on my end.

Cheers,
Julien.

### Re: cblas_gemv / cblas_gemm much faster with vector of 0s

Posted: Tue Jun 28, 2016 1:44 am
Jim Demmel says:
The reference BLAS (which may or may not be how your BLAS is implemented)
sometimes check for zero entries in the inputs to avoid unnecessary arithmetic
(look at sger.f). I could imagine that someone optimizing the BLAS might
do this for other BLAS implementations as well. For example, it only costs
O(n^2) to count the number of zeros in an n-by-n matrix, which is cheap
compared to the O(n^3) cost of matrix multiplication, and if the number of
zeros is large enough, they might use a "sparse matrix multiply" algorithm.

Interestingly, this means that the BLAS do not propagate exceptions
consistently, eg some might multiply 0*inf and get a NaN, and others
may skip it.

### Re: cblas_gemv / cblas_gemm much faster with vector of 0s

Posted: Tue Jun 28, 2016 2:54 am
Thank you for the reply. An answer to the original question confirms that such optimization exists in that particular version of BLAS.

### Re: cblas_gemv / cblas_gemm much faster with vector of 0s

Posted: Thu Jul 14, 2016 1:23 am
When you optimize gemv and gemm different techniques apply this
* For the matrix-matrix operation you are using blocked algorithms. Block sizes depend on cache sizes.
* For optimising the matrix-vector product you use so called fused Level 1 operations (e.g. fused dot-products or fused axpy).