## cblas_gemv / cblas_gemm much faster with vector of 0s

Post here if you have a question about LAPACK performance

### cblas_gemv / cblas_gemm much faster with vector of 0s

I saw a question on stackoverflow.

http://stackoverflow.com/questions/3804 ... rix-values

It shows that in a vector-matrix multiplication, gemv/gemm runs much faster when the vector contains a lot of 0s, and the result is correct.
Until now I can only reproduce this with the package libblas-dev on Ubuntu 14.04.

Could you help explain why?
kangshiyin

Posts: 2
Joined: Mon Jun 27, 2016 1:00 pm

### Re: cblas_gemv / cblas_gemm much faster with vector of 0s

Hi Usman,

I have no idea. You multiply a 1-by-m vector A by the m-by-m identity B and you want to compute C = AB. (So you will get C=A. Fine.)
Then when A is a sparse-vector, you observe that the code runs (significantly) faster than when A is a dense vector.
OK. No idea why on my end.

Cheers,
Julien.
Julien Langou

Posts: 828
Joined: Thu Dec 09, 2004 12:32 pm
Location: Denver, CO, USA

### Re: cblas_gemv / cblas_gemm much faster with vector of 0s

Jim Demmel says:
The reference BLAS (which may or may not be how your BLAS is implemented)
sometimes check for zero entries in the inputs to avoid unnecessary arithmetic
(look at sger.f). I could imagine that someone optimizing the BLAS might
do this for other BLAS implementations as well. For example, it only costs
O(n^2) to count the number of zeros in an n-by-n matrix, which is cheap
compared to the O(n^3) cost of matrix multiplication, and if the number of
zeros is large enough, they might use a "sparse matrix multiply" algorithm.

Interestingly, this means that the BLAS do not propagate exceptions
consistently, eg some might multiply 0*inf and get a NaN, and others
may skip it.
Julien Langou

Posts: 828
Joined: Thu Dec 09, 2004 12:32 pm
Location: Denver, CO, USA

### Re: cblas_gemv / cblas_gemm much faster with vector of 0s

Thank you for the reply. An answer to the original question confirms that such optimization exists in that particular version of BLAS.
kangshiyin

Posts: 2
Joined: Mon Jun 27, 2016 1:00 pm

### Re: cblas_gemv / cblas_gemm much faster with vector of 0s

When you optimize gemv and gemm different techniques apply this
* For the matrix-matrix operation you are using blocked algorithms. Block sizes depend on cache sizes.
* For optimising the matrix-vector product you use so called fused Level 1 operations (e.g. fused dot-products or fused axpy).
ChrisBragg

Posts: 2
Joined: Wed Jul 13, 2016 5:05 am

Return to Performance

### Who is online

Users browsing this forum: No registered users and 1 guest