banded matrix-matrix multiply

Open discussion regarding features, bugs, issues, vendors, etc.

banded matrix-matrix multiply

Dear Forum:

I am investigating the application of finite difference methods to simulating optical propagation in the atmosphere. The fundamental step is calculation of a 2D Laplacian, and this can be cast as a banded matrix-matrix multiply. Here the banded matrix contains the coefficients of the finite difference scheme for the second derivative, and is real, symmetric and 13-diagonal. The matrix is the spatial field of the wavefront, and is full and complex. I have been unable to find a BLAS/LAPACK routine for such a banded matrix-matrix multiply and so have segmented the problem into a banded matrix-vector multiply, and loop over columns or rows of the field matrix to get my two derivatives. For this I am using CHBMV, even though my banded matrix is real. I have also coded the problem writing out all the multiplies and adds of the finite difference scheme. My BLAS/LAPACK routines come from AMD's ACML library, and are, they claim, optimized for the Athlon64 architecture on which the problem is running. I compile both codes using the PathScale f90 compiler set to aggressive optimization (-O3), and to my surprise, my hand coded program beats the CHBMV version about two to one. I am running on a 2.2 GHz machine using SuSE 9.3. The execution rate of the faster code is 500-600 Mflops, although this number is based on my own hand operations count because I have not been able to get either PAPI or AMD's CodeAnalyst working on my machine.

My questions:

1. Is there a banded matrix-matrix multiply routine out there? Real? Complex? Mixed-type? I am assuming GEMM (and related) is faster than GEMV (and related) inside a loop.

2. Am I doing wasted computation with CHBMV? A real-complex multiply is two real multiplies, but a complex-complex multiply uses two multiplies and an add for each component of the complex number.

3. Any rules of thumb for what the maximum flop rate should be for a matrix-matrix multiply as a percentage of machine speed? 50%, 80%? I seem to recall reading the slides from a presentation that Prof. Dongarra posted on the web about memory hierarchy effects on execution, and came away with the figure of 70-80%.

4. Is it possible to allocate memory for matrices and vectors in the calling program in such a way that when passed to BLAS/LAPACK routines those routines are effectively crippled?

All this obsession with speed comes from comparisons with FFT methods on the same problem. For periodic spatial domains perpendicular to the wave propagation direction, an FFT calculation of the Laplacian runs at 1.1-1.2 Gflop on my 2.2 GHz machine. (I am using op counts from Brigham's book on FFT for this number). I would like to equal or best this execution rate for my finite difference code.

Thanks for any light anyone can shed on these questions.
lrkeefe

Posts: 4
Joined: Mon Sep 26, 2005 4:52 pm
Location: 6655 Palomino Circle, West Linn, OR 97068-2505

Laurence, your banded matrix is a big gap between some of the bands, so the lapack routine (which is for a "dense" band) will be very inefficient.

How did you code your matrix-vector mulitply? It might be worth looking at the templates book.

I can think of various ways of coding your particular algorithm, but the short answer is that your algorithm is not in blas/lapack (afaik). The sparse-times-dense product is part of the proposed "sparse blas", but I don't know if there are industrial-grade implementations of that.

If I had to code your algorithm efficiently, I'd divide the Laplacian in small blocks, and mulitply them against the dense matrix. That way you'd use Blas3 operations, which are very fast.

Victor.
Victor Eijkhout

Posts: 4
Joined: Thu Dec 09, 2004 12:19 am
Location: Austin, TX

Victor:

Thanks for the advice, and the reference to the templates book. I will look at that.

I am confused by your statement about there being a big gap between some of the bands. The finite-difference coefficient matrix in this case has no gaps between the diagonals, they are the six diagonals above and below the main diagonal.

The matrix-vector product is not so much coded as adpated to the input requirements of CHBMV. The preface to that routine contains code for packing the storage of the banded matrix and I do that. Otherwise, I simply create a temporary vector from columns or rows of the field matrix and call the routine.

At this point I would be interested in seeing "research" grade algorithms for banded -matrix matrix multiply. Where is this development work occurring?
lrkeefe

Posts: 4
Joined: Mon Sep 26, 2005 4:52 pm
Location: 6655 Palomino Circle, West Linn, OR 97068-2505

lrkeefe wrote: I am confused by your statement about there being a big gap between some of the bands. The finite-difference coefficient matrix in this case has no gaps between the diagonals, they are the six diagonals above and below the main diagonal.

Are you confusing storage and the actual matrix? If you have n points per line in your domain, then the diagonals above the main are at distance 1,2,3,n-1,n,n+1,2n, if I guess correctly the discretization stencil you're using. You may be storing them contiguously, but if you'd make a picture of the matrix, there would be big gaps.

Or maybe I'm misunderstanding you completely.

At this point I would be interested in seeing "research" grade algorithms for banded -matrix matrix multiply. Where is this development work occurring?

I don't think there is much development. Mail Iain Duff (I.Duff@rl.ac.uk)who was involved in sparse blas, and ask him. Even then, he probably only knows about sparse-times-dense.

Victor.
Victor Eijkhout

Posts: 4
Joined: Thu Dec 09, 2004 12:19 am
Location: Austin, TX

Victor:

Thanks for Iain Duff's address, I will contact him.
With regards to the diagonals, this is a central differencing scheme, and all the coefficient diagonals are concentrated around the main diagonal, there are none in the upper right or lower left corners. If you read the matrix elements by rows then, yes, after crossing through the coefficient diagonals there are a whole lot of zeros at the end of that line and at the beginning of the next line before you run into the coefficient diagonals again. Is this what you mean? I have no idea how these diagonals are stored after being packed prior to the CHBMV call.
lrkeefe

Posts: 4
Joined: Mon Sep 26, 2005 4:52 pm
Location: 6655 Palomino Circle, West Linn, OR 97068-2505