Dear Forum:
I am investigating the application of finite difference methods to simulating optical propagation in the atmosphere. The fundamental step is calculation of a 2D Laplacian, and this can be cast as a banded matrix-matrix multiply. Here the banded matrix contains the coefficients of the finite difference scheme for the second derivative, and is real, symmetric and 13-diagonal. The matrix is the spatial field of the wavefront, and is full and complex. I have been unable to find a BLAS/LAPACK routine for such a banded matrix-matrix multiply and so have segmented the problem into a banded matrix-vector multiply, and loop over columns or rows of the field matrix to get my two derivatives. For this I am using CHBMV, even though my banded matrix is real. I have also coded the problem writing out all the multiplies and adds of the finite difference scheme. My BLAS/LAPACK routines come from AMD's ACML library, and are, they claim, optimized for the Athlon64 architecture on which the problem is running. I compile both codes using the PathScale f90 compiler set to aggressive optimization (-O3), and to my surprise, my hand coded program beats the CHBMV version about two to one. I am running on a 2.2 GHz machine using SuSE 9.3. The execution rate of the faster code is 500-600 Mflops, although this number is based on my own hand operations count because I have not been able to get either PAPI or AMD's CodeAnalyst working on my machine.
My questions:
1. Is there a banded matrix-matrix multiply routine out there? Real? Complex? Mixed-type? I am assuming GEMM (and related) is faster than GEMV (and related) inside a loop.
2. Am I doing wasted computation with CHBMV? A real-complex multiply is two real multiplies, but a complex-complex multiply uses two multiplies and an add for each component of the complex number.
3. Any rules of thumb for what the maximum flop rate should be for a matrix-matrix multiply as a percentage of machine speed? 50%, 80%? I seem to recall reading the slides from a presentation that Prof. Dongarra posted on the web about memory hierarchy effects on execution, and came away with the figure of 70-80%.
4. Is it possible to allocate memory for matrices and vectors in the calling program in such a way that when passed to BLAS/LAPACK routines those routines are effectively crippled?
All this obsession with speed comes from comparisons with FFT methods on the same problem. For periodic spatial domains perpendicular to the wave propagation direction, an FFT calculation of the Laplacian runs at 1.1-1.2 Gflop on my 2.2 GHz machine. (I am using op counts from Brigham's book on FFT for this number). I would like to equal or best this execution rate for my finite difference code.
Thanks for any light anyone can shed on these questions.

