Hello Tim,
Well, I think you are stucked. This is a BLAS problem and has been
reported several times. There is no BLAS routine to exploit the symmetry
in the output H when H is computed as
H < M^H * h * M
with h Hermitian.
Assume M is nbym and h is nbyn. Then your optimal number of FLOPS is
to compute H is 2mn^2. If you also want to compute Q then you end up
with 3mn^2 operations.
You can not get neither one nor the other with BLAS right now. BLAS
enables you to get 4mn^2.
This is a known problem and bothers lots of people.
(e.g. The folks working on LDL^T for example)
Here are two tricks.
[1] There are two conditions for the first trick.
If
(a) n << m
(b) h is Hermitian positive definite
then
(i) Cholesky of h (POTRF = n^3/3): r < chol (h), you get r such
that h = r^H*r and r upper triangular,
(ii) apply r to M from the left Q < r*M, mn^2
(iii) apply SYRK: H < Q^H Q > mn^2
(iv) apply Q < r^H * Q (this is mn^2)
total: 3mn^2 + n^3/3 so 3mn^2 id n<<m.
Well ok condition (a) and (b) might be really too restrictive but if you
have both conditions I encourage you to try.
In LDL^T, h is indefinite so we are stucked.
[2] If you really want your FLOPS back when you perform
H < M^H * Q
without losing too much efficiency then you will need to break the
operations by yourself in blocks of size NB. (NB to be tuned). For each
NB, you compute only a column (or row) of H. and you adjust the length to
take into account the symmetry. This is not perfect FLOPS saving but
roughly you get where you want. If NB = 1, you have perfect FLOPS saving
(mn^2 for this operation) but you have bad performance. If NB = N, well
you have no FLOPS saving (back to 2mn^2). Using a hand made block
algorithm with NB = 64 (e.g.) enables to get the sought FLOPS saving and
efficiency.
This is the way we (LAPACK) do in for example LDL^T.
Best way would have had to have the BLAS routine for this operation, sure.
Best wishes,
Julien.
On Thu, 28 Jun 2007, Professor Boykin wrote:
Dear LAPACK folks,
We need to compute a matrix product of the form (hc=hermitian conjugate):
H = hc(M)*Q, Q = h*M
where H and h are hermitian, Q is already computed (it's needed for
another stage). As you can see we really need only the full upper
(or lower) triangle of the product matrix, H. Looking through the
BLAS routines we have not found any special routine which would be faster
than the general matrix multiply of hc(M) and Q. Is there some multiply
routine where we could specify only compute the full upper (or lower)
triangle of the product (so as to speed things up by a bit less than a
factor of 2)?
Thanks,
Tim Boykin
*************************************************************************
Timothy Boykin, Ph.D.
Professor (*)
Department of Electrical & Computer Engineering
The University of Alabama in Hunstville
Huntsville, Alabama 35899
USA
(*) effective Aug. 2007; presently "Associate Professor"...
email: boykin@Domain.Removed
Voice: (256) 8246269
FAX: (256) 8246803 (9AM5PM Central Time)
(256) 8243482 (5PM9AM Central Time)
*************************************************************************
"You know, Foley, I have only one eye. I have a right to be blind
sometimes. I really do not see the signal."
ViceAdmiral Lord Horatio Nelson (with telescope held to his bad eye) to
Captain Foley (on board Nelson's flagship HMS Elephant at the Battle of
Copenhagen, 2 April 1801) explaining why he is ignoring Admiral Sir Hyde
Parker's signal to discontinue action. Parker assumed that Nelson would
disregard the order if he (Nelson) could continue the attack and indeed
supported Nelson's squadron when he saw that Nelson was pressing forward.
*************************************************************************
_______________________________________________
Lapack mailing list
Lapack@Domain.Removed
http://lists.cs.utk.edu/listinfo/lapack
