I don't think (correct me if I'm wrong) that complex support is there yet in MAGMA. Presumably this will only increase the number of flops/s since complex arithmetic is more compute intensive?
We have the MAGMA routines for complex but not all the complex BLAS that is needed. NVIDIA is working on completing their complex BLAS and when done we will also release the MAGMA routines in complex arithmetic. As you mention, complex is more compute intensive but it's effect can not be seen yet in the current CUBLAS implementation. For example, on a GTX 280, sgemm runs at up to ~375 GFlop/s and cgemm at up to ~292 GFlop/s.
CUBLAS is improving the triangular solves, but the last time I checked (version 2.1) cublasStrsm / cublasDtrsm was getting to about 0.24 / 0.09 GFlop/s on a GTX 280 for matrices of size 14,000 / 7,000. We mentioned in an article
(page 10) that this can be improved to about 14 / 6.7 GFlop/s. We will include certain BLAS (techniques like in the article, and as given in these MAGMA roadmap slides
) in the next MAGMA release by November 14.