Thanks for reporting and trying to figure out the reasons for these descrepenacies. They are due to the GPU BLAS implementation used - in the paper we used customized BLAS kernels that are not yet in the release. The high level algorithms though, as described in the paper, are in the release.
Talking specificly for the SSYTRD routine, its performance critically depends on the speed of SSYMV (as 50% of the flops are in SYMV). Theoretically SSYMV can run up to 142 GFlop/s on a GTX280 (bus speed 142 GB/s), so if this is available, the SSYTRD from MAGMA 1.0 RC3 would run asymptotically at speed above 142 GFlop/s. In reality though this SYMV performance is not possible. CUBLAS SSYMV would run at below 10 GFlop/s and as a result the MAGMA SSYTRD, using CUBLAS SSYMV, would run at about that speed as well. The paper used a SSYMV kernel running at up to ~80 GFlop/s and so the MAGMA SSYTRD using that kernel goes to about that speed. Although this may sound impressive, there is obviously a lot of room for improvement. Indeed, we developed another SSYMV (shortly after the paper was submitted) that reached up to a little above 100 GFlop/s and along other optimizations the SSYTRD actually reached close to 120 GFlop/s.
The development of BLAS consumes a lot of effort especially with GPU changes coming frequently. For example in Fermi we had to redesign some BLAS algorithms. Level 2 BLAS on Fermi is also slow as the bus bandwidth was not increased, while ECC was added, further reducing the bandwidth available to users. Therefore we may even consider dropping MAGMA BLAS support from MAGMA. The CUBLAS GEMM is based on the MAGMA GEMM, so similarly, we will be happy to provide highly optimized MAGMA BLAS to NVIDIA to be incorporated and maintained in CUBLAS.