Yes, the kernel is memory bound.
For n^2 * sizeof(float) bytes of data the flops are 2 n^2, i.e. only 0.5 flops per byte.
This means that if for example the bus is 140 GB/s (in the GTX280), the theoretical
peak for sgemv (due to the memory speed limitation) will be 70 GFlop/s (assuming
we do the computations for "free"). The sgemv achieves up to 66 GFlop/s on the
GTX280, which is very good.