i am using magma blas function segmv for my application and i run it for a square matrix 4096x4096 in size and multiply it with a 4096x1 vector. However the occupancy is 50% but the bandwidth is pretty neat at (80-85) Gb/s for my tesla card(C1060).
I increased the block size to 128 and recompiled my code with this version but it didn't change the performance of the code and both bandwidth and execution time remained the same. Also the occupancy increased to 100%.
Does this mean that my kernel is bandwidth bound?
Please dont mind if my question seems to be rookie. I am doing such an analysis for the first time.
thanks and regards