Hello,
I have been trying to find a parallel (GPU+CUDA) tiled SGEMM, to solve big matrices which otherwise would not fit in GPU memory.
A friend of mine said he once saw a Magma version of tiled SGEMM, but I can't find it.
Does such implementation exist?
Thanks for your time...