I'm using CUDA 5 with Magma with a K20 GPU.
I have a 1024x1024 which I'm calling magma_zgesv_gpu on. I'm trying to see if I can speed performance up with using multistream.
I guess I have a couple of questions:
- How do I know if the GPU is already busy with this setup such that the additional streams will not help?
- I'm doing the following:
- Code: Select all
magmablasSetKernelStream(streams[0]);
magma_zsetmatrix_async(N, N, (cuDoubleComplex *)A, N, magmaWrapper.d_pA, N, streams[0]);
magma_zsetmatrix_async(N, N, (cuDoubleComplex *)B, N, magmaWrapper.d_pB, N, streams[0]);
magma_queue_sync(streams[0]);
magmaWrapper.SolveAXequB(N, true, magmaWrapper.d_pA, magmaWrapper.d_pB, magmaWrapper.d_pS);
magmablasSetKernelStream(streams[1]);
magma_zsetmatrix_async(N, N, (cuDoubleComplex *)A, N, magmaWrapper2.d_pA, N, streams[1]);
magma_zsetmatrix_async(N, N, (cuDoubleComplex *)B, N, magmaWrapper2.d_pB, N, streams[1]);
magma_queue_sync(streams[1]);
magmaWrapper2.SolveAXequB(N, true, magmaWrapper2.d_pA, magmaWrapper2.d_pB, magmaWrapper2.d_pS);
magmablasSetKernelStream(streams[2]);
magma_zsetmatrix_async(N, N, (cuDoubleComplex *)A, N, magmaWrapper3.d_pA, N, streams[2]);
magma_zsetmatrix_async(N, N, (cuDoubleComplex *)B, N, magmaWrapper3.d_pB, N, streams[2]);
magma_queue_sync(streams[2]);
magmaWrapper3.SolveAXequB(N, true, magmaWrapper3.d_pA, magmaWrapper3.d_pB, magmaWrapper3.d_pS);
I dont see any wall-clock-time improvement of using 1 or 3 streams. 3 streams take 3 times the time it takes to calculate with one stream.
Furthermore with nvprof, I see 4 streams (even though I opened only 3) as if there was some additional stream syncronizing all the memcopies caused
by the magma_zgesv_gpu call.
Thanks