The domain I am working in is possibly unusual in that it requires eigendecomposition of hundreds of small matrices (e.g 5x5). All of these matrices are ready to go at the same time having been developed by a sequence of CUDA 4.0 kernels.
Each of these decompositions is far too small to keep the device busy. In a similar situation with CUBLAS, one would open a stream per matrix operation and cublasSetStream for the operation, and since 16 streams can be executing at the same time the device is better utilized (hopefully, in a future hardware release 64 CUDA streams will be able to execute at the same time).
What does one do in this situation with Magma?
I imagine that the hybrid nature of Magma diminishes or eliminates the utility of cublasSetStream.
Also, say the CPU only supports 4 parallel hardware threads.
