- 1) Legacy CUBLAS library prevents concurrent execution between multiple CPU threads.
2) Conditional statements following the call to cublasIdaMax create an implicit synchronization between the host and device.
3) Changing the device cache configuration (cudaDeviceSetCacheConfig) forces a global device synchronization, and consequently CPU thread synchronization.
With these points in mind, is there any work being done to implement context switching/multiple streams in these routines and/or MAGMA as a whole? Many of the techniques are already available in the CUDA sample cdpLUDecompoistion routine, which uses CUDA Dynamic Parallelism to perform a right-looking level 3 BLAS version of LU decomposition with partial pivoting entirely on the device. Something akin to this example with the ability to perform a single kernel launch would be very beneficial to myself, and I'd wager many others.