Hi,
I will check, because I haven't work on this paper, but I think that the difference of performance you get on these routines is due to the cublasAlloc that we added to the function to make it user friendly. When the workspace is allocated once out of the function, it can change a lot the performances, that's why in the next version we will add a interface where the user will be able to provide the workspace.
Mathieu
