I am calculating some values on the GPU which form one row of the matrix. At the moment I copy them back a row at a time to the matrix on the CPU and then copy the whole matrix back to the GPU. This is clearly wasteful:
The device pointers are defined as in testing_dgetrf_gpu_f.f in RC$:
- Code: Select all
real, dimension(4) :: devptrA, devptrB
My code to transfer one row looks like this (I am storing the transpose as the elements are then adjacent):
- Code: Select all
call cublas_get_matrix(n, 1, size_of_elt, devptrD, n,
$ G(1,jrow),n)
G is an array on the CPU. This is followed later by the following:
- Code: Select all
!---- devPtrA = G
call cublas_set_matrix(n, n, size_of_elt, G, ldda, devptrA, ldda)
What I would like to do is something like this:
- Code: Select all
call cublas_dcopy(n,devptrD,1,devptrXXX,1)
where devptrXXX needs to point to the correct location in devptrA. I have been looking around for an example of this and cannot find one.
If I can crack this I can save two complete matrix transfers and the memory of the array on the CPU.
It would help to have some explanation for the design decision to change the type of these pointers from RC3 to RC4
Please help if you can.
Thanks
John