I am calculating some values on the GPU which form one row of the matrix. At the moment I copy them back a row at a time to the matrix on the CPU and then copy the whole matrix back to the GPU. This is clearly wasteful:

The device pointers are defined as in testing_dgetrf_gpu_f.f in RC$:

Code: Select all

` real, dimension(4) :: devptrA, devptrB`

Code: Select all

```
call cublas_get_matrix(n, 1, size_of_elt, devptrD, n,
$ G(1,jrow),n)
```

Code: Select all

```
!---- devPtrA = G
call cublas_set_matrix(n, n, size_of_elt, G, ldda, devptrA, ldda)
```

Code: Select all

` call cublas_dcopy(n,devptrD,1,devptrXXX,1)`

If I can crack this I can save two complete matrix transfers and the memory of the array on the CPU.

It would help to have some explanation for the design decision to change the type of these pointers from RC3 to RC4

Please help if you can.

Thanks

John