Create distributed matrix on gpus with no cpu to gpu copy
I want to use _mgpu functions in magma as we need to work with matrix sizes of N=40,000 and more whose memory can't fit in a single gpu.
Lets say I am working with k gpus, and I am able to independently compute N/k different column blocks of the matrix directly on k gpus.
Next, lets say I want to use magma_int_t magma_dpotrf_mgpu( magma_int_t ngpu, magma_uplo_t uplo, magma_int_t n, magmaDouble_ptr d_lA, magma_int_t ldda, magma_int_t * info).
I am wondering if there a way to use the distributed matrix memory which is already on the gpus in the magmaDouble_ptr argument without the need to copy the memory to cpu?
Currently, if I understand correctly the magmaDouble_ptr is obtained using magma_dsetmatrix_1D_col_bcyclic(...), which uses the full matrix memory already present on the cpu.
I wouldn't like to copy to cpu and again back to gpu as this will involve additional costs and the full matrix memory might be beyond the cpu memory as well.
Thanks a lot in advance,