Page 1 of 2

Performance of data transfers to GPU

PostPosted: Wed Mar 16, 2011 7:33 pm
by fletchjp
I am working on some code using magma_zgetrf_gpu called from FORTRAN.
For the CPU part of my code I am using the goto2 BLAS compiled for CORE2 which consistently uses four cores, which I can see by monitoring CPU cores on my computer. I am running this on an 8 core CPU (I cannot use the 8 core version of goto2 BLAS because it has a bug).

When I run a program which makes a call to magma_zgetrf_gpu I have to make a call to cublas_set_matrix to transfer the data and one to cublas_get_matrix to get it back afterwards. What I observe is that during these calls three of the four cores show a sharp drop in usage, and an extra core has a peak. I don't know what use the cublas routines make of blas calls, but it looks very much as though these calls are not using the four cores available to the program through gotoblas.

I think that these calls are in fact what is giving me poor performance, particularly when I want to do repeat back substitution using magma_zgetrs_gpu when in fact the performance is much worse than doing the work in the CPU. The matrix size is about 4000.

The following data for zgetrf_gpu shows that I should get a considerable speed up but I am not seeing it because of the overhead of the data transfers.

Code: Select all
fletcher@fletcher-desktop:~/magma_1.0.0-rc4/testing$ ./testing_zgetrf_gpu
device 0: GeForce GTX 460, 1400.0 MHz clock, 2047.2 MB memory

  testing_zgetrf_gpu -M 1024 -N 1024

  M     N   CPU GFlop/s    GPU GFlop/s   ||PA-LU||/(||A||*N)
  960   960   19.62          47.00         1.102403e-17
 1920  1920   26.94          59.55         1.096587e-17
 3072  3072   27.30          63.06         1.075028e-17
 4032  4032   27.76          67.16         1.033353e-17
 4992  4992   27.93          68.29         1.044090e-17
 5952  5952   27.96          68.98         1.025062e-17
 7104  7104   28.13          69.48         1.020955e-17
 8064  8064   27.92          69.82         1.004068e-17
 9024  9024   27.68          70.09         9.916281e-18


Re: Performance of data transfers to GPU

PostPosted: Fri Mar 18, 2011 8:21 am
by fletchjp
I have done some more work to confirm the previous analysis. I have three strategies.

0. Use ZGETRF to factorise the matrix and then repeatedly call ZGETRS to solve for successive right hand sides which are made available one at a time in a loop.

1. Transfer the matrix to the GPU. Use MAGMA ZGETRF GPU to factorise the matrix. Transfer the factorised matrix back to the CPU and use ZGETRS as in 0.

2. Transfer the matrix to the GPU. Use MAGMA ZGETRF GPU to factorise the matrix. Transfer successive vectors to the GPU and use MAGMA ZGETRS GPU to back substitute, returning a vector to the CPU in each case for the next vector to be calcualted.

For a matrix size of 4000, strategy 0 and strategy 1 take about the same time, and I expect strategy 1 to be faster as the problem size gets bigger. Strategy 2 takes four times as long, and the extra time is taken with the transfers to and from the GPU, because of the setup time of repeated small transfers. This far outweighs any gain from using MAGMA ZGETRS GPU. The only way around this would be to move the other parts of the loop onto the GPU as well, which means a lot of work on another library in my case. In any case for strategies 0 and 1 this is not the largest part of the time, so that does not seem a good use of effort. I would not expect this peformance to improve with problem size, as the vector transfers will remain relatively small and the overhead will remain.

I am sharing this so that others can get a feel for where it is most useful to contemplate the use of a GPU.

The conclusion I am drawing is that the overhead on transfers is such that methods should be favoured where the transfers can be few and large in size, rather than small and many.

I would be grateful for any comments from other experience or suggestions how to tackle such problems.

I have had a look for the source code for the cublas_get_vector code and other similar routines but it seems not to be in the public domain. If it is please point this out to me.

I hope this helps.


It turns out that the analysis here was mistaken. The problem was in fact that MAGMA ZGETRS is very slow for one righthand side. This can be remedied by using ztrsv for this case instead of ztrsm. I don't want anyone to be misled. This means that data transfer speeds are not in this case the problem, so most of what follows is not the solution to this problem.

John (26th March)

Re: Performance of data transfers to GPU

PostPosted: Mon Mar 21, 2011 7:42 am
by fletchjp
I have done a strategy 3 in which I used pinned memory for the buffer in the CPU. This was quite difficult to sort out for the FORTRAN interface. I will post the code later. After all the work it did not make much difference at all to the time taken compared with strategy 2.

I have found a reference which discusses other ways of doing the transfer. In particular there is a way of mapping the CPU memory into the GPU space which is said to avoid the setup times, which is what is killing the time on this problem for me. Has any work been done on this for MAGMA?

This is mentioned by Lawlor in his paper on cudaMPI: ... I_2009.pdf (Section III C)



Re: Performance of data transfers to GPU

PostPosted: Mon Mar 21, 2011 12:14 pm
by mateo70

just to know, do you really need to do all the solve separatly ? can't you just call getrs_gpu with all the rhs ?

Otherwise these results looks interesting, but I will try to fix that for next release since we have a function to do the swap directly on GPU without the need to transfer the RHS on CPU, do the swap, transfert it back to the GPU.


Re: Performance of data transfers to GPU

PostPosted: Mon Mar 21, 2011 12:54 pm
by fletchjp
Thank you for the comments.

The reason I transfer the vectors one at a time is that they are calculated sequentially, and each depends on the previous results. There is no way to get around that. I am using ARPACK to solve a generalised eigenvector problem, and I call its routine and it does repeated call backs with a vector, which I need to treat as a RHS for my factorisation and then feed the results back to it. The ARPACK part of the calculation is not very time demanding so I don't want to move ARPACK to the GPU. I did ask Stan a long time ago if you were looking at ARPACK and he said no, which I think makes sense. So if I am to use the GPU for the back substitution I need to have the vector on the GPU and pass it back to the CPU for ARPACK to work again. This happens several times per iteration. The problem is that the times are dominated by the setup time for each transfer, documented by Lawlor - see above.

I think the best bet is to use CPU memory mapped to the GPU as described in the Dr Dobbs part 12 - see my comments above. If I use what I think is pinned AND mapped memory then the both the CPU and the GPU can write to it and read it and there is no overhead for setting up the transfer, which is what is the time problem at the moment. The Dr Dobbs article is one of a series and seems to be talking about things is a way I have not seen anywhere else.

The fallback is to use what I call strategy 1 above and only use the GPU for the factorisation and not for the back substitution. That is the best at the moment but feels like giving in too easily.

It is all good learning as far as I am concerned, to find out what works best.


Re: Performance of data transfers to GPU

PostPosted: Mon Mar 21, 2011 5:07 pm
by Stan Tomov

Thanks for these posts. We will try the new ways to communicate between the CPU and the GPU. So far we have tried to avoid this type of communication when possible, but there are cases where it wasn't possible and the algorithms would benefit from it (e.g., the two-sided factorizations with application to eigen/singular-value problems).

The closest that we get to your pattern of computation are the mixed-precision solvers, but there we manage to do the entire computation on the GPU and not to transfer any vectors. As you point out communications are expensive, e.g., the latency for CPU-GPU data transfer is ~11 microseconds, i.e., to transfer a vector of 1,000 double precision numbers through a PCIe of 5.5 GB/s it will take 11 us + (8 * 1,000) / (5.5 * 1,000) us = 11 + 1.45 us. Most of the time (11 us) in this example is latency so indeed it would be very good if it is possible to avoid it.

Please post your findings.


Re: Performance of data transfers to GPU

PostPosted: Mon Mar 21, 2011 6:07 pm
by fletchjp

Thank you for that. My main code is in FORTRAN so I had to do some fiddling to find out how to implement an interface to declare pinned memory from FORTRAN. I managed to sort that out but had no advantage. I now need to find out how to do mapped memory. I think I will try it in C++ first and then in FORTRAN.

Code for pinned memory. C code with the same headers I used the first time I posted C in the FORTRAN pointers thread:

Code: Select all
/* I cannot get the linker to link these directly from FORTRAN but this works
I don't know why. */
/* This is passed a C pointer type and does not need indirection on call on */
void cuda_alloc_host_(devptr_t *hostPtr,const int *n,const int *size)
  cudaMallocHost(hostPtr, (*n)*(*size));

void cuda_free_host_(devptr_t *hostPtr)



Code: Select all
       use iso_c_binding
! see
      type(C_PTR) :: cptr_buff
      complex*16, dimension(:), pointer :: h_buff


Code: Select all
! see
c I cannot get the next call to link
c         call cudaMallocHost(cptr_buff, ldda*size_of_elt)
c so I have used my own routine instead.
         call cuda_alloc_host(cptr_buff,ldda,size_of_elt)
         call c_f_pointer(cptr_buff,h_buff,(/ ldda /) )

Usage example:

Code: Select all
       CALL ZCOPY(N, BWORK, 1, h_buff, 1)
       call cublas_set_vector(n,size_of_elt,h_buff,1,devptrB,1)

Free up. Note this uses the cptr not the FORTRAN name.

Code: Select all
      call cuda_free_host(cptr_buff)

The information for this came from here: page 85


Re: Performance of data transfers to GPU

PostPosted: Mon Mar 21, 2011 7:27 pm
by fletchjp

There is an example of the use of mapped memory on the Dr Dobbs Part 12 page I referred to before.

The code is called and I have it working on my device (GTX 460)

The crucial tests are that the device can map host memory, and that the CUDA Run Time is 2.2 or later.

Code: Select all
   /* from CUDA by example page 28-32 */
    int count;
    cudaDeviceProp prop;
    for (int i = 0; i < count; i++) {
      cudaGetDeviceProperties (&prop, i);
      if (prop.canMapHostMemory) {
        printf("Device %d can map host memory\n",i);
        printf("Device %d cannot map host memory\n",i);

There is another example on the same web page which shows how to send with one array and get back in another to be sure of data consistency.

I now have versions of this for double precision and double precision complex. I had warnings about demoting to float until I put in -arch = sm_20 on my nvcc command line in the makefile.

I have not tried to apply this to my problem yet.


Re: Performance of data transfers to GPU

PostPosted: Tue Mar 22, 2011 4:29 am
by fletchjp
There is a section called Zero-Copy Host Memory in CUDA by Example starting on page 214. It is in the chapter on multiple GPUs but isn't that at all.

I will start some experiments to see where I can get with all this.

See some experiments on timing of zgetrs_gpu. I now find that it is the routine itself which is slow.


Re: Performance of data transfers to GPU

PostPosted: Fri Mar 25, 2011 4:02 am
by danilo
Hi John, I definitely suggest you to use pinned memory for such situations. I had a similar problem in conjugate gradient algorithm, where for each iteration (until convergence) you need to invoke gpu kernels and transfer data from cpu to gpu and vice-versa. Using cudamallochost I obtained good improvements in performance and the overhead was drastically reduced.