asynchronous magmablas_sgemm calls

Open discussion for MAGMA

asynchronous magmablas_sgemm calls

Postby luiceur » Wed Nov 07, 2012 7:53 am

I am trying to execute 2 totally independent matrix calculations at the same time using magmablas_sgemm. I am thinking of using CUDA streams, however in order to be able to do it I believe that magmablas_sgem calls should be aynchronous otherwise the focus won't be returned to the CPU therefore it won't be able to start the second magmablas_sgemm. Am I right? Would it be possible?

Best,
Luis
luiceur
 
Posts: 26
Joined: Tue Jul 10, 2012 4:38 am

Re: asynchronous magmablas_sgemm calls

Postby mgates3 » Wed Nov 07, 2012 10:57 am

Both cublas and magmablas gemm calls are asynchronous. Actually, nearly all cublas and magmablas functions are asynchronous (but not higher level magma algorithms such as getrf). With recent performance improvements in CUDA 5.0, I actually recommend using cublas gemm.

Yes, you can use streams to execute multiple gemms simultaneously. This is helpful for small gemms -- specifically, where the output matrix (C) is small. For large gemms, each gemm will basically fill up the whole GPU, so there is no benefit to attempting to execute gemms simultaneously.

-mark
mgates3
 
Posts: 330
Joined: Fri Jan 06, 2012 2:13 pm

Re: asynchronous magmablas_sgemm calls

Postby luiceur » Thu Nov 08, 2012 5:27 am

Hi Mark,

Thanks for the info, very useful.
If it is possible to execute gemm with streams then, how can I indicate with which stream sgemm should be executed? The definition of magmablas_sgemm does not have anything for streams.
I have to say that in our case, with matrices of 9000x24000 float elements magma behaves in similar fashion than cublas achieving similar speedups.

Luis
luiceur
 
Posts: 26
Joined: Tue Jul 10, 2012 4:38 am

Re: asynchronous magmablas_sgemm calls

Postby mgates3 » Thu Nov 08, 2012 12:51 pm

With the new cublas interface (cublas_v2.h), you set the stream on the cublas handle using cublasSetStream( ), then pass the handle to the cublas gemm function.
With the old cublas interface (cublas.h), you set the stream globally using cublasSetKernelStream( ). Subsequent cublas calls use that stream.
With magma, you set the stream globally using magmablasSetKernelStream( ). Subsequent magmablas and cublas calls use that stream.

-mark
mgates3
 
Posts: 330
Joined: Fri Jan 06, 2012 2:13 pm

Re: asynchronous magmablas_sgemm calls

Postby luiceur » Fri Nov 09, 2012 7:26 am

I have managed to make them run on different streams setting
Code: Select all
 magmablasSetKernelStream
but I have not managed to make them run simultaneously i.e. they don't overlap. Ideally I would like them to overlap in time as they don't depend on each other. How could I do it? Do you have any ideas of how?

Code: Select all
cudaStream_t stream[2];
  for (int i =0; i< 2; i++){
    cudaStreamCreate( &stream[i] ) ;
  }
 
  int lda = iSize;
  int ldb = iSize;
  int ldc = kSize;

  magmablasSetKernelStream(stream[0]);
  magmablas_sgemm( transA, transB, kSize, kSize, iSize, alpha, d_Xv, lda, d_Xv, ldb, beta, d_AN, ldc );

  magmablasSetKernelStream(stream[1]);
  magmablas_sgemm( transA, transB, kSize, kSize, iSize, alpha, d_X, lda, d_X, ldb, beta,  d_A, ldc);
 
  for(int i =0; i< 2;i++){
    // synchronize streams
    cudaStreamSynchronize( stream[i] ) ;
   //destroy streams
    cudaStreamDestroy( stream[i] ) ;
  }
Attachments
Nsight.jpg
Nsight.jpg (151.38 KiB) Viewed 1953 times
luiceur
 
Posts: 26
Joined: Tue Jul 10, 2012 4:38 am

Re: asynchronous magmablas_sgemm calls

Postby mgates3 » Fri Nov 09, 2012 11:01 am

Each of your gemms looks like it fills up the whole GPU for over 1 second, which I would expect for 9000x24000 matrix. So I would not expect them to overlap. If they did overlap, the total time would not decrease any. That is, both gemms would get half the GPU for twice as long, together taking 2 seconds. In other words, for this size problem, I do not see any advantage to overlapping the gemms, nor any way to force them to overlap.

-mark
mgates3
 
Posts: 330
Joined: Fri Jan 06, 2012 2:13 pm

Re: asynchronous magmablas_sgemm calls

Postby luiceur » Mon Nov 19, 2012 10:13 am

What if I use multiple GPUs? In theory using 2 GPUs will divide the time by two, is there any reason I would not take advantage of that? Could magmablas_sgemm access another GPUs addresses as it would happen with a CUDA kernel?
Cheers,
luiceur
 
Posts: 26
Joined: Tue Jul 10, 2012 4:38 am

Re: asynchronous magmablas_sgemm calls

Postby mgates3 » Mon Nov 19, 2012 12:00 pm

Yes, you can divide the matrix in half and compute the gemm in parallel. For instance,
A [B0 B1] = [C0 C1]
becomes
A B0 = C0 and A B1 = C1.
Note while B and C are split, the matrix A must be duplicated on the two GPUs. Or you can split the other way,
[ A0 ] B = [ C0 ]
[ A1 ] [ C1 ]
For most linear algebra algorithms, distributing the matrix in a block column cyclic fashion, or sometimes block row cyclic, is efficient for a small number of GPUs.

You can use cublas gemm or magmablas gemm to achieve this, in both cases using cudaSetDevice to switch between GPUs. For example,
cudaSetDevice( 0 );
cublasSgemm( handle0, CUBLAS_OP_N, CUBLAS_OP_N, m, n0, k, alpha, A0, lda, B0, ldb, beta, C0, ldc );
cudaSetDevice( 1 );
cublasSgemm( handle1, CUBLAS_OP_N, CUBLAS_OP_N, m, n1, k, alpha, A1, lda, B1, ldb, beta, C1, ldc );
Here I assumed B and C were split; A is duplicated in A0 and A1 on GPU 0 and 1 respectively.
-mark
mgates3
 
Posts: 330
Joined: Fri Jan 06, 2012 2:13 pm

Re: asynchronous magmablas_sgemm calls

Postby luiceur » Tue Nov 20, 2012 6:53 am

Thanks Mark for your help. Just to clarify, because magmablas_gemm and cublas_gemm are both async, calling them:
Code: Select all
cudaSetDevice( 0 );
cublasSgemm( handle0, CUBLAS_OP_N, CUBLAS_OP_N, m, n0, k, alpha, A0, lda, B0, ldb, beta, C0, ldc );
cudaSetDevice( 1 );
cublasSgemm( handle1, CUBLAS_OP_N, CUBLAS_OP_N, m, n1, k, alpha, A1, lda, B1, ldb, beta, C1, ldc );

will execute them in parallel, am I right? Or it would be needed to create different CPU threads to set device 0 and set device 1 in order to execute them in parallel?

Cheers and thanks a lot!
luiceur
 
Posts: 26
Joined: Tue Jul 10, 2012 4:38 am

Re: asynchronous magmablas_sgemm calls

Postby mgates3 » Tue Nov 20, 2012 11:35 am

That's correct, since they are async, coding as in my previous post should execute them in parallel on two GPUs. All of our multi-GPU codes are written this way.
-mark
mgates3
 
Posts: 330
Joined: Fri Jan 06, 2012 2:13 pm


Return to User discussion

Who is online

Users browsing this forum: Google [Bot] and 1 guest

cron