time measurements in MAGMA

Open discussion for MAGMA

time measurements in MAGMA

Postby Mikky » Thu Apr 21, 2011 9:48 am

1) What is timer resolution used in testing programs (from MAGMA testing directory)
for Linux/x86-64 ?

I obtained some (may be not nonsense) GFLOPS values for C2050 with testing_dgemm run where
small 32 x 32 matrices were used. But like dgemm call requires only something about 1E-5 sec for
one Nehalem/2.7 Ghz core.

2) Do I understand correctly that there are start-stop timings at x86 host side, i.e.
(for example) testing_dgemm printed runtimes (and performance itself) include PCIe transfers delays ?

2) Do I need to perform MAGMA performance tuning (based onsrc/get_nb.cpp)
after 1.0.0-rc4 installation for NVidia С2050 ?
Mikky
 
Posts: 4
Joined: Wed Apr 20, 2011 2:51 pm

Re: time measurements in MAGMA

Postby fletchjp » Fri Apr 22, 2011 4:31 am

I found this paper by Orion Lawlor very helpful to get an understanding of the timing of data transfers to a GPU

http://lawlor.cs.uaf.edu/~olawlor/paper ... I_2009.pdf

There is a set up time for a transfer which will dominate for small transfers.

I hope this helps

John
fletchjp
 
Posts: 170
Joined: Mon Dec 27, 2010 7:29 pm

Re: time measurements in MAGMA

Postby Mikky » Fri Apr 22, 2011 2:18 pm

Thanks for your reference, it's very helpful !

But the question was a bit more stupid: what measures get_current_time call and what is therefore time difference
between end and start time (does it include PCI transfer time, as I beleive) ? And what is about get_current_time resolution ?
Mikky
 
Posts: 4
Joined: Wed Apr 20, 2011 2:51 pm

Re: time measurements in MAGMA

Postby Stan Tomov » Sat Apr 23, 2011 11:05 pm

Function get_current_time calls gettimeofday, so the resolution is a microsecond. Before getting the gettimeofday there is a call to cudaThreadSynchronize() to make sure previous GPU tasks have completed. Thus one can measure the time of a particular GPU kernel by surrounding it by calls to get_current_time. If between two get_current_time calls there are functions transferring data, the time measure will include the time for the transfer.
Stan Tomov
 
Posts: 249
Joined: Fri Aug 21, 2009 10:39 pm

Re: time measurements in MAGMA

Postby Mikky » Mon Apr 25, 2011 12:11 pm

ОК.
I looked again testing_magma source. It looks that magma_dgemm itself includes host-device data exchanges (right?).

There is also cuBLASSetMatrix calls before magma_dgemm call. Do I understand correctly that this calls performs arrays allocation in GPU global memory and therefore are negligible for total exeсution time even for 32х32 matrices ?

I.e. is it correct to compare directly (GPU vs CPU) via comparison of (testing_dgemm execution time vs usual dgemm
execution time) ?

To be more exact: instead of execution time I use testing_dgemm GFLOPS value etc.
Mikky
 
Posts: 4
Joined: Wed Apr 20, 2011 2:51 pm

Re: time measurements in MAGMA

Postby Stan Tomov » Thu Apr 28, 2011 2:06 pm

It looks that magma_dgemm itself includes host-device data exchanges (right?).

No. We measure the time for dgemm on the GPU, i.e., we assume the data and the result will be on the GPU memory.

There is also cuBLASSetMatrix calls before magma_dgemm call. Do I understand correctly that this calls performs arrays allocation in GPU global memory and therefore are negligible for total exeсution time even for 32х32 matrices ?

This call is not allocating memory. The memory allocation is before. This call only sets up the matrix values in the GPU memory by copying them from the CPU memory. The transfer of a 32x32 matrix will be significant time of the magma_dgemm execution.

is it correct to compare directly (GPU vs CPU) via comparison of (testing_dgemm execution time vs usual dgemm execution time) ?

It will depend on what you need to accelerate. If you have the matrix on the CPU, want the result on the CPU as well, and want to check if you can accelerate this using a GPU, you must modify the testing_dgemm code to include the memory transfers. The current MAGMA GEMM is an optimized implementation of DGEMM for GPU where the inputs and the output is on the GPU. A CPU interface GEMM must be hybrd, taking into account transfer times, and the CPU and GPU computational power, e.g., see

Massimiliano Fatica. 2009. Accelerating linpack with CUDA on heterogenous clusters. In Proceedings of 2nd Workshop on General Purpose Processing on Graphics Processing Units (GPGPU-2). ACM, New York, NY, USA, 46-51.
Stan Tomov
 
Posts: 249
Joined: Fri Aug 21, 2009 10:39 pm


Return to User discussion

Who is online

Users browsing this forum: Google [Bot] and 1 guest