It looks that magma_dgemm itself includes host-device data exchanges (right?).
No. We measure the time for dgemm on the GPU, i.e., we assume the data and the result will be on the GPU memory.
There is also cuBLASSetMatrix calls before magma_dgemm call. Do I understand correctly that this calls performs arrays allocation in GPU global memory and therefore are negligible for total exeсution time even for 32х32 matrices ?
This call is not allocating memory. The memory allocation is before. This call only sets up the matrix values in the GPU memory by copying them from the CPU memory. The transfer of a 32x32 matrix will be significant time of the magma_dgemm execution.
is it correct to compare directly (GPU vs CPU) via comparison of (testing_dgemm execution time vs usual dgemm execution time) ?
It will depend on what you need to accelerate. If you have the matrix on the CPU, want the result on the CPU as well, and want to check if you can accelerate this using a GPU, you must modify the testing_dgemm code to include the memory transfers. The current MAGMA GEMM is an optimized implementation of DGEMM for GPU where the inputs and the output is on the GPU. A CPU interface GEMM must be hybrd, taking into account transfer times, and the CPU and GPU computational power, e.g., see
Massimiliano Fatica. 2009. Accelerating linpack with CUDA on heterogenous clusters. In Proceedings of 2nd Workshop on General Purpose Processing on Graphics Processing Units (GPGPU-2). ACM, New York, NY, USA, 46-51.