The copy from GPU to main memory can be either sync or async, depending on what version of the routine you call. If sync, then you don't need to do anything special--the copy will execute after the gemm call and block the CPU until the copy finishes. If async, then use the CUDA stream synchronize function before using the result to ensure the copy has finished. All CUDA kernels are async, not just magma ones. This allows you to do useful work on the CPU while the GPU is busy.
If the only operation you do is a single gemm on the GPU, you may not see any performance improvement, as you have to pay data transfer times. Depends on the matrix size.
Finally, in general, I would recommend using cublas gemm, as nvidia continues to optimize it for new architectures. Magma really focuses on the higher level routines like getrf (LU factorization).