Page 1 of 1

How to use magma with GPU and multi-core CPU systems

PostPosted: Wed Oct 11, 2017 9:10 am
by ianmcc
Hi, I am in the process of porting a large computational physics code to use CUDA. It is basically an eigensolver, where the matrix-vector multiply is represented as a succession of smaller matrix-matrix multiplies (dgemm) of mixed sizes *depends on the problem, but typically one iteration involves a bunch of multiplies for various sizes, up to a few hundred by a few hundred(. I think I've got this part mostly under control, using cuBLAS with one stream per matrix, using events to manage dependencies so that effectively the matrix multiplies are replaced by non-blocking cuBLAS calls, and the existing code is largely unmodified, but now runs asynchronously.

But I also need some lapack functions, for dense eigensolvers and SVD. I want to do this in a similar way, effectively with an asynchronous call that simply schedules the work and later on wait for some event, which will be something like merging and sorting all of the eigenvalues or singular values from a bunch of matrices. This is a potential bottleneck.

I understand that most MAGMA functions are hybrid CPU/GPU, but I don't understand yet how this actually works - do they just run in one cpu thread (ie, the caller), in which case I guess I should arrange for some kind of thread pool to process a bunch of matrices in different cpu threads, or do they use multiple threads (OpenMP?)? If so, how do I control the number of threads? If I call some magma driver asynchronously (ie, in another thread via a future or some other mechanism), how do I tell when it has finished running? Is it when the call returns, or do I need to synchronize the CUDA stream?

Re: How to use magma with GPU and multi-core CPU systems

PostPosted: Fri Oct 13, 2017 7:47 am
by ianmcc
I still haven't been able to find any documentation that answers my questions, so I've been trying to go through the source code. I think I've answered some of my questions, but in other ways i'm even more confused now.

The Lapack functions don't take a queue parameter, so it looks like I need to synchronize the stream associated with the matrix prior to the call (or hack the sources to pass a preexisting queue to the function, which I'm tempted to do). There is some thread management code, and a thread_queue, as far as I can tell it is only used by the magma_Xtrevc3_mt() function.

There is also magma_set_lapack_numthreads(), which applies if Magma is using an openmp or threaded CPU-based lapack function, or openmp (only in Xlaex3 function, as far as I can tell).

I have some concerns about stream synchronization, I'm not sure how this works in the magma lapack functions. For example, take zheevd_gpu(). This constructs a queue, and uses it for some BLAS calls, for example,

magma_queue_t queue;
magma_device_t cdev;
magma_getdevice( &cdev );
magma_queue_create( cdev, &queue );
// ...
magmablas_zlascl( uplo, 0, 0, 1., sigma, n, n, dA, ldda, queue, info );

This should run asynchronously, using the newly created queue. But a bit later, it calls magma_zhetrd_gpu(). This is a lapack function, that constructs another queue, and calls some more BLAS functions. How is the cuda stream usedby zlascl() synchronized with respect to the stream used by magma_zhetrd_gpu() ? I can't see any explicit synchronization, so maybe there is some implicit synchronization, eg does magma_queue_destroy() synchronize the stream? I don't see how -- I think early versions of CUDA synchronized the stream when a stream was destroyed, but the current documentation says "In case the device is still doing work in the stream stream when cudaStreamDestroy() is called, the function will return immediately and the resources associated with stream will be released automatically once the device has completed all work in stream. "

Some of the lapack functions also allocate gpu memory internally with magma_Xmalloc(), which calls cudaMalloc() internally. cudaMalloc() synchronizes all streams running on the GPU, so this is a possible way that the streams above coujld end up synchronized, although I can't see where the allocation happens if that is the case. Anyway, it would be really nice if there is a way to avoid these device synchronization points, eg by requiring the caller to supply the memory buffer, or using a sub-allocator. I'll be running the codes on a V100 machine with many CPU cores, and I'm hoping to get at least a few simultaneous streams throughout the code.

Re: How to use magma with GPU and multi-core CPU systems

PostPosted: Sat Oct 14, 2017 3:09 am
by mgates3
Yes, MAGMA functions for dense matrices are generally hybrid CPU/GPU functions. In most cases they use a single main CPU thread and rely on multi-threaded BLAS on the CPU, e.g., Intel MKL or OpenBLAS. So just set the OMP_NUM_THREADS environment variable as appropriate. E.g.,

./testing_dgesvd -n 10000

If you do async operations before calling another magma function, it's probably a good idea to sync the queue before, to ensure that it finishes. E.g.

Code: Select all
    magmablas_zlascl( ..., queue, info );
    magma_queue_sync( queue );
    magma_zhetrd_gpu( ... );

As you note, possibly memory allocation and deallocation within magma_zhetrd_gpu would implicitly sync. In most cases, hybrid magma_* functions are synchronous, due to data transfers between the CPU and GPU, while magmablas_* functions that take a queue are async.

Not sure how well running multiple simultaneous factorizations will work.