It looks like the dsygvd code needs workspace for the matrix A on both the CPU and GPU. For matrix B, it looks like it just needs it on the GPU. As all the underlying routines called from dsygvd have GPU interfaces, it should be straight forward to implement a GPU interface. Basically add dA, ldda as arguments, and change arguments B, ldb (on CPU) to dB, lddb (on GPU). We'll keep it in mind for future releases, but if you need it now, hopefully that gives you some pointers about what to modify.
You said a few ms are spent each MD step on allocation and copying the matrix. How long does the dsygvd call itself take? That is, what % of time is allocating memory wasting? I would be a little surprised if allocation was a large overhead, but copying the A and B matrices from CPU to GPU might be expensive. However, unless you can generate the matrices on the GPU, or overlap the copy with other computation, you have to pay that cost sometime. (I assume this is just the alloc and copy in dsygvd itself. The underlying routines also do some alloc and copies, but smaller.)