### *_gpu versions of the gesvd routines

Posted:

**Wed Jul 17, 2019 2:59 am**I have some code which is performing an svd calculation followed by a least squares calculation. I don't want to have to transfer the arrays twice. Is there some reason why the svd routines don't offer *_gpu versions so that I can call two functions sequentially, prefaced by just one device->gpu copy?

For context, the code I'm working with looks like this (in python, using scikit-cuda) ... I'm trying to reduce overheads on this and ultimately, I would like to use Cuda streams to hide the latency all together (since this code gets called repeatedly):

s = np.zeros(min(m, n), np.float32)

u = np.zeros((m, m), np.float32)

vh = np.zeros((n, n), np.float32)

# Compute singular-value decomposition

a1_wrk = np.copy(a1, order='F')

lwork = magma.magma_sgesvd_buffersize('A', 'A', m, n, a1_wrk.ctypes.data, m, s.ctypes.data, u.ctypes.data, m,vh.ctypes.data, n)

hwork = np.zeros(lwork, np.float32, order='F')

magma.magma_sgesvd('A', 'A', m, n, a1_wrk.ctypes.data, m, s.ctypes.data, u.ctypes.data, m, vh.ctypes.data, n, hwork.ctypes.data, lwork)

# Note, the use of i>10^-3 here; this is meant to select

# values that are effectively non-zero. Results will depend

# somewhat on the choice for this value

rank = sum(1 for i in s if i > 1e-3)

# Query and allocate optimal workspace

# n.b.: - the result for 'x' gets written to the input vector for b, so we just label b->x

# - assume magma variables lda=ldb=m throughout here

x = np.copy(b1, order='F')

lwork = magma.magma_sgels_buffersize('n', m, n, nrhs, a1_wrk.ctypes.data, m, x.ctypes.data, m)

hwork = np.zeros(lwork, np.float32)

# Run solver

a1_wrk = np.copy(a1, order='F')

a1_gpu = gpuarray.to_gpu(a1_wrk)

x_gpu = gpuarray.to_gpu(x)

magma.magma_sgels_gpu('n', m, n, nrhs, a1_gpu.gpudata, m, x_gpu.gpudata, m, hwork.ctypes.data, lwork)

x = x_gpu.get()

For context, the code I'm working with looks like this (in python, using scikit-cuda) ... I'm trying to reduce overheads on this and ultimately, I would like to use Cuda streams to hide the latency all together (since this code gets called repeatedly):

s = np.zeros(min(m, n), np.float32)

u = np.zeros((m, m), np.float32)

vh = np.zeros((n, n), np.float32)

# Compute singular-value decomposition

a1_wrk = np.copy(a1, order='F')

lwork = magma.magma_sgesvd_buffersize('A', 'A', m, n, a1_wrk.ctypes.data, m, s.ctypes.data, u.ctypes.data, m,vh.ctypes.data, n)

hwork = np.zeros(lwork, np.float32, order='F')

magma.magma_sgesvd('A', 'A', m, n, a1_wrk.ctypes.data, m, s.ctypes.data, u.ctypes.data, m, vh.ctypes.data, n, hwork.ctypes.data, lwork)

# Note, the use of i>10^-3 here; this is meant to select

# values that are effectively non-zero. Results will depend

# somewhat on the choice for this value

rank = sum(1 for i in s if i > 1e-3)

# Query and allocate optimal workspace

# n.b.: - the result for 'x' gets written to the input vector for b, so we just label b->x

# - assume magma variables lda=ldb=m throughout here

x = np.copy(b1, order='F')

lwork = magma.magma_sgels_buffersize('n', m, n, nrhs, a1_wrk.ctypes.data, m, x.ctypes.data, m)

hwork = np.zeros(lwork, np.float32)

# Run solver

a1_wrk = np.copy(a1, order='F')

a1_gpu = gpuarray.to_gpu(a1_wrk)

x_gpu = gpuarray.to_gpu(x)

magma.magma_sgels_gpu('n', m, n, nrhs, a1_gpu.gpudata, m, x_gpu.gpudata, m, hwork.ctypes.data, lwork)

x = x_gpu.get()