So I have an N x N matrix A on device memory where I need update only the on-diagonal elements of A (e.g. elements of row 0 col 0, row 1 col1, row2 col2.....) before feeding into magma sgesv solver. Currently I have a simple kernel to do this:

- Code: Select all
`__global__ void d_update_matrix(float * A, float * beta, int total, int offset)`

{

int i = blockIdx.x * blockDim.x + threadIdx.x;

if (i<total)

{

//only update the on-diagonal elements for A

A[i*offset] += beta[i];

}

}

where offset = N+1.

How would these indices for the on-diagonal elements correspond in d_A of size ((N+32)*(N+32) + 32*maxnb + lwork+2*maxnb*maxnb)? I tried to update the on-diagonal elements my matrix A then copy to d_A using cudamemcpyDeviceToDevice but it does not seem to work. So I would like to copy the elements of A to d_A and update d_A directly but i cannot figure out how the indexing would correspond in d_A. When I print out ((N+32)*(N+32) + 32*maxnb + lwork+2*maxnb*maxnb) it gives me 25937, where N = 73.

I am in the process of comparing the performance of MAGMA, CULA (commercial), and CUSP (sparse matrix solver being developed by people from NVIDIA). So far MAGMA seems to have the best results, but if I can figure this out I would be able to present complete results.

Thanks!