## Problem using magma_queue's

Open discussion for MAGMA library (Matrix Algebra on GPU and Multicore Architectures)
Noran
Posts: 27
Joined: Fri Sep 26, 2014 3:47 am

### Problem using magma_queue's

Hi @all,
i were very happy when I saw, that the cudaStreams have been implemented. But I've got the following problem: We have a linear system A * AB = B, where A has 400 x 400 Elements and B 400 x 10. Both are sparse matrices and we want to calculate AB. I decided to test your gmres algorithm with this Problem. If I don't call magma_queue_create all is fine. But when I call it, gmres needs more than 30s and produces a huge error in the result Matrix within the range of e^-1 and e^0. But it doesn't throw any error, either in the magma_queue_create nor in magma_sgmres. I think i'm doing something wrong but i don't know what. Can somebody help me, please? A short example code stands below.

Thank you in advance and greetings
Noran

Code: Select all

``````int
myFunction(	magma_int_t rowsMatrixA,
magma_int_t colsMatrixA,
magma_index_t *rowPointerA,
magma_index_t *colIndA,
float *valA,
magma_int_t colsMatrixB,
float *valB,
float *valAB)
{
int nonZeroElementsA = rowPointerA[rowsMatrixA];

int elementsB = colsMatrixA * colsMatrixB;
int elementsAB = rowsMatrixA * colsMatrixB;
magma_s_sparse_matrix d_A;
magma_index_t *d_rowPointerA;
magma_index_t *d_colIndA;
magma_s_vector d_b;
magma_s_vector d_ab;
magma_s_solver_par solverPar;

DEBUG_LOG("initializing magma_queues");
magma_queue_t queue;
magma_queue_create(&queue);

DEBUG_LOG("Allocating GPU Memory...");
float *d_valA;
float *d_valB;
float *d_valAB;

magma_index_malloc (&d_rowPointerA, rowsMatrixA + 1);
magma_index_malloc (&d_colIndA, nonZeroElementsA);
magma_smalloc (&d_valA, nonZeroElementsA);
magma_smalloc (&d_valB, elementsB);
magma_smalloc (&d_valAB, elementsAB);

magma_index_setvector(	rowsMatrixA + 1,
rowPointerA,
1,
d_rowPointerA,
1);
magma_index_setvector(nonZeroElementsA, colIndA, 1, d_colIndA, 1);
magma_ssetvector(nonZeroElementsA, valA, 1, d_valA, 1);
magma_ssetmatrix(	colsMatrixA,
colsMatrixB,
valB,
colsMatrixA,
d_valB,
colsMatrixA);

magma_ssetmatrix(	rowsMatrixA,
colsMatrixB,
valAB,
rowsMatrixA,
d_valAB,
rowsMatrixA);

DEBUG_LOG("Initializing matrix and vectors");
d_A.col = d_colIndA;
d_A.row = d_rowPointerA;
d_A.val = d_valA;
d_A.memory_location = Magma_DEV;
d_A.storage_type = Magma_CSR;
d_A.nnz = nonZeroElementsA;
d_A.num_rows = rowsMatrixA;
d_A.num_cols = colsMatrixA;

d_b.memory_location = Magma_DEV;
d_b.nnz = colsMatrixA;
d_b.num_rows = colsMatrixA;
d_b.val = d_valB;

d_ab.memory_location = Magma_DEV;
d_ab.nnz = rowsMatrixA;
d_ab.num_rows = rowsMatrixA;
d_ab.val = d_valAB;

DEBUG_LOG("Initializing solver_par");
solverPar.solver = Magma_GMRES;
solverPar.epsilon = 10e-8;
solverPar.maxiter = 1000;
solverPar.restart = 40;
solverPar.version = 0;
solverPar.verbose = 0;
solverPar.num_eigenvalues = 0;
solverPar.res_vec = (double*) (malloc (sizeof(double)));

DEBUG_LOG("Running solver for each column of B...");
for (int col = 0; col < colsMatrixB; col++)
{
magma_sgmres (d_A, d_b, &d_ab, &solverPar, queue);
d_b.val = d_b.val + colsMatrixA;
d_ab.val = d_ab.val + rowsMatrixA;
}

magma_sgetmatrix(	rowsMatrixA,
colsMatrixB,
d_valAB,
rowsMatrixA,
valAB,
rowsMatrixA);

DEBUG_LOG("Destroy magma_queue_t queue");
cudaStreamDestroy (queue);*/

DEBUG_LOG("Free memory");
magma_free(d_valA);
magma_free(d_valB);
magma_free(d_valAB);
magma_free(d_colIndA);
magma_free(d_rowPointerA);

return error;
}
``````

hartwig anzt
Posts: 90
Joined: Tue Sep 02, 2014 5:44 pm

### Re: Problem using magma_queue's

Noran,

I am not sure I understand your problem, I assume you want to solve AX=B where A is 400x400 and B 400x10.

I looked into your code. If GMRES is converging very slow, maybe use a preconditioner? another restart parameter? Or (in case SPD) CG instead of GMRES? I modified your code a bit, as there are functions that make the coding more conveniant, see below. This at least works for me...

Hartwig

int
myFunction(
magma_int_t rowsMatrixA,
magma_int_t colsMatrixA,
magma_index_t *rowPointerA,
magma_index_t *colIndA,
float *valA,
magma_int_t colsMatrixB,
float *valB,
float *valAB )
{
// create queue
magma_queue_t queue;
magma_queue_create(&queue);

float zero, one;

zero = 0.0;
one = 1.0;

magma_s_sparse_matrix A, dA;
magma_s_vector b, db, x, dx;

// set the matrix as MAGMA type
magma_scsrset( rowsMatrixA, colsMatrixA,
rowPointerA, colIndA, valA,
&A, queue );

// set the RHS as MAGMA type
magma_svset( rowsMatrixA, colsMatrixB,
valb, &b, queue );

// copy the matrix to the GPU
magma_s_mtransfer( A, &dA, Magma_CPU, Magma_DEV, queue );

// copy vector to GPU
magma_s_vtransfer( b, &db, Magma_CPU, Magma_DEV, queue );

// init the initial guess for x as zero
magma_s_vinit( &dx, Magma_DEV, dA.num_cols, zero, queue );

// set solver parameters
magma_sopts zopts;
zopts.solver_par.solver = Magma_GMRES;
zopts.solver_par.epsilon = 10e-8;
zopts.solver_par.maxiter = 1000;
zopts.solver_par.restart = 40;
zopts.solver_par.version = 0;
zopts.solver_par.verbose = 0;
zopts.solver_par.num_eigenvalues = 0;
zopts.solver_par.eigenvectors = NULL;
zopts.solver_par.eigenvalues = NULL;

// initialize solver setup
magma_ssolverinfo_init( &zopts.solver_par, &zopts.precond_par, queue );

// solve on the GPU
magma_sgmres (dA, db, &dx, &zopts.solver_par, queue);

// show solver info
magma_ssolverinfo( &zopts.solver_par, &zopts.precond_par, queue );

// copy solution back to CPU
magma_s_vtransfer( dx, &x, Magma_DEV, Magma_CPU, queue );

// pass back results
magma_svget( x, &xrows, &xcols, &valAB, queue );

// free memory
magma_s_mfree( &A, queue );
magma_s_mfree( &dA, queue );
magma_s_vfree( &dx, queue );
magma_s_vfree( &x, queue );
magma_s_vfree( &db, queue );
magma_s_vfree( &b, queue );

// free solverinfo
magma_ssolverinfo_free( &zopts.solver_par, &zopts.precond_par, queue );

// destroy queue
magma_queue_destroy( queue );

return error;
}

Noran
Posts: 27
Joined: Fri Sep 26, 2014 3:47 am

### Re: Problem using magma_queue's

Hi Hartwig,
thank you for you're fast answer. I see the changes and will try it tomorow.
Our goal ist, to parallelize the calculation of 10 times matrix vector A * ab_i = b_i. Where A ist every time the same Matrix 400x400 And b_i are columns from the matrix B and ab_i are columns from the matrix b. For that i found your magma_queue respective cudaStreams. The cruel thing is, that when i run it serial on the queue 0 it looks good: less than 200ms and residuals (A*ab_i - b_i) below e^-6. But if i run it in that for-loop and different queues (or one initialized queue) i got a slowmotion with 30 seconds and residuals beyond good and evil.

Greetz from good old Germany
Noran

Noran
Posts: 27
Joined: Fri Sep 26, 2014 3:47 am

### Re: Problem using magma_queue's

Hi Hartwig,
all worked well, with some small changes:
1. I put the transferring, calculation and backtransferring in a for loop, to get the full Matrix.
2. for the matrices i changed from magma_s_vector to magma_s_vector[colsMatrixB]
3. I need the result in the CPU memory at the address float* valAB, therefore i changed from magma_s_vtransfer to a little change (see below). With tha code the result is copied directly from the GPU to the place i get from the calling procedure. Btw. our aim is it to call the code from a python function.
4. create a varoius number of queues

After that i got to the point i wanted to reach :)

Greetz and a really big THANK YOU for your help
Noran

at point 3

Code: Select all

``````DEBUG_LOG("Get Matrix from GPU");
ab[i].memory_location = Magma_CPU;
ab[i].num_rows = d_ab[i].num_rows;
ab[i].num_cols = d_ab[i].num_cols;
ab[i].nnz = d_ab[i].nnz;
ab[i].major = d_ab[i].major;
ab[i].val = valAB + i*ab[i].num_rows;

magma_sgetvector(	d_ab[i].num_rows,
d_ab[i].val,
1,
ab[i].val,
1);
``````

hartwig anzt
Posts: 90
Joined: Tue Sep 02, 2014 5:44 pm

### Re: Problem using magma_queue's

Noran,

very good to hear it's working now. Maybe think about using a preconditioned GMRES to accelerate ( i.e. ILU-GMRES)

Hartwig