Page 1 of 1

Problem using magma_queue's

Posted: Tue Nov 25, 2014 7:36 am
by Noran
Hi @all,
i were very happy when I saw, that the cudaStreams have been implemented. But I've got the following problem: We have a linear system A * AB = B, where A has 400 x 400 Elements and B 400 x 10. Both are sparse matrices and we want to calculate AB. I decided to test your gmres algorithm with this Problem. If I don't call magma_queue_create all is fine. But when I call it, gmres needs more than 30s and produces a huge error in the result Matrix within the range of e^-1 and e^0. But it doesn't throw any error, either in the magma_queue_create nor in magma_sgmres. I think i'm doing something wrong but i don't know what. Can somebody help me, please? A short example code stands below.

Thank you in advance and greetings

Code: Select all

myFunction(	magma_int_t rowsMatrixA,
				magma_int_t colsMatrixA,
				magma_index_t *rowPointerA,
				magma_index_t *colIndA,
				float *valA,
				magma_int_t colsMatrixB,
				float *valB,
				float *valAB)
	int nonZeroElementsA = rowPointerA[rowsMatrixA];

	int elementsB = colsMatrixA * colsMatrixB;
	int elementsAB = rowsMatrixA * colsMatrixB;
	magma_s_sparse_matrix d_A;
	magma_index_t *d_rowPointerA;
	magma_index_t *d_colIndA;
	magma_s_vector d_b;
	magma_s_vector d_ab;
	magma_s_solver_par solverPar;

	DEBUG_LOG("initializing magma_queues");
	magma_queue_t queue;

	DEBUG_LOG("Allocating GPU Memory...");
	float *d_valA;
	float *d_valB;
	float *d_valAB;

	magma_index_malloc (&d_rowPointerA, rowsMatrixA + 1);
	magma_index_malloc (&d_colIndA, nonZeroElementsA);
	magma_smalloc (&d_valA, nonZeroElementsA);
	magma_smalloc (&d_valB, elementsB);
	magma_smalloc (&d_valAB, elementsAB);
	magma_index_setvector(	rowsMatrixA + 1,
	magma_index_setvector(nonZeroElementsA, colIndA, 1, d_colIndA, 1);
	magma_ssetvector(nonZeroElementsA, valA, 1, d_valA, 1);
	magma_ssetmatrix(	colsMatrixA,

	magma_ssetmatrix(	rowsMatrixA,

	DEBUG_LOG("Initializing matrix and vectors");
	d_A.col = d_colIndA;
	d_A.row = d_rowPointerA;
	d_A.val = d_valA;
	d_A.memory_location = Magma_DEV;
	d_A.storage_type = Magma_CSR;
	d_A.nnz = nonZeroElementsA;
	d_A.num_rows = rowsMatrixA;
	d_A.num_cols = colsMatrixA;

	d_b.memory_location = Magma_DEV;
	d_b.nnz = colsMatrixA;
	d_b.num_rows = colsMatrixA;
	d_b.val = d_valB;

	d_ab.memory_location = Magma_DEV;
	d_ab.nnz = rowsMatrixA;
	d_ab.num_rows = rowsMatrixA;
	d_ab.val = d_valAB;

	DEBUG_LOG("Initializing solver_par");
	solverPar.solver = Magma_GMRES;
	solverPar.epsilon = 10e-8;
	solverPar.maxiter = 1000;
	solverPar.restart = 40;
	solverPar.version = 0;
	solverPar.verbose = 0;
	solverPar.num_eigenvalues = 0;
	solverPar.res_vec = (double*) (malloc (sizeof(double)));

	DEBUG_LOG("Running solver for each column of B...");
	for (int col = 0; col < colsMatrixB; col++)
		magma_sgmres (d_A, d_b, &d_ab, &solverPar, queue);
		d_b.val = d_b.val + colsMatrixA;
		d_ab.val = d_ab.val + rowsMatrixA;

	magma_sgetmatrix(	rowsMatrixA,

	DEBUG_LOG("Destroy magma_queue_t queue");
	cudaStreamDestroy (queue);*/

	DEBUG_LOG("Free memory");

	return error;

Re: Problem using magma_queue's

Posted: Tue Nov 25, 2014 12:16 pm
by hartwig anzt

I am not sure I understand your problem, I assume you want to solve AX=B where A is 400x400 and B 400x10.

I looked into your code. If GMRES is converging very slow, maybe use a preconditioner? another restart parameter? Or (in case SPD) CG instead of GMRES? I modified your code a bit, as there are functions that make the coding more conveniant, see below. This at least works for me...


magma_int_t rowsMatrixA,
magma_int_t colsMatrixA,
magma_index_t *rowPointerA,
magma_index_t *colIndA,
float *valA,
magma_int_t colsMatrixB,
float *valB,
float *valAB )
// create queue
magma_queue_t queue;

float zero, one;

zero = 0.0;
one = 1.0;

magma_s_sparse_matrix A, dA;
magma_s_vector b, db, x, dx;

// set the matrix as MAGMA type
magma_scsrset( rowsMatrixA, colsMatrixA,
rowPointerA, colIndA, valA,
&A, queue );

// set the RHS as MAGMA type
magma_svset( rowsMatrixA, colsMatrixB,
valb, &b, queue );

// copy the matrix to the GPU
magma_s_mtransfer( A, &dA, Magma_CPU, Magma_DEV, queue );

// copy vector to GPU
magma_s_vtransfer( b, &db, Magma_CPU, Magma_DEV, queue );

// init the initial guess for x as zero
magma_s_vinit( &dx, Magma_DEV, dA.num_cols, zero, queue );

// set solver parameters
magma_sopts zopts;
zopts.solver_par.solver = Magma_GMRES;
zopts.solver_par.epsilon = 10e-8;
zopts.solver_par.maxiter = 1000;
zopts.solver_par.restart = 40;
zopts.solver_par.version = 0;
zopts.solver_par.verbose = 0;
zopts.solver_par.num_eigenvalues = 0;
zopts.solver_par.eigenvectors = NULL;
zopts.solver_par.eigenvalues = NULL;

// initialize solver setup
magma_ssolverinfo_init( &zopts.solver_par, &zopts.precond_par, queue );

// solve on the GPU
magma_sgmres (dA, db, &dx, &zopts.solver_par, queue);

// show solver info
magma_ssolverinfo( &zopts.solver_par, &zopts.precond_par, queue );

// copy solution back to CPU
magma_s_vtransfer( dx, &x, Magma_DEV, Magma_CPU, queue );

// pass back results
magma_svget( x, &xrows, &xcols, &valAB, queue );

// free memory
magma_s_mfree( &A, queue );
magma_s_mfree( &dA, queue );
magma_s_vfree( &dx, queue );
magma_s_vfree( &x, queue );
magma_s_vfree( &db, queue );
magma_s_vfree( &b, queue );

// free solverinfo
magma_ssolverinfo_free( &zopts.solver_par, &zopts.precond_par, queue );

// destroy queue
magma_queue_destroy( queue );

return error;

Re: Problem using magma_queue's

Posted: Tue Nov 25, 2014 3:30 pm
by Noran
Hi Hartwig,
thank you for you're fast answer. I see the changes and will try it tomorow.
Our goal ist, to parallelize the calculation of 10 times matrix vector A * ab_i = b_i. Where A ist every time the same Matrix 400x400 And b_i are columns from the matrix B and ab_i are columns from the matrix b. For that i found your magma_queue respective cudaStreams. The cruel thing is, that when i run it serial on the queue 0 it looks good: less than 200ms and residuals (A*ab_i - b_i) below e^-6. But if i run it in that for-loop and different queues (or one initialized queue) i got a slowmotion with 30 seconds and residuals beyond good and evil.

Greetz from good old Germany

Re: Problem using magma_queue's

Posted: Wed Nov 26, 2014 4:53 am
by Noran
Hi Hartwig,
all worked well, with some small changes:
1. I put the transferring, calculation and backtransferring in a for loop, to get the full Matrix.
2. for the matrices i changed from magma_s_vector to magma_s_vector[colsMatrixB]
3. I need the result in the CPU memory at the address float* valAB, therefore i changed from magma_s_vtransfer to a little change (see below). With tha code the result is copied directly from the GPU to the place i get from the calling procedure. Btw. our aim is it to call the code from a python function.
4. create a varoius number of queues

After that i got to the point i wanted to reach :)

Greetz and a really big THANK YOU for your help

at point 3

Code: Select all

DEBUG_LOG("Get Matrix from GPU");
		ab[i].memory_location = Magma_CPU;
		ab[i].num_rows = d_ab[i].num_rows;
		ab[i].num_cols = d_ab[i].num_cols;
		ab[i].nnz = d_ab[i].nnz;
		ab[i].major = d_ab[i].major;
		ab[i].val = valAB + i*ab[i].num_rows;

		magma_sgetvector(	d_ab[i].num_rows,

Re: Problem using magma_queue's

Posted: Wed Nov 26, 2014 9:31 am
by hartwig anzt

very good to hear it's working now. Maybe think about using a preconditioned GMRES to accelerate ( i.e. ILU-GMRES)