IDR perfomance?

Open discussion for MAGMA library (Matrix Algebra on GPU and Multicore Architectures)

IDR perfomance?

Postby bnas » Tue Jun 14, 2016 4:14 am

Hey all,

Awesome job :) I have recently tested latest MAGMA's IDR (all three versions of it). I am solving a system of square non spd matrix with JACOBI preconditioner. The same system is solved very fast by MAGMA's bicgstab with same precond, but the IDR implementation is extremely slow compared (about 10 times slower). A preliminary profiling showed that there is a lot of HtoD copying going on, I suppose it's a hybrid CPU/GPU version? It's a shame because, MAGMA has implemented the biortho IDR (the latest version of IDR that shows great potential) that exists in NVIDIA's AMGX. Am I doing something wrong or it is trully a CPU/GPU hybrid solver?
bnas
 
Posts: 8
Joined: Tue Jun 14, 2016 4:06 am

Re: IDR perfomance?

Postby hartwig anzt » Tue Jun 14, 2016 11:11 am

The IDR in AMGX sure is very optimized. However, I am surprised that the MAGMA version is that slow, in particular as it is identified to run close to the memory bandwidth: http://hpc.sagepub.com/content/early/20 ... 4.abstract
In particular, there should not be much GPU-host communication.

Can you please provide me with some more details? What kind of problem are you targeting? What is the size/nonzeros? What is the shadow space dimension you are using in IDR? In the best case, you can provide me with the problem, then I can take a closer look.

Thanks, Hartwig
hartwig anzt
 
Posts: 75
Joined: Tue Sep 02, 2014 5:44 pm

Re: IDR perfomance?

Postby bnas » Wed Jun 15, 2016 3:13 am

First of all, thanks for the prompt reply. I am using a non spd matrix of M: 314728, N: 314728, nnz: 2134992. I am invoking IDR through the
Code: Select all
magma_d_solver( )
call with
Code: Select all
zopts.solver_par.solver = Magma_PIDRMERGE;
zopts.precond_par.solver = Magma_JACOBI;
zopts.solver_par.maxiter = 100;
zopts.solver_par.rtol = 1e-8;   


Since I think shadow space dimension s can only be determined before compile time (meaning it's a hardcoded parameter and not a function arguement), I recompilled didr_strms.cpp with s=4 and I have tried the initial default s=1. I wouldn't be posting if I didn't see that weird HtoD copies during the profile. I am attaching my whole implementation so you can check for possible errors on my side.
Code: Select all
// includes, system
//ommited for space
// includes, project
int main( int argc, char** argv)
{
magma_init(); 
magma_int_t info = 0;
magma_dopts zopts;
//magma_d_solver_par sopts;
magma_queue_t queue=NULL;
magma_queue_create( 0, &queue );
//Chronometry
real_Double_t tempo1, tempo2, t_transfer = 0.0;

double one = MAGMA_D_MAKE(1.0, 0.0);;
double zero = MAGMA_D_MAKE(0.0, 0.0);

magma_int_t err;
//double *a=NULL, *d_a=NULL;
magma_d_matrix x={Magma_DENSE}, x_h={Magma_DENSE}, b_h={Magma_DENSE}, b={Magma_DENSE};
magma_d_matrix A={Magma_CSR}, A_d={Magma_CSR};
//magma_d_vector b, d_x;
  int i=1;

magma_d_csr_mtx( &A,  argv[i], queue );
       
printf( "\n%% matrix info: %d-by-%d with %d nonzeros\n\n",
                            int(A.num_rows), int(A.num_cols), int(A.nnz) );
       
printf("matrixinfo = [ \n");
printf("%%   size   (m x n)     ||   nonzeros (nnz)   ||   nnz/m \n");
printf("%%======================================================="
                            "======%%\n");
printf("  %8d  %8d      %10d        %10d\n",
            int(A.num_rows), int(A.num_cols), int(A.nnz), int(A.nnz/A.num_rows) );
printf("%%======================================================="
        "======%%\n");
printf("];\n");

magma_dvread( &b_h, A.num_cols, argv[i+1], queue );
magma_dvinit( &x, Magma_DEV, A.num_cols, 1, one, queue );
magma_d_vtransfer(b_h, &b, Magma_CPU, Magma_DEV, queue);
magma_dmtransfer( A, &A_d, Magma_CPU, Magma_DEV, queue );

zopts.solver_par.solver = Magma_PIDRMERGE;
zopts.precond_par.solver = Magma_JACOBI;
zopts.solver_par.maxiter = 150;
zopts.solver_par.rtol = 1e-8;


magma_dsolverinfo_init( &zopts.solver_par, &zopts.precond_par, queue );
//magma_dmscale( &A, zopts.scaling, queue );
magma_d_precondsetup( A_d, b, &zopts.solver_par, &zopts.precond_par, queue );
info=magma_d_solver( A_d, b, &x, &zopts, queue );
magma_dsolverinfo( &zopts.solver_par, &zopts.precond_par, queue );
std::cout<<"\n"<<std::endl;
std::cout<<info;

std::cout<<"\n"<<std::endl;

magma_finalize();
return 0;
}


So running from bash, first argument is matrix A, second is vector b. I apply the precond with
Code: Select all
magma_d_precondsetup( A_d, b, &zopts.solver_par, &zopts.precond_par, queue );
.
Since I am very new to MAGMA, could you pinpoint any errors on my side? I feel that the preconditioning should have an apply step as well, after the predondsetup, plus I think I should be using the idr_strms () function directly instead of the wrapped d_solve. Could you provide me with a better way? Thanks a lot!
bnas
 
Posts: 8
Joined: Tue Jun 14, 2016 4:06 am

Re: IDR perfomance?

Postby hartwig anzt » Wed Jun 15, 2016 9:10 am

Thanks for the details!

First of all, the shadow space dimension can be controlled via the parameter
''solver_par.restart''. So, there is no need to recompile the library,
in fact you should not even have to touch the source files. In your runfile,
if you want a shadow space dimension of 4, use
zopts.solver_par.restart = 4;

Second, please use the routine [z|c|d|s]pidr_merge.cpp, as this should give
you overall good performance. The _strms.cpp may in some configurations suffer
from handling too many streams for a small problem. But the solver-wrapper should already choose the right solver. So in other words - if you keep all source files like in the release, and add to your code ''zopts.solver_par.restart = 4;'' you should see the code running much faster.

An easy check to see whether you are getting the performance you are supposed to get is to use the tester the release ships with:

./testing_dsolver --solver PIDR --precond JACOBI --restart 4 --maxiter 150 ~/path_to_matrix.mtx

I hope this helps!

Hartwig
hartwig anzt
 
Posts: 75
Joined: Tue Sep 02, 2014 5:44 pm

Re: IDR perfomance?

Postby bnas » Wed Jun 15, 2016 9:45 am

Thanks a lot for your reply! I will look into it and get back to you. Btw is there any way to read for ./testing_dsolver b vectors? I mean without messing with the source files and adding a second argument
bnas
 
Posts: 8
Joined: Tue Jun 14, 2016 4:06 am

Re: IDR perfomance?

Postby hartwig anzt » Wed Jun 15, 2016 9:46 am

No - but I will keep this in mind for future. But you can always add a line in the tester for reading in a vector.
hartwig anzt
 
Posts: 75
Joined: Tue Sep 02, 2014 5:44 pm

Re: IDR perfomance?

Postby bnas » Tue Jun 21, 2016 7:13 am

Hey,

I am happy to say that I have been able to verify much faster results with reading the proper B vector. I was comparing AMGX IDR with Jacobi prec with a proper B vector vs MAGMA's IDR with Jab prec but with B=1 vector. Hardly a fair comparison. I'd say at the moment for most of my matrix sizes, though it's still initial findings, MAGMA's IDR (pidr_strms() in particular, since that's what selected when you choose solver_par.solver = Magma_PIDR) is 10-30% slower than AMGX's IDR but that's to be expected. In other words DAMN GOOD JOB guys!

I have two questions though:

1) How come choosing solver_par.solver = Magma_PIDR (which corresponds to pidr(), not pidr_merge() or pidr_strms()) fails to solve my matrix with Jacobi prec within 1000 iterations, whereas solver_par.solver = Magma_PIDRMERGE (which corresponds to pidr_strms()) succeeds -and actually does very good compared to AMGX-? Aren't pidr() and pidr_strms()/pidr_merge() equivalent in terms of implementation and results, with the latter just being kernel fused/overlapped?

2) Do you think implementing MAGMA's pidr_merge() or pidr_strms() in native cuSPARSE/cuBLAS/thrust would be worth the trouble performance wise? I mean actually converting the MAGMA calls to "native" CUDA calls and where not available writing my own CUDA kernels,ie combining the best of both worlds, for optimal results.
bnas
 
Posts: 8
Joined: Tue Jun 14, 2016 4:06 am

Re: IDR perfomance?

Postby hartwig anzt » Tue Jun 21, 2016 10:31 am

Hey!

Thanks for keeping me posted! Thanks for letting me know your experience! To answer your two questions:

1) This is what I would call a bug. Sorry! I will look into this - or just remove it as option, as the other one is faster anyway...

2) Although I don't want to stop you from looking into this, I doubt that this effort will get you far. This is pretty much what we do: we use the CUDA and cuSPARSE calls where possible (for the BLAS etc.) So for these functions you should already get the cuSPARSE performance. On top of that, we try to write ''fused'' kernels whenever we outperform the respective sequence of CUDA calls. Whatever might be better in the AMGX implementation is a secret to me. What might be useful is to check different SpMV kernels in MAGMA, e.g. : --format SELLP --blocksize 32 ( when running ./testing_dsolver ) If you have a ''thick'' matrix, you may also use a larger alignment ( e.g. --format SELLP --blocksize 8 --alignment 4 )

Let me know if I can help further!

Hartwig
hartwig anzt
 
Posts: 75
Joined: Tue Sep 02, 2014 5:44 pm

Re: IDR perfomance?

Postby bnas » Wed Jun 22, 2016 4:58 am

Thanks once again for your reply. I will look into the block size alignment stuff and get back to you. Here is a small compilation of my current issues:

1) Where can I see all the possible available options about alignment/block size regarding SELLP? I mean I can look through the source code, but any easier way?

2) One other issue I found is that changing smoothing from 1 to 0 (I think default is 1) in didr_strms.cpp gave me quite a speedup in the solve time (around 30-40%). Shouldn't that be handled by solver opts instead of the actual source file?
Speaking of source code parameters, what about the initial om=0.7?

3) I know it's quite complex, but can we expect a solid algebraic multigrid solver soonish?


PS: If you think we can should move this to private messages, to save bandwidth and forum space, fine by me, but I think my issues/findings might save some people some time in the long run :)
bnas
 
Posts: 8
Joined: Tue Jun 14, 2016 4:06 am

Re: IDR perfomance?

Postby hartwig anzt » Wed Jun 22, 2016 8:52 am

Hey!

I think the forum is the right place, as others may benefit from the discussion. And I truly welcome any feedback!

1) Yes, I do need to improve the documentation. I can tell you:
available blocksizes are 2,4,8,16,32,64,128,256
available alignments are 1,4,8,16,32
but the combination may fail if the product exceeds the range of valid thread block sizes, e.g. 256x32 = 8192
In general, you should try keeping both numbers small, e.g. 32x1 or - for thick matrices - 32x4/32x8

2) I agree, smoothing is an option I should add to the solver options to avoid recompiling. I set it to 1 as it gives nicer convergence, important for many problems. There may be problems though, where it is faster without. The omega of 0.7 is empirically chosen as it works good for many problems. You can sure play around with this too - but this really blows up the search space...

3) I am currently more focusing on efficient ILU / ILUT preconditioning. Algebraic Multigrid is not very high on my list, sorry.

Finally, a question from my side: What kind of problems are you working on? Where do they come from, and what are the typical characteristics?

Thanks, Hartwig
hartwig anzt
 
Posts: 75
Joined: Tue Sep 02, 2014 5:44 pm

Next

Return to User discussion

Who is online

Users browsing this forum: No registered users and 2 guests

cron