IDR perfomance?
IDR perfomance?
Hey all,
Awesome job :) I have recently tested latest MAGMA's IDR (all three versions of it). I am solving a system of square non spd matrix with JACOBI preconditioner. The same system is solved very fast by MAGMA's bicgstab with same precond, but the IDR implementation is extremely slow compared (about 10 times slower). A preliminary profiling showed that there is a lot of HtoD copying going on, I suppose it's a hybrid CPU/GPU version? It's a shame because, MAGMA has implemented the biortho IDR (the latest version of IDR that shows great potential) that exists in NVIDIA's AMGX. Am I doing something wrong or it is trully a CPU/GPU hybrid solver?
Awesome job :) I have recently tested latest MAGMA's IDR (all three versions of it). I am solving a system of square non spd matrix with JACOBI preconditioner. The same system is solved very fast by MAGMA's bicgstab with same precond, but the IDR implementation is extremely slow compared (about 10 times slower). A preliminary profiling showed that there is a lot of HtoD copying going on, I suppose it's a hybrid CPU/GPU version? It's a shame because, MAGMA has implemented the biortho IDR (the latest version of IDR that shows great potential) that exists in NVIDIA's AMGX. Am I doing something wrong or it is trully a CPU/GPU hybrid solver?
-
- Posts: 90
- Joined: Tue Sep 02, 2014 5:44 pm
Re: IDR perfomance?
The IDR in AMGX sure is very optimized. However, I am surprised that the MAGMA version is that slow, in particular as it is identified to run close to the memory bandwidth: http://hpc.sagepub.com/content/early/20 ... 4.abstract
In particular, there should not be much GPU-host communication.
Can you please provide me with some more details? What kind of problem are you targeting? What is the size/nonzeros? What is the shadow space dimension you are using in IDR? In the best case, you can provide me with the problem, then I can take a closer look.
Thanks, Hartwig
In particular, there should not be much GPU-host communication.
Can you please provide me with some more details? What kind of problem are you targeting? What is the size/nonzeros? What is the shadow space dimension you are using in IDR? In the best case, you can provide me with the problem, then I can take a closer look.
Thanks, Hartwig
Re: IDR perfomance?
First of all, thanks for the prompt reply. I am using a non spd matrix of M: 314728, N: 314728, nnz: 2134992. I am invoking IDR through the call with
Since I think shadow space dimension s can only be determined before compile time (meaning it's a hardcoded parameter and not a function arguement), I recompilled didr_strms.cpp with s=4 and I have tried the initial default s=1. I wouldn't be posting if I didn't see that weird HtoD copies during the profile. I am attaching my whole implementation so you can check for possible errors on my side.
So running from bash, first argument is matrix A, second is vector b. I apply the precond with .
Since I am very new to MAGMA, could you pinpoint any errors on my side? I feel that the preconditioning should have an apply step as well, after the predondsetup, plus I think I should be using the idr_strms () function directly instead of the wrapped d_solve. Could you provide me with a better way? Thanks a lot!
Code: Select all
magma_d_solver( )
Code: Select all
zopts.solver_par.solver = Magma_PIDRMERGE;
zopts.precond_par.solver = Magma_JACOBI;
zopts.solver_par.maxiter = 100;
zopts.solver_par.rtol = 1e-8;
Since I think shadow space dimension s can only be determined before compile time (meaning it's a hardcoded parameter and not a function arguement), I recompilled didr_strms.cpp with s=4 and I have tried the initial default s=1. I wouldn't be posting if I didn't see that weird HtoD copies during the profile. I am attaching my whole implementation so you can check for possible errors on my side.
Code: Select all
// includes, system
//ommited for space
// includes, project
int main( int argc, char** argv)
{
magma_init();
magma_int_t info = 0;
magma_dopts zopts;
//magma_d_solver_par sopts;
magma_queue_t queue=NULL;
magma_queue_create( 0, &queue );
//Chronometry
real_Double_t tempo1, tempo2, t_transfer = 0.0;
double one = MAGMA_D_MAKE(1.0, 0.0);;
double zero = MAGMA_D_MAKE(0.0, 0.0);
magma_int_t err;
//double *a=NULL, *d_a=NULL;
magma_d_matrix x={Magma_DENSE}, x_h={Magma_DENSE}, b_h={Magma_DENSE}, b={Magma_DENSE};
magma_d_matrix A={Magma_CSR}, A_d={Magma_CSR};
//magma_d_vector b, d_x;
int i=1;
magma_d_csr_mtx( &A, argv[i], queue );
printf( "\n%% matrix info: %d-by-%d with %d nonzeros\n\n",
int(A.num_rows), int(A.num_cols), int(A.nnz) );
printf("matrixinfo = [ \n");
printf("%% size (m x n) || nonzeros (nnz) || nnz/m \n");
printf("%%======================================================="
"======%%\n");
printf(" %8d %8d %10d %10d\n",
int(A.num_rows), int(A.num_cols), int(A.nnz), int(A.nnz/A.num_rows) );
printf("%%======================================================="
"======%%\n");
printf("];\n");
magma_dvread( &b_h, A.num_cols, argv[i+1], queue );
magma_dvinit( &x, Magma_DEV, A.num_cols, 1, one, queue );
magma_d_vtransfer(b_h, &b, Magma_CPU, Magma_DEV, queue);
magma_dmtransfer( A, &A_d, Magma_CPU, Magma_DEV, queue );
zopts.solver_par.solver = Magma_PIDRMERGE;
zopts.precond_par.solver = Magma_JACOBI;
zopts.solver_par.maxiter = 150;
zopts.solver_par.rtol = 1e-8;
magma_dsolverinfo_init( &zopts.solver_par, &zopts.precond_par, queue );
//magma_dmscale( &A, zopts.scaling, queue );
magma_d_precondsetup( A_d, b, &zopts.solver_par, &zopts.precond_par, queue );
info=magma_d_solver( A_d, b, &x, &zopts, queue );
magma_dsolverinfo( &zopts.solver_par, &zopts.precond_par, queue );
std::cout<<"\n"<<std::endl;
std::cout<<info;
std::cout<<"\n"<<std::endl;
magma_finalize();
return 0;
}
Code: Select all
magma_d_precondsetup( A_d, b, &zopts.solver_par, &zopts.precond_par, queue );
Since I am very new to MAGMA, could you pinpoint any errors on my side? I feel that the preconditioning should have an apply step as well, after the predondsetup, plus I think I should be using the idr_strms () function directly instead of the wrapped d_solve. Could you provide me with a better way? Thanks a lot!
-
- Posts: 90
- Joined: Tue Sep 02, 2014 5:44 pm
Re: IDR perfomance?
Thanks for the details!
First of all, the shadow space dimension can be controlled via the parameter
''solver_par.restart''. So, there is no need to recompile the library,
in fact you should not even have to touch the source files. In your runfile,
if you want a shadow space dimension of 4, use
zopts.solver_par.restart = 4;
Second, please use the routine [z|c|d|s]pidr_merge.cpp, as this should give
you overall good performance. The _strms.cpp may in some configurations suffer
from handling too many streams for a small problem. But the solver-wrapper should already choose the right solver. So in other words - if you keep all source files like in the release, and add to your code ''zopts.solver_par.restart = 4;'' you should see the code running much faster.
An easy check to see whether you are getting the performance you are supposed to get is to use the tester the release ships with:
./testing_dsolver --solver PIDR --precond JACOBI --restart 4 --maxiter 150 ~/path_to_matrix.mtx
I hope this helps!
Hartwig
First of all, the shadow space dimension can be controlled via the parameter
''solver_par.restart''. So, there is no need to recompile the library,
in fact you should not even have to touch the source files. In your runfile,
if you want a shadow space dimension of 4, use
zopts.solver_par.restart = 4;
Second, please use the routine [z|c|d|s]pidr_merge.cpp, as this should give
you overall good performance. The _strms.cpp may in some configurations suffer
from handling too many streams for a small problem. But the solver-wrapper should already choose the right solver. So in other words - if you keep all source files like in the release, and add to your code ''zopts.solver_par.restart = 4;'' you should see the code running much faster.
An easy check to see whether you are getting the performance you are supposed to get is to use the tester the release ships with:
./testing_dsolver --solver PIDR --precond JACOBI --restart 4 --maxiter 150 ~/path_to_matrix.mtx
I hope this helps!
Hartwig
Re: IDR perfomance?
Thanks a lot for your reply! I will look into it and get back to you. Btw is there any way to read for ./testing_dsolver b vectors? I mean without messing with the source files and adding a second argument
-
- Posts: 90
- Joined: Tue Sep 02, 2014 5:44 pm
Re: IDR perfomance?
No - but I will keep this in mind for future. But you can always add a line in the tester for reading in a vector.
Re: IDR perfomance?
Hey,
I am happy to say that I have been able to verify much faster results with reading the proper B vector. I was comparing AMGX IDR with Jacobi prec with a proper B vector vs MAGMA's IDR with Jab prec but with B=1 vector. Hardly a fair comparison. I'd say at the moment for most of my matrix sizes, though it's still initial findings, MAGMA's IDR (pidr_strms() in particular, since that's what selected when you choose solver_par.solver = Magma_PIDR) is 10-30% slower than AMGX's IDR but that's to be expected. In other words DAMN GOOD JOB guys!
I have two questions though:
1) How come choosing solver_par.solver = Magma_PIDR (which corresponds to pidr(), not pidr_merge() or pidr_strms()) fails to solve my matrix with Jacobi prec within 1000 iterations, whereas solver_par.solver = Magma_PIDRMERGE (which corresponds to pidr_strms()) succeeds -and actually does very good compared to AMGX-? Aren't pidr() and pidr_strms()/pidr_merge() equivalent in terms of implementation and results, with the latter just being kernel fused/overlapped?
2) Do you think implementing MAGMA's pidr_merge() or pidr_strms() in native cuSPARSE/cuBLAS/thrust would be worth the trouble performance wise? I mean actually converting the MAGMA calls to "native" CUDA calls and where not available writing my own CUDA kernels,ie combining the best of both worlds, for optimal results.
I am happy to say that I have been able to verify much faster results with reading the proper B vector. I was comparing AMGX IDR with Jacobi prec with a proper B vector vs MAGMA's IDR with Jab prec but with B=1 vector. Hardly a fair comparison. I'd say at the moment for most of my matrix sizes, though it's still initial findings, MAGMA's IDR (pidr_strms() in particular, since that's what selected when you choose solver_par.solver = Magma_PIDR) is 10-30% slower than AMGX's IDR but that's to be expected. In other words DAMN GOOD JOB guys!
I have two questions though:
1) How come choosing solver_par.solver = Magma_PIDR (which corresponds to pidr(), not pidr_merge() or pidr_strms()) fails to solve my matrix with Jacobi prec within 1000 iterations, whereas solver_par.solver = Magma_PIDRMERGE (which corresponds to pidr_strms()) succeeds -and actually does very good compared to AMGX-? Aren't pidr() and pidr_strms()/pidr_merge() equivalent in terms of implementation and results, with the latter just being kernel fused/overlapped?
2) Do you think implementing MAGMA's pidr_merge() or pidr_strms() in native cuSPARSE/cuBLAS/thrust would be worth the trouble performance wise? I mean actually converting the MAGMA calls to "native" CUDA calls and where not available writing my own CUDA kernels,ie combining the best of both worlds, for optimal results.
-
- Posts: 90
- Joined: Tue Sep 02, 2014 5:44 pm
Re: IDR perfomance?
Hey!
Thanks for keeping me posted! Thanks for letting me know your experience! To answer your two questions:
1) This is what I would call a bug. Sorry! I will look into this - or just remove it as option, as the other one is faster anyway...
2) Although I don't want to stop you from looking into this, I doubt that this effort will get you far. This is pretty much what we do: we use the CUDA and cuSPARSE calls where possible (for the BLAS etc.) So for these functions you should already get the cuSPARSE performance. On top of that, we try to write ''fused'' kernels whenever we outperform the respective sequence of CUDA calls. Whatever might be better in the AMGX implementation is a secret to me. What might be useful is to check different SpMV kernels in MAGMA, e.g. : --format SELLP --blocksize 32 ( when running ./testing_dsolver ) If you have a ''thick'' matrix, you may also use a larger alignment ( e.g. --format SELLP --blocksize 8 --alignment 4 )
Let me know if I can help further!
Hartwig
Thanks for keeping me posted! Thanks for letting me know your experience! To answer your two questions:
1) This is what I would call a bug. Sorry! I will look into this - or just remove it as option, as the other one is faster anyway...
2) Although I don't want to stop you from looking into this, I doubt that this effort will get you far. This is pretty much what we do: we use the CUDA and cuSPARSE calls where possible (for the BLAS etc.) So for these functions you should already get the cuSPARSE performance. On top of that, we try to write ''fused'' kernels whenever we outperform the respective sequence of CUDA calls. Whatever might be better in the AMGX implementation is a secret to me. What might be useful is to check different SpMV kernels in MAGMA, e.g. : --format SELLP --blocksize 32 ( when running ./testing_dsolver ) If you have a ''thick'' matrix, you may also use a larger alignment ( e.g. --format SELLP --blocksize 8 --alignment 4 )
Let me know if I can help further!
Hartwig
Re: IDR perfomance?
Thanks once again for your reply. I will look into the block size alignment stuff and get back to you. Here is a small compilation of my current issues:
1) Where can I see all the possible available options about alignment/block size regarding SELLP? I mean I can look through the source code, but any easier way?
2) One other issue I found is that changing smoothing from 1 to 0 (I think default is 1) in didr_strms.cpp gave me quite a speedup in the solve time (around 30-40%). Shouldn't that be handled by solver opts instead of the actual source file?
Speaking of source code parameters, what about the initial om=0.7?
3) I know it's quite complex, but can we expect a solid algebraic multigrid solver soonish?
PS: If you think we can should move this to private messages, to save bandwidth and forum space, fine by me, but I think my issues/findings might save some people some time in the long run :)
1) Where can I see all the possible available options about alignment/block size regarding SELLP? I mean I can look through the source code, but any easier way?
2) One other issue I found is that changing smoothing from 1 to 0 (I think default is 1) in didr_strms.cpp gave me quite a speedup in the solve time (around 30-40%). Shouldn't that be handled by solver opts instead of the actual source file?
Speaking of source code parameters, what about the initial om=0.7?
3) I know it's quite complex, but can we expect a solid algebraic multigrid solver soonish?
PS: If you think we can should move this to private messages, to save bandwidth and forum space, fine by me, but I think my issues/findings might save some people some time in the long run :)
-
- Posts: 90
- Joined: Tue Sep 02, 2014 5:44 pm
Re: IDR perfomance?
Hey!
I think the forum is the right place, as others may benefit from the discussion. And I truly welcome any feedback!
1) Yes, I do need to improve the documentation. I can tell you:
available blocksizes are 2,4,8,16,32,64,128,256
available alignments are 1,4,8,16,32
but the combination may fail if the product exceeds the range of valid thread block sizes, e.g. 256x32 = 8192
In general, you should try keeping both numbers small, e.g. 32x1 or - for thick matrices - 32x4/32x8
2) I agree, smoothing is an option I should add to the solver options to avoid recompiling. I set it to 1 as it gives nicer convergence, important for many problems. There may be problems though, where it is faster without. The omega of 0.7 is empirically chosen as it works good for many problems. You can sure play around with this too - but this really blows up the search space...
3) I am currently more focusing on efficient ILU / ILUT preconditioning. Algebraic Multigrid is not very high on my list, sorry.
Finally, a question from my side: What kind of problems are you working on? Where do they come from, and what are the typical characteristics?
Thanks, Hartwig
I think the forum is the right place, as others may benefit from the discussion. And I truly welcome any feedback!
1) Yes, I do need to improve the documentation. I can tell you:
available blocksizes are 2,4,8,16,32,64,128,256
available alignments are 1,4,8,16,32
but the combination may fail if the product exceeds the range of valid thread block sizes, e.g. 256x32 = 8192
In general, you should try keeping both numbers small, e.g. 32x1 or - for thick matrices - 32x4/32x8
2) I agree, smoothing is an option I should add to the solver options to avoid recompiling. I set it to 1 as it gives nicer convergence, important for many problems. There may be problems though, where it is faster without. The omega of 0.7 is empirically chosen as it works good for many problems. You can sure play around with this too - but this really blows up the search space...
3) I am currently more focusing on efficient ILU / ILUT preconditioning. Algebraic Multigrid is not very high on my list, sorry.
Finally, a question from my side: What kind of problems are you working on? Where do they come from, and what are the typical characteristics?
Thanks, Hartwig