LAPACK and ScaLAPACK Survey Results - ordered by question
Additional Information
Question #6. Additional Comments/Suggestions? | Responses |
|---|
| In benchmarking my application that needs eigenpairs of a double precision symmetric matrix, I found that some architectures seem to have much greater overhead associated with function calls. In particular, I was comparing the different nodes on cheetah.ccs.ornl.gov that handle batch jobs (p690 1.3ghz) and interactive jobs (p655 1.7 ghz). I found that the speed was worse (factor of 1.8) on the batch node than one would expect by scaling to the difference in clock speed (factor of 1.3). I eventually convinced myself that the difference was caused by a huge number of function calls in inner loops, particularly the functions ROTG and DLARTG (e.g. 7 million calls per call to DSTEQR working with a 3750x3750 matrix). I found two ways to solve the problem. One was to recompile Lapack and my application using the highest optimization level (not recommended by the ORNL support folks because it takes so long for typically little benefit), which allows inlining of functions that exist in different source files. The other way was to put a copies of the ROTG and DLARTG in the source file that was generating so many calls. This allowed a lower optimization level to do the inlining.
I don't know if there is any good solution other than recommending the use of a certain optimization level when compiling libraries and applications. Perhaps a preprocessor that would insert copies of functions into source files where they are called in inner loops? | |
I'm not a primary customer and more weight should be given to the comments from apps users. Mainly I'm filling this in so that the Lapack team will know they have yet another customer. :-) | | Would be nice if the technology in HPL (look ahead, recursive options) can be merged back into PZGETRF/PZGETRS. Currently HPL is only in double in C so double complex is not available. HPL works on rhs but conceptually an extra pass over L should make it compatible with PxGETRF.
General symmetric linear solver (LTL') (perhaps in packed storage as well) would be nice.
Since performance of PBLAS is crucial for scalapack, it would be nice if there are parameters to tune PBLAS, especially on triangular solvers.
More examples or tutorials (like PETSc) to help new users to use scalapack.
Better interfaces for scalapack with other iterative linear solvers such as PETSc or Object oriented CCA technology. | | Lapack is one of the best things that has ever happened to me. Everyone of the lapack workers that I've interacted with has bent over backwards to be helpful. This survey is yet another example of how the lapack people care about being useful to their community. If you ever need someone to write glowing letters of support for grant applications (particularly someone from a US National Lab other than ORNL), don't hesitate to contact me at the address above. | | I am currently not using ScaLAPACK but I probably should. I would like to spend some time thinking about how to best integrate with C++.
I also would like to discuss is the way to use BLAS and LAPACK from C++ can be standardised. Several people have asked me for help, and I'm offering the interfaces we developed in the psimag toolkit. It would be nice if we could decide on properly suporting one model. | | I hope that, whatever changes are made to "modernize" lapack, that they be made in the manner to augment the current basic procedural functionality rather than replace it. | | Additional sparse/banded matrix routines would be great, but
I realize that opens up a new can of worms. | | Could you please install PVM? I really need that for my research. Thank you very much! | | general question 8 is a bit unclearly stated. | | The most frequent difficulty that my users encounter while linking their application to my optimization solvers and LAPACK involves language interoperability. Despite our best efforts to help them configure things properly, calling the Fortran routines from C or C++ continues to frustrate users who are uninterested in these subtleties of computer science.
I would also like to see the BLAS1 include y = alpha y + x, w = alpha x + beta y, and BLAS2 include alphs = x' A x where x is a column vector and A is a symmetric matrix.
| | First of all, many thanks for developing LAPACK. It is very important to our group
(computational nuclear theory).
It would be nice to get a genuine Fortran 95 /2003 LAPACK library WITHOUT
interface calls to Fortran 77 version of LAPACK. | | Thank you for your work on:
lapack
atlas
scalapack
pblas... | | thank you | | Lapack needs sparse linear algebra routines and search/sort routines. | | Currently I am working with a vectorization specialist at a supercomputer center to try to port ScaLapack to my codes. The learning curve for getting it running seems to be fairly significant. Our major bottleneck is communications between nodes to set-up the original matrix and to get the results once the calculation is completed. It doesn't seem like this should be so difficult to handle, but we are unable to improve upon it at the moment.
Being able to use ScaLapack in the same user friendly way as Lapack is used would be a great advance. It currently doesn't seem to be there. | | We at the Seminar for Applied Mathematics at the
ETH Zurich are entrenched LAPACK fans!
Thanks for the great work. Keep it up! | | Re ScaLAPACK: while I am not now using it my need for the functionality it provides has risen sharply in the last several months and I expect I will be diving in to it soon. I expect to use it via a matlab interface and will need to locate a suitable interface or roll my own. | | Online lapack, scalapack guides are useful, most of the time, I just google the routine I need
cholesky update missing in scalapack
large (banded) matrices require huge memory for linear solution.
superlu_dist is nice and should be incorporated
petsc is nice and should be better incorporated | | Few people seem to know this trick: if one has matlab installed on a system, one can link to the MathWorks-provided lapack libraries (called something like libmwlapack.a) which are highly tuned for a given architecture. Some operations are 3x faster than building lapack from source with the highest optimizations. I don't know if the MathWorks discourages its customers from doing this but this kind of tip in my opinion belongs in a lapack FAQ or related document. | | Calculation of gradient information of objectives involving log det of matrix valued function and constraints involving solution of linear system are much faster with explicit inverse of symmetric posiitve definite matrix. These codes involve element
by element multiplications of inverse with derivatives of matrix elements.
Many vendor implementations multi-thread Cholesky DPOTRF, but not DPOTRI explicit inverse using this Cholesky, producing a ottleneck on local 2 and 4 way SMP nodes. ATLAS does include multithreaded DPOTRI. | | Would very much like to see the RRR tridiagonal eigensolver in ScaLAPACK.
Thanks for providing the survey! | | Kindly circulate the comments that I had sent to Jim and Jack earlier. | | LAPACk is great for not very large systems, the same can not be said about ScaLAPACK for very large systems. For very large systems,
alternative approaches shall be taken instead of direct extention
of LAPACK. The size matters here. | | Would really like an FFT package in ScaLAPACK, as it is in PESSL. In fact, we are having to use FFTW 2.x in our porting of our image processing software from an IBM P4 system to the Cray XD1 system because we used PESSL's FFT but ScaLAPACK doesn't have it. We are a bit concerned about using FFTW 2.x because it isn't the latest version, but the latest version doesn't have distributed memory FFTs. It would be nice to have it in ScaLAPACK. | | LAPACK is excellent software.
We also use a multiprecision build of CLAPACK, built in C++ using customized class for 'double' say, overloaded arithmetic operators, and runtime-determined substitutes for LAPACK machine-constants routines. Building this requires some "clean up" of the CLAPACK sources, which makes getting updates/fixes more involved. It would be great if this could be done more easily, without any need for minor editing, eg. use of temp vars and no calls involving explicit float args like foo(...,1.0,...) or comparisons like bar > 1.0 . The ability for anyone to be able to build easily a quad- or multiprecision (GnuMP based, say) version of LAPACK might be generally well received. It's not immediately clear that all LAPACK routines (eg, SVD) will remain robust (eg. convergence) when treated in this way. Of course, multiprecision LAPACK brings with it many involved questions about how to best leverage double-precision solutions as initial candidate solutions for arbitrary high precision computation.
| | need more user friendness | | We are mostly using out-of-core version of scalapack. Since it's only a prototype code, documentation is limited. I would like to see an offical version of out-of-core version of scalapack. Also, partial factorization for both in-core and out-of-core scalapack would be nice. | | I think one of the frustrating parts of the LAPACK libraries is
the build and patch system. I think moving to a modern
revision control system (such as subversion), along with
a good set of open-source development tools accessible
through the web, introducing supported language bindings
(e.g., python), and finally a more reasonable build system is critical to the next generation of users.
| | 1. A quad-precision version of LAPACK would be very useful.
2. After using the matrix solvers in LAPACK for many years, I am convinced there is a deep problem that exhibits itself in the SVD routines. | | I like Lapack, thank you. | | Functionalities for symmetric indefinite matrices
are missing in Scalappack. | | The documentation (man-pages) of LAPACK could be kept more up-to-date wrt. the specification of workspace sizes. We have observed problems in the past due to this information being outdated in some cases.
Another problem concerns error handling: we'd rather not have LAPACK call the standard XERBLA-routine, which terminates the program. Unfortunately, one cannot replace it when using shared libraries. It would be great if there were e.g. some kind of global switch that forced the standard XERBLA to do nothing so that the called LAPACK-function can return the error code in INFO to the application, where it can be handled in a specific way. | | Scalability and performance of the Scalapack routines are very impressive compared to other libraries but still not easy to use according to the data distribution driven by the application algorithms.
Many thanks and wish you all the best. | | I would be very interested in using the MRRR (Multiple
Relatively Robust Representations) eigensolver algorithm.
I would be very grateful if you could feedback to me any
plans you have to incorporate this into future
releases of Scalapack, as this could influence the direction
of my future research. | | 1. I think that sca/lapack should offer a support for solving systems with Toeplitz matrices. Recently I have developed some algorithms for solving linear systems with banded triangular Toeplitz matrices (both versions using OpenMP, mpi and Level 2 & 3 BLAS routines). Please, let me know if you thinh that thet could be useful. Also see:
P. Stpiczynski: Numerical evaluation of linear recurrences on high performance computers and clusters of workstations, In: Proceedings of PARELEC 2004, IEEE Computer Society Press, 2004, 200-205P.
Stpiczynski: Solving linear recurrence systems using level 2 and 3 BLAS routines, Lecture Notes in Computer Science 3019 (2004) 1059-1066
2. Some support for vector processing counld be improved in case of multiple right hand side vectors (instead of repeating a simpler solver for one right hand side vector).
3. Recently I have developed a triangular matrix solver which use an alternative data distriburion (P. Stpiczynski: Parallel Cholesky factorization on orthogonal multiprocessors, Parallel Computing 18 (1992) 213-219). which is faster than the original scalapack routine. I believe that this idea can ba applied to produce faster Cholesky factorization. Currently I'm wonking on it. | | 1. I think that sca/lapack should offer a support for solving systems with Toeplitz matrices. Recently I have developed some algorithms for solving linear systems with banded triangular Toeplitz matrices (both versions using OpenMP, mpi and Level 2 & 3 BLAS routines). Please, let me know if you thinh that thet could be useful. Also see:
P. Stpiczynski: Numerical evaluation of linear recurrences on high performance computers and clusters of workstations, In: Proceedings of PARELEC 2004, IEEE Computer Society Press, 2004, 200-205P.
Stpiczynski: Solving linear recurrence systems using level 2 and 3 BLAS routines, Lecture Notes in Computer Science 3019 (2004) 1059-1066
2. Some support for vector processing counld be improved in case of multiple right hand side vectors (instead of repeating a simpler solver for one right hand side vector).
3. Recently I have developed a triangular matrix solver which use an alternative data distriburion (P. Stpiczynski: Parallel Cholesky factorization on orthogonal multiprocessors, Parallel Computing 18 (1992) 213-219). which is faster than the original scalapack routine. I believe that this idea can ba applied to produce faster Cholesky factorization. Currently I'm wonking on it. | | Please continue to develop these packages as they are of immense value to the research that we perform. |
|