LAPACK and ScaLAPACK Survey Results - ordered by question
General
Question #1. Is dense linear algebra a performance bottleneck in your applications? | Response | Count | Percent |
|---|
| Yes | 157 | 76% | | | No | 48 | 23% | |
Question #2. How often do your applications use the arithmetic precisions listed below:
Question #3. What dense matrix sizes are most important or time-consuming for your application? | Response | Count | Percent |
|---|
| 1,000s X 1,000s | 56 | 26% | | | 10,000s X 10,000s | 50 | 23% | | | 100,000s X 100,000s | 27 | 12% | | | 100s X 100s | 20 | 9% | | | 1,000,000 or more X 1,000,000 or more | 16 | 7% | | | 10s X 10s | 9 | 4% | | | 1,000s X 100s | 5 | 2% | | | 10,000s | 4 | 1% | | | 100,000s X 1,000,000 or more | 3 | 1% | | | 100s X 10,000s | 3 | 1% | | | 1,000,000 or more | 3 | 1% | | | 1,000,000 or more X 100s | 2 | 0% | | | 100,000s | 2 | 0% | | | 100s | 2 | 0% | | | 1,000s | 2 | 0% | | | 10,000s X 100,000s | 2 | 0% | | | 1,000s X 10s | 1 | 0% | | | 100s X 1,000s | 1 | 0% | | | 10,000s X 1,000,000 or more | 1 | 0% | | | 100,000s X 1,000s | 1 | 0% | | | 100s X 10s | 1 | 0% | | | 10,000s X 1,000s | 1 | 0% | | | 100,000s X 10,000s | 1 | 0% | |
Question #4. Does your application come close to, or run out of memory on important problems? | Response | Count | Percent |
|---|
| Yes | 150 | 69% | | | No | 66 | 30% | |
Question #5. Number of processors used for your application:
Question #6. Which architectures do you use or intend to use in the next three years? | Response | Count | Percent |
|---|
| Distributed-memory | 148 | 20% | | | Multi-core-thread | 133 | 18% | | | Sequential | 124 | 17% | | | Hybrid-shared | 98 | 13% | | | Distributed-shared-memory | 71 | 9% | | | Symmetric-multi-procs | 55 | 7% | | | Vector-computers | 55 | 7% | | | Widely-distributed | 32 | 4% | |
Question #7. Do you use any other sequential or parallel dense linear algebra packages other than LAPACK or ScaLAPACK? | Response | Count | Percent |
|---|
| no | 16 | 22% | | | PLAPACK | 14 | 19% | | | ESSL | 3 | 4% | | | BLAS | 2 | 2% | | | PESSL | 1 | 1% | | | linpack | 1 | 1% | | | scilib | 1 | 1% | | | Paradiso | 1 | 1% | | | CLAPACK | 1 | 1% | | | ESSL, CXML (LAPACK-based) | 1 | 1% | | | PLAPACK, Pliris | 1 | 1% | | | IBM ESSL | 1 | 1% | | | IBM's ESSL | 1 | 1% | | | Steven Kenny | 1 | 1% | | | NAG | 1 | 1% | | | super_lu_dist, petsc | 1 | 1% | | | mumps | 1 | 1% | | | Peigs | 1 | 1% | | | ESSL, NAG | 1 | 1% | | | NAG, IMSL, ESSL (IBM), ASL (NEC) | 1 | 1% | | | PETSc | 1 | 1% | | | Yes | 1 | 1% | | | PLAPACK, Trilinos, Parallel ESSL | 1 | 1% | | | Compact Numerical Methods / Pascal | 1 | 1% | | | none | 1 | 1% | | | Lee Samuel Finn | 1 | 1% | | | Arpack | 1 | 1% | | | ARPACK, BLZPACK | 1 | 1% | | | perhaps the 'parallel relatively robust representation' code. It has less memory requirements which favours a small cluster. | 1 | 1% | | | ARPACK and PARPACK | 1 | 1% | | | BNCpack, MPIBNCpack | 1 | 1% | | | fftw | 1 | 1% | | | ATLAS, in-house modular LA packages for rational (not float) problems | 1 | 1% | | | BLAS of course | 1 | 1% | | | ATLAS | 1 | 1% | | | some OpenMP blas | 1 | 1% | | | amd acml library, Intel mkl library | 1 | 1% | | | umfpack | 1 | 1% | | | BLAS, linpack, eispack (legacy code) | 1 | 1% | | | FLAME, PLAPACK | 1 | 1% | |
Question #8. Please rank how the following features would be useful to your current or planned applications?
Question #9. Do your applications solve linear algebra problems of the type? | Response | Count | Percent |
|---|
| General linear systems | 144 | 19% | | | Linear positive definite systems | 115 | 15% | | | Symmetric eigenvalue | 110 | 14% | | | SVDs | 84 | 11% | | | Generalized eigenvalue | 79 | 10% | | | Least-square problems | 74 | 9% | | | Banded linear systems | 72 | 9% | | | Non-symmetric eigenvalue | 55 | 7% | | | cholesky updating | 1 | 0% | | | Symmetric complex | 1 | 0% | | | product problems (Schur | 1 | 0% | | | singular linear systems | 1 | 0% | | | Linear indefinite systems | 1 | 0% | | | complex symmetric linear systems | 1 | 0% | | | sparse matrices | 1 | 0% | | | Symmetric indefinite [G A; A' 0] strucutre | 1 | 0% | | | general complex inverse and determinant | 1 | 0% | | | control theory | 1 | 0% | | | matrix exponential | 1 | 0% | | | gen.Schur) | 1 | 0% | |
LAPACK Usage
Question #1. Do you use LAPACK (or a vendor version of LAPACK )? | Response | Count | Percent |
|---|
| Yes | 198 | 92% | | | No | 15 | 7% | |
Question #2. If you do not use LAPACK, why? | Response | Count | Percent |
|---|
| Use another pkg | 12 | 50% | | | Not solving linear algebra | 8 | 33% | | | Cost of learning | 4 | 16% | |
Question #3. If using another package, which one(s)? | Response | Count | Percent |
|---|
| FLAME | 3 | 20% | | | ESSL | 2 | 13% | | | PLAPACK | 2 | 13% | | | CLAPACK | 1 | 6% | | | scalapack | 1 | 6% | | | matlab | 1 | 6% | | | NAG, IMSL, ESSL (IBM), ASL (NEC) | 1 | 6% | | | PARPACK | 1 | 6% | | | I usually write my own solvers. | 1 | 6% | | | ESSL for sequential work (though I do also use LAPACK when a customer requires it). | 1 | 6% | | | Also Custom | 1 | 6% | |
Question #4. If you use LAPACK, do you use a vendors version or one obtained directly from Netlib? | Response | Count | Percent |
|---|
| Netlib | 133 | 48% | | | Vendor | 119 | 43% | | | Other | 20 | 7% | |
Question #5. If you have used both a vendors version of LAPACK and Netlibs, how do the two versions compare? | Responses |
|---|
| vendor lib decreases our runtime by 10%-20% (of calculations taking hours to days). | | vendor's version may only have LU solver, not the full library | | netlib's version is less convenient to use since it does not have Makefiles. | |
Very similar in performance provided you use an optimised BLAS | | Roughly equal, given well optimized BLAS (e.g. Atlas) | | Netlib's version provides the source code, very helpful in porting.
I use vendor's LAPACK functions whenever possible. | |
This depends on the architecture/compiler:
The AMD (ACML) libraries perform very well, though interacting
with the 'fastsse' options on the portland group compiler
can cause rare failures. So most ScaLapack/Lapack libraries
are a mixture on 'pre' and 'locally' compiled.
At NERSC, linking to the 'essl' library , ( for DGEMM) has proved
substantially faster than a user compiled version.
| | I usually try to use the vendor's version first and if it is not available, costs too much, or for some other reason too much work, use Netlib's version next. | | Mostly interchangeable. However, on occasion, the flexibility of configuring through ATLAS has been a key enabler. | | Generally more optimized, although I have not separated out the
BLAS performance (actually use of vendor LAPACK over reference netlib version is often a matter of convenience when using vendor BLAS). | | very similar | | No real differences. | | Similar. Usually performance depends on BLAS, for which the Goto
BLAS performs on par with vendor BLAS. | | vendor version is substantially faster. | | Vendor supplied typically factor 2 faster. | | well. | | Vendor faster but Vendor is incomplete or undefined API (Lapack 1? 2? 3?) | | On Opterons with Portland group compilers under Linux, Netlib version is at least order of magnitude worse. It's useful for testing on a PC though. | | Presume vendor version provides better performance. | | I use MKL on Windows machines and ATLAS on Linux machines. Both perform
better than reference implementations. | | the intel MKL libaries are much easier to use than ATLAS and LAPACK.
despite what others say, I do not find ATLAS to be much faster than MKL
blas (perhaps the difference is not suffient to motivate me to change). | | Vendor's usually does a better job of using cache | | Prefer vendor's version. | | I have run them on different architectures and so have not had the opportunity to do a very direct comparison. | | Never compared. | | Have not compared. Usually I have used the vendor's version assuming they had optimized blas, and maybe lapack. I also assume the vendor is more knwoledgeable (and has more time) to experiment with the different compiler options. Have not actually verified my assumptions, I could be wrong. | | Vendor's version somewhat faster | | Vendor version is more convenient and sometimes optimized. However, my main reason for using a vendor supplied version is convenience in the cases where I don't need portability. | | Sometimes the vender version better performs. | | Similar - Netlib self compiled version is often as good as/better than vendor supplied versions | | Similar | | The overall size of LAPACK library is big for some small platforms. | | Vendor versions tend to be minimally tuned, but better tuned than netlib. | | Vendor code is faster, but not more than 25% | | Sun Performance Library (Solaris 8, 9) had too many bugs. We were on the phone for half and hour or more each time we called in bug reports to Sun and finally decided that their implementation was worthless; we stopped using it and reverted to netlib sources. | | Vendor version is far better (faster). | | Tuning netlib version can be annoying. | | Often e.g. on IBP SP's, the vendor version is better | | no tests performed | | Vendor better on Solaris | | LAPACK is comfortable in some sense because I have been using it (off-and-on) for so many years. However, the differences are all in performance, which is important, but my problems with LAPACK/ScaLAPACK all stem from the interface and that, of course, does not change. | | They are more or less the same, I just don't have to compile / keep manually track of updates with Debian's lapack. | | Vender is much faster | | Vendor versions are *usually* faster, but not always.
| | vendor version is faster | | Vendor's version (MLIB) optimized much better and thus faster | | Vendor version is more prone to errors but is usually faster speedwise.
Having the netlib version available is a good check on the stability
of the vendor version (especially the test suite) | | SGI computer has its own versio of BLAS and LAPACK. It is faster. | | Vendor releases seem only marginally optimized over reasonable
compilation of netlibs code. ATLAS libraries are critical for
any real performance bottle neck. | | Basically the same | | I was never able to get the NEtlib version to pass all it's own tests. I have not submitted the vendor package to the same tests! | | I use ESSL from IBM on p690
At link step, I ask first for ESSL, next for LAPack for completeness.
(ESSL does NOT contain all LAPack subroutines) | | Vendor's version is used for performance and netlib's one for portability. | | vendor's is faster | | Vendor version is much faster | | Comparison is difficult because I only use Netlib LAPACK on architectures for which I don't have access to a vendor version. | | Higher speed for the vendor's version as expected. Same robustness (keeping in mind that I compile the Netlib version with medium optimization flags). | | The vendor version is faster, and I use it because it is free in this case (AMD's ACML). | | I use MKL provided by Intel. Level 2 BLAS routines from MKL are faster than | | I use MKL provided by Intel. Level 2 BLAS routines from MKL are faster than | | Generally the vendor versions are faster showing considerably speed up over the Netlib versions. | | vendor's version of LAPACK much better tuned to their platforms |
Question #6. Do your applications make direct LAPACK calls? | Response | Count | Percent |
|---|
| Yes | 181 | 84% | | | No | 32 | 15% | |
Question #7. Do your applications use libraries which depend on LAPACK? | Response | Count | Percent |
|---|
| Yes | 136 | 65% | | | No | 73 | 34% | |
Question #8. Do your applications use a higher-level interface to LAPACK? | Response | Count | Percent |
|---|
| Yes | 55 | 27% | | | No | 148 | 72% | |
Question #9. If you answered yes above, which higher-level interfaces do you use? | Response | Count | Percent |
|---|
| Matlab | 19 | 35% | | | R | 3 | 5% | | | SciPy | 2 | 3% | | | python | 2 | 3% | | | Some Matlab routines | 1 | 1% | | | the matrix template library (MTL) | 1 | 1% | | | in-house custom written | 1 | 1% | | | CLAPACK | 1 | 1% | | | Matlab, NAG | 1 | 1% | | | Scilab | 1 | 1% | | | numarray | 1 | 1% | | | Matlab mex-functions | 1 | 1% | | | FLAME interfaces | 1 | 1% | | | Mathematica, Apple's "Accelerate" framework | 1 | 1% | | | SciPy/Python, Boost ublas-lapack | 1 | 1% | | | PETSc | 1 | 1% | | | Matlab, Lapack95 | 1 | 1% | | | Matlab, IDL | 1 | 1% | | | FLAME | 1 | 1% | | | C++ bindings from boost-sandbox | 1 | 1% | | | Trilinos, my own c++ templated lapack wrappers | 1 | 1% | | | python, matlab, octave | 1 | 1% | | | MPL (our own multi precision matlab clone) and Matlab | 1 | 1% | | | LACAML | 1 | 1% | | | Matlab, PETSc | 1 | 1% | | | MTJ | 1 | 1% | | | MATLAB matrix ops | 1 | 1% | | | in-house C++ layer | 1 | 1% | | | Octave | 1 | 1% | | | python/scipy | 1 | 1% | | | LAPACK90 | 1 | 1% | |
Question #10. Is the LAPACK procedure interface a barrier to more extensive use? | Response | Count | Percent |
|---|
| Yes | 37 | 18% | | | No | 159 | 81% | |
Question #11. From which languages do you call LAPACK routines? | Response | Count | Percent |
|---|
| Fortran 90/95 | 126 | 25% | | | Fortran 77 | 107 | 21% | | | C | 104 | 21% | | | C++ | 75 | 15% | | | Matlab | 34 | 6% | | | Python | 16 | 3% | | | Octave | 10 | 2% | | | Fortran 2003 | 5 | 1% | | | Java | 4 | 0% | | | R | 3 | 0% | | | Scilab | 1 | 0% | | | Mathematica | 1 | 0% | | | C# | 1 | 0% | | | tcl | 1 | 0% | | | IDL | 1 | 0% | | | VB | 1 | 0% | | | OCaml | 1 | 0% | |
Question #12. Please describe any tools or helper functions that you frequently implement to assist your applications in using LAPACK? | Responses |
|---|
| CLAPACK | | C++ interfaces | | Numeric Python | |
1. Perhaps an architecture independent diagnostic, that would
interrogate a particlur system for different L1 and L2 cache
boundaries and suggest optimum matrix size.
2. DGEMM_HALF ... ie knowing my matrix solution will be symmetric in
advance,i would like a half matrix multiply that
scales as well as a vendor supplied DGEMM. | | We have built a C++ interface for BLAS and LAPACK that uses the ability to overload arguments and removes the need to encode precision into the name of the method/subroutine.
We have also defined the simples possible matrix concept and use generic linear algebar operations that interface to LAPACK and BLAS. | | None. | | I usually wrap the f77 calls in f90 shells that allocate workspace memory as required. | | #if SP2
#define C2FCALL(x) x
#else C2FCALL(x) x##_
#endif | | I mostly use lapack via Scipy, and I find the wrapped interface satisfactory. I'm sure it could be improved, but I have never felt that part to be a development bottleneck for me. | | wraper routines that dynamically allocate the work spaces | | I typically write "wrapper functions" that call LAPACK while hiding many of its arguments from the nonlinear solver or optimization solver. When using C++, these functions inherit the interface from an abstract matrix class. | | In my C codes, I always have to make a wrapper to hold dummy integers, etc. for the options I don't use. | | Compact wrappers | | sparse routines and search routines have been written. | | Class wrappers in C++ | | boost python bindings
f2py
| | C/C++ wrappers | | Our helper functions abstract from indices, much like FLAME does | | I/O Stuff for printf-debugging of small examples | | NETLIB | | C headers | | OO wrapper | | C++ class package for matrices. | | Home grown wrappers for LAPACK that allocated work space.
| | explicit prototypes | | Wrappers that handle memory allocation. | | self-developed tools | | none | | Getting LAPACK templated is one of the greatest obstacles. We are currently trying to use a modified f2c to do the job of converting the hard coded types 'double' and 'float' into our own MpIeee class. | | Functions to display matrix elements to check correctness | | matrix plotters (ascii, graphical, etc) | | my own c++ templated lapack wrappers | | PMD (Parallel multi-domain Decomposition), see : http://www.idris.fr/data/publications/PMD | | I frequently write wrapper routines to make repeated calls from C more straightforward. I often do not need access to the full range of input arguments, so I write the C wrapper to supply the "extraneous" arguments. | | I have not implemented such tools. I think that the use of lapack
routines is quite simple. | | I have not implemented such tools. I think that the use of lapack
routines is quite simple. | | plane rotations, reflections, Toeplitz and Cauchy solvers, updating and
downdating, |
Question #13. How could the LAPACK interface be improved to feel more natural to your application and implementation language? | Responses |
|---|
| There could be a facility to interrupt a Lapack call. This can be very useful in a multithreaded application, or one that makes use of asynchronous communication. I hacked this feature into Lapack3 in a fairly inelegant way. I identified several points in the Lapack code to call an external function, which was part of my application that handled communication events. These points were chosen so they were called fairly often during the course of solving my symmetric eigenvalue problem, but not in inner loops to avoid a performance hit. Because the Lapack functions have no parameters for such a purpose, I used "LDA" as a flag to indicate a situation where I need Lapack to abort the current operation, as opposed to just continuing on where it left off. In my application, this depended on the nature of the communication event. | | Object orientation | | see the draft LAPACK bindings of the C++ boost project | |
Fortran 90/95 interfaces would be more natural
Overload so that have one routine name for all data types | | Better integration with Python | | Short of doing a full thing with a generic linear algebra library (modeled after blitz++ or MTL), I favor the pragmatic and simple approach I described in item 12. | | I actually find the procedural interface as it stands quite natural, but then I'm one of the f77 dinosaurs. | | Automatic memory allocation for temporary/work arrays would be a nice option. I understand why LAPACK does the whole bit about 'leading dimension of array...' but it's awkward and clumsy. Would be nicer to use f90 interfaces that know what shape the array is, or a simple C++ array type that does as well. (But *not* a complicated C++ array type like Blitz supplies; don't make LAPACK itself dependent on anything!) | | I don't see any obvious/major weakness to the current
LAPACK (used under F90). | | - removing the work spaces
- call-by-value interfaces for C | | Make a C++ framework for it so I don't have to write my own. | | Some examples that call LAPACK from C would be helpful. | | I hate to call the routine sometimes with underscore on some systems like linux. | | Should have automatic workspace allocation, and would be very useful to have a "higher level" interface that, given the sort of problem to solve (e. g. symmetric eigenvalue problem) detects the type of data and the structure of the matrix, so that the user does not have to rewrite large portions of code while trying different algorithms and modifiying existing code (e.g. when the underlying data changes from real to complex). | | automatic interface. | | It would be nice to have multiple c wrappers for LAPACK that can customize calls to many-option functions. | | Definitely oppose to any f90 interface.
f77/C as is makes it very easy to use with any applications. | | Simple consistent interface such as outputs first, followed by inputs. | | give full C or C++ version of lapack | | generic programming would make lapack more accessible from C++ | | More comprehensive wrapper interface for C++ that does not sacrifice performance (e.g. user keeps control over memory allocation) | | Fortran 90/95/2003 interfaces should become standard. | | Provision of standard C/C++ wrappers | | The symmetric packed routines should be improved. | | Provide interface for row-major storage. | | remove user supplied workspace; do it automatically. More obvious (longer) routine names, although this might break f77 applications. | | Make it more like FLAME | | Hide work arrays, leading dimensions, matrix dimensions, and possibly other matrix information (symmetric, triangular) beneath an abstraction layer (a la FLAME objects). Do not require user to provide workspace. | | provide C headers + assistance tool for linking
(linking of Fortran and C can be a bit of a pain in the backside, finding all libraries that have to be linked in) | | More optional arguments and thus shorter method signatures. | | Getting rid of some of the explicit indexing would be nice. Being able to shift the borders of the "matrix of interest" inside the larger array on the machine would be nice. ScaLAPACK makes this situation worse, which is unavoidable I suppose. A very simple change would be slightly less cryptic identifier names (indirect feedback from anyone I help use LAPACK for the first, and often last, time).
| | Object oriented framework for C++. | | Heavy use of C++ | | simplifying the indexing | | We have C++ application codes. We would like to have matrix 'objects' which would bundle size information, and possibly have various format/layout options (e.g. packed symmetric, etc.).
| | Automated memory allocation. | | with a matlab wraper | | It's a pretty good interface as is. | | It would be slightly nicer to have actual routines to query accuracy of eigenproblem results as error bounds from the routines such as zgeevx (that calculate eigen-condition numbers). Right now, the LUG shows (eg. lug/node100.html) code fragments to obtain such accuracy results based on working precision details and the computed condition numbers. It'd be slightly nicer to have these code fragments be actually implemented in some routine -- the driver, say, or other.
| | Using lapack in C++ makes more sense if the library is templated. Allowing to use your own datatypes in the algorithms and hence ennabeling multi-precision.
| | Consistent and intuitive ordering of function parameters/argument | | possibility to have more info about lower level steps in higher
level routines | | provide standartized c++ templated lapack wrappers | | Since we predominantly use OCaml as implementation language, we are not particulary dependent on the LAPACK-interface itself, because the intermediate LACAML-interface makes calling LAPACK very easy without losing functionality or efficiency. I assume that this is also mostly true for other people using different implementation languages. Therefore, I'd propose to keep LAPACK as efficient as possible, i.e. there is no need to waste implementation effort on making it more convenient. Higher-level interfaces and implementation languages are more suitable for this purpose. | | With fortran 95 and even more with Fortran 2003 many things could be improved using the new functionnalities as MODULE, genericity, optional arguments, dynamic allocation of Fortran 2003 and its object oriented programming and C-Fortran compatibility. | | C++ wrapper | | A C interface would make LAPACK feel more natural in C. It might also be nice to have set of functions which have a restricted set of arguments that work for most "simple" problem. FFTW supplies such function and I find them quite handy. | | A simplified interface for Fortran90 etc. that allows for simple quick use of the routines. | | I think it is not necessary to make such improvements. | | I think it is not necessary to make such improvements. |
Question #14. If you have installed LAPACK yourself, how could the installation process be improved? | Responses |
|---|
| Haaave the default make target *not* run the test phase. | | would be nice if the testing and verification process is quicker | | use a simple configure script | | It could not be improved | | This is a minor issue: complete the LAPACK implementation in ATLAS so that optimized BLAS and LAPACK can be installed in one package. | | The testing process is too slow. It's hard to believe the functionality oculdn't be tested with quicker-running problems. Perhaps have the option of the long test if you want it. A typical scenario is that I get onto some new machine and find I have to install LAPACK, then I can kiss productive time goodbye waiting for the tests to run. | | More automatic configuration, and some basic tuning. | | Place in autoconfig construct as a package in itself
Place in autoconfig construct with scalapack, lapack, blas
./configure --with[out]-(scalapack,lapack,blas) --with-blas=ibm_essl | | If you can have binary version for all the systems, it will be great. | | not much, current method is fairly straightforward | | I have not done this recently.
Several years ago, I had difficutly using the compiler that came with my linux distribution (I think either atlas or lapack required an older version).
| | Does it use GNU autoconf and automake? | | rpm, if possible | | it's simple enough | | LAPACK always seem to compile fine, but
testing routines often fail for non-obvious reasons. | | 1. I always compile and create the libraries manually. Provide a script or makefile that does just that. The official installation process is confusing and gets stuck.
2. provide standard ./configure make make install but make sure it actually works on (nearly) everything | | The LAPACK installation process itself isn't bad once you've done it a few times. What would be really nice is if it followed the GNU style "./configure ; make ; make check ; make install" sequence. I install tons of software for the team I support and reallly appreciate those tools that use familiar and consistent configure scripts & build env variables (CFLAGS, etc). Another pain--and this isn't LAPACK's fault--is building an ATLAS-enhanced LAPACK. Sure, it is easy once you've done it a few times but the process is really non-standard. | | It is ok. | | Precompiled downloads work for me | | LAPACK currently has no configuration support. Manually editing makefiles is fine for most wonks like us, but it's not very "professional" for an NSF project funded at LAPACK's level. Autoconf configure scripts would be very slick and help provide built-in support for building LAPACK on various frequently-used architectures. | | dlamch tuning could be simplified? | | I have not seeen any problem with the installation process | | configure script rather than multiple makefiles + user choice | | It should build shared libraries, not just static. | | Unless the architecture is new, there seem to be few problems. Even with ScaLAPACK, I have encountered few problems in this respect. It was very well done. | | Many recent compilers seem too clever for automatic detection of parameters like machine epsilon to work. It would be useful to have a replacement that uses Fortran 90 intrinsics instead. | | it always worked fine for me | | Make the default install double & double complex only. Improve the test
suite (it fails eg on some compilers when attemmpting to discern the
machine accuracy/tolerances). | | Export .exp files for use with MSVC++ on Windows would be slightly nice to have, though this is a minor issue.
| | No problems with installation... | | Been a long time since I did this - Think it's OK as is. | | ok as is | | autoconfig!! Manually tweaking makefile options and manually installing the library is a pain. This would also allow for more reasonable shared library support.
Also, go ahead a make a LAPACK 3.1 or 3.0.1 (or whatever you'd call it) release. Downloading the lapack-3.0 sources and then manually copying all of the patch files on top of the originals is a pain. | | configure / make | | Installation is not a problem. | | see Scalapack comments below. | | I feel the installation process was extremely simple. I can't suggest any improvements. | | Maybe an ATLAS-like procedure that reduced the optimization levels only on routines that tend to fail on a given arch/compiler combination and optimized the rest. | | I know just enough about the install process to get into trouble. I originally tried to install just the single and double precision routines, but the install would not complete. It was easier to just default the install to create the full library with all four precisions/kinds | | One could consider to provide makefiles for some popular platforms and compilers like Intel compilers. | | One could consider to provide makefiles for some popular platforms and compilers like Intel compilers. |
Question #15. How frequently do you refer to the LAPACK Users Guide? | Response | Count | Percent |
|---|
| Very Frequently | 7 | 3% | | | Frequently | 53 | 26% | | | Sometimes | 91 | 45% | | | Rarely | 40 | 20% | | | Never | 8 | 4% | |
Question #16. What information in the LAPACK guide is hard to find or is missing, if any? | Responses |
|---|
| More examples would be nice (more examples for scalapack would be nice as well) | | N/A | | Information about sparse matrices could be better documented. | | Performance information is difficult to find. For instance, it took me
a while to confirm my observation that the Cholesky factorization using packed symmetric format were much slower than Cholesky factorizations using full dense format. | | Information is not missing; however, it can be difficult to find on netlib, and more examples are needed. Some time the documentation is difficult to understand (you can't assume that even the most
hard core of us know the nomenclature). | | It is difficult to find the names of the high-level functions. | | I would like to see a list of NAG routines which have corresponding LAPACK
equivalents. | | it's so hard to read, i would like to see examples. | | Sometimes hard to find the right function name for a given computation. | | It is ok. | | Syntax of routines, meanings and orders of parameters | | Not everybody nowadays knows what LDA is - this scares off
some especially young users of LAPACK | | The information is relevant and useful | | algorithmic aspects and details | | a precise description of the algorithm | | It's not clear from the section on storage schemes how non-square packed storage works (the examples are square, only). One has to figure out how it might work, eg for non-square packed band storage.
The use of m,n,k,etc in the few topmost SVD routines is slightly confusing. It's not immediately clear how to implement it to produce a "non-full-span" U (for U*S*V^T=A say) so as to be most efficient when solving least-squares with m>>n .
In the absence of infomation in the docs about whether the QR implementations (with or without column pivoting) are rank-revealing it becomes more prudent to always have to do SVD for least-squares problems with might be rank-deficient. It might be useful if the docs said more on this. But that's asking for more mathematical education in the docs, which I realize is a big request.
Links in the LUG on netlib from the instances of LAPACK function names to their nearby source locations on netlib would be very useful, since the specification and calling sequence infomation is in comments in the individual routines' source files.
NB.The LUG section Specifications of Routines (lug/node149.html), as it appears on netlib at least, is empty except for a brief note. Thus the Individual Routines sources' comments appear to be the specs.
Details of the encoding of the results of the Bunch-Kaufman-Parlett decomposition of symmetric indefinite matrices seems missing in the guide. This makes it near difficult to extract and form the individual factors computed by, say, dsytrf. I realize that this request is inefficient and unwise and unnecessary when solving systems, etc, but sometimes users just really want to get their hands on the explicit matrix factors and are willing to lose the efficient encoding which the dsycon, dsytrs, etc understand. Using the details as supplied in the comments in, say, dsytrf's source, is involved.
The approximation of condition numbers (Higham's modification of Hager's method) can be inaccurate. The documentation isn't very clear on how inaccurate it might be.
| | explicit examples | | Program samples would be nice. Especially from other languages. | | Always found what I need. | | See Scalapack comments below. | | The guide does not seem to include a discussion of the individual functions and their arguments. I often have to go to the Fortran source to find out which arguments a function takes. |
ScaLAPACK Usage
Question #1. Do you use ScaLAPACK (or a vendor version of ScaLAPACK )? | Response | Count | Percent |
|---|
| Yes | 93 | 45% | | | No | 113 | 54% | |
Question #2. If you do not use ScaLAPACK, why? | Response | Count | Percent |
|---|
| Not solving linear algebra | 32 | 41% | | | Cost of learning | 26 | 33% | | | Use another pkg | 19 | 24% | |
Question #3. If using another package, which one(s)? | Response | Count | Percent |
|---|
| PLAPACK | 10 | 43% | | | Matlab | 1 | 4% | | | MPIBNCpack | 1 | 4% | | | self written | 1 | 4% | | | PESSL (IBM) | 1 | 4% | | | LAPACK -- matrices are local to node | 1 | 4% | | | (we use coarse-gain parallelization above the numerical layer) | 1 | 4% | | | PETSc | 1 | 4% | | | LAPACK - SCALAPACK does not support SVD | 1 | 4% | | | LAPACK | 1 | 4% | | | I usually write my own solvers. | 1 | 4% | | | symmlq | 1 | 4% | | | SuperLU | 1 | 4% | | | Typically, I use pESSL, but use ScaLAPACK or PLAPACK when users have the need (prefer PLAPACK, especially to build a custom solver/routine). | 1 | 4% | |
Question #4. If you use ScaLAPACK, do you use a vendors version or one obtained directly from Netlib? | Response | Count | Percent |
|---|
| Netlib | 46 | 48% | | | Vendor | 44 | 46% | | | Other | 4 | 4% | |
Question #5. If you have used both a vendors version of ScaLAPACK and Netlibs, how do the two versions compare? | Responses |
|---|
| Vendor supplied is faster, but more buggy. | | Vendor's version is definitely faster (on Compaq and IBM) | | ScaLAPACK is comfortable in some sense because I have been using it (off-and-on) for so many years. However, the differences are all in performance, which is important, but my problems with LAPACK/ScaLAPACK all stem from the interface and that, of course, does not change. [Yes, I copied my response from the LAPACK response above ... but it does apply] | | They are more or less the same, I just don't have to compile / keep manually track of updates with Debian's lapack. | | Both - but I haven't compared their performances because other issues come into play, such as the version of mpich used to compile them, etc. | | vendor's version is faster | | Same as before : P-ESSL instead of ESSL | | As for LAPACK, vendor's version is used to reach optimal performance and netlib's one for portability. | | Vendor implementations are faster. I have also found them to be more robust. We have had cases where the Netlib verrsion fails but the vendor version works. | | Comparable | | They are essentially identical |
Question #6. Do your applications make direct ScaLAPACK calls? | Response | Count | Percent |
|---|
| Yes | 85 | 74% | | | No | 29 | 25% | |
Question #7. Do your applications use libraries which depend on ScaLAPACK? | Response | Count | Percent |
|---|
| Yes | 37 | 34% | | | No | 70 | 65% | |
Question #8. Do your applications use a higher-level interface to ScaLAPACK? | Response | Count | Percent |
|---|
| Yes | 7 | 6% | | | No | 98 | 93% | |
Question #9. If you answered yes above, which higher-level interfaces do you use? | Response | Count | Percent |
|---|
| writing our own | 1 | 25% | | | I will use Python | 1 | 25% | | | module at PSC | 1 | 25% | | | mumps | 1 | 25% | |
Question #10. Is the ScaLAPACK procedure interface a barrier to more extensive use? | Response | Count | Percent |
|---|
| Yes | 39 | 39% | | | No | 61 | 61% | |
Question #11. From which languages do you call ScaLAPACK routines? | Response | Count | Percent |
|---|
| C | 37 | 38% | | | Fortran 90/95 | 26 | 26% | | | Fortran 77 | 22 | 22% | | | C++ | 12 | 12% | |
Question #12. Please describe any tools or helper functions that you frequently implement to assist your applications in using ScaLAPACK? | Responses |
|---|
| None, although I'd like some | | Global Arrays | | None. | | I typically write "wrapper functions" that call LAPACK while hiding many of its arguments from the nonlinear solver or optimization solver. | | None | | MPI | | We often diagonize matrices about the size of 1000~10000, or even larger. Storage of the matrices (and the eigenvectors after the diagonization) are usually splitted by stripes of the matrices.
Before calling ScaLAPACK we have to transform the stripe distribution to the block cyclic distribution, and back transform from BCD to the stripe distribution after diagonization. Is it possible for ScaLAPACK to make these processes automatic? | | boost python bindings | | own layer | | na | | Data distribution : the user should just provide pieces of
matrices and set of indices, and the distribution should be automatic.
Some comment should then be provided on possible performance degradation for some parameter values. | | Lots of routines to change "shifts" into indexed calls. They are very small helper functions. Also, often I find that I have trouble making ScaLAPACK give me the data distribution I need. I'm fairly familiar with that part of ScaLAPACK, but perhaps this is a shortcoming on my part (maybe it could be done within ScaLAPACK more efficiently than I do it myself, but I have never felt I had the time to work it through; it's faster just to write my own). | | OO wrapper; see above... | | different data distribution models | | Getting the matrix into a 2-D block cyclic distribution is non-trivial.
| | MPI | | PMD (Parallel multi-domain Decomposition), see : http://www.idris.fr/data/publications/PMD | | PBLAS redistribution routines: pdelset, pdelget, indxl2g etc. | | BLACS_gridmap
MPI_wtime (among many other MPI subroutines) |
Question #13. How could the ScaLAPACK interface be improved to feel more natural to your application and implementation language? | Responses |
|---|
| Would be nice if there are tools or example to help in setting up the matrix, distributing data, reading/writing the matrix | | A consistent interface for the QR routines with pivoting would be most useful. The public version of QR routines seem to fail for one-column matrices.
| |
Same as LAPACK: F90/95 interfaces, overload
GET RID OF BLACS: Use MPI instead
Memory for diagonalizers often seems an awful lot, and sometimes stops jobs that I think should run from running. | | Better interface to Blacs. The redist utils are a great start, but they're poorly documented, and frequently when the system administrators install scalapack they don't even know that they should also build redist. | | 64-bit integer arguments are needed (Fortran Integer*8 or C/C++ long long). | | A more extensive, and *universal*, pblas library. Portability of code is of paramount concern (for maintainability), and many of the support routines are not universal, and therefore the underlying distributed data structures need to be deciphered and the necessary functions hand-coded. Very ppu (personal processing unit) time-consuming. | | It is very inconvenient to prepare the data for ScaLAPACK. Should be more natural, at least as natural as lapack. | | Some examples that call ScaLAPACK from C would be helpful. | | The required data layout was quite diffcult to implement. | | The same: f77/C interfaces are sufficient. | | 1) There should be a functional call to generate a global matrix descriptor using either a BLACS context or an MPI communicator.
2) Another function should generate a BLACS gridmap from an MPI communicator and vice versa.
3) Maybe the BLACS context should be taken out of the global matrix descriptor | | It is hard to sort out precisely how to split up and send a matrix to different processors. For example, in a situation where the matrix can be stored on one node, but ScaLapack is being used for speed-up of the linear algebra operations, there is no simple way to send the matrix to the nodes, perform the computation and get the matrix sent back. Having a routine to do that would be very useful, especially if it would automatically determine the optimal number of processors for the ScaLapack routines. | | object oriented
| | na | | The symmetric packed | | Automated workspace, more obvious routine names. | | The interface used by PLAPACK, which allows submatrices to be submitted in a transparent fashion, is far superior to the ScaLAPACK interface. | | Same as LAPACK: desperately needed abstraction from details, memory allocation, etc. | | Getting matrix packings right is quite annoying. | | Give a set of routines to help distribute the data.
The packed format is for us a key feature. | | The matrix layout and communication seems difficult. | | The problem is that once you've used ScaLAPACK for a while, you get used to the shortcomings of the package. Often people who have used other packages (Trilinos subsets, PLAPACK, Global Array, etc.) are not willing to get used to things. Any step towards an interface like these would be helpful, especially to new users. | | see above | | Heavy use of C++ | | The data distribution is difficult. PLAPACK allows this to happen in
a much more natural way where the user does not have to worry about
placing the data on the nodes (in the case of clusters), PLAPACK does it
in the background for you.
| | Some object-orientation in terms of matrix types (maybe bundling the array descriptors with the matrix, etc.) would be useful from a C++ application's perspective.
| | similar to the call in serial jobs | | It's not the interface per se that causes me problems, it's the functionality. I need
a tridiagonal or band diagonal matrix solver which will allow for a two-D data
distribution and dedicated IO nodes. The ScaLAPACK-based IBM library PESSL
only allows for a 1D data decomposition, as (I think) does ScaLAPACK itself.
Several years ago I got a 2D data decomposition to work on a Hitachi system
by passing MPI communicators to the BLACS grid initialization routines. It would
be great if this were standardized across platforms. | | The key problem of using ScalaPACK (as others similar libraries) is data distribution. Block cyclic distribution is usually hard to stick on during a calculation. Perhaps, enriching the type of data distribution or letting users define/describe their own data distribution could be a more natural way (but at what cost ?). | | Get rid of the dependance on BLACS. BLACS contexts in particular
are unwieldy and difficult to use and understand. | | I think that the use of block cyclic distribution of dense matrices is
a little bit complicated. Thus it woluld be nice to find routines that assist do distribute matrics. | | I think that the use of block cyclic distribution of dense matrices is
a little bit complicated. Thus it woluld be nice to find routines that assist do distribute matrics. | | The most major problem is the errors that are given when the workspace is too small. The message that comes from ScaLAPACK is often incorrect and says that the problem is due to an incorrect argument to a routine. I would also like to use ScaLAPACK in a way that allowed several parallel diagonalisations to be carried out in parallel. This is mentioned in the BLACS documentation but does not seem to work. | | Make it easier to use subsets of MPI_world in parallel LAPACK operations. |
Question #14. If you have installed ScaLAPACK yourself, how could the installation process be improved? | Responses |
|---|
| Installation can be complicated involving BLACS, PBLAS, MPI, interfacing C and Fortran 77/90 compilers. Common problem is 1 versus 2 underscores in symbol names. Perhaps there can be C interface stubs to accomodate both versions. | | Definitely can be improved compared to GNU and R packages or even PETSc which requires more work than the other two. | | It's pretty good, although I recall having to go through a lot of configuration tweaking to make it work. Autoconf might help here.
| |
It is still not at the
1. configure (compiler inquiries)
2. make
3. make install
stage. What would be useful, is a script that searches for
dependent libraries and then gives you the option
'Cannot find BLACS library'
1.Enter directory path
or
2.Enter a directory path where i can install my own.
| | It will be very helpful to make the distribution of matrix in columns instead of blocks. | | The distribution should contain BLACS. | | The installation is TERRIBLE beyond description and requires major work. Test it on a variety of machines before release!! And with various installations of MPI on the same machine!
There is a mixture of "mpi.h" and in BLACS that makes installation fail if there is /usr/include/mpi.h which is different then specified in Bmake.inc
Please provide standard ./configure make make install
| | na | | ok | | I have not met any problem with the installation. | | Please dump the makefiles and provide a simple compile script. I have had to get help from our unix experts in the university to get it to compile on our unix cluster using makefiles. No one in the university has been able to get it to compile on a PC win32/win64 cluster which does not use makefiles. Has anyone ever done this? If the answer is yes please email me details m.e.honnor@durham.ac.uk, thanks | | Again, very professional, painless installation. Now, sometimes we hit bugs when the test suites are run, but the distribution of bugs includes ScaLAPACK, other install libraries, etc. | | Did not get it to run, stopped bothering, handpicked some other
routines from netlib. | | Our users do report to us difficultes to have compatible and efficient
versions of SCalappack/BLACS/MPI running correctly on their computer. | | Our users do report to us difficultes to have compatible and efficient
versions of SCalappack/BLACS/MPI running correctly on their computer. | | May be using a configure like installation can let people avoid the configuration by hand of the Bmake.inc file. This means automatic configuration of the C-Fortran calling interface and the MPI implementation for example. | | Clearer options for compiling Single/Double/Real/Complex libraries only.
Better and clearer validation routines. | | The collection of makefiles could be extended. | | The collection of makefiles could be extended. | | The interaction between BLACS, PBLAS and ScaLAPACK. The installation process is made difficult due to the fact it seems necessary to compile several times to get the uncerscores correct. | | Updated for new architecture and Fortran90 |
Question #15. How frequently do you refer to the ScaLAPACK Users Guide? | Response | Count | Percent |
|---|
| Very Frequently | 10 | 11% | | | Frequently | 20 | 22% | | | Sometimes | 36 | 40% | | | Rarely | 18 | 20% | | | Never | 6 | 6% | |
Question #16. What information in the ScaLAPACK guide is hard to find or is missing, if any?
Targeted Environment Specifics
Question #1. Under which operating system environments do your applications run? | Response | Count | Percent |
|---|
| Linux | 190 | 29% | | | AIX | 85 | 13% | | | Solaris | 55 | 8% | | | Windows (other) | 48 | 7% | | | IRIX | 48 | 7% | | | Windows (cygwin) | 43 | 6% | | | HP/UX | 34 | 5% | | | Mac OS X | 34 | 5% | | | Tru64 | 34 | 5% | | | Unicos | 29 | 4% | | | BSD | 17 | 2% | | | MS Visual C++ | 1 | 0% | | | NEC | 1 | 0% | | | via SGI O3K & Cray XT3 | 1 | 0% | | | Unicos/mp | 1 | 0% | | | mingw | 1 | 0% | | | vpp5000 specific | 1 | 0% | | | Windows XP | 1 | 0% | | | SUPER-UX | 1 | 0% | | | MAC OS X | 1 | 0% | | | Other (Undisclosed) | 1 | 0% | | | xt3 catamount | 1 | 0% | | | Mac OSX | 1 | 0% | | | Linux on all of (x86-64 x86-32 IA-64) | 1 | 0% | | | MacOS X | 1 | 0% | | | RT | 1 | 0% | | | opteron-based systems | 1 | 0% | | | OS X | 1 | 0% | |
Question #2. If your applications run in a shared-memory environment, which styles of parallelism do they employ? | Response | Count | Percent |
|---|
| Platform threading | 27 | 45% | | | Multiple | 19 | 31% | | | Concurrent programs | 12 | 20% | | | Components | 2 | 3% | |
Question #2a. Please specify any particular libraries of frameworks used? | Response | Count | Percent |
|---|
| openmp | 9 | 32% | | | MPI | 4 | 14% | | | phtread | 2 | 7% | | | java.util.concurrent | 1 | 3% | | | essl, sunperf, sgimath | 1 | 3% | | | MPI implementation | 1 | 3% | | | I use OpenMP. | 1 | 3% | | | OpenMP, MPI | 1 | 3% | | | IBM smp essl | 1 | 3% | | | R and R contributed | 1 | 3% | | | BRL-CAD | 1 | 3% | | | Don't know | 1 | 3% | | | MPI, pthreads, OpenMP | 1 | 3% | | | OMP | 1 | 3% | | | mpi using shared memory for shared objects | 1 | 3% | | | Global Arrays (that uses System V shared memory) | 1 | 3% | |
Question #3. If your applications run in a distributed-memory environment, which styles of parallelism do they employ? | Response | Count | Percent |
|---|
| Message passing | 134 | 99% | | | Widely-distributed | 1 | 0% | |
Question #3a. Please specify any particular libraries of frameworks used? | Response | Count | Percent |
|---|
| MPI | 5 | 33% | | | PLAPACK | 2 | 13% | | | Trilinos | 1 | 6% | | | communication over Unix sockets | 1 | 6% | | | Direct calls to the BLACS | 1 | 6% | | | I need PVM. | 1 | 6% | | | petsc | 1 | 6% | | | Global Arrays | 1 | 6% | | | HDF5 | 1 | 6% | | | mobile object library, Clam (from William and Mary) | 1 | 6% | |
Additional Information
Question #5. Description of related activities? | Responses |
|---|
| programmer and maintainer of ACESII computational chemistry package | | Compact storage for scalapack.
Out of core extension for scalapack. | | Computational condensed matter physics | |
| | Working on density matrix techniques as eigensolver replacements in quantum chemistry. This relies heavily on (sca)lapack. | |
Electron-impact excitation/ionization of atoms
for modelling of fusion diagnostics and experiments.
From a computational perspective, it involves the repeated
diagonalisation (ie 40-70 times) of symmetric
matrices in excess of 50,000 in which ALL eigenvectors
and ALL eigenvalues are required. The physics of electron
scattering of relativistic targets will drive the size
of these matrices upward by at least a factor 5 in coming
years.
I need efficient I/O and stable diagonalistion. I hope that
the next generation of ScaLapack will not be as demanding
on memory requirements.
Ie. a 40K matrix is the maximum i can diagonalise
over 25 opteron processors each with 2 Gb of Ram.
| | Development of the computational chemistry software NWChem
http://www.emsl.pnl.gov/docs/nwchem | | I have been working on parallel sparse solvers such as incomplete factorization preconditioners for iterative methods. In such precondtioners I sometimes exploits dense matrix computation to achieve better performance, but the cost of communication is typically more imprtant for the method of interest. | | We make limited use of LAPACK in the GYRO code. The
dominant matrix structure is sparse, so for that we
use UMFPACK. | | electronic structure calculation | | Reseach on Computational Condensed Matter Physics.
| | Numical optimization solvers that I develop and distribute target applications nonlinear and semidefinite programming. I use LAPACK
to factor dense matrices that are usually the Schur complements of
much larger indefinite matrices. Parallel versions of these solvers use ScaLAPACK. Users of our software must also link to these packages. | | Computational nuclear theory (static Hartree-Fock, time-dependent Hartree-Fock,
Hartree-Fock-Bogoliubov, Dirac equation, ...) | | function parameter fitting
principle component analysis
structural superposition | | Condensed matter theory; scientific code developments | | quantum chemistry code development (q-chem program) | | We are an Electronic Structure Physics group of Northwestern University engaging in the numerical calculations and simulations of materials.
| | Computational Electromagnetics | | quadratic and nonlinear programming | | statistical computations | | Research on statistical machine learning. | | Finite Element methods in CFD. http://www.cimec.org.ar/petscfem | | Numerical methods for Atmospheric Dynamics
| | Time series data analysis; deconvolution | | Numerical simulation of reactive flows (CFD and Bifurcation Theory) | | Library routine development, User collaborations (consultant work) | | Optimization, approximation and cubature, distribution of points on manifolds, mathematical finance | | Application of Multiple Precision Numerical Computation | | biostatistics bioinformatics | | We write code for quantum mechanics, which does repeated generalised eigensolving. Eigenvectors from one iteration are generally good initial guesses to the next, but we can't make any use of this in ScaLAPACK. For large systems our matrices get sparse and so we are looking into using PARPACK or something along those lines for those. Our biggest problem with ScaLAPACK is memory, not speed. | | Dense library development | | Numerical simulation of fluid flow in oil reservoirs. | | PhD-student in wavelet methods, I only use LAPACK for routines that I do not feel like implementing myself. Running time is not really an issue. The large-scale matrix operations are so sparse and specific that I implement them myself. | | CAE consultancy shop, with emphasis on CFD. | | BEM application developer | | I am mostly talking about my experience working with applicaiton groups. | | We use ScaLAPACK in a boundary element method application. We also have finite element method applications which use sparse solvers (both iterative and direct) which in turn use BLAS and LAPACK heavily.
| | Image processing, estimation theory | | COndensed matter physics, high-temperature superconductivity | | development of software for large-scale numerical optimization, semidefinite programming, sparse and dense | | Not sure what you're asking - if you're asking about the nature of the application
it's fluid dynamics in the interior of the Sun. | | Finite element application | | Finite element application | | Our goal is to incorporate the lapack routines in our own multi precision environment MPL. | | FEM/MOM formulation for electromagnetic code | | I work on the development and implementation of multilevel iterative algortihms on various parallel machines. The underlying linear systems are large and sparse, and so LAPACK routines are typically used for various local (serial) computations in a distributed memory environment. Increasingly we are using shared memory nodes within large clusters, however, hybrid memory models have yet to pay off. This may change as OpenMP and other threading capabilities improve. | | Use D H Bailey's MP and Yozo's QD - for extra precison. | | PDE's in control, boundary control, conservation laws | | Sparse linear algebra | | CFD Applications | | FEM, sparse generalized eigenvalue problems | | optimization | | Financial industry, statistics, data mining, machine learning. | | Computing science in fluid dynamic and heat transfert.
Cluster and grid computing. Great use of MPI-2, Atlas, Lapack, Scalapack and Spools from Fortran and C. | | Atomic, Molecular and Optical Physics, Computational Chemistry. | | I work on the numerical investigation of geophysical fluid instabilities and predictability. I am especially interested the relationship between ensemble forecasting and geophysical fluid instabilities. | | Computational Fluid Dynamics, solution of ODEs/PDEs | | I use lapack routines to develop aplatations from hte area of signal processing. Esp. I'm interested in recursive filters. | | I use lapack routines to develop aplatations from hte area of signal processing. Esp. I'm interested in recursive filters. | | Signal processing | | Materials modelling
ab-initio modelling (density functional theory) | | Most of my current work is with sparse matrices coming from finite element discritizations. | | Quantum-chemistry. Iterative eigenvalue solves. | | Our main CFD code does not use LAPACK. A stand-alone applications of ours does, however. |
Question #6. Additional Comments/Suggestions? | Responses |
|---|
| In benchmarking my application that needs eigenpairs of a double precision symmetric matrix, I found that some architectures seem to have much greater overhead associated with function calls. In particular, I was comparing the different nodes on cheetah.ccs.ornl.gov that handle batch jobs (p690 1.3ghz) and interactive jobs (p655 1.7 ghz). I found that the speed was worse (factor of 1.8) on the batch node than one would expect by scaling to the difference in clock speed (factor of 1.3). I eventually convinced myself that the difference was caused by a huge number of function calls in inner loops, particularly the functions ROTG and DLARTG (e.g. 7 million calls per call to DSTEQR working with a 3750x3750 matrix). I found two ways to solve the problem. One was to recompile Lapack and my application using the highest optimization level (not recommended by the ORNL support folks because it takes so long for typically little benefit), which allows inlining of functions that exist in different source files. The other way was to put a copies of the ROTG and DLARTG in the source file that was generating so many calls. This allowed a lower optimization level to do the inlining.
I don't know if there is any good solution other than recommending the use of a certain optimization level when compiling libraries and applications. Perhaps a preprocessor that would insert copies of functions into source files where they are called in inner loops? | |
I'm not a primary customer and more weight should be given to the comments from apps users. Mainly I'm filling this in so that the Lapack team will know they have yet another customer. :-) | | Would be nice if the technology in HPL (look ahead, recursive options) can be merged back into PZGETRF/PZGETRS. Currently HPL is only in double in C so double complex is not available. HPL works on rhs but conceptually an extra pass over L should make it compatible with PxGETRF.
General symmetric linear solver (LTL') (perhaps in packed storage as well) would be nice.
Since performance of PBLAS is crucial for scalapack, it would be nice if there are parameters to tune PBLAS, especially on triangular solvers.
More examples or tutorials (like PETSc) to help new users to use scalapack.
Better interfaces for scalapack with other iterative linear solvers such as PETSc or Object oriented CCA technology. | | Lapack is one of the best things that has ever happened to me. Everyone of the lapack workers that I've interacted with has bent over backwards to be helpful. This survey is yet another example of how the lapack people care about being useful to their community. If you ever need someone to write glowing letters of support for grant applications (particularly someone from a US National Lab other than ORNL), don't hesitate to contact me at the address above. | | I am currently not using ScaLAPACK but I probably should. I would like to spend some time thinking about how to best integrate with C++.
I also would like to discuss is the way to use BLAS and LAPACK from C++ can be standardised. Several people have asked me for help, and I'm offering the interfaces we developed in the psimag toolkit. It would be nice if we could decide on properly suporting one model. | | I hope that, whatever changes are made to "modernize" lapack, that they be made in the manner to augment the current basic procedural functionality rather than replace it. | | Additional sparse/banded matrix routines would be great, but
I realize that opens up a new can of worms. | | Could you please install PVM? I really need that for my research. Thank you very much! | | general question 8 is a bit unclearly stated. | | The most frequent difficulty that my users encounter while linking their application to my optimization solvers and LAPACK involves language interoperability. Despite our best efforts to help them configure things properly, calling the Fortran routines from C or C++ continues to frustrate users who are uninterested in these subtleties of computer science.
I would also like to see the BLAS1 include y = alpha y + x, w = alpha x + beta y, and BLAS2 include alphs = x' A x where x is a column vector and A is a symmetric matrix.
| | First of all, many thanks for developing LAPACK. It is very important to our group
(computational nuclear theory).
It would be nice to get a genuine Fortran 95 /2003 LAPACK library WITHOUT
interface calls to Fortran 77 version of LAPACK. | | Thank you for your work on:
lapack
atlas
scalapack
pblas... | | thank you | | Lapack needs sparse linear algebra routines and search/sort routines. | | Currently I am working with a vectorization specialist at a supercomputer center to try to port ScaLapack to my codes. The learning curve for getting it running seems to be fairly significant. Our major bottleneck is communications between nodes to set-up the original matrix and to get the results once the calculation is completed. It doesn't seem like this should be so difficult to handle, but we are unable to improve upon it at the moment.
Being able to use ScaLapack in the same user friendly way as Lapack is used would be a great advance. It currently doesn't seem to be there. | | We at the Seminar for Applied Mathematics at the
ETH Zurich are entrenched LAPACK fans!
Thanks for the great work. Keep it up! | | Re ScaLAPACK: while I am not now using it my need for the functionality it provides has risen sharply in the last several months and I expect I will be diving in to it soon. I expect to use it via a matlab interface and will need to locate a suitable interface or roll my own. | | Online lapack, scalapack guides are useful, most of the time, I just google the routine I need
cholesky update missing in scalapack
large (banded) matrices require huge memory for linear solution.
superlu_dist is nice and should be incorporated
petsc is nice and should be better incorporated | | Few people seem to know this trick: if one has matlab installed on a system, one can link to the MathWorks-provided lapack libraries (called something like libmwlapack.a) which are highly tuned for a given architecture. Some operations are 3x faster than building lapack from source with the highest optimizations. I don't know if the MathWorks discourages its customers from doing this but this kind of tip in my opinion belongs in a lapack FAQ or related document. | | Calculation of gradient information of objectives involving log det of matrix valued function and constraints involving solution of linear system are much faster with explicit inverse of symmetric posiitve definite matrix. These codes involve element
by element multiplications of inverse with derivatives of matrix elements.
Many vendor implementations multi-thread Cholesky DPOTRF, but not DPOTRI explicit inverse using this Cholesky, producing a ottleneck on local 2 and 4 way SMP nodes. ATLAS does include multithreaded DPOTRI. | | Would very much like to see the RRR tridiagonal eigensolver in ScaLAPACK.
Thanks for providing the survey! | | Kindly circulate the comments that I had sent to Jim and Jack earlier. | | LAPACk is great for not very large systems, the same can not be said about ScaLAPACK for very large systems. For very large systems,
alternative approaches shall be taken instead of direct extention
of LAPACK. The size matters here. | | Would really like an FFT package in ScaLAPACK, as it is in PESSL. In fact, we are having to use FFTW 2.x in our porting of our image processing software from an IBM P4 system to the Cray XD1 system because we used PESSL's FFT but ScaLAPACK doesn't have it. We are a bit concerned about using FFTW 2.x because it isn't the latest version, but the latest version doesn't have distributed memory FFTs. It would be nice to have it in ScaLAPACK. | | LAPACK is excellent software.
We also use a multiprecision build of CLAPACK, built in C++ using customized class for 'double' say, overloaded arithmetic operators, and runtime-determined substitutes for LAPACK machine-constants routines. Building this requires some "clean up" of the CLAPACK sources, which makes getting updates/fixes more involved. It would be great if this could be done more easily, without any need for minor editing, eg. use of temp vars and no calls involving explicit float args like foo(...,1.0,...) or comparisons like bar > 1.0 . The ability for anyone to be able to build easily a quad- or multiprecision (GnuMP based, say) version of LAPACK might be generally well received. It's not immediately clear that all LAPACK routines (eg, SVD) will remain robust (eg. convergence) when treated in this way. Of course, multiprecision LAPACK brings with it many involved questions about how to best leverage double-precision solutions as initial candidate solutions for arbitrary high precision computation.
| | need more user friendness | | We are mostly using out-of-core version of scalapack. Since it's only a prototype code, documentation is limited. I would like to see an offical version of out-of-core version of scalapack. Also, partial factorization for both in-core and out-of-core scalapack would be nice. | | I think one of the frustrating parts of the LAPACK libraries is
the build and patch system. I think moving to a modern
revision control system (such as subversion), along with
a good set of open-source development tools accessible
through the web, introducing supported language bindings
(e.g., python), and finally a more reasonable build system is critical to the next generation of users.
| | 1. A quad-precision version of LAPACK would be very useful.
2. After using the matrix solvers in LAPACK for many years, I am convinced there is a deep problem that exhibits itself in the SVD routines. | | I like Lapack, thank you. | | Functionalities for symmetric indefinite matrices
are missing in Scalappack. | | The documentation (man-pages) of LAPACK could be kept more up-to-date wrt. the specification of workspace sizes. We have observed problems in the past due to this information being outdated in some cases.
Another problem concerns error handling: we'd rather not have LAPACK call the standard XERBLA-routine, which terminates the program. Unfortunately, one cannot replace it when using shared libraries. It would be great if there were e.g. some kind of global switch that forced the standard XERBLA to do nothing so that the called LAPACK-function can return the error code in INFO to the application, where it can be handled in a specific way. | | Scalability and performance of the Scalapack routines are very impressive compared to other libraries but still not easy to use according to the data distribution driven by the application algorithms.
Many thanks and wish you all the best. | | I would be very interested in using the MRRR (Multiple
Relatively Robust Representations) eigensolver algorithm.
I would be very grateful if you could feedback to me any
plans you have to incorporate this into future
releases of Scalapack, as this could influence the direction
of my future research. | | 1. I think that sca/lapack should offer a support for solving systems with Toeplitz matrices. Recently I have developed some algorithms for solving linear systems with banded triangular Toeplitz matrices (both versions using OpenMP, mpi and Level 2 & 3 BLAS routines). Please, let me know if you thinh that thet could be useful. Also see:
P. Stpiczynski: Numerical evaluation of linear recurrences on high performance computers and clusters of workstations, In: Proceedings of PARELEC 2004, IEEE Computer Society Press, 2004, 200-205P.
Stpiczynski: Solving linear recurrence systems using level 2 and 3 BLAS routines, Lecture Notes in Computer Science 3019 (2004) 1059-1066
2. Some support for vector processing counld be improved in case of multiple right hand side vectors (instead of repeating a simpler solver for one right hand side vector).
3. Recently I have developed a triangular matrix solver which use an alternative data distriburion (P. Stpiczynski: Parallel Cholesky factorization on orthogonal multiprocessors, Parallel Computing 18 (1992) 213-219). which is faster than the original scalapack routine. I believe that this idea can ba applied to produce faster Cholesky factorization. Currently I'm wonking on it. | | 1. I think that sca/lapack should offer a support for solving systems with Toeplitz matrices. Recently I have developed some algorithms for solving linear systems with banded triangular Toeplitz matrices (both versions using OpenMP, mpi and Level 2 & 3 BLAS routines). Please, let me know if you thinh that thet could be useful. Also see:
P. Stpiczynski: Numerical evaluation of linear recurrences on high performance computers and clusters of workstations, In: Proceedings of PARELEC 2004, IEEE Computer Society Press, 2004, 200-205P.
Stpiczynski: Solving linear recurrence systems using level 2 and 3 BLAS routines, Lecture Notes in Computer Science 3019 (2004) 1059-1066
2. Some support for vector processing counld be improved in case of multiple right hand side vectors (instead of repeating a simpler solver for one right hand side vector).
3. Recently I have developed a triangular matrix solver which use an alternative data distriburion (P. Stpiczynski: Parallel Cholesky factorization on orthogonal multiprocessors, Parallel Computing 18 (1992) 213-219). which is faster than the original scalapack routine. I believe that this idea can ba applied to produce faster Cholesky factorization. Currently I'm wonking on it. | | Please continue to develop these packages as they are of immense value to the research that we perform. |
Question #7. Use DOE-lab resources? | Response | Count | Percent |
|---|
| Yes | 30 | 46% | | | No | 34 | 53% | |
Question #8. Use HPCS resources? | Response | Count | Percent |
|---|
| Yes | 4 | 80% | | | No | 1 | 20% | |
|