### General

Question #1. Is dense linear algebra a performance bottleneck in your applications?

Response Count Percent
Yes 157 76%
No 48 23%

Question #2. How often do your applications use the arithmetic precisions listed below:

a. Single precision:

Response Count Percent
Very Frequently 17 8%
Frequently 22 10%
Sometimes 39 19%
Rarely 56 27%
Never 67 33%

b. Double precision:

Response Count Percent
Very Frequently 185 86%
Frequently 23 10%
Sometimes 4 1%
Rarely 3 1%
Never 0%

c. More than double precision:

Response Count Percent
Very Frequently 9 4%
Frequently 15 7%
Sometimes 24 11%
Rarely 63 31%
Never 91 45%

d. Complex single precision:

Response Count Percent
Very Frequently 5 2%
Frequently 17 8%
Sometimes 34 17%
Rarely 45 22%
Never 98 49%

e. Complex double precision:

Response Count Percent
Very Frequently 80 38%
Frequently 29 14%
Sometimes 36 17%
Rarely 17 8%
Never 45 21%

f. Complex, more than double precision:

Response Count Percent
Very Frequently 4 2%
Frequently 12 6%
Sometimes 16 8%
Rarely 41 20%
Never 124 62%

Question #3. What dense matrix sizes are most important or time-consuming for your application?

Response Count Percent
1,000s X 1,000s 56 26%
10,000s X 10,000s 50 23%
100,000s X 100,000s 27 12%
100s X 100s 20 9%
1,000,000 or more X 1,000,000 or more 16 7%
10s X 10s 9 4%
1,000s X 100s 5 2%
10,000s 4 1%
100,000s X 1,000,000 or more 3 1%
100s X 10,000s 3 1%
1,000,000 or more 3 1%
1,000,000 or more X 100s 2 0%
100,000s 2 0%
100s 2 0%
1,000s 2 0%
10,000s X 100,000s 2 0%
1,000s X 10s 1 0%
100s X 1,000s 1 0%
10,000s X 1,000,000 or more 1 0%
100,000s X 1,000s 1 0%
100s X 10s 1 0%
10,000s X 1,000s 1 0%
100,000s X 10,000s 1 0%

Question #4. Does your application come close to, or run out of memory on important problems?

Response Count Percent
Yes 150 69%
No 66 30%

Question #5. Number of processors used for your application:

a. SMP:

Response Count Percent
Less than 10 114 59%
More than 10 57 29%
More than 100 14 7%
More than 1,000 4 2%
More than 10,000 2 1%

b. Distributed shared-memory:

Response Count Percent
Less than 10 80 44%
More than 10 49 27%
More than 100 37 20%
More than 1,000 11 6%
More than 10,000 1 0%

c. Distributed memory:

Response Count Percent
Less than 10 53 28%
More than 10 51 27%
More than 100 54 28%
More than 1,000 25 13%
More than 10,000 5 2%

Question #6. Which architectures do you use or intend to use in the next three years?

Response Count Percent
Distributed-memory 148 20%
Sequential 124 17%
Hybrid-shared 98 13%
Distributed-shared-memory 71 9%
Symmetric-multi-procs 55 7%
Vector-computers 55 7%
Widely-distributed 32 4%

Question #7. Do you use any other sequential or parallel dense linear algebra packages other than LAPACK or ScaLAPACK?

Response Count Percent
no 16 22%
PLAPACK 14 19%
ESSL 3 4%
BLAS 2 2%
PESSL 1 1%
linpack 1 1%
scilib 1 1%
CLAPACK 1 1%
ESSL, CXML (LAPACK-based) 1 1%
PLAPACK, Pliris 1 1%
IBM ESSL 1 1%
IBM's ESSL 1 1%
Steven Kenny 1 1%
NAG 1 1%
super_lu_dist, petsc 1 1%
mumps 1 1%
Peigs 1 1%
ESSL, NAG 1 1%
NAG, IMSL, ESSL (IBM), ASL (NEC) 1 1%
PETSc 1 1%
Yes 1 1%
PLAPACK, Trilinos, Parallel ESSL 1 1%
Compact Numerical Methods / Pascal 1 1%
none 1 1%
Lee Samuel Finn 1 1%
Arpack 1 1%
ARPACK, BLZPACK 1 1%
perhaps the 'parallel relatively robust representation' code. It has less memory requirements which favours a small cluster. 1 1%
ARPACK and PARPACK 1 1%
BNCpack, MPIBNCpack 1 1%
fftw 1 1%
ATLAS, in-house modular LA packages for rational (not float) problems 1 1%
BLAS of course 1 1%
ATLAS 1 1%
some OpenMP blas 1 1%
amd acml library, Intel mkl library 1 1%
umfpack 1 1%
BLAS, linpack, eispack (legacy code) 1 1%
FLAME, PLAPACK 1 1%

Question #8. Please rank how the following features would be useful to your current or planned applications?

a. User defined matrix types:

Response Count Percent
Very useful 61 29%
Somewhat useful 89 42%
Not useful 57 27%

b. Using optional arguments in the language interface:

Response Count Percent
Very useful 43 20%
Somewhat useful 112 53%
Not useful 54 25%

c. Automatic memory allocation of the work space:

Response Count Percent
Very useful 134 63%
Somewhat useful 54 25%
Not useful 22 10%

d. More complicated matrix data structures:

Response Count Percent
Very useful 70 33%
Somewhat useful 102 49%
Not useful 36 17%

Question #9. Do your applications solve linear algebra problems of the type?

Response Count Percent
General linear systems 144 19%
Linear positive definite systems 115 15%
Symmetric eigenvalue 110 14%
SVDs 84 11%
Generalized eigenvalue 79 10%
Least-square problems 74 9%
Banded linear systems 72 9%
Non-symmetric eigenvalue 55 7%
cholesky updating 1 0%
Symmetric complex 1 0%
product problems (Schur 1 0%
singular linear systems 1 0%
Linear indefinite systems 1 0%
complex symmetric linear systems 1 0%
sparse matrices 1 0%
Symmetric indefinite [G A; A' 0] strucutre 1 0%
general complex inverse and determinant 1 0%
control theory 1 0%
matrix exponential 1 0%
gen.Schur) 1 0%

### LAPACK Usage

Question #1. Do you use LAPACK (or a vendor version of LAPACK )?

Response Count Percent
Yes 198 92%
No 15 7%

Question #2. If you do not use LAPACK, why?

Response Count Percent
Use another pkg 12 50%
Not solving linear algebra 8 33%
Cost of learning 4 16%

Question #3. If using another package, which one(s)?

Response Count Percent
FLAME 3 20%
ESSL 2 13%
PLAPACK 2 13%
CLAPACK 1 6%
scalapack 1 6%
matlab 1 6%
NAG, IMSL, ESSL (IBM), ASL (NEC) 1 6%
PARPACK 1 6%
I usually write my own solvers. 1 6%
ESSL for sequential work (though I do also use LAPACK when a customer requires it). 1 6%
Also Custom 1 6%

Question #4. If you use LAPACK, do you use a vendors version or one obtained directly from Netlib?

Response Count Percent
Netlib 133 48%
Vendor 119 43%
Other 20 7%

Question #5. If you have used both a vendors version of LAPACK and Netlibs, how do the two versions compare?

Responses
vendor lib decreases our runtime by 10%-20% (of calculations taking hours to days).
vendor's version may only have LU solver, not the full library
netlib's version is less convenient to use since it does not have Makefiles.
Very similar in performance provided you use an optimised BLAS
Roughly equal, given well optimized BLAS (e.g. Atlas)
Netlib's version provides the source code, very helpful in porting. I use vendor's LAPACK functions whenever possible.
This depends on the architecture/compiler: The AMD (ACML) libraries perform very well, though interacting with the 'fastsse' options on the portland group compiler can cause rare failures. So most ScaLapack/Lapack libraries are a mixture on 'pre' and 'locally' compiled. At NERSC, linking to the 'essl' library , ( for DGEMM) has proved substantially faster than a user compiled version.
I usually try to use the vendor's version first and if it is not available, costs too much, or for some other reason too much work, use Netlib's version next.
Mostly interchangeable. However, on occasion, the flexibility of configuring through ATLAS has been a key enabler.
Generally more optimized, although I have not separated out the BLAS performance (actually use of vendor LAPACK over reference netlib version is often a matter of convenience when using vendor BLAS).
very similar
No real differences.
Similar. Usually performance depends on BLAS, for which the Goto BLAS performs on par with vendor BLAS.
vendor version is substantially faster.
Vendor supplied typically factor 2 faster.
well.
Vendor faster but Vendor is incomplete or undefined API (Lapack 1? 2? 3?)
On Opterons with Portland group compilers under Linux, Netlib version is at least order of magnitude worse. It's useful for testing on a PC though.
Presume vendor version provides better performance.
I use MKL on Windows machines and ATLAS on Linux machines. Both perform better than reference implementations.
the intel MKL libaries are much easier to use than ATLAS and LAPACK. despite what others say, I do not find ATLAS to be much faster than MKL blas (perhaps the difference is not suffient to motivate me to change).
Vendor's usually does a better job of using cache
Prefer vendor's version.
I have run them on different architectures and so have not had the opportunity to do a very direct comparison.
Never compared.
Have not compared. Usually I have used the vendor's version assuming they had optimized blas, and maybe lapack. I also assume the vendor is more knwoledgeable (and has more time) to experiment with the different compiler options. Have not actually verified my assumptions, I could be wrong.
Vendor's version somewhat faster
Vendor version is more convenient and sometimes optimized. However, my main reason for using a vendor supplied version is convenience in the cases where I don't need portability.
Sometimes the vender version better performs.
Similar - Netlib self compiled version is often as good as/better than vendor supplied versions
Similar
The overall size of LAPACK library is big for some small platforms.
Vendor versions tend to be minimally tuned, but better tuned than netlib.
Vendor code is faster, but not more than 25%
Sun Performance Library (Solaris 8, 9) had too many bugs. We were on the phone for half and hour or more each time we called in bug reports to Sun and finally decided that their implementation was worthless; we stopped using it and reverted to netlib sources.
Vendor version is far better (faster).
Tuning netlib version can be annoying.
Often e.g. on IBP SP's, the vendor version is better
no tests performed
Vendor better on Solaris
LAPACK is comfortable in some sense because I have been using it (off-and-on) for so many years. However, the differences are all in performance, which is important, but my problems with LAPACK/ScaLAPACK all stem from the interface and that, of course, does not change.
They are more or less the same, I just don't have to compile / keep manually track of updates with Debian's lapack.
Vender is much faster
Vendor versions are *usually* faster, but not always.
vendor version is faster
Vendor's version (MLIB) optimized much better and thus faster
Vendor version is more prone to errors but is usually faster speedwise. Having the netlib version available is a good check on the stability of the vendor version (especially the test suite)
SGI computer has its own versio of BLAS and LAPACK. It is faster.
Vendor releases seem only marginally optimized over reasonable compilation of netlibs code. ATLAS libraries are critical for any real performance bottle neck.
Basically the same
I was never able to get the NEtlib version to pass all it's own tests. I have not submitted the vendor package to the same tests!
I use ESSL from IBM on p690 At link step, I ask first for ESSL, next for LAPack for completeness. (ESSL does NOT contain all LAPack subroutines)
Vendor's version is used for performance and netlib's one for portability.
vendor's is faster
Vendor version is much faster
Comparison is difficult because I only use Netlib LAPACK on architectures for which I don't have access to a vendor version.
Higher speed for the vendor's version as expected. Same robustness (keeping in mind that I compile the Netlib version with medium optimization flags).
The vendor version is faster, and I use it because it is free in this case (AMD's ACML).
I use MKL provided by Intel. Level 2 BLAS routines from MKL are faster than
I use MKL provided by Intel. Level 2 BLAS routines from MKL are faster than
Generally the vendor versions are faster showing considerably speed up over the Netlib versions.
vendor's version of LAPACK much better tuned to their platforms

Question #6. Do your applications make direct LAPACK calls?

Response Count Percent
Yes 181 84%
No 32 15%

Question #7. Do your applications use libraries which depend on LAPACK?

Response Count Percent
Yes 136 65%
No 73 34%

Question #8. Do your applications use a higher-level interface to LAPACK?

Response Count Percent
Yes 55 27%
No 148 72%

Question #9. If you answered yes above, which higher-level interfaces do you use?

Response Count Percent
Matlab 19 35%
R 3 5%
SciPy 2 3%
python 2 3%
Some Matlab routines 1 1%
the matrix template library (MTL) 1 1%
in-house custom written 1 1%
CLAPACK 1 1%
Matlab, NAG 1 1%
Scilab 1 1%
numarray 1 1%
Matlab mex-functions 1 1%
FLAME interfaces 1 1%
Mathematica, Apple's "Accelerate" framework 1 1%
SciPy/Python, Boost ublas-lapack 1 1%
PETSc 1 1%
Matlab, Lapack95 1 1%
Matlab, IDL 1 1%
FLAME 1 1%
C++ bindings from boost-sandbox 1 1%
Trilinos, my own c++ templated lapack wrappers 1 1%
python, matlab, octave 1 1%
MPL (our own multi precision matlab clone) and Matlab 1 1%
LACAML 1 1%
Matlab, PETSc 1 1%
MTJ 1 1%
MATLAB matrix ops 1 1%
in-house C++ layer 1 1%
Octave 1 1%
python/scipy 1 1%
LAPACK90 1 1%

Question #10. Is the LAPACK procedure interface a barrier to more extensive use?

Response Count Percent
Yes 37 18%
No 159 81%

Question #11. From which languages do you call LAPACK routines?

Response Count Percent
Fortran 90/95 126 25%
Fortran 77 107 21%
C 104 21%
C++ 75 15%
Matlab 34 6%
Python 16 3%
Octave 10 2%
Fortran 2003 5 1%
Java 4 0%
R 3 0%
Scilab 1 0%
Mathematica 1 0%
C# 1 0%
tcl 1 0%
IDL 1 0%
VB 1 0%
OCaml 1 0%

Question #12. Please describe any tools or helper functions that you frequently implement to assist your applications in using LAPACK?

Responses
CLAPACK
C++ interfaces
Numeric Python
1. Perhaps an architecture independent diagnostic, that would interrogate a particlur system for different L1 and L2 cache boundaries and suggest optimum matrix size. 2. DGEMM_HALF ... ie knowing my matrix solution will be symmetric in advance,i would like a half matrix multiply that scales as well as a vendor supplied DGEMM.
We have built a C++ interface for BLAS and LAPACK that uses the ability to overload arguments and removes the need to encode precision into the name of the method/subroutine. We have also defined the simples possible matrix concept and use generic linear algebar operations that interface to LAPACK and BLAS.
None.
I usually wrap the f77 calls in f90 shells that allocate workspace memory as required.
#if SP2 #define C2FCALL(x) x #else C2FCALL(x) x##_ #endif
I mostly use lapack via Scipy, and I find the wrapped interface satisfactory. I'm sure it could be improved, but I have never felt that part to be a development bottleneck for me.
wraper routines that dynamically allocate the work spaces
I typically write "wrapper functions" that call LAPACK while hiding many of its arguments from the nonlinear solver or optimization solver. When using C++, these functions inherit the interface from an abstract matrix class.
In my C codes, I always have to make a wrapper to hold dummy integers, etc. for the options I don't use.
Compact wrappers
sparse routines and search routines have been written.
Class wrappers in C++
boost python bindings f2py
C/C++ wrappers
Our helper functions abstract from indices, much like FLAME does
I/O Stuff for printf-debugging of small examples
NETLIB
OO wrapper
C++ class package for matrices.
Home grown wrappers for LAPACK that allocated work space.
explicit prototypes
Wrappers that handle memory allocation.
self-developed tools
none
Getting LAPACK templated is one of the greatest obstacles. We are currently trying to use a modified f2c to do the job of converting the hard coded types 'double' and 'float' into our own MpIeee class.
Functions to display matrix elements to check correctness
matrix plotters (ascii, graphical, etc)
my own c++ templated lapack wrappers
PMD (Parallel multi-domain Decomposition), see : http://www.idris.fr/data/publications/PMD
I frequently write wrapper routines to make repeated calls from C more straightforward. I often do not need access to the full range of input arguments, so I write the C wrapper to supply the "extraneous" arguments.
I have not implemented such tools. I think that the use of lapack routines is quite simple.
I have not implemented such tools. I think that the use of lapack routines is quite simple.
plane rotations, reflections, Toeplitz and Cauchy solvers, updating and downdating,

Question #13. How could the LAPACK interface be improved to feel more natural to your application and implementation language?

Responses
There could be a facility to interrupt a Lapack call. This can be very useful in a multithreaded application, or one that makes use of asynchronous communication. I hacked this feature into Lapack3 in a fairly inelegant way. I identified several points in the Lapack code to call an external function, which was part of my application that handled communication events. These points were chosen so they were called fairly often during the course of solving my symmetric eigenvalue problem, but not in inner loops to avoid a performance hit. Because the Lapack functions have no parameters for such a purpose, I used "LDA" as a flag to indicate a situation where I need Lapack to abort the current operation, as opposed to just continuing on where it left off. In my application, this depended on the nature of the communication event.
Object orientation
see the draft LAPACK bindings of the C++ boost project
Fortran 90/95 interfaces would be more natural Overload so that have one routine name for all data types
Better integration with Python
Short of doing a full thing with a generic linear algebra library (modeled after blitz++ or MTL), I favor the pragmatic and simple approach I described in item 12.
I actually find the procedural interface as it stands quite natural, but then I'm one of the f77 dinosaurs.
Automatic memory allocation for temporary/work arrays would be a nice option. I understand why LAPACK does the whole bit about 'leading dimension of array...' but it's awkward and clumsy. Would be nicer to use f90 interfaces that know what shape the array is, or a simple C++ array type that does as well. (But *not* a complicated C++ array type like Blitz supplies; don't make LAPACK itself dependent on anything!)
I don't see any obvious/major weakness to the current LAPACK (used under F90).
- removing the work spaces - call-by-value interfaces for C
Make a C++ framework for it so I don't have to write my own.
Some examples that call LAPACK from C would be helpful.
I hate to call the routine sometimes with underscore on some systems like linux.
Should have automatic workspace allocation, and would be very useful to have a "higher level" interface that, given the sort of problem to solve (e. g. symmetric eigenvalue problem) detects the type of data and the structure of the matrix, so that the user does not have to rewrite large portions of code while trying different algorithms and modifiying existing code (e.g. when the underlying data changes from real to complex).
automatic interface.
It would be nice to have multiple c wrappers for LAPACK that can customize calls to many-option functions.
Definitely oppose to any f90 interface. f77/C as is makes it very easy to use with any applications.
Simple consistent interface such as outputs first, followed by inputs.
give full C or C++ version of lapack
generic programming would make lapack more accessible from C++
More comprehensive wrapper interface for C++ that does not sacrifice performance (e.g. user keeps control over memory allocation)
Fortran 90/95/2003 interfaces should become standard.
Provision of standard C/C++ wrappers
The symmetric packed routines should be improved.
Provide interface for row-major storage.
remove user supplied workspace; do it automatically. More obvious (longer) routine names, although this might break f77 applications.
Make it more like FLAME
Hide work arrays, leading dimensions, matrix dimensions, and possibly other matrix information (symmetric, triangular) beneath an abstraction layer (a la FLAME objects). Do not require user to provide workspace.
provide C headers + assistance tool for linking (linking of Fortran and C can be a bit of a pain in the backside, finding all libraries that have to be linked in)
More optional arguments and thus shorter method signatures.
Getting rid of some of the explicit indexing would be nice. Being able to shift the borders of the "matrix of interest" inside the larger array on the machine would be nice. ScaLAPACK makes this situation worse, which is unavoidable I suppose. A very simple change would be slightly less cryptic identifier names (indirect feedback from anyone I help use LAPACK for the first, and often last, time).
Object oriented framework for C++.
Heavy use of C++
simplifying the indexing
We have C++ application codes. We would like to have matrix 'objects' which would bundle size information, and possibly have various format/layout options (e.g. packed symmetric, etc.).
Automated memory allocation.
with a matlab wraper
It's a pretty good interface as is.
It would be slightly nicer to have actual routines to query accuracy of eigenproblem results as error bounds from the routines such as zgeevx (that calculate eigen-condition numbers). Right now, the LUG shows (eg. lug/node100.html) code fragments to obtain such accuracy results based on working precision details and the computed condition numbers. It'd be slightly nicer to have these code fragments be actually implemented in some routine -- the driver, say, or other.
Using lapack in C++ makes more sense if the library is templated. Allowing to use your own datatypes in the algorithms and hence ennabeling multi-precision.
Consistent and intuitive ordering of function parameters/argument
provide standartized c++ templated lapack wrappers
Since we predominantly use OCaml as implementation language, we are not particulary dependent on the LAPACK-interface itself, because the intermediate LACAML-interface makes calling LAPACK very easy without losing functionality or efficiency. I assume that this is also mostly true for other people using different implementation languages. Therefore, I'd propose to keep LAPACK as efficient as possible, i.e. there is no need to waste implementation effort on making it more convenient. Higher-level interfaces and implementation languages are more suitable for this purpose.
With fortran 95 and even more with Fortran 2003 many things could be improved using the new functionnalities as MODULE, genericity, optional arguments, dynamic allocation of Fortran 2003 and its object oriented programming and C-Fortran compatibility.
C++ wrapper
A C interface would make LAPACK feel more natural in C. It might also be nice to have set of functions which have a restricted set of arguments that work for most "simple" problem. FFTW supplies such function and I find them quite handy.
A simplified interface for Fortran90 etc. that allows for simple quick use of the routines.
I think it is not necessary to make such improvements.
I think it is not necessary to make such improvements.

Question #14. If you have installed LAPACK yourself, how could the installation process be improved?

Responses
Haaave the default make target *not* run the test phase.
would be nice if the testing and verification process is quicker
use a simple configure script
It could not be improved
This is a minor issue: complete the LAPACK implementation in ATLAS so that optimized BLAS and LAPACK can be installed in one package.
The testing process is too slow. It's hard to believe the functionality oculdn't be tested with quicker-running problems. Perhaps have the option of the long test if you want it. A typical scenario is that I get onto some new machine and find I have to install LAPACK, then I can kiss productive time goodbye waiting for the tests to run.
More automatic configuration, and some basic tuning.
Place in autoconfig construct as a package in itself Place in autoconfig construct with scalapack, lapack, blas ./configure --with[out]-(scalapack,lapack,blas) --with-blas=ibm_essl
If you can have binary version for all the systems, it will be great.
not much, current method is fairly straightforward
I have not done this recently. Several years ago, I had difficutly using the compiler that came with my linux distribution (I think either atlas or lapack required an older version).
Does it use GNU autoconf and automake?
rpm, if possible
it's simple enough
LAPACK always seem to compile fine, but testing routines often fail for non-obvious reasons.
1. I always compile and create the libraries manually. Provide a script or makefile that does just that. The official installation process is confusing and gets stuck. 2. provide standard ./configure make make install but make sure it actually works on (nearly) everything
The LAPACK installation process itself isn't bad once you've done it a few times. What would be really nice is if it followed the GNU style "./configure ; make ; make check ; make install" sequence. I install tons of software for the team I support and reallly appreciate those tools that use familiar and consistent configure scripts & build env variables (CFLAGS, etc). Another pain--and this isn't LAPACK's fault--is building an ATLAS-enhanced LAPACK. Sure, it is easy once you've done it a few times but the process is really non-standard.
It is ok.
LAPACK currently has no configuration support. Manually editing makefiles is fine for most wonks like us, but it's not very "professional" for an NSF project funded at LAPACK's level. Autoconf configure scripts would be very slick and help provide built-in support for building LAPACK on various frequently-used architectures.
dlamch tuning could be simplified?
I have not seeen any problem with the installation process
configure script rather than multiple makefiles + user choice
It should build shared libraries, not just static.
Unless the architecture is new, there seem to be few problems. Even with ScaLAPACK, I have encountered few problems in this respect. It was very well done.
Many recent compilers seem too clever for automatic detection of parameters like machine epsilon to work. It would be useful to have a replacement that uses Fortran 90 intrinsics instead.
it always worked fine for me
Make the default install double & double complex only. Improve the test suite (it fails eg on some compilers when attemmpting to discern the machine accuracy/tolerances).
Export .exp files for use with MSVC++ on Windows would be slightly nice to have, though this is a minor issue.
No problems with installation...
Been a long time since I did this - Think it's OK as is.
ok as is
autoconfig!! Manually tweaking makefile options and manually installing the library is a pain. This would also allow for more reasonable shared library support. Also, go ahead a make a LAPACK 3.1 or 3.0.1 (or whatever you'd call it) release. Downloading the lapack-3.0 sources and then manually copying all of the patch files on top of the originals is a pain.
configure / make
Installation is not a problem.
I feel the installation process was extremely simple. I can't suggest any improvements.
Maybe an ATLAS-like procedure that reduced the optimization levels only on routines that tend to fail on a given arch/compiler combination and optimized the rest.
I know just enough about the install process to get into trouble. I originally tried to install just the single and double precision routines, but the install would not complete. It was easier to just default the install to create the full library with all four precisions/kinds
One could consider to provide makefiles for some popular platforms and compilers like Intel compilers.
One could consider to provide makefiles for some popular platforms and compilers like Intel compilers.

Question #15. How frequently do you refer to the LAPACK Users Guide?

Response Count Percent
Very Frequently 7 3%
Frequently 53 26%
Sometimes 91 45%
Rarely 40 20%
Never 8 4%

Question #16. What information in the LAPACK guide is hard to find or is missing, if any?

Responses
More examples would be nice (more examples for scalapack would be nice as well)
N/A
Information about sparse matrices could be better documented.
Performance information is difficult to find. For instance, it took me a while to confirm my observation that the Cholesky factorization using packed symmetric format were much slower than Cholesky factorizations using full dense format.
Information is not missing; however, it can be difficult to find on netlib, and more examples are needed. Some time the documentation is difficult to understand (you can't assume that even the most hard core of us know the nomenclature).
It is difficult to find the names of the high-level functions.
I would like to see a list of NAG routines which have corresponding LAPACK equivalents.
it's so hard to read, i would like to see examples.
Sometimes hard to find the right function name for a given computation.
It is ok.
Syntax of routines, meanings and orders of parameters
Not everybody nowadays knows what LDA is - this scares off some especially young users of LAPACK
The information is relevant and useful
algorithmic aspects and details
a precise description of the algorithm
It's not clear from the section on storage schemes how non-square packed storage works (the examples are square, only). One has to figure out how it might work, eg for non-square packed band storage. The use of m,n,k,etc in the few topmost SVD routines is slightly confusing. It's not immediately clear how to implement it to produce a "non-full-span" U (for U*S*V^T=A say) so as to be most efficient when solving least-squares with m>>n . In the absence of infomation in the docs about whether the QR implementations (with or without column pivoting) are rank-revealing it becomes more prudent to always have to do SVD for least-squares problems with might be rank-deficient. It might be useful if the docs said more on this. But that's asking for more mathematical education in the docs, which I realize is a big request. Links in the LUG on netlib from the instances of LAPACK function names to their nearby source locations on netlib would be very useful, since the specification and calling sequence infomation is in comments in the individual routines' source files. NB.The LUG section Specifications of Routines (lug/node149.html), as it appears on netlib at least, is empty except for a brief note. Thus the Individual Routines sources' comments appear to be the specs. Details of the encoding of the results of the Bunch-Kaufman-Parlett decomposition of symmetric indefinite matrices seems missing in the guide. This makes it near difficult to extract and form the individual factors computed by, say, dsytrf. I realize that this request is inefficient and unwise and unnecessary when solving systems, etc, but sometimes users just really want to get their hands on the explicit matrix factors and are willing to lose the efficient encoding which the dsycon, dsytrs, etc understand. Using the details as supplied in the comments in, say, dsytrf's source, is involved. The approximation of condition numbers (Higham's modification of Hager's method) can be inaccurate. The documentation isn't very clear on how inaccurate it might be.
explicit examples
Program samples would be nice. Especially from other languages.
Always found what I need.
The guide does not seem to include a discussion of the individual functions and their arguments. I often have to go to the Fortran source to find out which arguments a function takes.

### ScaLAPACK Usage

Question #1. Do you use ScaLAPACK (or a vendor version of ScaLAPACK )?

Response Count Percent
Yes 93 45%
No 113 54%

Question #2. If you do not use ScaLAPACK, why?

Response Count Percent
Not solving linear algebra 32 41%
Cost of learning 26 33%
Use another pkg 19 24%

Question #3. If using another package, which one(s)?

Response Count Percent
PLAPACK 10 43%
Matlab 1 4%
MPIBNCpack 1 4%
self written 1 4%
PESSL (IBM) 1 4%
LAPACK -- matrices are local to node 1 4%
(we use coarse-gain parallelization above the numerical layer) 1 4%
PETSc 1 4%
LAPACK - SCALAPACK does not support SVD 1 4%
LAPACK 1 4%
I usually write my own solvers. 1 4%
symmlq 1 4%
SuperLU 1 4%
Typically, I use pESSL, but use ScaLAPACK or PLAPACK when users have the need (prefer PLAPACK, especially to build a custom solver/routine). 1 4%

Question #4. If you use ScaLAPACK, do you use a vendors version or one obtained directly from Netlib?

Response Count Percent
Netlib 46 48%
Vendor 44 46%
Other 4 4%

Question #5. If you have used both a vendors version of ScaLAPACK and Netlibs, how do the two versions compare?

Responses
Vendor supplied is faster, but more buggy.
Vendor's version is definitely faster (on Compaq and IBM)
ScaLAPACK is comfortable in some sense because I have been using it (off-and-on) for so many years. However, the differences are all in performance, which is important, but my problems with LAPACK/ScaLAPACK all stem from the interface and that, of course, does not change. [Yes, I copied my response from the LAPACK response above ... but it does apply]
They are more or less the same, I just don't have to compile / keep manually track of updates with Debian's lapack.
Both - but I haven't compared their performances because other issues come into play, such as the version of mpich used to compile them, etc.
vendor's version is faster
Same as before : P-ESSL instead of ESSL
As for LAPACK, vendor's version is used to reach optimal performance and netlib's one for portability.
Vendor implementations are faster. I have also found them to be more robust. We have had cases where the Netlib verrsion fails but the vendor version works.
Comparable
They are essentially identical

Question #6. Do your applications make direct ScaLAPACK calls?

Response Count Percent
Yes 85 74%
No 29 25%

Question #7. Do your applications use libraries which depend on ScaLAPACK?

Response Count Percent
Yes 37 34%
No 70 65%

Question #8. Do your applications use a higher-level interface to ScaLAPACK?

Response Count Percent
Yes 7 6%
No 98 93%

Question #9. If you answered yes above, which higher-level interfaces do you use?

Response Count Percent
writing our own 1 25%
I will use Python 1 25%
module at PSC 1 25%
mumps 1 25%

Question #10. Is the ScaLAPACK procedure interface a barrier to more extensive use?

Response Count Percent
Yes 39 39%
No 61 61%

Question #11. From which languages do you call ScaLAPACK routines?

Response Count Percent
C 37 38%
Fortran 90/95 26 26%
Fortran 77 22 22%
C++ 12 12%

Question #12. Please describe any tools or helper functions that you frequently implement to assist your applications in using ScaLAPACK?

Responses
None, although I'd like some
Global Arrays
None.
I typically write "wrapper functions" that call LAPACK while hiding many of its arguments from the nonlinear solver or optimization solver.
None
MPI
We often diagonize matrices about the size of 1000~10000, or even larger. Storage of the matrices (and the eigenvectors after the diagonization) are usually splitted by stripes of the matrices. Before calling ScaLAPACK we have to transform the stripe distribution to the block cyclic distribution, and back transform from BCD to the stripe distribution after diagonization. Is it possible for ScaLAPACK to make these processes automatic?
boost python bindings
own layer
na
Data distribution : the user should just provide pieces of matrices and set of indices, and the distribution should be automatic. Some comment should then be provided on possible performance degradation for some parameter values.
Lots of routines to change "shifts" into indexed calls. They are very small helper functions. Also, often I find that I have trouble making ScaLAPACK give me the data distribution I need. I'm fairly familiar with that part of ScaLAPACK, but perhaps this is a shortcoming on my part (maybe it could be done within ScaLAPACK more efficiently than I do it myself, but I have never felt I had the time to work it through; it's faster just to write my own).
OO wrapper; see above...
different data distribution models
Getting the matrix into a 2-D block cyclic distribution is non-trivial.
MPI
PMD (Parallel multi-domain Decomposition), see : http://www.idris.fr/data/publications/PMD
PBLAS redistribution routines: pdelset, pdelget, indxl2g etc.
BLACS_gridmap MPI_wtime (among many other MPI subroutines)

Question #13. How could the ScaLAPACK interface be improved to feel more natural to your application and implementation language?

Responses
Would be nice if there are tools or example to help in setting up the matrix, distributing data, reading/writing the matrix
A consistent interface for the QR routines with pivoting would be most useful. The public version of QR routines seem to fail for one-column matrices.
Same as LAPACK: F90/95 interfaces, overload GET RID OF BLACS: Use MPI instead Memory for diagonalizers often seems an awful lot, and sometimes stops jobs that I think should run from running.
Better interface to Blacs. The redist utils are a great start, but they're poorly documented, and frequently when the system administrators install scalapack they don't even know that they should also build redist.
64-bit integer arguments are needed (Fortran Integer*8 or C/C++ long long).
A more extensive, and *universal*, pblas library. Portability of code is of paramount concern (for maintainability), and many of the support routines are not universal, and therefore the underlying distributed data structures need to be deciphered and the necessary functions hand-coded. Very ppu (personal processing unit) time-consuming.
It is very inconvenient to prepare the data for ScaLAPACK. Should be more natural, at least as natural as lapack.
Some examples that call ScaLAPACK from C would be helpful.
The required data layout was quite diffcult to implement.
The same: f77/C interfaces are sufficient.
1) There should be a functional call to generate a global matrix descriptor using either a BLACS context or an MPI communicator. 2) Another function should generate a BLACS gridmap from an MPI communicator and vice versa. 3) Maybe the BLACS context should be taken out of the global matrix descriptor
It is hard to sort out precisely how to split up and send a matrix to different processors. For example, in a situation where the matrix can be stored on one node, but ScaLapack is being used for speed-up of the linear algebra operations, there is no simple way to send the matrix to the nodes, perform the computation and get the matrix sent back. Having a routine to do that would be very useful, especially if it would automatically determine the optimal number of processors for the ScaLapack routines.
object oriented
na
The symmetric packed
Automated workspace, more obvious routine names.
The interface used by PLAPACK, which allows submatrices to be submitted in a transparent fashion, is far superior to the ScaLAPACK interface.
Same as LAPACK: desperately needed abstraction from details, memory allocation, etc.
Getting matrix packings right is quite annoying.
Give a set of routines to help distribute the data. The packed format is for us a key feature.
The matrix layout and communication seems difficult.
The problem is that once you've used ScaLAPACK for a while, you get used to the shortcomings of the package. Often people who have used other packages (Trilinos subsets, PLAPACK, Global Array, etc.) are not willing to get used to things. Any step towards an interface like these would be helpful, especially to new users.
see above
Heavy use of C++
The data distribution is difficult. PLAPACK allows this to happen in a much more natural way where the user does not have to worry about placing the data on the nodes (in the case of clusters), PLAPACK does it in the background for you.
Some object-orientation in terms of matrix types (maybe bundling the array descriptors with the matrix, etc.) would be useful from a C++ application's perspective.
similar to the call in serial jobs
It's not the interface per se that causes me problems, it's the functionality. I need a tridiagonal or band diagonal matrix solver which will allow for a two-D data distribution and dedicated IO nodes. The ScaLAPACK-based IBM library PESSL only allows for a 1D data decomposition, as (I think) does ScaLAPACK itself. Several years ago I got a 2D data decomposition to work on a Hitachi system by passing MPI communicators to the BLACS grid initialization routines. It would be great if this were standardized across platforms.
The key problem of using ScalaPACK (as others similar libraries) is data distribution. Block cyclic distribution is usually hard to stick on during a calculation. Perhaps, enriching the type of data distribution or letting users define/describe their own data distribution could be a more natural way (but at what cost ?).
Get rid of the dependance on BLACS. BLACS contexts in particular are unwieldy and difficult to use and understand.
I think that the use of block cyclic distribution of dense matrices is a little bit complicated. Thus it woluld be nice to find routines that assist do distribute matrics.
I think that the use of block cyclic distribution of dense matrices is a little bit complicated. Thus it woluld be nice to find routines that assist do distribute matrics.
The most major problem is the errors that are given when the workspace is too small. The message that comes from ScaLAPACK is often incorrect and says that the problem is due to an incorrect argument to a routine. I would also like to use ScaLAPACK in a way that allowed several parallel diagonalisations to be carried out in parallel. This is mentioned in the BLACS documentation but does not seem to work.
Make it easier to use subsets of MPI_world in parallel LAPACK operations.

Question #14. If you have installed ScaLAPACK yourself, how could the installation process be improved?

Responses
Installation can be complicated involving BLACS, PBLAS, MPI, interfacing C and Fortran 77/90 compilers. Common problem is 1 versus 2 underscores in symbol names. Perhaps there can be C interface stubs to accomodate both versions.
Definitely can be improved compared to GNU and R packages or even PETSc which requires more work than the other two.
It's pretty good, although I recall having to go through a lot of configuration tweaking to make it work. Autoconf might help here.
It is still not at the 1. configure (compiler inquiries) 2. make 3. make install stage. What would be useful, is a script that searches for dependent libraries and then gives you the option 'Cannot find BLACS library' 1.Enter directory path or 2.Enter a directory path where i can install my own.
It will be very helpful to make the distribution of matrix in columns instead of blocks.
The distribution should contain BLACS.
The installation is TERRIBLE beyond description and requires major work. Test it on a variety of machines before release!! And with various installations of MPI on the same machine! There is a mixture of "mpi.h" and in BLACS that makes installation fail if there is /usr/include/mpi.h which is different then specified in Bmake.inc Please provide standard ./configure make make install
na
ok
I have not met any problem with the installation.
Please dump the makefiles and provide a simple compile script. I have had to get help from our unix experts in the university to get it to compile on our unix cluster using makefiles. No one in the university has been able to get it to compile on a PC win32/win64 cluster which does not use makefiles. Has anyone ever done this? If the answer is yes please email me details m.e.honnor@durham.ac.uk, thanks
Again, very professional, painless installation. Now, sometimes we hit bugs when the test suites are run, but the distribution of bugs includes ScaLAPACK, other install libraries, etc.
Did not get it to run, stopped bothering, handpicked some other routines from netlib.
Our users do report to us difficultes to have compatible and efficient versions of SCalappack/BLACS/MPI running correctly on their computer.
Our users do report to us difficultes to have compatible and efficient versions of SCalappack/BLACS/MPI running correctly on their computer.
May be using a configure like installation can let people avoid the configuration by hand of the Bmake.inc file. This means automatic configuration of the C-Fortran calling interface and the MPI implementation for example.
Clearer options for compiling Single/Double/Real/Complex libraries only. Better and clearer validation routines.
The collection of makefiles could be extended.
The collection of makefiles could be extended.
The interaction between BLACS, PBLAS and ScaLAPACK. The installation process is made difficult due to the fact it seems necessary to compile several times to get the uncerscores correct.
Updated for new architecture and Fortran90

Question #15. How frequently do you refer to the ScaLAPACK Users Guide?

Response Count Percent
Very Frequently 10 11%
Frequently 20 22%
Sometimes 36 40%
Rarely 18 20%
Never 6 6%

Question #16. What information in the ScaLAPACK guide is hard to find or is missing, if any?

Responses
Details in the code such as how the pivot vector is distributed and used, layout of data in distributed band solver, details about the QR factorization.
A clear and concise definition of the API would be most welcome. The explanations in the comments is commonly found to be confusing among would be user.
redist!! Plus, information *about* redistributing matrices with/without the redist utilities. Scalapack is of no use to anyone if people can't get their matrices distributed on the computers.
Google . Works quite well, actually.
How to simultaneously call different Scalapack functions using disjoint packs of processors.
Hard to sort out how to generate/store matrices on the different nodes.
The online guide is extremely confusing. For example the critical information that all matrices in operation must have same processor grid is not there. The comments in source are cryptic. It is unclear what is input and what is output.
Ok.
I think that the guide should be refreshed : new architectures, new processors counts, new grids should be provided.
A simple compile script.
I want to know how to build ScaLAPACK as a shared library. Maybe this is not possible. I have not tried it myself. But it would be good to know.
examples
A clear description of the procedures and their arguments with lots of examples.
More information on the underlying algorithms used in the various routines would be useful.
Improvement to block decomposition methods
How to use BLACS_gridmap

### Targeted Environment Specifics

Question #1. Under which operating system environments do your applications run?

Response Count Percent
Linux 190 29%
AIX 85 13%
Solaris 55 8%
Windows (other) 48 7%
IRIX 48 7%
Windows (cygwin) 43 6%
HP/UX 34 5%
Mac OS X 34 5%
Tru64 34 5%
Unicos 29 4%
BSD 17 2%
MS Visual C++ 1 0%
NEC 1 0%
via SGI O3K & Cray XT3 1 0%
Unicos/mp 1 0%
mingw 1 0%
vpp5000 specific 1 0%
Windows XP 1 0%
SUPER-UX 1 0%
MAC OS X 1 0%
Other (Undisclosed) 1 0%
xt3 catamount 1 0%
Mac OSX 1 0%
Linux on all of (x86-64 x86-32 IA-64) 1 0%
MacOS X 1 0%
RT 1 0%
opteron-based systems 1 0%
OS X 1 0%

Question #2. If your applications run in a shared-memory environment, which styles of parallelism do they employ?

Response Count Percent
Multiple 19 31%
Concurrent programs 12 20%
Components 2 3%

Question #2a. Please specify any particular libraries of frameworks used?

Response Count Percent
openmp 9 32%
MPI 4 14%
java.util.concurrent 1 3%
essl, sunperf, sgimath 1 3%
MPI implementation 1 3%
I use OpenMP. 1 3%
OpenMP, MPI 1 3%
IBM smp essl 1 3%
R and R contributed 1 3%
Don't know 1 3%
OMP 1 3%
mpi using shared memory for shared objects 1 3%
Global Arrays (that uses System V shared memory) 1 3%

Question #3. If your applications run in a distributed-memory environment, which styles of parallelism do they employ?

Response Count Percent
Message passing 134 99%
Widely-distributed 1 0%

Question #3a. Please specify any particular libraries of frameworks used?

Response Count Percent
MPI 5 33%
PLAPACK 2 13%
Trilinos 1 6%
communication over Unix sockets 1 6%
Direct calls to the BLACS 1 6%
I need PVM. 1 6%
petsc 1 6%
Global Arrays 1 6%
HDF5 1 6%
mobile object library, Clam (from William and Mary) 1 6%

Question #5. Description of related activities?

Responses
programmer and maintainer of ACESII computational chemistry package
Compact storage for scalapack. Out of core extension for scalapack.
Computational condensed matter physics
Working on density matrix techniques as eigensolver replacements in quantum chemistry. This relies heavily on (sca)lapack.
Electron-impact excitation/ionization of atoms for modelling of fusion diagnostics and experiments. From a computational perspective, it involves the repeated diagonalisation (ie 40-70 times) of symmetric matrices in excess of 50,000 in which ALL eigenvectors and ALL eigenvalues are required. The physics of electron scattering of relativistic targets will drive the size of these matrices upward by at least a factor 5 in coming years. I need efficient I/O and stable diagonalistion. I hope that the next generation of ScaLapack will not be as demanding on memory requirements. Ie. a 40K matrix is the maximum i can diagonalise over 25 opteron processors each with 2 Gb of Ram.
Development of the computational chemistry software NWChem http://www.emsl.pnl.gov/docs/nwchem
I have been working on parallel sparse solvers such as incomplete factorization preconditioners for iterative methods. In such precondtioners I sometimes exploits dense matrix computation to achieve better performance, but the cost of communication is typically more imprtant for the method of interest.
We make limited use of LAPACK in the GYRO code. The dominant matrix structure is sparse, so for that we use UMFPACK.
electronic structure calculation
Reseach on Computational Condensed Matter Physics.
Numical optimization solvers that I develop and distribute target applications nonlinear and semidefinite programming. I use LAPACK to factor dense matrices that are usually the Schur complements of much larger indefinite matrices. Parallel versions of these solvers use ScaLAPACK. Users of our software must also link to these packages.
Computational nuclear theory (static Hartree-Fock, time-dependent Hartree-Fock, Hartree-Fock-Bogoliubov, Dirac equation, ...)
function parameter fitting principle component analysis structural superposition
Condensed matter theory; scientific code developments
quantum chemistry code development (q-chem program)
We are an Electronic Structure Physics group of Northwestern University engaging in the numerical calculations and simulations of materials.
Computational Electromagnetics
statistical computations
Research on statistical machine learning.
Finite Element methods in CFD. http://www.cimec.org.ar/petscfem
Numerical methods for Atmospheric Dynamics
Time series data analysis; deconvolution
Numerical simulation of reactive flows (CFD and Bifurcation Theory)
Library routine development, User collaborations (consultant work)
Optimization, approximation and cubature, distribution of points on manifolds, mathematical finance
Application of Multiple Precision Numerical Computation
biostatistics bioinformatics
We write code for quantum mechanics, which does repeated generalised eigensolving. Eigenvectors from one iteration are generally good initial guesses to the next, but we can't make any use of this in ScaLAPACK. For large systems our matrices get sparse and so we are looking into using PARPACK or something along those lines for those. Our biggest problem with ScaLAPACK is memory, not speed.
Dense library development
Numerical simulation of fluid flow in oil reservoirs.
PhD-student in wavelet methods, I only use LAPACK for routines that I do not feel like implementing myself. Running time is not really an issue. The large-scale matrix operations are so sparse and specific that I implement them myself.
CAE consultancy shop, with emphasis on CFD.
BEM application developer
I am mostly talking about my experience working with applicaiton groups.
We use ScaLAPACK in a boundary element method application. We also have finite element method applications which use sparse solvers (both iterative and direct) which in turn use BLAS and LAPACK heavily.
Image processing, estimation theory
COndensed matter physics, high-temperature superconductivity
development of software for large-scale numerical optimization, semidefinite programming, sparse and dense
Not sure what you're asking - if you're asking about the nature of the application it's fluid dynamics in the interior of the Sun.
Finite element application
Finite element application
Our goal is to incorporate the lapack routines in our own multi precision environment MPL.
FEM/MOM formulation for electromagnetic code
I work on the development and implementation of multilevel iterative algortihms on various parallel machines. The underlying linear systems are large and sparse, and so LAPACK routines are typically used for various local (serial) computations in a distributed memory environment. Increasingly we are using shared memory nodes within large clusters, however, hybrid memory models have yet to pay off. This may change as OpenMP and other threading capabilities improve.
Use D H Bailey's MP and Yozo's QD - for extra precison.
PDE's in control, boundary control, conservation laws
Sparse linear algebra
CFD Applications
FEM, sparse generalized eigenvalue problems
optimization
Financial industry, statistics, data mining, machine learning.
Computing science in fluid dynamic and heat transfert. Cluster and grid computing. Great use of MPI-2, Atlas, Lapack, Scalapack and Spools from Fortran and C.
Atomic, Molecular and Optical Physics, Computational Chemistry.
I work on the numerical investigation of geophysical fluid instabilities and predictability. I am especially interested the relationship between ensemble forecasting and geophysical fluid instabilities.
Computational Fluid Dynamics, solution of ODEs/PDEs
I use lapack routines to develop aplatations from hte area of signal processing. Esp. I'm interested in recursive filters.
I use lapack routines to develop aplatations from hte area of signal processing. Esp. I'm interested in recursive filters.
Signal processing
Materials modelling ab-initio modelling (density functional theory)
Most of my current work is with sparse matrices coming from finite element discritizations.
Quantum-chemistry. Iterative eigenvalue solves.
Our main CFD code does not use LAPACK. A stand-alone applications of ours does, however.

Responses
In benchmarking my application that needs eigenpairs of a double precision symmetric matrix, I found that some architectures seem to have much greater overhead associated with function calls. In particular, I was comparing the different nodes on cheetah.ccs.ornl.gov that handle batch jobs (p690 1.3ghz) and interactive jobs (p655 1.7 ghz). I found that the speed was worse (factor of 1.8) on the batch node than one would expect by scaling to the difference in clock speed (factor of 1.3). I eventually convinced myself that the difference was caused by a huge number of function calls in inner loops, particularly the functions ROTG and DLARTG (e.g. 7 million calls per call to DSTEQR working with a 3750x3750 matrix). I found two ways to solve the problem. One was to recompile Lapack and my application using the highest optimization level (not recommended by the ORNL support folks because it takes so long for typically little benefit), which allows inlining of functions that exist in different source files. The other way was to put a copies of the ROTG and DLARTG in the source file that was generating so many calls. This allowed a lower optimization level to do the inlining. I don't know if there is any good solution other than recommending the use of a certain optimization level when compiling libraries and applications. Perhaps a preprocessor that would insert copies of functions into source files where they are called in inner loops?
I'm not a primary customer and more weight should be given to the comments from apps users. Mainly I'm filling this in so that the Lapack team will know they have yet another customer. :-)
Would be nice if the technology in HPL (look ahead, recursive options) can be merged back into PZGETRF/PZGETRS. Currently HPL is only in double in C so double complex is not available. HPL works on rhs but conceptually an extra pass over L should make it compatible with PxGETRF. General symmetric linear solver (LTL') (perhaps in packed storage as well) would be nice. Since performance of PBLAS is crucial for scalapack, it would be nice if there are parameters to tune PBLAS, especially on triangular solvers. More examples or tutorials (like PETSc) to help new users to use scalapack. Better interfaces for scalapack with other iterative linear solvers such as PETSc or Object oriented CCA technology.
Lapack is one of the best things that has ever happened to me. Everyone of the lapack workers that I've interacted with has bent over backwards to be helpful. This survey is yet another example of how the lapack people care about being useful to their community. If you ever need someone to write glowing letters of support for grant applications (particularly someone from a US National Lab other than ORNL), don't hesitate to contact me at the address above.
I am currently not using ScaLAPACK but I probably should. I would like to spend some time thinking about how to best integrate with C++. I also would like to discuss is the way to use BLAS and LAPACK from C++ can be standardised. Several people have asked me for help, and I'm offering the interfaces we developed in the psimag toolkit. It would be nice if we could decide on properly suporting one model.
I hope that, whatever changes are made to "modernize" lapack, that they be made in the manner to augment the current basic procedural functionality rather than replace it.
Additional sparse/banded matrix routines would be great, but I realize that opens up a new can of worms.
Could you please install PVM? I really need that for my research. Thank you very much!
general question 8 is a bit unclearly stated.
The most frequent difficulty that my users encounter while linking their application to my optimization solvers and LAPACK involves language interoperability. Despite our best efforts to help them configure things properly, calling the Fortran routines from C or C++ continues to frustrate users who are uninterested in these subtleties of computer science. I would also like to see the BLAS1 include y = alpha y + x, w = alpha x + beta y, and BLAS2 include alphs = x' A x where x is a column vector and A is a symmetric matrix.
First of all, many thanks for developing LAPACK. It is very important to our group (computational nuclear theory). It would be nice to get a genuine Fortran 95 /2003 LAPACK library WITHOUT interface calls to Fortran 77 version of LAPACK.
Thank you for your work on: lapack atlas scalapack pblas...
thank you
Lapack needs sparse linear algebra routines and search/sort routines.
Currently I am working with a vectorization specialist at a supercomputer center to try to port ScaLapack to my codes. The learning curve for getting it running seems to be fairly significant. Our major bottleneck is communications between nodes to set-up the original matrix and to get the results once the calculation is completed. It doesn't seem like this should be so difficult to handle, but we are unable to improve upon it at the moment. Being able to use ScaLapack in the same user friendly way as Lapack is used would be a great advance. It currently doesn't seem to be there.
We at the Seminar for Applied Mathematics at the ETH Zurich are entrenched LAPACK fans! Thanks for the great work. Keep it up!
Re ScaLAPACK: while I am not now using it my need for the functionality it provides has risen sharply in the last several months and I expect I will be diving in to it soon. I expect to use it via a matlab interface and will need to locate a suitable interface or roll my own.
Online lapack, scalapack guides are useful, most of the time, I just google the routine I need cholesky update missing in scalapack large (banded) matrices require huge memory for linear solution. superlu_dist is nice and should be incorporated petsc is nice and should be better incorporated
Few people seem to know this trick: if one has matlab installed on a system, one can link to the MathWorks-provided lapack libraries (called something like libmwlapack.a) which are highly tuned for a given architecture. Some operations are 3x faster than building lapack from source with the highest optimizations. I don't know if the MathWorks discourages its customers from doing this but this kind of tip in my opinion belongs in a lapack FAQ or related document.
Calculation of gradient information of objectives involving log det of matrix valued function and constraints involving solution of linear system are much faster with explicit inverse of symmetric posiitve definite matrix. These codes involve element by element multiplications of inverse with derivatives of matrix elements. Many vendor implementations multi-thread Cholesky DPOTRF, but not DPOTRI explicit inverse using this Cholesky, producing a ottleneck on local 2 and 4 way SMP nodes. ATLAS does include multithreaded DPOTRI.
Would very much like to see the RRR tridiagonal eigensolver in ScaLAPACK. Thanks for providing the survey!
Kindly circulate the comments that I had sent to Jim and Jack earlier.
LAPACk is great for not very large systems, the same can not be said about ScaLAPACK for very large systems. For very large systems, alternative approaches shall be taken instead of direct extention of LAPACK. The size matters here.
Would really like an FFT package in ScaLAPACK, as it is in PESSL. In fact, we are having to use FFTW 2.x in our porting of our image processing software from an IBM P4 system to the Cray XD1 system because we used PESSL's FFT but ScaLAPACK doesn't have it. We are a bit concerned about using FFTW 2.x because it isn't the latest version, but the latest version doesn't have distributed memory FFTs. It would be nice to have it in ScaLAPACK.
LAPACK is excellent software. We also use a multiprecision build of CLAPACK, built in C++ using customized class for 'double' say, overloaded arithmetic operators, and runtime-determined substitutes for LAPACK machine-constants routines. Building this requires some "clean up" of the CLAPACK sources, which makes getting updates/fixes more involved. It would be great if this could be done more easily, without any need for minor editing, eg. use of temp vars and no calls involving explicit float args like foo(...,1.0,...) or comparisons like bar > 1.0 . The ability for anyone to be able to build easily a quad- or multiprecision (GnuMP based, say) version of LAPACK might be generally well received. It's not immediately clear that all LAPACK routines (eg, SVD) will remain robust (eg. convergence) when treated in this way. Of course, multiprecision LAPACK brings with it many involved questions about how to best leverage double-precision solutions as initial candidate solutions for arbitrary high precision computation.
need more user friendness
We are mostly using out-of-core version of scalapack. Since it's only a prototype code, documentation is limited. I would like to see an offical version of out-of-core version of scalapack. Also, partial factorization for both in-core and out-of-core scalapack would be nice.
I think one of the frustrating parts of the LAPACK libraries is the build and patch system. I think moving to a modern revision control system (such as subversion), along with a good set of open-source development tools accessible through the web, introducing supported language bindings (e.g., python), and finally a more reasonable build system is critical to the next generation of users.
1. A quad-precision version of LAPACK would be very useful. 2. After using the matrix solvers in LAPACK for many years, I am convinced there is a deep problem that exhibits itself in the SVD routines.
I like Lapack, thank you.
Functionalities for symmetric indefinite matrices are missing in Scalappack.
The documentation (man-pages) of LAPACK could be kept more up-to-date wrt. the specification of workspace sizes. We have observed problems in the past due to this information being outdated in some cases. Another problem concerns error handling: we'd rather not have LAPACK call the standard XERBLA-routine, which terminates the program. Unfortunately, one cannot replace it when using shared libraries. It would be great if there were e.g. some kind of global switch that forced the standard XERBLA to do nothing so that the called LAPACK-function can return the error code in INFO to the application, where it can be handled in a specific way.
Scalability and performance of the Scalapack routines are very impressive compared to other libraries but still not easy to use according to the data distribution driven by the application algorithms. Many thanks and wish you all the best.
I would be very interested in using the MRRR (Multiple Relatively Robust Representations) eigensolver algorithm. I would be very grateful if you could feedback to me any plans you have to incorporate this into future releases of Scalapack, as this could influence the direction of my future research.
1. I think that sca/lapack should offer a support for solving systems with Toeplitz matrices. Recently I have developed some algorithms for solving linear systems with banded triangular Toeplitz matrices (both versions using OpenMP, mpi and Level 2 & 3 BLAS routines). Please, let me know if you thinh that thet could be useful. Also see: P. Stpiczynski: Numerical evaluation of linear recurrences on high performance computers and clusters of workstations, In: Proceedings of PARELEC 2004, IEEE Computer Society Press, 2004, 200-205P. Stpiczynski: Solving linear recurrence systems using level 2 and 3 BLAS routines, Lecture Notes in Computer Science 3019 (2004) 1059-1066 2. Some support for vector processing counld be improved in case of multiple right hand side vectors (instead of repeating a simpler solver for one right hand side vector). 3. Recently I have developed a triangular matrix solver which use an alternative data distriburion (P. Stpiczynski: Parallel Cholesky factorization on orthogonal multiprocessors, Parallel Computing 18 (1992) 213-219). which is faster than the original scalapack routine. I believe that this idea can ba applied to produce faster Cholesky factorization. Currently I'm wonking on it.
1. I think that sca/lapack should offer a support for solving systems with Toeplitz matrices. Recently I have developed some algorithms for solving linear systems with banded triangular Toeplitz matrices (both versions using OpenMP, mpi and Level 2 & 3 BLAS routines). Please, let me know if you thinh that thet could be useful. Also see: P. Stpiczynski: Numerical evaluation of linear recurrences on high performance computers and clusters of workstations, In: Proceedings of PARELEC 2004, IEEE Computer Society Press, 2004, 200-205P. Stpiczynski: Solving linear recurrence systems using level 2 and 3 BLAS routines, Lecture Notes in Computer Science 3019 (2004) 1059-1066 2. Some support for vector processing counld be improved in case of multiple right hand side vectors (instead of repeating a simpler solver for one right hand side vector). 3. Recently I have developed a triangular matrix solver which use an alternative data distriburion (P. Stpiczynski: Parallel Cholesky factorization on orthogonal multiprocessors, Parallel Computing 18 (1992) 213-219). which is faster than the original scalapack routine. I believe that this idea can ba applied to produce faster Cholesky factorization. Currently I'm wonking on it.
Please continue to develop these packages as they are of immense value to the research that we perform.

Question #7. Use DOE-lab resources?

Response Count Percent
Yes 30 46%
No 34 53%

Question #8. Use HPCS resources?

Response Count Percent
Yes 4 80%
No 1 20%

Tue May 21 11:18:57 2013
0 seconds