PLASMA RELEASE NOTES
__________________________________________________________________
Summary of Current Features
* Solution of standard and generalized dense symmetric Eigenvalue
Problem in real space and complex space, via two-stage tridiagonal
reduction which has been proved to be upto 10 times faster than the
standard tridiagonalisation. Today both eigenvalues and
eigenvectors are supported. The overall speedup of the two-stage
dense symmetric Eigenvalue algorithm vary between two (when both
eigenpair are needed) and 10 (when only eigenvalues are needed).
The new routines are: plasma_zheev, plasma_zheevd, plasma_zheevr,
plasma_zhegvd, plasma_zhegv, plasma_zhetrd. More details about the
technique can be found in:
+ H. Ltaief, P. Luszczek, A. Haidar and J. Dongarra. Solving the
Generalized Symmetric Eigenvalue Problem using Tile Algorithms
on Multicore Architectures. Advances in Parallel Computing
Volume 22, 2012.
+ A. Haidar, H. Ltaief and J. Dongarra. Parallel Memory-Aware
Fine-Grained Reduction to Condensed Forms for Symmetric
Eigenvalue Problems. International Conference for High
Performance Computing, Networking, Storage and Analysis,
IEEE-SC 2011.
* Solution of dense Singular Value Decomposition in real space and
complex space, via two-stage bidiagonal reduction which has been
proved to be upto 10 times faster than the standard
tridiagonalisation. Today both singular values and vectors are
supported. The overall speedup of the two-stage dense Singular
Value Decomposition algorithm vary between two (when both singular
vectors are needed) and 10 (when only singular values are needed).
The new routines are: plasma_zgesvd, plasma_zgesdd, plasma_zgebrd.
More details about the technique can be found in:
+ A. Haidar, H. Ltaief, P. Luszczek and J. Dongarra. A
Comprehensive Study of Task Coalescing for Selecting
Parallelism Granularity in a Two-Stage Bidiagonal Reduction A
Comprehensive Study of Task Coalescing for Selecting
Parallelism Granularity in a Two-Stage Bidiagonal Reduction.
IEEE IPDPS 2012
+ A. Haidar, P. Luszczek, J. Kurzak and J. Dongarra. An Improved
Parallel Singular Value Algorithm and Its Implementation for
Multicore Hardware. International Conference for High
Performance Computing, Networking, Storage and Analysis,
IEEE-SC 2013.
* Solution of dense systems of linear equations and least square
problems in real space and complex space, using single precision
and double precision, via the Cholesky, LU, QR and LQ
factorizations
* Solution of dense linear systems of equations in real space and
complex space using the mixed-precision algorithm based on the
Cholesky, LU, QR and LQ factorizations
* Multiple implementations of the LU factorization algorithm: Partial
pivoting based on recursive parallel panel, tournament pivoting
with LU partial pivoting and incremental pivoting.
* Tree-based QR and LQ factorizations and Q matrix generation and
application (“tall and skinny”)
* Tree-based bidiagonal reduction (“tall and skinny”)
* Explicit matrix inversion based on Cholesky factorization
(symmetric positive definite)
* Parallel and cache-efficient in-place layout translations
(Gustavson et at.)
* Complete set of Level 3 BLAS routines for matrices stored in tile
layout
* Simple LAPACK-like interface for greater productivity and advanced
(tile) interface for full control and maximum performance; Routines
for conversion between LAPACK matrix layout and PLASMA’s tile
layout
* Dynamic scheduler QUARK (QUeuing And Runtime for Kernels) and
dynamically scheduled versions of all computational routines
(alongside statically scheduled ones)
* Asynchronous interface for launching dynamically scheduled routines
in a non-blocking mode. Sequence and request constructs for
controlling progress and checking errors
* Automatic handling of workspace allocation; A set of auxiliary
functions to assist the user with workspace allocation
* A simple set of "sanity" tests for all numerical routines including
Level 3 BLAS routines for matrices in tile layout
* An advanced testing suite for exhaustive numerical testing of all
the routines in all precisions (based on the testing suite of the
LAPACK library)
* Basic timing suite for most of the routines in all precisions
* Thread safety
* Support for Make and CMake build systems
* LAPACK-style comments in the source code using the Doxygen system
* Native support for Microsoft Windows using WinThreads through a
thin OS interaction layer
* Installer capable of downloading from Netlib and installing missing
components of PLASMA’s software stack (BLAS, CBLAS, LAPACK, LAPACKE
C API)
* Extensive documentation including Installation Guide, Users' Guide,
Reference Manual and an HTML code browser, a guide on running
PLASMA with the TAU package, Contributors' Guide, a README and
Release Notes.
* A comprehensive set of usage examples
__________________________________________________________________
New Features by Release
2.5.2, September, 2013
* Add -m and -n options to timing routines to define matrix size
without using ranges
* Fix a minor bug that appears when combining muti-threaded tasks
with thread-masks in Quark. Previously, the thread mask was not
respected when the tasks of the multi-threaded task were being
assigned to threads.
* Fix illegal division by 0 that occured when matrix size was smaller
than the tile size during inplace layout translation. See
[1]http://icl.cs.utk.edu/plasma/forum/viewtopic.php?f=2&t=1684&p=23
74#p2374
* Fix the QUARK_REGION bug that was limiting performance of QR/LQ
factorization in the last release.
* Fix illegal division by 0 when first numa node detected by HwLoc is
empty. Thanks to Jim for those two bug reports, see
[2]http://icl.cs.utk.edu/plasma/forum/viewtopic.php?f=2&t=1680.
* Fix integer size that was creating overflow in tile pointers
computation. Thanks to SGI for the bug report.
2.5.1, July, 2013
* Add LU factorization with tournament pivoting. Each tournament is
based on the classical partial pivoting algorithm.
PLASMA_[sdcz]getrf_tntpiv[_Tile[_Async]]. The size of each
subdomain involved in the tournament can be set through the call to
"PLASMA_Set( PLASMA_TNTPIVOTING_SIZE, nt );". The default is 4. See
LAWN 226.
* Add LU factorization with no pivoting:
PLASMA_[sdcz]getrf_nopiv[_Tile[_Async]]. WARNING: your matrix has
to diagonal dominant to use it or the result might be wrong.
* Add QR with rank revealing routine:
PLASMA_[sdcz]geqp3[_Tile[_Async]].
* Fix many comments in the Doxygen documentation
* Complete documentation on DAG and execution traces generation
* Add the dense hermetian eigenvalue problem routines: Note that
these routines requires mulithreaded BLAS. For that, the user is
required to tell PLASMA that he is using multithreaded BLAS library
and so specify which library is being used by adding
-DPLASMA_WITH_XXX to the compilation flags. Current supported
library are -DPLASMA_WITH_MKL or -DPLASMA_WITH_ACML but it is easy
to add morelibrary, please contact PLASMA team if you require
addtional libraries to be supported. 1- PLASMA_[sdch]hetrd: compute
the tridiagonal reduction of a dense hermetian matrix using the
2-stage algorithm A = QTQ^H. It also has the feature to generates
the complex matrix Q with orthonormal columns used to reduce the
matrix A to tridiagonal. This function is similar to the ZHETRD
routine combined with the ZUNGQR routine (when Q is generated) of
LAPACK. 2- PLASMA[sdch]heev: computes all eigenvalues and,
optionally, eigenvectors of a complex Hermitian matrix A. This
function is similar to the ZHEEV routine of LAPACK. 3-
PLASMA[sdch]heevd: computes all eigenvalues and, optionally,
eigenvectors of a complex Hermitian matrix A. If eigenvectors are
desired, it uses a divide and conquer algorithm. This function is
similar to the ZHEEVD routine of LAPACK. 4- PLASMA[sdch]heevr:
computes selected eigenvalues and, optionally, eigenvectors of a
complex Hermitian matrix A. Eigenvalues and eigenvectors can be
selected by specifying either a range of values or a range of
indices for the desired eigenvalues. Whenever possible, ZHEEVR
calls ZSTEMR to compute eigenspectrum using Relatively Robust
Representations (MRRR). This function is similar to the ZHEEVR
routine of LAPACK. 5- PLASMA[sdch]_hegv: computes all the
eigenvalues, and optionally, the eigenvectors of a complex
generalized Hermitian-definite eigenproblem, of the form:
A*x=(lambda)*B*x, A*Bx=(lambda)*x, or B*A*x=(lambda)*x. Here A and
B are assumed to be Hermitian and B is also positive definite. This
function uses the QR algorithm, and is similar to the ZHEGV routine
of LAPACK.
6- PLASMA_[sdch]_hegvd: computes all the eigenvalues, and optionally, the eigenv
ectors
the eigenvectors of a complex generalized Hermitian-definite
eigenproblem, of the form:
A*x=(lambda)*B*x, A*Bx=(lambda)*x, or B*A*x=(lambda)*x.
Here A and B are assumed to be Hermitian and B is also positive definite.
If eigenvectors are desired, it uses a divide and conquer algorithm,
and is similar to the ZHEGVD routine of LAPACK.
* Add the singular value decomposition (SVD) routines: Note that
these routines requires mulithreaded BLAS. For that, the user is
required to tell PLASMA that he is using multithreaded BLAS library
and so specify which library is being used by adding
-DPLASMA_WITH_XXX to the compilation flags. Current supported
library are -DPLASMA_WITH_MKL or -DPLASMA_WITH_ACML but it is easy
to add morelibrary, please contact PLASMA team if you require
addtional libraries to be supported. 1- PLASMA_[sdch]gebrd: compute
the bidiagonal reduction of a dense general matrix using the
2-stage algorithm A = QBP^H. It also has the feature to generates
the complex matrix Q and PH with orthonormal columns used to reduce
the matrix A to bidiagonal. This function is similar to the ZGEBRD
routine combined with the ZUNGBR routine (when Q is generated) of
LAPACK. 2- PLASMA[sdch]gesvd: computes the singular value
decomposition (SVD) of a complex matrix A, optionally computing the
left and/or right singular vectors. The SVD is written A = U *
SIGMA * conjugate-transpose(V). This routine use the implicit
zero-shift QR algorithm and is similar to the ZGESVD routine of
LAPACK. 3- PLASMA[sdch]_gesdd: computes the singular value
decomposition (SVD) of a complex matrix A, optionally computing the
left and/or right singular vectors. The SVD is written A = U *
SIGMA * conjugate-transpose(V). This routine use the divide and
conquer algorithm and is similar to the ZGESDD routine of LAPACK.
2.5.0, November, 2012
* Introduce condition estimators for General and Positive Definite
cases (xGECON, xPOCON)
* Fix recurring with lapack release number in plasma-installer
* Fix out-of-order computation in QR/LQ factorization that were
causing numerical issues with dynamic scheduling
* Fix many comments in the Doxygen documentation
* Correct some performance issues with in-place layout translation
2.4.6, August 20th, 2012
* Add eigenvectors support in eigensolvers for symmetric/hermitian
problems and generalized problems.
* Add support of Frobenius norm.
* Release the precision generation script used to generate the
precision s, d and c from z, as well as ds from zc
* Add all Fortran90 for mixed precision routines.
* Add all Fortran90 wrappers to tile interface and asynchronous
interface. Thanks to NAG for providing those wrappers.
* Add 4 examples with Fortran90 interface.
* Add support for all computational functions in F77 wrappers.
* Fix memory leaks related to fake dependencies in dynamically
scheduled algorithms.
* Fix interface issues in eigensolvers routines.
* Fixed returned info in PLASMA_zgetrf function
* Fixed bug with matrices of size 0.
* WARNING: all lapack interfaces having a T or L argument for QR or
LU factorization have been changed to take a descriptor. The
workspace allocation has been changed to match those requirements
and all functions PLASMA_Alloc_Workspace_XXXXX_Tile are now
deprecated and users are encouraged to move to the
PLASMA_Alloc_Workspace_XXXXX version.
2.4.5, November 22nd, 2011
* Add LU inversion functions: PLASMA_zgetri, PLASMA_zgetri_Tile and
PLASMA_zgetri_Tile_Async using the recursive parallel panel
implementation of LU factorization.
* The householder reduction trees for QR and LQ factorizations can
now work on general cases and not only on matrices with M multiple
of MB.
* Matrices generation has been changed in every timing, testing and
example files to use a parallel initialization generating a better
distribution of the data on the architecture, especially for Tile
interface. “numactl” is not required anymore.
* Timing routines can now generate DAGs with the --dag option, and
traces with --trace option if EZTRACE is present.
2.4.2, September 14th, 2011
* New version of quark removing active waiting and allowing user to
bind tasks to set of cores.
* Installer: Fix compatibility issues between plasma-installer and
PGI compiler reported on Kraken by A. Bouteiller.
* Fix one memory leak with Hwloc.
* Introduce a new kernel for the recursive LU operation on tile
layout which reduces cache misses.
* Fix several bugs and introduce new features thanks to people from
Fujitsu and NAG :
+ The new LU factorization with partial pivoting introduced in
release 2.4 is now working on rectangular matrices.
+ Add missing functions to Fortran 77 interface.
+ Add a new Fortran 90 interface to all LAPACK and Tile
interface. Asynchronous interface and mixed precision routines
are not available yet.
+ Fix arguments order in header files to fit implementation.
2.4.1, July 8th, 2011
* Fix bug with Fujitsu compiler reported on the forum
([3]http://icl.cs.utk.edu/plasma/forum/viewtopic.php?f=2&t=108)
* Unbind threads in PLASMA_Finalize to avoid problem of binding in
OpenMP section following PLASMA calls (still possible on Mac and
AIX without hwloc). A better fix is to create the OpenMP thread in
the user code before any call to PLASMA thanks to a fake parallel
section.
2.4.0, June 6th, 2011
* Tree-based QR and LQ factorizations: routines for application of
the Q matrix support all combinations of input parameters:
Left/Right, NoTrans/Trans/ConjTrans
* Symmetric Engenvalue Problem using: tile reduction to
band-tridiagonal form, reduction to "standard" tridiagonal form by
bulge chasing, finding eigenvalues using the QR algorithm
(eigenvectors currently not supported)
* Singular Value Decomposition using: tile reduction to
band-bidiagonal form, reduction to “standard” bidiagonal form by
bulge chasing, finding singular values using the QR algorithm
(singular vectors currently no supported)
* Gaussian Elimination with partial pivoting (as opposed to the
incremental pivoting in the tile LU factorization) and parallel
panel (using Quark extensions for nested parallelism) WARNING:
Following the integration of this new feature, the interface to
call LU factorization has changed. Now, PLASMA_zgetrf follows the
LAPACK interface and corresponds to the new partial pivoting. Old
interface related to LU factorization with incremental pivoting is
now renamed PLASMA_zgetrf_incpiv.
2.3.1, November 30th, 2010
* Add functions to generate random matrices (plrnt, plghe and plgsy)
⇒ fix the problem with time_zpotri_tile.c reported by Katayama on
the forum
([4]http://icl.cs.utk.edu/plasma/forum/viewtopic.php?f=2&t=59)
* Fix a deadlock in norm computations with static scheduling
* Installer: fix the LAPACK version when libtmg is the only library
to be install Thanks to Henc.
([5]http://icl.cs.utk.edu/plasma/forum/viewtopic.php?f=2&t=60)
2.3.0, November 15th, 2010
* Parallel and cache-efficient in-place layout translations
(Gustavson et al.)
* Tree-based QR factorization and Q matrix generation (“tall and
skinny”)
* Explicit matrix inversion based on Cholesky factorization
(symmetric positive definite)
* Replacement of LAPACK C Wrapper with LAPACKE C API by Intel
2.2.0, July 9th, 2010
* Dynamic scheduler QUARK (QUeuing And Runtime for Kernels) and
dynamically scheduled versions of all computational routines
(alongside statically scheduled ones)
* Asynchronous interface for launching dynamically scheduled routines
in a non-blocking mode. Sequence and request constructs for
controlling progress and checking errors
* Removal of CBLAS and pieces of LAPACK from PLASMA’s source tree.
BLAS, CBLAS, LAPACK and Netlib LAPACK C Wrapper become PLASMA’s
software dependencies required prior to the installation of PLASMA
* Installer capable of downloading from Netlib and installing missing
components of PLASMA’s software stack (BLAS, CBLAS, LAPACK, LAPACK
C Wrapper)
* Complete set of Level 3 BLAS routines for matrices stored in tile
layout
2.1.0, November 15th, 2009
* Native support for Microsoft Windows using WinThreads
* Support for Make and CMake build systems
* Performance-optimized mixed-precision routine for the solution of
linear systems of equations using the LU factorization
* Initial timing code (PLASMA_dgesv only)
* Release notes
2.0.0, July 4th, 2008
* Support for real and complex arithmetic in single and double
precision
* Generation and application of the Q matrix from the QR and LQ
factorizations
* Prototype of mixed-precision routine for the solution of linear
systems of equations using the LU factorization (not optimized for
performance)
* Simple interface and native interface
* Major code cleanup and restructuring
* Redesigned workspace allocation
* LAPACK testing
* Examples
* Thread safety
* Python installer
* Documentation: Installation Guide, Users' Guide with routine
reference and an HTML code browser, a guide on running PLASMA with
the TAU package, initial draft of Contributors' Guide, a README
file and a LICENSE file
1.0.0, November 15th, 2008
* Double precision routines for the solution of linear systems of
equations and least square problems using Cholesky, LU, QR and LQ
factorizations
__________________________________________________________________
Last updated 2013-09-16 11:43:30 CEST
Références
1. http://icl.cs.utk.edu/plasma/forum/viewtopic.php?f=2&t=1684&p=2374#p2374
2. http://icl.cs.utk.edu/plasma/forum/viewtopic.php?f=2&t=1680
3. http://icl.cs.utk.edu/plasma/forum/viewtopic.php?f=2&t=108
4. http://icl.cs.utk.edu/plasma/forum/viewtopic.php?f=2&t=59
5. http://icl.cs.utk.edu/plasma/forum/viewtopic.php?f=2&t=60