Summary of Current Features

Solution of standard and generalized dense symmetric Eigenvalue Problem in real space and complex space, via twostage tridiagonal reduction which has been proved to be upto 10 times faster than the standard tridiagonalisation. Today both eigenvalues and eigenvectors are supported. The overall speedup of the twostage dense symmetric Eigenvalue algorithm vary between two (when both eigenpair are needed) and 10 (when only eigenvalues are needed). The new routines are: plasma_zheev, plasma_zheevd, plasma_zheevr, plasma_zhegvd, plasma_zhegv, plasma_zhetrd. More details about the technique can be found in:

H. Ltaief, P. Luszczek, A. Haidar and J. Dongarra. Solving the Generalized Symmetric Eigenvalue Problem using Tile Algorithms on Multicore Architectures. Advances in Parallel Computing Volume 22, 2012.

A. Haidar, H. Ltaief and J. Dongarra. Parallel MemoryAware FineGrained Reduction to Condensed Forms for Symmetric Eigenvalue Problems. International Conference for High Performance Computing, Networking, Storage and Analysis, IEEESC 2011.


Solution of dense Singular Value Decomposition in real space and complex space, via twostage bidiagonal reduction which has been proved to be upto 10 times faster than the standard tridiagonalisation. Today both singular values and vectors are supported. The overall speedup of the twostage dense Singular Value Decomposition algorithm vary between two (when both singular vectors are needed) and 10 (when only singular values are needed). The new routines are: plasma_zgesvd, plasma_zgesdd, plasma_zgebrd. More details about the technique can be found in:

A. Haidar, H. Ltaief, P. Luszczek and J. Dongarra. A Comprehensive Study of Task Coalescing for Selecting Parallelism Granularity in a TwoStage Bidiagonal Reduction A Comprehensive Study of Task Coalescing for Selecting Parallelism Granularity in a TwoStage Bidiagonal Reduction. IEEE IPDPS 2012

A. Haidar, P. Luszczek, J. Kurzak and J. Dongarra. An Improved Parallel Singular Value Algorithm and Its Implementation for Multicore Hardware. International Conference for High Performance Computing, Networking, Storage and Analysis, IEEESC 2013.


Solution of dense systems of linear equations and least square problems in real space and complex space, using single precision and double precision, via the Cholesky, LU, QR and LQ factorizations

Solution of dense linear systems of equations in real space and complex space using the mixedprecision algorithm based on the Cholesky, LU, QR and LQ factorizations

Multiple implementations of the LU factorization algorithm: Partial pivoting based on recursive parallel panel, tournament pivoting with LU partial pivoting and incremental pivoting.

Treebased QR and LQ factorizations and Q matrix generation and application (“tall and skinny”)

Treebased bidiagonal reduction (“tall and skinny”)

Explicit matrix inversion based on Cholesky factorization (symmetric positive definite)

Parallel and cacheefficient inplace layout translations (Gustavson et at.)

Complete set of Level 3 BLAS routines for matrices stored in tile layout

Simple LAPACKlike interface for greater productivity and advanced (tile) interface for full control and maximum performance; Routines for conversion between LAPACK matrix layout and PLASMA’s tile layout

Dynamic scheduler QUARK (QUeuing And Runtime for Kernels) and dynamically scheduled versions of all computational routines (alongside statically scheduled ones)

Asynchronous interface for launching dynamically scheduled routines in a nonblocking mode. Sequence and request constructs for controlling progress and checking errors

Automatic handling of workspace allocation; A set of auxiliary functions to assist the user with workspace allocation

A simple set of "sanity" tests for all numerical routines including Level 3 BLAS routines for matrices in tile layout

An advanced testing suite for exhaustive numerical testing of all the routines in all precisions (based on the testing suite of the LAPACK library)

Basic timing suite for most of the routines in all precisions

Thread safety

Support for Make and CMake build systems

LAPACKstyle comments in the source code using the Doxygen system

Native support for Microsoft Windows using WinThreads through a thin OS interaction layer

Installer capable of downloading from Netlib and installing missing components of PLASMA’s software stack (BLAS, CBLAS, LAPACK, LAPACKE C API)

Extensive documentation including Installation Guide, Users' Guide, Reference Manual and an HTML code browser, a guide on running PLASMA with the TAU package, Contributors' Guide, a README and Release Notes.

A comprehensive set of usage examples
New Features by Release
2.6.0, November, 2013

libcoreblas has been made fully independent. All dependencies to libplasma and libquark have been removed. A pkgconfig file has been added to ease compilation of projects using the standalone coreblas library.

New routines PLASMA_[sdcz]pltmg[_Tile[_Async]], for PLasma Test Matrices Generation, have been added to create special test matrices from the Matlab gallery. This includes Cauchy, Circulant, Fiedler, Foster, Hadamard, Hankel, Householder and many other matrices.

Add norms computation for triangular matrices: PLASMA_[sdcz]lantr[_Tile[_Async]], and dependent kernels.

Doxygen documentation of coreblas kernels have been updated.

Fix problem reported by J. Dobson from NAG on thread settings modification made in singular values, eigen values toutines when MKL is used.
2.5.2, September, 2013

Add m and n options to timing routines to define matrix size without using ranges

Fix a minor bug that appears when combining mutithreaded tasks with threadmasks in Quark. Previously, the thread mask was not respected when the tasks of the multithreaded task were being assigned to threads.

Fix illegal division by 0 that occured when matrix size was smaller than the tile size during inplace layout translation. See http://icl.cs.utk.edu/plasma/forum/viewtopic.php?f=2&t=1684&p=2374#p2374

Fix the QUARK_REGION bug that was limiting performance of QR/LQ factorization in the last release.

Fix illegal division by 0 when first numa node detected by HwLoc is empty. Thanks to Jim for those two bug reports, see http://icl.cs.utk.edu/plasma/forum/viewtopic.php?f=2&t=1680.

Fix integer size that was creating overflow in tile pointers computation. Thanks to SGI for the bug report.
2.5.1, July, 2013

Add LU factorization with tournament pivoting. Each tournament is based on the classical partial pivoting algorithm. PLASMA_[sdcz]getrf_tntpiv[_Tile[_Async]]. The size of each subdomain involved in the tournament can be set through the call to "PLASMA_Set( PLASMA_TNTPIVOTING_SIZE, nt );". The default is 4. See LAWN 226.

Add LU factorization with no pivoting: PLASMA_[sdcz]getrf_nopiv[_Tile[_Async]]. WARNING: your matrix has to diagonal dominant to use it or the result might be wrong.

Add QR with rank revealing routine: PLASMA_[sdcz]geqp3[_Tile[_Async]].

Fix many comments in the Doxygen documentation

Complete documentation on DAG and execution traces generation

Add the dense hermetian eigenvalue problem routines: Note that these routines requires mulithreaded BLAS. For that, the user is required to tell PLASMA that he is using multithreaded BLAS library and so specify which library is being used by adding DPLASMA_WITH_XXX to the compilation flags. Current supported library are DPLASMA_WITH_MKL or DPLASMA_WITH_ACML but it is easy to add morelibrary, please contact PLASMA team if you require addtional libraries to be supported. 1 PLASMA_[sdch]hetrd: compute the tridiagonal reduction of a dense hermetian matrix using the 2stage algorithm A = QTQ^H. It also has the feature to generates the complex matrix Q with orthonormal columns used to reduce the matrix A to tridiagonal. This function is similar to the ZHETRD routine combined with the ZUNGQR routine (when Q is generated) of LAPACK. 2 PLASMA[sdch]heev: computes all eigenvalues and, optionally, eigenvectors of a complex Hermitian matrix A. This function is similar to the ZHEEV routine of LAPACK. 3 PLASMA[sdch]heevd: computes all eigenvalues and, optionally, eigenvectors of a complex Hermitian matrix A. If eigenvectors are desired, it uses a divide and conquer algorithm. This function is similar to the ZHEEVD routine of LAPACK. 4 PLASMA[sdch]heevr: computes selected eigenvalues and, optionally, eigenvectors of a complex Hermitian matrix A. Eigenvalues and eigenvectors can be selected by specifying either a range of values or a range of indices for the desired eigenvalues. Whenever possible, ZHEEVR calls ZSTEMR to compute eigenspectrum using Relatively Robust Representations (MRRR). This function is similar to the ZHEEVR routine of LAPACK. 5 PLASMA[sdch]_hegv: computes all the eigenvalues, and optionally, the eigenvectors of a complex generalized Hermitiandefinite eigenproblem, of the form: A*x=(lambda)*B*x, A*Bx=(lambda)*x, or B*A*x=(lambda)*x. Here A and B are assumed to be Hermitian and B is also positive definite. This function uses the QR algorithm, and is similar to the ZHEGV routine of LAPACK.
6 PLASMA_[sdch]_hegvd: computes all the eigenvalues, and optionally, the eigenvectors the eigenvectors of a complex generalized Hermitiandefinite eigenproblem, of the form: A*x=(lambda)*B*x, A*Bx=(lambda)*x, or B*A*x=(lambda)*x. Here A and B are assumed to be Hermitian and B is also positive definite. If eigenvectors are desired, it uses a divide and conquer algorithm, and is similar to the ZHEGVD routine of LAPACK.

Add the singular value decomposition (SVD) routines: Note that these routines requires mulithreaded BLAS. For that, the user is required to tell PLASMA that he is using multithreaded BLAS library and so specify which library is being used by adding DPLASMA_WITH_XXX to the compilation flags. Current supported library are DPLASMA_WITH_MKL or DPLASMA_WITH_ACML but it is easy to add morelibrary, please contact PLASMA team if you require addtional libraries to be supported. 1 PLASMA_[sdch]gebrd: compute the bidiagonal reduction of a dense general matrix using the 2stage algorithm A = QBP^{H. It also has the feature to generates the complex matrix Q and P}H with orthonormal columns used to reduce the matrix A to bidiagonal. This function is similar to the ZGEBRD routine combined with the ZUNGBR routine (when Q is generated) of LAPACK. 2 PLASMA[sdch]gesvd: computes the singular value decomposition (SVD) of a complex matrix A, optionally computing the left and/or right singular vectors. The SVD is written A = U * SIGMA * conjugatetranspose(V). This routine use the implicit zeroshift QR algorithm and is similar to the ZGESVD routine of LAPACK. 3 PLASMA[sdch]_gesdd: computes the singular value decomposition (SVD) of a complex matrix A, optionally computing the left and/or right singular vectors. The SVD is written A = U * SIGMA * conjugatetranspose(V). This routine use the divide and conquer algorithm and is similar to the ZGESDD routine of LAPACK.
2.5.0, November, 2012

Introduce condition estimators for General and Positive Definite cases (xGECON, xPOCON)

Fix recurring with lapack release number in plasmainstaller

Fix outoforder computation in QR/LQ factorization that were causing numerical issues with dynamic scheduling

Fix many comments in the Doxygen documentation

Correct some performance issues with inplace layout translation
2.4.6, August 20th, 2012

Add eigenvectors support in eigensolvers for symmetric/hermitian problems and generalized problems.

Add support of Frobenius norm.

Release the precision generation script used to generate the precision s, d and c from z, as well as ds from zc

Add all Fortran90 for mixed precision routines.

Add all Fortran90 wrappers to tile interface and asynchronous interface. Thanks to NAG for providing those wrappers.

Add 4 examples with Fortran90 interface.

Add support for all computational functions in F77 wrappers.

Fix memory leaks related to fake dependencies in dynamically scheduled algorithms.

Fix interface issues in eigensolvers routines.

Fixed returned info in PLASMA_zgetrf function

Fixed bug with matrices of size 0.

WARNING: all lapack interfaces having a T or L argument for QR or LU factorization have been changed to take a descriptor. The workspace allocation has been changed to match those requirements and all functions PLASMA_Alloc_Workspace_XXXXX_Tile are now deprecated and users are encouraged to move to the PLASMA_Alloc_Workspace_XXXXX version.
2.4.5, November 22nd, 2011

Add LU inversion functions: PLASMA_zgetri, PLASMA_zgetri_Tile and PLASMA_zgetri_Tile_Async using the recursive parallel panel implementation of LU factorization.

The householder reduction trees for QR and LQ factorizations can now work on general cases and not only on matrices with M multiple of MB.

Matrices generation has been changed in every timing, testing and example files to use a parallel initialization generating a better distribution of the data on the architecture, especially for Tile interface. “numactl” is not required anymore.

Timing routines can now generate DAGs with the dag option, and traces with trace option if EZTRACE is present.
2.4.2, September 14th, 2011

New version of quark removing active waiting and allowing user to bind tasks to set of cores.

Installer: Fix compatibility issues between plasmainstaller and PGI compiler reported on Kraken by A. Bouteiller.

Fix one memory leak with Hwloc.

Introduce a new kernel for the recursive LU operation on tile layout which reduces cache misses.

Fix several bugs and introduce new features thanks to people from Fujitsu and NAG :

The new LU factorization with partial pivoting introduced in release 2.4 is now working on rectangular matrices.

Add missing functions to Fortran 77 interface.

Add a new Fortran 90 interface to all LAPACK and Tile interface. Asynchronous interface and mixed precision routines are not available yet.

Fix arguments order in header files to fit implementation.

2.4.1, July 8th, 2011

Fix bug with Fujitsu compiler reported on the forum (http://icl.cs.utk.edu/plasma/forum/viewtopic.php?f=2&t=108)

Unbind threads in PLASMA_Finalize to avoid problem of binding in OpenMP section following PLASMA calls (still possible on Mac and AIX without hwloc). A better fix is to create the OpenMP thread in the user code before any call to PLASMA thanks to a fake parallel section.
2.4.0, June 6th, 2011

Treebased QR and LQ factorizations: routines for application of the Q matrix support all combinations of input parameters: Left/Right, NoTrans/Trans/ConjTrans

Symmetric Engenvalue Problem using: tile reduction to bandtridiagonal form, reduction to "standard" tridiagonal form by bulge chasing, finding eigenvalues using the QR algorithm (eigenvectors currently not supported)

Singular Value Decomposition using: tile reduction to bandbidiagonal form, reduction to “standard” bidiagonal form by bulge chasing, finding singular values using the QR algorithm (singular vectors currently no supported)

Gaussian Elimination with partial pivoting (as opposed to the incremental pivoting in the tile LU factorization) and parallel panel (using Quark extensions for nested parallelism) WARNING: Following the integration of this new feature, the interface to call LU factorization has changed. Now, PLASMA_zgetrf follows the LAPACK interface and corresponds to the new partial pivoting. Old interface related to LU factorization with incremental pivoting is now renamed PLASMA_zgetrf_incpiv.
2.3.1, November 30th, 2010

Add functions to generate random matrices (plrnt, plghe and plgsy) ⇒ fix the problem with time_zpotri_tile.c reported by Katayama on the forum (http://icl.cs.utk.edu/plasma/forum/viewtopic.php?f=2&t=59)

Fix a deadlock in norm computations with static scheduling

Installer: fix the LAPACK version when libtmg is the only library to be install Thanks to Henc. (http://icl.cs.utk.edu/plasma/forum/viewtopic.php?f=2&t=60)
2.3.0, November 15th, 2010

Parallel and cacheefficient inplace layout translations (Gustavson et al.)

Treebased QR factorization and Q matrix generation (“tall and skinny”)

Explicit matrix inversion based on Cholesky factorization (symmetric positive definite)

Replacement of LAPACK C Wrapper with LAPACKE C API by Intel
2.2.0, July 9th, 2010

Dynamic scheduler QUARK (QUeuing And Runtime for Kernels) and dynamically scheduled versions of all computational routines (alongside statically scheduled ones)

Asynchronous interface for launching dynamically scheduled routines in a nonblocking mode. Sequence and request constructs for controlling progress and checking errors

Removal of CBLAS and pieces of LAPACK from PLASMA’s source tree. BLAS, CBLAS, LAPACK and Netlib LAPACK C Wrapper become PLASMA’s software dependencies required prior to the installation of PLASMA

Installer capable of downloading from Netlib and installing missing components of PLASMA’s software stack (BLAS, CBLAS, LAPACK, LAPACK C Wrapper)

Complete set of Level 3 BLAS routines for matrices stored in tile layout
2.1.0, November 15th, 2009

Native support for Microsoft Windows using WinThreads

Support for Make and CMake build systems

Performanceoptimized mixedprecision routine for the solution of linear systems of equations using the LU factorization

Initial timing code (PLASMA_dgesv only)

Release notes
2.0.0, July 4th, 2008

Support for real and complex arithmetic in single and double precision

Generation and application of the Q matrix from the QR and LQ factorizations

Prototype of mixedprecision routine for the solution of linear systems of equations using the LU factorization (not optimized for performance)

Simple interface and native interface

Major code cleanup and restructuring

Redesigned workspace allocation

LAPACK testing

Examples

Thread safety

Python installer

Documentation: Installation Guide, Users' Guide with routine reference and an HTML code browser, a guide on running PLASMA with the TAU package, initial draft of Contributors' Guide, a README file and a LICENSE file
1.0.0, November 15th, 2008

Double precision routines for the solution of linear systems of equations and least square problems using Cholesky, LU, QR and LQ factorizations