MAGMA
2.3.0
Matrix Algebra for GPU and Multicore Architectures

The interface for MAGMA is similar to LAPACK, to facilitate porting existing codes.
Many routines have the same base names and the same arguments as LAPACK. In some cases, MAGMA needs larger workspaces or some additional arguments in order to implement an efficient algorithm.
There are several classes of routines in MAGMA:
A brief summary of routines is given here. Full descriptions of individual routines are given in the Modules section.
Driver & computational routines have a magma_
prefix. These are generally hybrid CPU/GPU algorithms. A suffix indicates in what memory the matrix starts and ends, not where the computation is done.
Suffix  Example  Description 

none  magma_dgetrf  hybrid CPU/GPU routine where the matrix is initially in CPU host memory. 
_m  magma_dgetrf_m  hybrid CPU/multipleGPU routine where the matrix is initially in CPU host memory. 
_gpu  magma_dgetrf_gpu  hybrid CPU/GPU routine where the matrix is initially in GPU device memory. 
_mgpu  magma_dgetrf_mgpu  hybrid CPU/multipleGPU routine where the matrix is distributed across multiple GPUs' device memories. 
In general, MAGMA follows LAPACK's naming conventions. The base name of each routine has a one letter precision (occasionally two letters), two letter matrix type, and usually a 23 letter routine name. For example, DGETRF is D (doubleprecision), GE (general matrix), TRF (triangular factorization).
Precision  Description 

s  single real precision (float) 
d  double real precision (double) 
c  singlecomplex precision (magmaFloatComplex) 
z  doublecomplex precision (magmaDoubleComplex) 
sc  singlecomplex input with single precision result (e.g., scnrm2) 
dz  doublecomplex input with double precision result (e.g., dznrm2) 
ds  mixedprecision algorithm (double and single, e.g., dsgesv) 
zc  mixedprecision algorithm (doublecomplex and singlecomplex, e.g., zcgesv) 
Matrix type  Description 

ge  general matrix 
sy  symmetric matrix, can be real or complex 
he  Hermitian (complex) matrix 
po  positive definite, symmetric (real) or Hermitian (complex) matrix 
tr  triangular matrix 
or  orthogonal (real) matrix 
un  unitary (complex) matrix 
Driver routines solve an entire problem.
Name  Description 

gesv  solve linear system, AX = B, A is general (nonsymmetric) 
posv  solve linear system, AX = B, A is symmetric/Hermitian positive definite 
hesv  solve linear system, AX = B, A is symmetric indefinite 
gels  least squares solve, AX = B, A is rectangular 
geev  nonsymmetric eigenvalue solver, AX = X Lambda 
syev/heev  symmetric eigenvalue solver, AX = X Lambda 
syevd/heevd  symmetric eigenvalue solver, AX = X Lambda, using divide & conquer 
sygvd/hegvd  symmetric generalized eigenvalue solver, AX = BX Lambda 
gesvd  singular value decomposition (SVD), A = U Sigma V^H 
gesdd  singular value decomposition (SVD), A = U Sigma V^H, using divide & conquer 
Computational routines solve one piece of a problem. Typically, driver routines call several computational routines to solve the entire problem. Here, curly braces { } group similar routines. Starred * routines are not yet implemented in MAGMA.
Name  Description 

: Triangular factorizations :  Description 
getrf, potrf, hetrf  triangular factorization (LU, Cholesky, Indefinite) 
getrs, potrs, hetrs  triangular forward and back solve 
getri, potri  triangular inverse 
getf2, potf2  triangular panel factorization (BLAS2) 
. Orthogonal factorizations  Description 
ge{qrf, qlf, lqf, rqf*}  QR, QL, LQ, RQ factorization 
geqp3  QR with column pivoting (BLAS3) 
or{mqr, mql, mlq, mrq*}  multiply by Q after factorization (real) 
un{mqr, mql, mlq, mrq*}  multiply by Q after factorization (complex) 
or{gqr, gql*, glq*, grq*}  generate Q after factorization (real) 
un{gqr, gql*, glq*, grq*}  generate Q after factorization (complex) 
. Eigenvalue & SVD  Description 
gehrd  Hessenberg reduction (in geev) 
sytrd/hetrd  tridiagonal reduction (in syev, heev) 
gebrd  bidiagonal reduction (in gesvd) 
There are many other computational routines that are mostly internal to MAGMA and LAPACK, and not commonly called by end users.
BLAS routines follow a similar naming scheme: precision, matrix type (for level 2 & 3), routine name. For BLAS routines, the magma_ prefix indicates a wrapper around CUBLAS (e.g., magma_zgemm calls cublasZgemm), while the magmablas_ prefix indicates our own MAGMA implementation (e.g., magmablas_zgemm). All MAGMA BLAS routines are GPU native and take the matrix in GPU memory. The descriptions here are simplified, omitting scalars (alpha & beta) and transposes.
These do O(n) operations on O(n) data and are memorybound.
Name  Description 

copy  copy vector, y = x 
scal  scale vector, y = alpha*y 
swap  swap two vectors, y <—> x 
axpy  y = alpha*x + y 
nrm2  vector 2norm 
amax  vector maxnorm 
asum  vector onenorm 
dot  dot product (real), x^T y 
dotu  dot product (complex), unconjugated, x^T y 
dotc  dot product (complex), conjugated, x^H y 
These do O(n^2) operations on O(n^2) data and are memorybound.
Name  Description 

gemv  general matrixvector product, y = A*x 
symv, hemv  symmetric/Hermitian matrixvector product, y = A*x 
syr, her  symmetric/Hermitian rank1 update, A = A + x*x^H 
syr2, her2  symmetric/Hermitian rank2 update, A = A + x*y^H + y*x^H 
trmv  triangular matrixvector product, y = A*x 
trsv  triangular solve, one righthand side (RHS), solve Ax = b 
These do O(n^3) operations on O(n^2) data and are computebound. Level 3 BLAS are significantly more efficient than the memorybound level 1 and level 2 BLAS.
Name  Description 

gemm  general matrixmatrix multiply, C = C + A*B 
symm, hemm  symmetric/Hermitian matrixmatrix multiply, C = C + A*B, A is symmetric 
syrk, herk  symmetric/Hermitian rankk update, C = C + A*A^H, C is symmetric 
syr2k, her2k  symmetric/Hermitian rank2k update, C = C + A*B^H + B*A^H, C is symmetric 
trmm  triangular matrixmatrix multiply, B = A*B or B*A, A is triangular 
trsm  triangular solve, multiple RHS, solve A*X = B or X*A = B, A is triangular 
Additional BLASlike routines, many originally defined in LAPACK. These follow a similar naming scheme: precision, then "la", then the routine name. MAGMA implements these common ones on the GPU, plus adds a few such as symmetrize and transpose.
For auxiliary routines, the magmablas_ prefix indicates our own MAGMA implementation (e.g., magmablas_zlaswp). All MAGMA auxiliary routines are GPU native and take the matrix in GPU memory.
Name  Description 

geadd  add general matrices (like axpy), B = alpha*A + B 
laswp  swap rows (in getrf) 
laset  set matrix to constant 
lacpy  copy matrix 
lascl  scale matrix 
lange  norm, general matrix 
lansy/lanhe  norm, symmetric/Hermitian matrix 
lantr  norm, triangular matrix 
lag2  convert general matrix from one precision to another (e.g., dlag2s is double to single) 
lat2  convert triangular matrix from one precision to another 
larf  apply Householder elementary reflector 
larfg  generate Householder elementary reflector 
larfb  apply block Householder elementary reflector 
larft  form T for block Householder elementary reflector 
symmetrize  copy lower triangle to upper triangle, or viceversa 
transpose  transpose matrix 
MAGMA can use regular CPU memory allocated with malloc or new, but it may achieve better performance using aligned and, especially, pinned memory. There are typed versions of these (e.g., magma_zmalloc) that avoid the need to cast and use sizeof, and untyped versions (e.g., magma_malloc) that are more flexible but require a (void**) cast and multiplying the number of elements by sizeof.
Name  Description 

magma_malloc_cpu()  allocate CPU memory that is aligned for better performance & reproducibility 
magma_malloc_pinned()  allocate CPU memory that is pinned (pagelocked) 
magma_malloc()  allocate GPU memory 
magma_free_cpu()  free CPU memory allocated with magma_malloc_cpu 
magma_free_pinned()  free CPU memory allocated with magma_malloc_pinned 
magma_free()  free GPU memory allocated with magma_malloc 
where * is one of the four precisions, s d c z, or i for magma_int_t, or none for an untyped version.
The name of communication routines is from the CPU's point of view.
Name  Description 

setmatrix  send matrix to GPU 
setvector  send vector to GPU 
getmatrix  get matrix from GPU 
getvector  get vector from GPU 
MAGMA uses magma_int_t for integers. Normally, this is mapped to the C/C++ int type. Most systems today use the LP64 convention, meaning long and pointers are 64bit, while int is 32bit.
MAGMA also supports the ILP64 convention as an alternative, where int, long, and pointers are all 64bit. To use this, we typedef magma_int_t to be long long. To use ILP64, define MAGMA_ILP64 or MKL_ILP64 when compiling, and link with an ILP64 BLAS and LAPACK library; see make.inc.mklilp64 for an example.
MAGMA supports complex numbers. Unfortunately, there is not a single standard for how to implement complex numbers in C/C++. Fortunately, most implementations are identical on a binary level, so you can freely cast from one to another. The MAGMA types are: magmaFloatComplex, which in CUDA MAGMA is a typedef of cuFloatComplex, and magmaDoubleComplex, which in CUDA MAGMA is a typedef of cuDoubleComplex.
For C, we provide macros to manipulate complex numbers. For C++ support, include the magma_operators.h header, which provides overloaded C++ operators and functions.
C macro  C++ operator  Description 

c = MAGMA_*_MAKE(r,i)  create complex number from real & imaginary parts  
r = MAGMA_*_REAL(a)  r = real(a)  return real part 
i = MAGMA_*_IMAG(a)  i = imag(a)  return imaginary part 
c = MAGMA_*_NEGATE(a)  c = a;  negate 
c = MAGMA_*_ADD(a,b)  c = a + b;  add 
c = MAGMA_*_SUB(a,b)  c = a  b;  subtract 
c = MAGMA_*_MUL(a,b)  c = a * b;  multiply 
c = MAGMA_*_DIV(a,b)  c = a / b;  divide 
c = MAGMA_*_CNJG(a)  c = conj(a)  conjugate 
r = MAGMA_*_ABS(a)  r = fabs(a)  2norm, sqrt( real(a)^2 + imag(a)^2 ) 
r = MAGMA_*_ABS1(a)  r = abs1(a)  1norm, abs(real(a)) + abs(imag(a)) 
. Constants  Description  
c = MAGMA_*_ZERO  zero  
c = MAGMA_*_ONE  one  
c = MAGMA_*_NAN  notanumber (e.g., 0/0)  
c = MAGMA_*_INF  infinity (e.g., 1/0, overflow) 
where * is one of the four precisions, S D C Z.
Here are general guidelines for variable names; there are of course exceptions to these.
Typically, the order of arguments is:
LAPACK and MAGMA use columnmajor matrices. For matrix X with dimension (lda,n), element X(i, j) is X[ i + j*lda ]. For symmetric, Hermitian, and triangular matrices, only the lower or upper triangle is accessed, as specified by the uplo argument; the other triangle is ignored.
lda is the leading dimension of matrix A; similarly ldb for B, ldda for dA, etc. It should immediately follow the matrix pointer in the argument list. The leading dimension can be the number of rows, or if A is a submatrix of a larger parent matrix, lda is the leading dimension (e.g., rows) of the parent matrix.
On the GPU, it is often beneficial to round the leading dimension up to a multiple of 32, to provide better performance. This aligns memory reads so they are coalesced. This is provided by the magma_roundup function:
ldda = magma_roundup( m, 32 );
The formula ((m + 31)/32)*32 also works, relying on floored integer division, but the roundup function is clearer to use.
On the CPU, it is often beneficial to ensure that the leading dimension is not a multiple of the page size (often 4 KiB), to minimize TLB misses.
For vectors, incx is the increment or stride between elements of vector x. In all cases, incx != 0. In most cases, if incx < 0, then the vector is indexed in reverse order, for instance, using Matlab notation,
incx = 1 means x( 1 : 1 : n ) incx = 2 means x( 1 : 2 : 2*n1 )
while
incx = 1 means x( n : 1 : 1 ) incx = 2 means x( 2*n1 : 2 : 1 )
For several routines (amax, amin, asum, nrm2, scal), the order is irrelevant, so negative incx are not allowed; incx > 0.
MAGMA defines a few constant parameters, such as MagmaTrans, MagmaNoTrans
, that are equivalent of CBLAS and LAPACK parameters. The naming and numbering of these parameters follow that of CBLAS from Netlib and the C Interface to LAPACK from Netlib, and PLASMA.
MAGMA includes functions, lapack_xyz_const()
, which take MAGMA's integer constants and return LAPACK's string constants, where xyz
is a MAGMA type such as uplo
, trans
, etc. From the standpoint of LAPACK, only the first letter of each string is significant. Nevertheless, the functions return meaningful strings, such as "No transpose", "Transpose", "Upper", "Lower", etc. Similarly, there are functions to go from MAGMA's integer constants to CBLAS, OpenCL's clBLAS, and CUDA's cuBLAS integer constants.
There are also functions, magma_xyz_const()
, to go in the opposite direction, from LAPACK's string constants to MAGMA's integer constants.
The most common constants are those defined for BLAS routines:
enum { MagmaNoTrans, MagmaTrans, MagmaConjTrans } magma_order_t
Whether a matrix is not transposed, transposed, or conjugatetransposed. For a real matrix, Trans and ConjTrans have the same meaning.
enum { MagmaLower, MagmaUpper, MagmaFull } magma_uplo_t
Whether the lower or upper triangle of a matrix is given, or the full matrix.
enum { MagmaLeft, MagmaRight } magma_side_t
Whether the matrix is on the left or right.
enum { MagmaUnit, MagmaNonUnit } magma_diag_t
Whether the diagonal is assumed to be unit (all ones) or not.
Additional constants for specific routines are defined in the documentation for the routines.
Because MAGMA, CBLAS, LAPACK, CUBLAS, and clBlas use potentially different constants, converters between them are provided.
These convert LAPACK constants to MAGMA constants. Note that the meaning of LAPACK constants depends on the context: 'N' can mean False, NoTrans, NonUnit, NoVec, etc. Here, curly braces { } group similar constants.
.  Function  .  Description 

magma_bool_t  magma_bool_const  ( character )  Map 'N', 'Y' to MagmaTrue, MagmaFalse 
magma_order_t  magma_order_const  ( character )  Map 'R', 'C' to MagmaRowMajor, MagmaColMajor 
magma_trans_t  magma_trans_const  ( character )  Map 'N', 'T', 'C' to MagmaNoTrans, MagmaTrans, MagmaConjTrans 
magma_uplo_t  magma_uplo_const  ( character )  Map 'L', 'U' to MagmaLower, MagmaUpper 
magma_diag_t  magma_diag_const  ( character )  Map 'N', 'U' to MagmaNonUnit, MagmaUnit 
magma_side_t  magma_side_const  ( character )  Map 'L', 'R' to MagmaLeft, MagmaRight 
magma_norm_t  magma_norm_const  ( character )  Map 'O', '1', '2', 'F', 'E', 'I', 'M' to Magma{One, Two, Frobenius, Inf, Max}Norm 
magma_dist_t  magma_dist_const  ( character )  Map 'U', 'S', 'N' to MagmaDist{Uniform, Symmetric, Normal} 
magma_vec_t  magma_vec_const  ( character )  Map 'V', 'N', 'I', 'A', 'S', 'O' to MagmaVec, Magma{No, I, All, Some, Overwrite}Vec 
magma_range_t  magma_range_const  ( character )  Map 'A', 'V', 'I' to MagmaRange{All, V, I} 
magma_vect_t  magma_vect_const  ( character )  Map 'Q', 'P' to MagmaQ, MagmaP 
magma_direct_t  magma_direct_const  ( character )  Map 'F', 'B' to MagmaForward, MagmaBackward 
magma_storev_t  magma_storev_const  ( character )  Map 'C', 'R' to MagmaColumnwise, MagmaRowwise 
These do the inverse map, converting MAGMA to LAPACK constants. From the standpoint of LAPACK, only the first letter of each string is significant. Nevertheless, the functions return meaningful strings, such as "No transpose", "Transpose". Substitute lapacke
for lapack
to get version that returns single char instead of string (const char*).
.  Function  .  Description 

const char*  lapack_bool_const  ( magma_bool_t )  Inverse of magma_bool_const() 
const char*  lapack_order_const  ( magma_order_t )  Inverse of magma_order_const() 
const char*  lapack_trans_const  ( magma_trans_t )  Inverse of magma_trans_const() 
const char*  lapack_uplo_const  ( magma_uplo_t )  Inverse of magma_uplo_const() 
const char*  lapack_diag_const  ( magma_diag_t )  Inverse of magma_diag_const() 
const char*  lapack_side_const  ( magma_side_t )  Inverse of magma_side_const() 
const char*  lapack_norm_const  ( magma_norm_t )  Inverse of magma_norm_const() 
const char*  lapack_dist_const  ( magma_dist_t )  Inverse of magma_dist_const() 
const char*  lapack_vec_const  ( magma_vec_t )  Inverse of magma_vec_const() 
const char*  lapack_range_const  ( magma_range_t )  Inverse of magma_range_const() 
const char*  lapack_vect_const  ( magma_vect_t )  Inverse of magma_vect_const() 
const char*  lapack_direct_const  ( magma_direct_t )  Inverse of magma_direct_const() 
const char*  lapack_storev_const  ( magma_storev_t )  Inverse of magma_storev_const() 
const char*  lapack_const  ( constant )  Map any MAGMA constant, Magma*, to an LAPACK string constant 
char  lapacke_const  ( constant )  Map any MAGMA constant, Magma*, to an LAPACKE character 
To convert MAGMA to Nvidia's CUBLAS constants:
.  Function  .  Description 

cublasOperation_t  cublas_trans_const  ( trans )  Map MagmaNoTrans, MagmaTrans, MagmaConjTrans to CUBLAS_OP_N, CUBLAS_OP_T, CUBLAS_OP_C 
cublasFillMode_t  cublas_uplo_const  ( uplo )  Map MagmaLower, MagmaUpper to CUBLAS_FILL_MODE_LOWER, CUBLAS_FILL_MODE_UPPER 
cublasDiagType_t  cublas_diag_const  ( diag )  Map MagmaNonUnit, MagmaUnit to CUBLAS_DIAG_NON_UNIT, CUBLAS_DIAG_UNIT 
cublasSideMode_t  cublas_side_const  ( side )  Map MagmaLeft, MagmaRight to CUBLAS_SIDE_LEFT, CUBLAS_SIDE_Right 
To convert MAGMA to AMD's clBlas constants:
.  Function  .  Description 

clblasOrder  clblas_order_const  ( order )  Map MagmaRowMajor, MagmaColMajor to clAmdBlasRowMajor, clAmdBlasColumnMajor 
clblasTranspose  clblas_trans_const  ( trans )  Map MagmaNoTrans, MagmaTrans, MagmaConjTrans to clAmdBlasNoTrans, clAmdBlasTrans, clAmdBlasConjTrans 
clblasUplo  clblas_uplo_const  ( uplo )  Map MagmaLower, MagmaUpper to clAmdBlasLower, clAmdBlasUpper 
clblasDiag  clblas_diag_const  ( diag )  Map MagmaNonUnit, MagmaUnit to clAmdBlasNonUnit, clAmdBlasUnit 
clblasSide  clblas_side_const  ( side )  Map MagmaLeft, MagmaRight to clAmdBlasLeft, clAmdBlasRight 
To convert MAGMA to CBLAS constants:
.  Function  .  Description 

enum CBLAS_ORDER  cblas_order_const  ( order )  Map MagmaRowMajor, MagmaColMajor to CblasRowMajor, CblasColMajor 
enum CBLAS_TRANSPOSE  cblas_trans_const  ( trans )  Map MagmaNoTrans, MagmaTrans, MagmaConjTrans to CblasNoTrans, CblasTrans, CblasConjTrans 
enum CBLAS_UPLO  cblas_uplo_const  ( uplo )  Map MagmaLower, MagmaUpper to CblasLower, CblasUpper 
enum CBLAS_DIAG  cblas_diag_const  ( diag )  Map MagmaNonUnit, MagmaUnit to CblasNonUnit, CblasUnit 
enum CBLAS_SIDE  cblas_side_const  ( side )  Map MagmaLeft, MagmaRight to CblasLeft, CblasRight 
Driver and computational routines, and a few BLAS/auxiliary routines, currently return errors both as a return value and in the info argument. The return value and info should always be identical. In general, the meaning is as given in this table. Predefined error codes are large negative numbers. Using the symbolic constants below is preferred, but the numeric values can be found in include/magma_types.h.
Info  Description 

info = 0 (MAGMA_SUCCESS)  Successful exit 
info < 0, but small  For info = i, the ith argument had an illegal value 
info > 0  Functionspecific error such as singular matrix 
MAGMA_ERR_DEVICE_ALLOC  Could not allocate GPU device memory 
MAGMA_ERR_HOST_ALLOC  Could not allocate CPU host memory 
MAGMA_ERR_ILLEGAL_VALUE  An argument had an illegal value (deprecated; instead it should return i to say the ith argument was bad) 
MAGMA_ERR_INVALID_PTR  Can't free pointer 
MAGMA_ERR_NOT_IMPLEMENTED  Function or option not implemented 
MAGMA_ERR_NOT_SUPPORTED  Function or option not supported on the current architecture 
magma_xerbla() is called to report errors (mostly bad arguments) to the user.
magma_strerror() returns string description of an error code.
The onesided LU, Cholesky, and QR factorizations form a basis for solving linear systems. A general recommendation is to use LU for general nbyn matrices, Cholesky for symmetric/Hermitian positive definite (SPD) matrices, and QR for solving least squares problems,
min  A x  b 
for general mbyn, m > n matrices.
We use hybrid algorithms where the computation is split between the GPU and and the CPU. In general for the onesided factorizations, the panels are factored on the CPU and the trailing submatrix updates on the GPU. Lookahead techniques are used to overlap the CPU and GPU work and some communications.
In both the CPU and GPU interfaces the matrix to be factored resides in the GPU memory, and CPUGPU transfers are associated only with the panels. The resulting matrix is accumulated (on the CPU or GPU according to the interface) along the computation, as a byproduct of the algorithm, vs. sending the the entire matrix when needed. In the CPU interface, the original transfer of the matrix to the GPU is overlapped with the factorization of the first panel. In this sense the CPU and GPU interfaces, although similar, are not derivatives of each other as they have different communication patterns.
Although the solution step has O(n) times less floating point operations than the factorization, it is still very important to optimize it. Solving a triangular system of equations can be very slow because the computation is bandwidth limited and naturally not parallel. Various approaches have been proposed in the past. We use an approach where diagonal blocks of A are explicitly inverted and used in a block algorithm. This results in a high performance, numerically stable algorithm, especially when used with triangular matrices coming from numerically stable factorization algorithms (e.g., as in LAPACK and MAGMA).
For instances when the GPU's single precision performance is much higher than its double precision performance, MAGMA provides a second set of solvers, based on the mixed precision iterative refinement technique. The solvers are based again on correspondingly the LU, QR, and Cholesky factorizations, and are designed to solve linear problems in double precision accuracy but at a speed that is characteristic for the much faster single precision computations. The idea is to use single precision for the bulk of the computation, namely the factorization step, and than use that factorization as a preconditioner in a simple iterative refinement process in double precision arithmetic. This often results in the desired high performance and high accuracy solvers.
As the onesided matrix factorizations are the basis for various linear solvers, the twosided matrix factorizations are the basis for eigensolvers, and therefore form an important class of dense linear algebra routines. The twosided factorizations have been traditionally more difficult to achieve high performance. The reason is that the twosided factorizations involve large matrixvector products which are memory bound, and as the gap between compute and communication power increases exponentially, these memory bound operations become an increasingly more difficult to handle bottleneck. GPUs though offer an attractive possibility to accelerate them. Indeed, having a high bandwidth (e.g. 10 times larger than current CPU bus bandwidths), GPUs can accelerate matrixvector products significantly (10 to 30 times). Here, the panel factorization itself is hybrid, while the trailing matrix update is performed on the GPU.