Hi Bobby,
When I was working on this I wrote level three versions of the solve. Well, it
forms L, does the solve, and then restores A. Even this crude approach was many
times faster than the level 2 I recall. Although I never understood why the
factorization routine couldn't just return L is the first place.
Anyway the code below in case of interest.
Craig
SUBROUTINE L3_ROOKDSYTRS( UPLO, N, NRHS, A, LDA, IPIV, B, LDB,
$ WORK, INFO )
*
* Craig Lucas
* The Numerical Algorithms Group.
* October 2008
*
*  LAPACK routine (version 3.1) 
* Univ. of Tennessee, Univ. of California Berkeley and NAG Ltd..
* November 2006
*
* .. Scalar Arguments ..
INTEGER INFO, LDA, LDB, N, NRHS
CHARACTER UPLO
* ..
* .. Array Arguments ..
DOUBLE PRECISION A( LDA, * ), B( LDB, * ), WORK( * )
INTEGER IPIV( * )
* ..
*
* Purpose
* =======
*
* rookDSYTRS solves a system of linear equations A*X = B with a real
* symmetric matrix A using the factorization A = U*D*U**T or
* A = L*D*L**T computed by rookDSYTRF or fastbpDSYTRF. This is a
* Level 3 BLAS version.
*
* Arguments
* =========
*
* UPLO (input) CHARACTER*1
* Specifies whether the details of the factorization are stored
* as an upper or lower triangular matrix.
* = 'U': Upper triangular, form is A = U*D*U**T;
* = 'L': Lower triangular, form is A = L*D*L**T.
*
* ** ONLY UPLO = 'L' IS IMPLEMENTED **
*
* N (input) INTEGER
* The order of the matrix A. N >= 0.
*
* NRHS (input) INTEGER
* The number of right hand sides, i.e., the number of columns
* of the matrix B. NRHS >= 0.
*
* A (input) DOUBLE PRECISION array, dimension (LDA,N)
* The block diagonal matrix D and the multipliers used to
* obtain the factor U or L as computed by rookSSYTRF or
* fastbpSSYTRF.
*
* LDA (input) INTEGER
* The leading dimension of the array A. LDA >= max(1,N).
*
* IPIV (input) INTEGER array, dimension (N)
* Details of the interchanges and the block structure of D
* as determined by rookDSYTRF or fastbpDSYTRF.
*
* B (input/output) DOUBLE PRECISION array, dimension (LDB,NRHS)
* On entry, the right hand side matrix B.
* On exit, the solution matrix X.
*
* LDB (input) INTEGER
* The leading dimension of the array B. LDB >= max(1,N).
*
* WORK (input) DOUBLE PRECISION array, dimension (N)
* Workspace.
*
* INFO (output) INTEGER
* = 0: successful exit
* < 0: if INFO = i, the ith argument had an illegal value
*
* =====================================================================
*
* .. Parameters ..
DOUBLE PRECISION ONE
PARAMETER ( ONE = 1.0D+0 )
* ..
* .. Local Scalars ..
DOUBLE PRECISION AK, AKM1, AKM1K, BK, BKM1, DENOM
INTEGER J, K, KP
LOGICAL UPPER
* ..
* .. External Functions ..
LOGICAL LSAME
EXTERNAL LSAME
* ..
* .. External Subroutines ..
EXTERNAL DCOPY, DSCAL, DSWAP, DTRSM, XERBLA
* ..
* .. Intrinsic Functions ..
INTRINSIC MAX
* ..
*
* .. Executable Statements ..
INFO = 0
UPPER = LSAME( UPLO, 'U' )
IF( .NOT.UPPER .AND. .NOT.LSAME( UPLO, 'L' ) ) THEN
INFO = 1
ELSE IF( N.LT.0 ) THEN
INFO = 2
ELSE IF( NRHS.LT.0 ) THEN
INFO = 3
ELSE IF( LDA.LT.MAX( 1, N ) ) THEN
INFO = 5
ELSE IF( LDB.LT.MAX( 1, N ) ) THEN
INFO = 8
END IF
IF( INFO.NE.0 ) THEN
*
CALL XERBLA( 'l3_rookDSYTRS', INFO )
*
RETURN
END IF
*
* Quick return if possible
*
IF( N.EQ.0 .OR. NRHS.EQ.0 )
$ RETURN
*
IF( UPPER ) THEN
*
RETURN
*
ELSE
*
* Solve A*X = B, where A = L*D*L'.
*
* First copy out lower diagonal to workspace, and form L in
* lower triangle of A. Zero D( 2, 1). Permute B.
*
* K is the main loop index, increasing from 1 to N in steps of
* 1 or 2, depending on the size of the diagonal blocks.
*
CALL DCOPY( N1, A( 2, 1 ), LDA+1, WORK( 1 ), 1 )
K = 1
10 CONTINUE
*
* If K > N, exit from loop.
*
IF( K.GT.N )
$ GO TO 20
*
IF( IPIV(K).GT.0 ) THEN
*
* 1 x 1 diagonal block
*
KP = IPIV( K )
IF( KP.NE.K ) THEN
CALL DSWAP( K1, A( K, 1 ), LDA, A( KP, 1 ), LDA )
CALL DSWAP( NRHS, B( K, 1 ), LDB, B( KP, 1 ), LDB )
END IF
K = K + 1
ELSE
*
* 2 x 2 diagonal block
*
KP = IPIV( K )
IF( KP.NE.K ) THEN
CALL DSWAP( K1, A( K, 1 ), LDA, A( KP, 1 ), LDA )
CALL DSWAP( NRHS, B( K, 1 ), LDB, B( KP, 1 ), LDB )
END IF
KP = IPIV( K+1 )
IF( KP.NE.K+1 ) THEN
CALL DSWAP( K1, A( K+1, 1 ), LDA, A( KP, 1 ), LDA )
CALL DSWAP( NRHS, B( K+1, 1 ), LDB, B( KP, 1 ), LDB )
END IF
*
A( K+1, K ) = 0
K = K + 2
END IF
*
GO TO 10
20 CONTINUE
*
* Next solve L*D*X = B, overwriting B with X.
*
CALL DTRSM( 'Left', 'Lower', 'No transpose', 'Unit', N, NRHS,
$ ONE, A, LDA, B, LDB )
*
* Multiply by the inverse of the diagonal blocks.
*
K = 1
30 CONTINUE
*
* If K > N, exit from loop.
*
IF( K.GT.N )
$ GO TO 50
IF( IPIV( K ).GT.0 ) THEN
*
* 1 x 1 diagonal block
* Interchange rows K and IPIV(K).
*
CALL DSCAL( NRHS, ONE / A( K, K ), B( K, 1 ), LDB )
K = K + 1
*
ELSE
*
* 2 x 2 diagonal block
* Interchange rows K and IPIV(K) THEN K1 and IPIV(K1)
*
AKM1K = WORK( K )
AKM1 = A( K, K ) / AKM1K
AK = A( K+1, K+1 ) / AKM1K
DENOM = AKM1*AK  ONE
DO 40 J = 1, NRHS
BKM1 = B( K, J ) / AKM1K
BK = B( K+1, J ) / AKM1K
B( K, J ) = ( AK*BKM1BK ) / DENOM
B( K+1, J ) = ( AKM1*BKBKM1 ) / DENOM
40 CONTINUE
K = K + 2
END IF
*
GO TO 30
50 CONTINUE
*
* Next solve L'*X = B, overwriting B with X.
*
CALL DTRSM( 'Left', 'L', 'Transpose', 'Unit', N, NRHS, ONE,
$ A, LDA, B, LDB )
*
* Restore A to input state. Permute B.
*
* K is the main loop index, decreasing from N to 1 in steps of
* 1 or 2, depending on the size of the diagonal blocks.
*
K = N
60 CONTINUE
*
* If K < 1, exit from loop.
*
IF( K.LT.1 )
$ GO TO 70
*
IF( IPIV( K ).GT.0 ) THEN
*
* 1 x 1 diagonal block
* Interchange rows K and IPIV(K).
*
KP = IPIV( K )
IF( KP.NE.K ) THEN
CALL DSWAP( K1, A( K, 1 ), LDA, A( KP, 1 ), LDA )
CALL DSWAP( NRHS, B( K, 1 ), LDB, B( KP, 1 ), LDB )
END IF
K = K  1
ELSE
*
* 2 x 2 diagonal block
* Interchange rows K and IPIV(K) THEN K1 and IPIV(K1)
*
KP = IPIV( K )
IF( KP.NE.K ) THEN
CALL DSWAP( K2, A( K, 1 ), LDA, A( KP, 1 ), LDA )
CALL DSWAP( NRHS, B( K, 1 ), LDB, B( KP, 1 ), LDB )
END IF
KP = IPIV( K1 )
IF( KP.NE.K1 ) THEN
CALL DSWAP( K2, A( K1, 1 ), LDA, A( KP, 1 ), LDA )
CALL DSWAP( NRHS, B( K1, 1 ), LDB, B( KP, 1 ), LDB )
END IF
K = K  2
END IF
*
GO TO 60
70 CONTINUE
CALL DCOPY( N1, WORK( 1 ), 1, A( 2, 1 ), LDA+1 )
END IF
*
RETURN
*
* End of rookDSYTRS
*
END
From: lapackbounces@Domain.Removed [mailto:lapackbounces@Domain.Removed] On
Behalf Of James Demmel
Sent: 13 November 2014 15:01
To: Bobby Cheng; lapack@Domain.Removed
Cc: beboppers@Domain.Removed; Pat Quillen; Duncan Po; Penny Anderson
Subject: Re: [Lapack] LAPACK 3.5  xSYSV_ROOK
The fastest alternative that we have explored is based on Aasen's method,
which factors A into P*L*T*L^T*P^T, where P is a permutation, L is lower
triangular, and T is tridiagonal. There is a paper (which won a prize at
IPDPS'13) on this, see
http://www.eecs.berkeley.edu/~odedsc/papers/block_aasen.pdf
These more recent versions modify Aasen's original factorization by going in
steps from A to narrower band and then solving the narrower band problem.
So the output factorization differs from BunchKaufman or rook, but this
seems necessary to attain high performance. How would such a change
affect the Matlab interface and its users?
Jim
On 11/12/14, 11:38 AM, Bobby Cheng wrote:
Hi LAPACK team,
It is great to see that LDL^T with rook pivoting made it into LAPACK 3.5.
I observed that xSYTRI_ROOK and xSYTRS_ROOK use only level 2 BLAS similar to
xSYTRI and xSYTRS. However, I have not found the level three BLAS counterpart
of xSYTRI2 and xSYTRS2. This is an issue for MATLAB, as customers, not
surprisingly, have found the performance of xSYTRI not acceptable for large
matrices. This will be the same for xSYTRI_ROOK. I am wondering what your plan
is for applying what is done in xSYTRI2 to rook pivoting.
Regards,
Bobby Cheng
_______________________________________________
Lapack mailing list
Lapack@Domain.Removed<mailto:Lapack@Domain.Removed>
http://lists.eecs.utk.edu/mailman/listinfo/lapack
________________________________________________________________________
This email has been scanned for all viruses by Star.
________________________________________________________________________
 next part 
An HTML attachment was scrubbed...
URL:
<http://lists.eecs.utk.edu/mailman/private/lapack/attachments/20141113/c7cf543c/attachment0001.html>
