I have a piece of rather efficient code to do a certain matrix manipulation. However, I am now facing a demand to make things even faster then they already are. The piece of code I have is efficient but it does not use any BLAS and/or LAPACK tricks and thus I think (or maybe it is more "hope"?) that the speed may be improved by incorporating some BLAS and/or LAPACK subroutines or at least some of the "smart tricks" BLAS and LAPACK use.

So let me explain what I am doing (source code is included below as well).

I am working with Normal Equations (nice symmetric positive definitive matrices). For efficiency reasons, and in order to keep the normal equation matrix small, I eliminate certain parameters from the Normal Equation system before solving it. The parameters I eliminate are what I call "epoch parameters" and they are eliminated from the main matrix after the processing of an epoch is finished. I typically process 24 hours of data at a 5 minute sampling rate thus resulting in 288 epochs to be processed.

So for some theorethical background, let us use as notation for the Normal Equation system:

- Code: Select all
`A x = b`

Where A is the normal equation matrix, x the vector of unknown parameters, and b the RHS (Right Hand Side) of the system. A is always symmetrical and positive definitive.

Then what I do for the elimination of the epoch parameters is as follows. We divide the main matrix A into four parts:

- A11 : the main matrix excluding all the non-epoch parameters

A12 : the correlation matrix between the main matrix and the epoch parameter matrix (at a certain epoch)

A21 : the transposed of A12

A22 : the matrix of the epoch parameters (at a certain epoch)

We may then write the normal equation system in the form:

- Code: Select all
`A11 A12 | x1 | = b1 |`

A21 A22 | x2 | b2 |

Where x1 are the "main" parameters, x2 the "epoch" parameters for the current epoch and b1 and b2 the "right hand side" (b) of the normal equation system.

At every epoch we then eliminate the epoch parameters by performing the following operations:

- Code: Select all
`A11~ = A11 - A12 A22^{-1} A21`

b1~ = b1 - A12 A22^{-1} b2

Where the reduced normal equation system, i.e., after elimination of the epoch parameters is then:

- Code: Select all
`A11~ x1 = b1~`

After processing all epochs and eliminating all epoch parameters this is the system we solve to get the parameters we are estimating.

So far so good and this is all simple enough (I hope).

However, now lets have a look at the size of the problems I am working with.

The typical size of x1 is around 10000 parameters (A11 being thus a 10000 x 10000 matrix) whereas the size of x2 is around 150 parameters (A22 being a 150 x 150 matrix and consequently A12 being a 10000 x 150 matrix). Doing a preelimination as described above (and remember I have to do it 288 times per 24 hour solution) took too much time. So I invested a lot of time and came up with something which is significantly faster. The main time reduction was obtained by exploiting the fact that A12 (and its transposed A21) have a large number of zeros (~80%). So the code I wrote first makes an index of the non-zero elements in A12 and then only loops over these non-zero elements. It also makes use of the fact that the multiplication:

- Code: Select all
`A12 A22^{-1}`

is used twice so this is stored in a "help" array. Also the fact that we are dealing with symmetrical matrices is used in that we only do the upper part of the matrix.

For many years (since 2005) this implementation has been good enough for us. However, it has now become the "bottleneck" in our computations and I am therefore interested in getting into contact with the BLAS and/or LAPACK experts and discuss on how my subroutine may be further improved! This part of the process consumes approximately 50% of all the CPU time we need for a 24 hour solution!

Below the source code I developed for this which is pretty efficient but it uses no BLAS/LAPACk tricks/featrues (except that a22 is actually inverted using LAPACK). I would appreciate any hints, tips, pointers in how to speed up this process. My aim is to improve the speed by a factor of 2 which I think should be possible having seen the normal improvements achievable with BLAS/LAPACK. I have time available to devote to this so any help would be greatly appreciated!

Cheers,

Tim

ps. As a "fun fact" I would like to say that the very first question in this forum was actually from me and also about this topic! So this is an "oldie" (or at least I am)!

- Code: Select all
`!! Very efficient pre-elimination of a normal equation system of the form`

!!

!! | a11, a12 | x1 | b1 |

!! | | * | = | = b

!! | a21, a22 | x2 | b2 |

!!

!! When calling the subroutine a22 must already be inverted and full matrix!

!! The normal matrix a11 must be symmetric or upper triangle on input

!! The normal matrix a11 is upper triangle on output

!!

!! On output a11 = a11 - a12 a22 a21

!! On output b1 = b1 - a12 a22 b2

!!

SUBROUTINE PreElMat ( a11, a21, a22, b1, b2 )

!---Declaration of arguments----------------------------------------------------

DOUBLE PRECISION, DIMENSION(:,:), INTENT(INOUT) :: a11

DOUBLE PRECISION, DIMENSION(:,:), INTENT(IN) :: a21

DOUBLE PRECISION, DIMENSION(:,:), INTENT(IN) :: a22

DOUBLE PRECISION, DIMENSION(:), INTENT(INOUT) :: b1

DOUBLE PRECISION, DIMENSION(:), INTENT(IN) :: b2

! Remark: a22 must already be inverted

!---End of declaration of arguments---------------------------------------------

!---Declaration of local variables----------------------------------------------

INTEGER :: i,j,k,l,ir,kc,kr,lc,nrow,ntot,n1,n2,nonzero,allocstat

!

INTEGER , DIMENSION (:), POINTER :: indrow

INTEGER , DIMENSION (:,:), POINTER :: indcol

INTEGER , DIMENSION (:), POINTER :: ncol

DOUBLE PRECISION, DIMENSION (:), POINTER :: help

!---End of declaration of local variables---------------------------------------

!---Executable code-------------------------------------------------------------

! Get matrix size

n1 = SIZE(a21,2)

n2 = SIZE(a21,1)

! Initialize pointers

indrow => NULL()

indcol => NULL()

ncol => NULL()

help => NULL()

!

ALLOCATE ( indrow(n1), STAT=allocStat )

IF ( allocStat /= 0 ) THEN

CALL NCer_Set ('PreElMat: allocating indrow.')

END IF

!

ALLOCATE ( indcol(n1,n2), STAT=allocStat )

IF ( allocStat /= 0 ) THEN

CALL NCer_Set ('PreElMat: allocating indcol.')

END IF

!

ALLOCATE ( ncol(n1), STAT=allocStat )

IF ( allocStat /= 0 ) THEN

CALL NCer_Set ('PreElMat: allocating ncol.')

END IF

!

ALLOCATE ( help(n2), STAT=allocStat )

IF ( allocStat /= 0 ) THEN

CALL NCer_Set ('PreElMat: allocating help.')

END IF

!

! Analyze matrix a21

! ------------------

! nrow : Number of effective lines in matrix a21

! indrow(i) : Row number of the i-th non empty line of matrix a21

! ncol(i) : Number of non-zero elements in line ncol(i) of matrix a21

! indcol(i,j) : Column number of the j-th non-zero element in line ncol(i) of matrix a21

!

nrow = 0

ntot = n1

DO k = 1, ntot

i = k

nonzero = 0

DO j = 1, n2

IF (a21(j,i) /= 0.D0) THEN

nonzero = nonzero + 1

IF (nonzero == 1) THEN

nrow = nrow + 1

indrow(nrow) = i

END IF

ncol(nrow) = nonzero

indcol(nrow,nonzero) = j

END IF

END DO

END DO

!

! Grand loop over all rows in a11

! -------------------------------

DO ir = 1, nrow

i = indrow(ir)

!

! Step 1 compute: help = a12 * a22

! --------------------------------

DO j = 1, n2

help(j) = 0.D0

DO kc = 1, ncol(ir)

k = indcol(ir, kc)

help(j) = help(j) + a21(k,i) * a22(k,j)

END DO

END DO

!

! Step 2a reduce A: a11 = a11 - a12 * a22 * a21

! = a11 - help * a21

! ---------------------------------------------

DO kr = 1, ir ! upper

k = indrow(kr)

DO lc = 1, ncol(kr)

l = indcol(kr,lc)

a11(k,i) = a11(k,i) - help(l) * a21(l,k)

END DO

END DO

!

! Step 2b reduce B: b1 = b1 - a12 * a22 * b2

! = b1 - help * b2

! ------------------------------------------

b1(i) = b1(i) - DOT_PRODUCT( help, b2 )

END DO

DEALLOCATE(indrow, indcol, ncol, help, STAT = allocStat)

IF ( allocStat /= 0 ) THEN

CALL NCer_Set ('PreElMat: deallocating matrices.')

END IF

END SUBROUTINE PreElMat