## Need for Speed

Open discussion regarding features, bugs, issues, vendors, etc.

### Need for Speed

Dear Forum,

I have a piece of rather efficient code to do a certain matrix manipulation. However, I am now facing a demand to make things even faster then they already are. The piece of code I have is efficient but it does not use any BLAS and/or LAPACK tricks and thus I think (or maybe it is more "hope"?) that the speed may be improved by incorporating some BLAS and/or LAPACK subroutines or at least some of the "smart tricks" BLAS and LAPACK use.

So let me explain what I am doing (source code is included below as well).
I am working with Normal Equations (nice symmetric positive definitive matrices). For efficiency reasons, and in order to keep the normal equation matrix small, I eliminate certain parameters from the Normal Equation system before solving it. The parameters I eliminate are what I call "epoch parameters" and they are eliminated from the main matrix after the processing of an epoch is finished. I typically process 24 hours of data at a 5 minute sampling rate thus resulting in 288 epochs to be processed.

So for some theorethical background, let us use as notation for the Normal Equation system:
Code: Select all
` A x = b`

Where A is the normal equation matrix, x the vector of unknown parameters, and b the RHS (Right Hand Side) of the system. A is always symmetrical and positive definitive.

Then what I do for the elimination of the epoch parameters is as follows. We divide the main matrix A into four parts:
A11 : the main matrix excluding all the non-epoch parameters
A12 : the correlation matrix between the main matrix and the epoch parameter matrix (at a certain epoch)
A21 : the transposed of A12
A22 : the matrix of the epoch parameters (at a certain epoch)

We may then write the normal equation system in the form:
Code: Select all
` A11  A12 |   x1 |  =  b1 | A21  A22 |   x2 |     b2 |`

Where x1 are the "main" parameters, x2 the "epoch" parameters for the current epoch and b1 and b2 the "right hand side" (b) of the normal equation system.

At every epoch we then eliminate the epoch parameters by performing the following operations:
Code: Select all
` A11~ = A11 - A12 A22^{-1} A21 b1~  = b1  - A12 A22^{-1} b2`

Where the reduced normal equation system, i.e., after elimination of the epoch parameters is then:
Code: Select all
` A11~ x1 = b1~ `

After processing all epochs and eliminating all epoch parameters this is the system we solve to get the parameters we are estimating.

So far so good and this is all simple enough (I hope).
However, now lets have a look at the size of the problems I am working with.
The typical size of x1 is around 10000 parameters (A11 being thus a 10000 x 10000 matrix) whereas the size of x2 is around 150 parameters (A22 being a 150 x 150 matrix and consequently A12 being a 10000 x 150 matrix). Doing a preelimination as described above (and remember I have to do it 288 times per 24 hour solution) took too much time. So I invested a lot of time and came up with something which is significantly faster. The main time reduction was obtained by exploiting the fact that A12 (and its transposed A21) have a large number of zeros (~80%). So the code I wrote first makes an index of the non-zero elements in A12 and then only loops over these non-zero elements. It also makes use of the fact that the multiplication:
Code: Select all
` A12 A22^{-1}`

is used twice so this is stored in a "help" array. Also the fact that we are dealing with symmetrical matrices is used in that we only do the upper part of the matrix.

For many years (since 2005) this implementation has been good enough for us. However, it has now become the "bottleneck" in our computations and I am therefore interested in getting into contact with the BLAS and/or LAPACK experts and discuss on how my subroutine may be further improved! This part of the process consumes approximately 50% of all the CPU time we need for a 24 hour solution!

Below the source code I developed for this which is pretty efficient but it uses no BLAS/LAPACk tricks/featrues (except that a22 is actually inverted using LAPACK). I would appreciate any hints, tips, pointers in how to speed up this process. My aim is to improve the speed by a factor of 2 which I think should be possible having seen the normal improvements achievable with BLAS/LAPACK. I have time available to devote to this so any help would be greatly appreciated!

Cheers,
Tim

ps. As a "fun fact" I would like to say that the very first question in this forum was actually from me and also about this topic! So this is an "oldie" (or at least I am)!

Code: Select all
`!! Very efficient pre-elimination of a normal equation system of the form!!!!   | a11, a12 |   x1 |    b1 |!!   |          | *    | =     | = b !!   | a21, a22 |   x2 |    b2 |!!!! When calling the subroutine a22 must already be inverted and full matrix!!! The normal matrix a11 must be symmetric or upper triangle on input!! The normal matrix a11 is upper triangle on output!!!! On output a11 = a11 - a12 a22 a21!! On output b1  = b1  - a12 a22 b2!!SUBROUTINE PreElMat ( a11, a21, a22, b1, b2 )!---Declaration of arguments----------------------------------------------------    DOUBLE PRECISION, DIMENSION(:,:), INTENT(INOUT)        :: a11    DOUBLE PRECISION, DIMENSION(:,:), INTENT(IN)           :: a21    DOUBLE PRECISION, DIMENSION(:,:), INTENT(IN)           :: a22    DOUBLE PRECISION, DIMENSION(:),   INTENT(INOUT)        :: b1    DOUBLE PRECISION, DIMENSION(:),   INTENT(IN)           :: b2!   Remark: a22 must already be inverted!---End of declaration of arguments---------------------------------------------!---Declaration of local variables----------------------------------------------    INTEGER :: i,j,k,l,ir,kc,kr,lc,nrow,ntot,n1,n2,nonzero,allocstat!    INTEGER         , DIMENSION (:),   POINTER :: indrow    INTEGER         , DIMENSION (:,:), POINTER :: indcol    INTEGER         , DIMENSION (:),   POINTER :: ncol    DOUBLE PRECISION, DIMENSION (:),   POINTER :: help!---End of declaration of local variables---------------------------------------!---Executable code-------------------------------------------------------------!   Get matrix size    n1 = SIZE(a21,2)    n2 = SIZE(a21,1)!   Initialize pointers    indrow => NULL()    indcol => NULL()    ncol   => NULL()    help   => NULL()!    ALLOCATE ( indrow(n1), STAT=allocStat )    IF ( allocStat /= 0 ) THEN      CALL NCer_Set ('PreElMat: allocating indrow.')    END IF!    ALLOCATE ( indcol(n1,n2), STAT=allocStat )    IF ( allocStat /= 0 ) THEN      CALL NCer_Set ('PreElMat: allocating indcol.')    END IF!    ALLOCATE ( ncol(n1), STAT=allocStat )    IF ( allocStat /= 0 ) THEN      CALL NCer_Set ('PreElMat: allocating ncol.')    END IF!    ALLOCATE ( help(n2), STAT=allocStat )    IF ( allocStat /= 0 ) THEN      CALL NCer_Set ('PreElMat: allocating help.')    END IF!! Analyze matrix a21! ------------------! nrow        : Number of effective lines in matrix a21! indrow(i)   : Row number of the i-th non empty line of matrix a21! ncol(i)     : Number of non-zero elements in line ncol(i) of matrix a21! indcol(i,j) : Column number of the j-th non-zero element in line ncol(i) of matrix a21!    nrow = 0    ntot = n1    DO k = 1, ntot      i = k      nonzero = 0      DO j = 1, n2        IF (a21(j,i) /= 0.D0) THEN          nonzero = nonzero + 1          IF (nonzero == 1) THEN            nrow = nrow + 1            indrow(nrow) = i          END IF          ncol(nrow) = nonzero          indcol(nrow,nonzero) = j        END IF      END DO    END DO!! Grand loop over all rows in a11! -------------------------------    DO ir = 1, nrow      i = indrow(ir)!! Step 1 compute: help = a12 * a22! --------------------------------      DO j = 1, n2        help(j) = 0.D0        DO kc = 1, ncol(ir)          k = indcol(ir, kc)          help(j) = help(j) + a21(k,i) * a22(k,j)        END DO      END DO!! Step 2a reduce A: a11 = a11 - a12 * a22 * a21!                       = a11 - help      * a21! ---------------------------------------------      DO kr = 1, ir   ! upper        k = indrow(kr)        DO lc = 1, ncol(kr)          l = indcol(kr,lc)          a11(k,i) = a11(k,i) - help(l) * a21(l,k)        END DO      END DO!! Step 2b reduce B: b1 = b1 - a12 * a22 * b2 !                      = b1 - help      * b2! ------------------------------------------      b1(i) = b1(i) - DOT_PRODUCT( help, b2 )    END DO    DEALLOCATE(indrow, indcol, ncol, help, STAT = allocStat)    IF ( allocStat /= 0 ) THEN      CALL NCer_Set ('PreElMat: deallocating matrices.')    END IFEND SUBROUTINE PreElMat`
springinhetveld

Posts: 4
Joined: Mon Oct 31, 2011 5:05 am

### Re: Need for Speed

Hi Forum,

It is a bit disappointing that I received no replies but I am not giving up ;)

I have been working on this quite a bit and got some interesting results and issues. For instance a performance issue with the DGEMM from the intel MKL, see my post in the Intel MKL forum. http://software.intel.com/en-us/forums/showthread.php?t=101253

The good thing was that this issue made me write a small test program which people may download and play around with. You will find this code and corresponding make scripts for gfortran and ifort (source make.gfortran and source make.ifort) on my ftp at: ftp://ftp.positim.com (web87f2 / ifort12)

The input (binary) data is a bit large but very representative for what I am trying to do.

In this program there are (currently) 5 methods (subroutines) to do the "thing" I want to speed up.
ReduceNormalEquation0: Simplest (and slowest) method using MATMUL
ReduceNormalEquation2: Simple but much faster because using DGEMM (this is the one that shows the performance issue of the ifort mkl DGEMM)
ReduceNormalEquation3: Like "2" but now using the fact that we have symmetrical matrices so one DGEMM is replaced by DSYMM
ReduceNormalEquation4: A LAPACK based approach which for the ifort shows some promise but is still slow. Based on what I found/learned/understood from:http://www.ii.metu.edu.tr/coursewebsites/quda/web_sitesi/1288489.htm
ReduceNormalEquation: My own smart approach which so far has remained unbeatable..... But this is the one I want to improve with a factor of at least 2.

The current performance table looks something like:
Version gfortran ifort
0 19.00 25.70
2 0.45 3.60
3 0.40 3.60
4 9.30 1.80 (but the results for the RHS are not correct, doing something wrong).
Mine 0.10 0.08

For your information this "pre-elimination" (as I call it) is also called schur-complement, Gauss-Jordan elimination, or also block-elimination. So if you are familiar with any of those terms please have a look at my problem and see if you can help!

My approach is based on the sparseness of the A12 matrix. I know that my (resulting) A11 matrix has about 20% non-zeros and I thought A12 was similar. However, turns out that A12 has ~1% non-zero. So that is really sparse!! So maybe I should be looking at some of the sparse packages? Is there an easy to use sparse DGEMM version? That might really do the trick for me. Especially considering that the dense DGEMM is only a factor of 4-5 slower (gfortran case). So a sparse version could really be performing much better, right?

Hoping for some good hints and tips!

Cheers,
Tim
springinhetveld

Posts: 4
Joined: Mon Oct 31, 2011 5:05 am

### Re: Need for Speed

Hi Tim,

Sorry for not coming back to you earlier, I did not see your first post.
All you are doing is interesting. I am giving you some pointers here.

1) If matrix A11 is SPD and 20% sparse, it might be interesting to use a sparse direct solver to factor it. 20% is at the border between dense and sparse as far as I understand so I am not sure you will get any speedup unfortunately. For the Schur complement update, with 1% sparsity, you really want to use your sparse code. DSYMM is performing 100x more flops, so also your code as a lots of indirection, a factor of 100 is hard to lose entirely ...

2) The website you point to (short paper on how to do Schur complement update) is correct, basically instead of computing the inverse of A22 (A22inv) and multiply by it (as you were doing), it would be better to leave A22 in factor form. So A22 would be the Cholesky factor of A22 (A22chol) as opposed to its inverse (A22inv), so A22 would simply be the output of DPOTRF. Then you can use the Cholesky factor with some TRSM. Since A22 = A22chol * A22chol^T, (Be careful here, I am assuming lower storage ..., would be A22 = A22chol^T * A22chol for upper !), A22inv = A22chol^(-T) * A22chol^(-1), so instead of GEMM with A22inv, you can do two TRSM. (Note: the cost of two TRSM is the cost of one GEMM). So I guess this is what you are trying to do with Method #4. You know ... I think you are fine with multiplication by A22inv ... This is OK. (You lose on DPOTRI but GEMM is more efficient than TRSM.)

3) If you want to speedup your application, one obvious way to go is to parallelize the code. This is not mentioned in your post so I mention it. It is fairly current nowadays to have a few cores available, and, besides the analysis of the sparsity of A12, your "Reduce Normal Equation" is easily parallelizable. You could use a few Open MP pragma a be done wit it.

4) You might want to give a shot at MUMPS, this is a sparse direct package in F90, ( http://graal.ens-lyon.fr/MUMPS/ ), they handle SPD matrices and (I think) they even handle the (sparse) Schur complement update for you and the package is multithreaded.

Cheers,
Julien.
Julien Langou

Posts: 827
Joined: Thu Dec 09, 2004 12:32 pm
Location: Denver, CO, USA

### Re: Need for Speed

Dear Tim,
well, you can rely on some smart tricks which are basically the ideas at the base of sparse factorization algorithms. I'm assuming here that your A11 matrix is full.
Assume you collect the row indices of all the non-empty rows of A21 in an array called idx and call B a submatrix made of only those rows of A21.
In Fortran90 (or more recent) you can do this easily

Code: Select all
`k =0do i=1, size(A21,1)   if(maxval(A21(i,:)) .gt. 0.d0) then      k = k+1      idx(k)=i   end ifend doallocate(B(k,size(A21,2)))B(1:k,:) = A21(idx(1:k),:)`

Now you can compute the Schur complement of this matrix

Code: Select all
`| A22 B'|| B   0 |`

Lets call it S. This can be done easily and efficiently using Level-3 BLAS routines (i.e., dense matrix-matrix products and dense triangular system solves).
Once you've done this, the only thing left to do is what is commonly called expand-add. Fortran90 is your friend, again

Code: Select all
`A11(idx,idx) = A11(idx,idx) - S(idx,idx)`

Now you're good to go. Let me ask you a question: what changes and what stays the same from one iteration to another? Is A11 always the same? Are the epoch variables in epoch i depending on what happens in epoch i-1 ? I suppose yes (otherwise it would be too simple). I would not expect any great improvement BTW because your A22 matrix is very small but a factor 2 should be at reach with the little trick above.

PS
you want to check the indices in the pseudo-code above as I might have messed up. BTW I hope you got the idea.

Regards
Alfredo
buttari

Posts: 51
Joined: Tue Jul 11, 2006 2:11 pm

### Re: Need for Speed

Tim,
I forgot to comment on sparse solvers, which is quite bad from one of the MUMPS software developers :-)

MUMPS has a Schur complement feature which allows you to compute the Schur complement of a sparse matrix. The Schur complement, however, is always returned as a dense matrix so you can't really take advantage of the sparsity of A11 but, unless memory it's a problem for you, there's not much to gain there since you only do sums on A11.

Now, if your A22 matrix is dense, MUMPS won't do anything different than what I proposed above so better to do it by hand as you'll spare some other operations done internally by MUMPS.

If A22 is sparse as well, then MUMPS will do (potentially many) fewer floating point operations to compute the Schur complement. Then the method I proposed above stays pretty much the same except that you use MUMPS to compute the Schur complement S (obviously the corresponding matrix made by assembling A22 and B must be assembled and passed to MUMPS in the COO sparse format).

Another option would be to pass the whole A matrix to MUMPS in sparse format and tell MUMPS only to eliminate the variables in A22 (the MUMPS interface for doing this is straightforward); the problem is that the Schur complement will be returned in dense format and then you would have to sparsify it again for the subsequent iteration.

regards,
alfredo
buttari

Posts: 51
Joined: Tue Jul 11, 2006 2:11 pm

### Re: Need for Speed

@Julien
Good to hear from you again!

1)Regarding A11 and sparseness. At some point I did use the "umfpack" package. But turned out that LAPACK was faster. A sparse expert I talked to (Tim Davis, author of umfpack) told me that real sparse matrices are <5% non-zero and thus I gave up on that and just used LAPACK. However, I overlooked the fact that the A12 is really sparse (<1%). So there is some potential.

2) Clearly parallelizing is the way to go! This is in fact why I started to look into this. We are planning to start running this using a GPU and thus I figured when moving to a "standard" package I can save myself the trouble of learning how to write parallel code myself. But if you say it is easy to do for my own code then maybe I should have a look into that option. However, I first wanted to find out which approach is actually the "fastest" and then in step 2 look into how to parallelize that on a GPU.

@Alfred
Thanks for your hints and tips!
1) Regarding A11 it is about 20% non-zero which to my understanding is considered "full".
What you are proposing for the matrix B is actually what my approach "ReduceNormalEquation" (PreElMat in my original post) does (all be it more in F77 then in F90 style). My routine makes an index of the non-zero elements in A21 and then makes the Schur complement in two steps using the "help" matrix as defined in my original post. Your approach seems to only take advantage of completely missing rows whereas mine disregards all the zeros. But then yours is able to use BLAS for the MM which mine is not... Since there are many zero rows (I believe typically 1000 out of 6000 are non-zero rows) this might actually work. Should speed up the DGEMM by a factor of at least 6 (or 6x6?) I guess. Interesting will have to check that out!

2) The example I made is very simple version of the real life thing. We basically process 24 hours of data in 5 minute sampling. We go through the data sequentially and if an epoch (5 min interval) is processed this normal equation reduction takes place. We have many (well 100 - 300) parameters that are epoch dependent and those are dropped after processing the epoch. So A11 is our main matrix which changes (a bit) with every observation that gets processed (roughly we process 1million obs per 24 hour interval). A21 and A22 are the epoch dependent parts from which we have 288 instances (24 hours at 5 min interval). The example I made just has A21 and A22 from one epoch but uses it 288 (just to simplify the example) times. If we would not do this pre-elimination our main matrix would grow by 288x100 (to 300)= 28800 parameters (or even the 3 fold of that).

3) The epoch parameters have so far been treated as completely independent. Which indeed keeps things simple. Why do you write "that would be too simple"?

4) Will look into MUMPS if your proposal 1) does not work!

@Both.
Many thanks for your comments. Has really motivated me to dig further into this!
I will keep you informed of my progress.

Cheers,
Tim
springinhetveld

Posts: 4
Joined: Mon Oct 31, 2011 5:05 am

### Re: Need for Speed

NB: regarding parallelization, I was more thinking about simply using a multithreaded implementation on a multicore machine (using OpenMP or pthreads) as opposed to going to GPU.
Julien Langou

Posts: 827
Joined: Thu Dec 09, 2004 12:32 pm
Location: Denver, CO, USA

### Re: Need for Speed

Tim,
well if all epochs are independent then your loop (1 to 288 over all the epochs) can be trivially parallelized with a simple openmp parallel do loop construct. You just have to make sure that there are no conflicts when the contributions (all the Schur complements) are assembled; this problem can be easily solved using some redundant memory which, in your case, should be relatively small if the structure of the A21 term is always the same (basically you have to allocate one S matrix on each thread and then do the assemblies A11+S at the end of the parallel loop).
I say that it would be too simple because it rarely happens in practice to have such a good luck.

reards
alfredo
buttari

Posts: 51
Joined: Tue Jul 11, 2006 2:11 pm

### Re: Need for Speed

Alfredo,
Just implemented your scheme and it performs well but it is a bit slower then my implementation.
mine: 0.10 / 0.08 (gfortran / ifort)
yours: 0.15 / 0.20 (gfortran / ifort)

But the big advantage with your approach is that it makes full use of DSYMM/DGEMM/DGEMV and thus it can very easily be parallelized.

Parallelizing over the 288 epochs would mean that we would have to have all 288 a21 and a22 matrices in memory.... That is a bit too much (~4GB) as we run many (or at least several) of these jobs at the same time normally.

@Julien
Regarding GPU. To my understanding there are now some BLAS/LAPACK libraries available for the GPUs. So I imagine (I am an optimist) that it would just be a matter of linking my source with such a library and "off it goes". But that may be a to simplistic view. Of course, multi-core may do the trick as well...

Cheers,
Tim
springinhetveld

Posts: 4
Joined: Mon Oct 31, 2011 5:05 am

### Re: Need for Speed

Regarding GPU. To my understanding there are now some BLAS/LAPACK libraries available for the GPUs.

Definetely, please consider MAGMA http://icl.cs.utk.edu/magma/ ;).

So I imagine (I am an optimist) that it would just be a matter of linking my source with such a library and "off it goes".

Pretty much. This means that you need to stay dense. You cannot do sparse computation on the GPU. Libraries do not do it yet and that would be a real challenge. But GPU are so fast that maybe being dense is not a concern.

Of course, multi-core may do the trick as well...

If you want to do it yourself and keep advantage of sparsity, that would be the way to go as far as I see it.

Cheers,
Julien.
Julien Langou

Posts: 827
Joined: Thu Dec 09, 2004 12:32 pm
Location: Denver, CO, USA

### Re: Need for Speed

springinhetveld wrote:Alfredo,
Just implemented your scheme and it performs well but it is a bit slower then my implementation.
mine: 0.10 / 0.08 (gfortran / ifort)
yours: 0.15 / 0.20 (gfortran / ifort)

But the big advantage with your approach is that it makes full use of DSYMM/DGEMM/DGEMV and thus it can very easily be parallelized.

Parallelizing over the 288 epochs would mean that we would have to have all 288 a21 and a22 matrices in memory.... That is a bit too much (~4GB) as we run many (or at least several) of these jobs at the same time normally.

@Julien
Regarding GPU. To my understanding there are now some BLAS/LAPACK libraries available for the GPUs. So I imagine (I am an optimist) that it would just be a matter of linking my source with such a library and "off it goes". But that may be a to simplistic view. Of course, multi-core may do the trick as well...

Cheers,
Tim

Tim,
well you won't be able to compute at the same time more epochs than the number of cores available on your machine. Thus if you have an 8-core machine you will only have 8 A21 and 8 A22 in memory at the same time. So, unless you have a 288-core machine this shouldn't be a major problem. Moreover, if you can store A21 in sparse format then this won't be a problem at all.
The other option is obviously to have a sequential loop but use parallel blas inside but I would expect a smaller speedup wrt the other approach.

ciao
Alfredo

Alfredo
buttari

Posts: 51
Joined: Tue Jul 11, 2006 2:11 pm