## DGEQRF performance on block p-cyclic matrix

Post here if you have a question about LAPACK or ScaLAPACK algorithm or data format

### DGEQRF performance on block p-cyclic matrix

Hello, first time poster here.

My work includes finding the QR decomposition of a p-cyclic matrix M. It is square, L blocks high, L blocks wide, and each block is N by N. Each block column has two nonzero blocks on and directly below the diagonal, like so
Code: Select all
`M11               M1LM21 M22    M32 M33        ... ...            MLL-1 MLL`

I am using a Block Orthogonal Factorization method to find the QR but want to compare it to DGEQRF in terms of time and speed. The following code shows this:
Code: Select all
`double * getBlockStart(double * A, int N, int L, int i, int j){        return A + N * L * N * (j-1) + N * (i-1);}void bruteQR(double * M, int N){   double * tau = new double[N];           double ini;        int lwork = -1, info;   dgeqrf_(&N, &N, M, &N, tau, &ini, &lwork, &info);   lwork = ini;   double * work = new double[lwork];   dgeqrf_(&N, &N, M, &N, tau, work, &lwork, &info);      free(work);   free(tau);}void bsof(double * M, int N, int L, double * taus){        int NL = N*L, N2 = 2 * N;        double ini;        int lwork = -1, info;        //Obtain the optimal value for lwork, does no work        dgeqrf_(&N2, &N, getBlockStart(M, N, L, 1, 1), &NL, taus, &ini, &lwork, &info);        lwork = ini;        double * work = new double[lwork];        for(int i = 1; i <= L-2; i++)        {                dgeqrf_(&N2, &N, getBlockStart(M, N, L, i, i), &NL, taus+((i-1)*N), work, &lwork, &info);                dormqr_("L","T",&N2,&N,&N,getBlockStart(M,N,L,i,i),&NL,taus+((i-1)*N),getBlockStart(M,N,L,i,i+1),&NL,work,&lwork,&info);                dormqr_("L","T",&N2,&N,&N,getBlockStart(M,N,L,i,i),&NL,taus+((i-1)*N),getBlockStart(M,N,L,i,L),&NL,work,&lwork,&info);        }        free(work);        lwork = -1;        dgeqrf_(&N2, &N2, getBlockStart(M, N, L, L-1, L-1), &NL, taus+(NL-2*N), &ini, &lwork, &info);        lwork = ini;        double * work2 = new double[lwork];        dgeqrf_(&N2, &N2, getBlockStart(M, N, L, L-1, L-1), &NL, taus+(NL-2*N), work2, &lwork, &info);        free(work2);}`

Now my problem is that although BSOF always has a better timing than DGEQRF, as L increases and the amount of zeroes in M increases, DGEQRF gets much much faster in terms of GFlop/s. Below are results from a test where the size of M is constant at 10,000 and L grows as N decreases. If DGEQRF were unaffected by structure, its speed and execution time would be the same for each test, but it is not. So my question is why is DGEQRF going so fast? My theory is that there is some heuristic which skips some amount of flops when it sees some formation of zeroes, making my flop count incorrect and leading to a bad GFlop rate. But I have no idea how/where this is being done.

Code: Select all
`N      L   QRF t   QRF GF   BSOF t   BSOF GF2500   4   5.97   223.53   5.93   108.951250   8   3.24   411.16   2.11   94.881000   10   2.82   473.48   1.45   92.05500   20   2.06   647.74   0.48   74.51400   25   1.89   704.28   0.37   63.74250   40   1.69   787.52   0.17   53.88200   50   1.6   832.28   0.13   44.94125   80   1.51   884.11   0.09   26.06100   100   1.45   917.95   0.08   20.1280   125   1.38   967.95   0.07   14.7850   200   1.35   989.34   0.05   8.0840   250   1.31   1014.52   0.04   6.4825   400   1.3   1022.98   0.03   3.1620   500   1.29   1035.39   0.03   2.0510   1000   1.28   1044.39   0.01   1.238   1250   1.27   1046.45   0.01   0.924   2500   1.25   1064.49   0.01   0.28`

Thanks!
dontpullajeff

Posts: 1
Joined: Tue Nov 13, 2012 1:46 pm

### Re: DGEQRF performance on block p-cyclic matrix

Due to the large size of your matrices, DGEQRF is using the blocked code path, where instead of computing the QR factorization one column at a time, it computes a block of columns at a time. The steps in the block factorization can be seen in http://www.netlib.org/lapack/double/dgeqrf.f, and there are 3 main steps: (1) compute a QR factorization of the block of columns (tall skinny matrix), with Q represented as single column Householder reflections, (2) Combine the individual reflectors into a block reflector with a diagonal block T (in routine DLARFT). The details of the algorithm are not at all obvious from the code, and there does not seem to be much literature on the subject (I myself am interested in knowing more about it, but I have found this paper: http://www.sciencedirect.com/science/article/pii/0045782576900323. Finally (3), the block reflector is applied.

Throughout this process, there are many optimizations performed which exploit the zero structure of the intermediate factors. In DLARFT, the code checks to see if any Tau's are zero (the scale factor of the Householder reflector), and if so, saves some time on filling in T. This will occur when the diagonal blocks are low rank, but only saves a small amount of time since the size of the diagonal block is relatively small. In DLARFB, the code checks for reflectors with trailing zeros, and saves time there. This happens if your matrix was roughly upper triangular to begin with so that the panel QR in step (1) has trailing zeros, which is true for you, thus saving a lot of time. So in this sense, what DGEQRF is doing is not a whole lot different from your BSOF routine, except it has to explicitly check for zero structure. In the limit of large block sizes, the trailing zero estimate returned by ILADLR becomes exactly correct and makes the amount of work between the two methods roughly the same.
CyLith

Posts: 41
Joined: Sun Feb 08, 2009 7:23 am
Location: Stanford, CA

### Re: DGEQRF performance on block p-cyclic matrix

Hi,

here is a link to (publicly available) paper by Schreiber and Van Loan ('87)
regarding the theory of blocked I - V*T*V**T representation of QR factorization:

http://ecommons.library.cornell.edu/handle/1813/6704

Hope this helps.

Regards!
--
Pawel Tomulik
ptomulik

Posts: 1
Joined: Wed Mar 20, 2013 1:08 pm