DGEQRF performance on block p-cyclic matrix

Post here if you have a question about LAPACK or ScaLAPACK algorithm or data format

DGEQRF performance on block p-cyclic matrix

Postby dontpullajeff » Tue Nov 13, 2012 2:11 pm

Hello, first time poster here.

My work includes finding the QR decomposition of a p-cyclic matrix M. It is square, L blocks high, L blocks wide, and each block is N by N. Each block column has two nonzero blocks on and directly below the diagonal, like so
Code: Select all
M11               M1L
M21 M22
    M32 M33
        ... ...
            MLL-1 MLL

I am using a Block Orthogonal Factorization method to find the QR but want to compare it to DGEQRF in terms of time and speed. The following code shows this:
Code: Select all
double * getBlockStart(double * A, int N, int L, int i, int j)
        return A + N * L * N * (j-1) + N * (i-1);

void bruteQR(double * M, int N)
   double * tau = new double[N];   
        double ini;
        int lwork = -1, info;
   dgeqrf_(&N, &N, M, &N, tau, &ini, &lwork, &info);
   lwork = ini;
   double * work = new double[lwork];
   dgeqrf_(&N, &N, M, &N, tau, work, &lwork, &info);

void bsof(double * M, int N, int L, double * taus)
        int NL = N*L, N2 = 2 * N;
        double ini;
        int lwork = -1, info;
        //Obtain the optimal value for lwork, does no work
        dgeqrf_(&N2, &N, getBlockStart(M, N, L, 1, 1), &NL, taus, &ini, &lwork, &info);
        lwork = ini;
        double * work = new double[lwork];

        for(int i = 1; i <= L-2; i++)
                dgeqrf_(&N2, &N, getBlockStart(M, N, L, i, i), &NL, taus+((i-1)*N), work, &lwork, &info);

        lwork = -1;
        dgeqrf_(&N2, &N2, getBlockStart(M, N, L, L-1, L-1), &NL, taus+(NL-2*N), &ini, &lwork, &info);
        lwork = ini;
        double * work2 = new double[lwork];
        dgeqrf_(&N2, &N2, getBlockStart(M, N, L, L-1, L-1), &NL, taus+(NL-2*N), work2, &lwork, &info);


Now my problem is that although BSOF always has a better timing than DGEQRF, as L increases and the amount of zeroes in M increases, DGEQRF gets much much faster in terms of GFlop/s. Below are results from a test where the size of M is constant at 10,000 and L grows as N decreases. If DGEQRF were unaffected by structure, its speed and execution time would be the same for each test, but it is not. So my question is why is DGEQRF going so fast? My theory is that there is some heuristic which skips some amount of flops when it sees some formation of zeroes, making my flop count incorrect and leading to a bad GFlop rate. But I have no idea how/where this is being done.

Code: Select all
N      L   QRF t   QRF GF   BSOF t   BSOF GF
2500   4   5.97   223.53   5.93   108.95
1250   8   3.24   411.16   2.11   94.88
1000   10   2.82   473.48   1.45   92.05
500   20   2.06   647.74   0.48   74.51
400   25   1.89   704.28   0.37   63.74
250   40   1.69   787.52   0.17   53.88
200   50   1.6   832.28   0.13   44.94
125   80   1.51   884.11   0.09   26.06
100   100   1.45   917.95   0.08   20.12
80   125   1.38   967.95   0.07   14.78
50   200   1.35   989.34   0.05   8.08
40   250   1.31   1014.52   0.04   6.48
25   400   1.3   1022.98   0.03   3.16
20   500   1.29   1035.39   0.03   2.05
10   1000   1.28   1044.39   0.01   1.23
8   1250   1.27   1046.45   0.01   0.92
4   2500   1.25   1064.49   0.01   0.28

Posts: 1
Joined: Tue Nov 13, 2012 1:46 pm

Re: DGEQRF performance on block p-cyclic matrix

Postby CyLith » Thu Nov 15, 2012 2:32 am

Due to the large size of your matrices, DGEQRF is using the blocked code path, where instead of computing the QR factorization one column at a time, it computes a block of columns at a time. The steps in the block factorization can be seen in http://www.netlib.org/lapack/double/dgeqrf.f, and there are 3 main steps: (1) compute a QR factorization of the block of columns (tall skinny matrix), with Q represented as single column Householder reflections, (2) Combine the individual reflectors into a block reflector with a diagonal block T (in routine DLARFT). The details of the algorithm are not at all obvious from the code, and there does not seem to be much literature on the subject (I myself am interested in knowing more about it, but I have found this paper: http://www.sciencedirect.com/science/article/pii/0045782576900323. Finally (3), the block reflector is applied.

Throughout this process, there are many optimizations performed which exploit the zero structure of the intermediate factors. In DLARFT, the code checks to see if any Tau's are zero (the scale factor of the Householder reflector), and if so, saves some time on filling in T. This will occur when the diagonal blocks are low rank, but only saves a small amount of time since the size of the diagonal block is relatively small. In DLARFB, the code checks for reflectors with trailing zeros, and saves time there. This happens if your matrix was roughly upper triangular to begin with so that the panel QR in step (1) has trailing zeros, which is true for you, thus saving a lot of time. So in this sense, what DGEQRF is doing is not a whole lot different from your BSOF routine, except it has to explicitly check for zero structure. In the limit of large block sizes, the trailing zero estimate returned by ILADLR becomes exactly correct and makes the amount of work between the two methods roughly the same.
Posts: 41
Joined: Sun Feb 08, 2009 7:23 am
Location: Stanford, CA

Re: DGEQRF performance on block p-cyclic matrix

Postby ptomulik » Wed Mar 20, 2013 1:17 pm


here is a link to (publicly available) paper by Schreiber and Van Loan ('87)
regarding the theory of blocked I - V*T*V**T representation of QR factorization:


Hope this helps.

Pawel Tomulik
Posts: 1
Joined: Wed Mar 20, 2013 1:08 pm

Return to Algorithm / Data

Who is online

Users browsing this forum: No registered users and 2 guests