My work includes finding the QR decomposition of a p-cyclic matrix M. It is square, L blocks high, L blocks wide, and each block is N by N. Each block column has two nonzero blocks on and directly below the diagonal, like so

- Code: Select all
`M11 M1L`

M21 M22

M32 M33

... ...

MLL-1 MLL

I am using a Block Orthogonal Factorization method to find the QR but want to compare it to DGEQRF in terms of time and speed. The following code shows this:

- Code: Select all
`double * getBlockStart(double * A, int N, int L, int i, int j)`

{

return A + N * L * N * (j-1) + N * (i-1);

}

void bruteQR(double * M, int N)

{

double * tau = new double[N];

double ini;

int lwork = -1, info;

dgeqrf_(&N, &N, M, &N, tau, &ini, &lwork, &info);

lwork = ini;

double * work = new double[lwork];

dgeqrf_(&N, &N, M, &N, tau, work, &lwork, &info);

free(work);

free(tau);

}

void bsof(double * M, int N, int L, double * taus)

{

int NL = N*L, N2 = 2 * N;

double ini;

int lwork = -1, info;

//Obtain the optimal value for lwork, does no work

dgeqrf_(&N2, &N, getBlockStart(M, N, L, 1, 1), &NL, taus, &ini, &lwork, &info);

lwork = ini;

double * work = new double[lwork];

for(int i = 1; i <= L-2; i++)

{

dgeqrf_(&N2, &N, getBlockStart(M, N, L, i, i), &NL, taus+((i-1)*N), work, &lwork, &info);

dormqr_("L","T",&N2,&N,&N,getBlockStart(M,N,L,i,i),&NL,taus+((i-1)*N),getBlockStart(M,N,L,i,i+1),&NL,work,&lwork,&info);

dormqr_("L","T",&N2,&N,&N,getBlockStart(M,N,L,i,i),&NL,taus+((i-1)*N),getBlockStart(M,N,L,i,L),&NL,work,&lwork,&info);

}

free(work);

lwork = -1;

dgeqrf_(&N2, &N2, getBlockStart(M, N, L, L-1, L-1), &NL, taus+(NL-2*N), &ini, &lwork, &info);

lwork = ini;

double * work2 = new double[lwork];

dgeqrf_(&N2, &N2, getBlockStart(M, N, L, L-1, L-1), &NL, taus+(NL-2*N), work2, &lwork, &info);

free(work2);

}

Now my problem is that although BSOF always has a better timing than DGEQRF, as L increases and the amount of zeroes in M increases, DGEQRF gets much much faster in terms of GFlop/s. Below are results from a test where the size of M is constant at 10,000 and L grows as N decreases. If DGEQRF were unaffected by structure, its speed and execution time would be the same for each test, but it is not. So my question is why is DGEQRF going so fast? My theory is that there is some heuristic which skips some amount of flops when it sees some formation of zeroes, making my flop count incorrect and leading to a bad GFlop rate. But I have no idea how/where this is being done.

- Code: Select all
`N L QRF t QRF GF BSOF t BSOF GF`

2500 4 5.97 223.53 5.93 108.95

1250 8 3.24 411.16 2.11 94.88

1000 10 2.82 473.48 1.45 92.05

500 20 2.06 647.74 0.48 74.51

400 25 1.89 704.28 0.37 63.74

250 40 1.69 787.52 0.17 53.88

200 50 1.6 832.28 0.13 44.94

125 80 1.51 884.11 0.09 26.06

100 100 1.45 917.95 0.08 20.12

80 125 1.38 967.95 0.07 14.78

50 200 1.35 989.34 0.05 8.08

40 250 1.31 1014.52 0.04 6.48

25 400 1.3 1022.98 0.03 3.16

20 500 1.29 1035.39 0.03 2.05

10 1000 1.28 1044.39 0.01 1.23

8 1250 1.27 1046.45 0.01 0.92

4 2500 1.25 1064.49 0.01 0.28

Thanks!