Hi Patrick,
I see. Thanks for the explanation.
Question: how do you form the normal equations? You say: "construct a few thousand rows at a time and fold them into the normal equations matrix A^T A." How do you fold? Do you use SYRK or GEMM?
Here is another way to go. You could use the TS (Triangle on top of Square QR) kernel in LAPACK to do an out-of-core tall-and-skinny QR factorization.
The name of the subroutine is DTPQRT (introduced in LAPACK 3.4).
Note: The advantage of QR factorization is that QR factorization is more stable than normal equations since it does not square the condition number. You seem to struggle with positive definiteness of the normal equations, as you know, this indicates that the normal equations are numerically ill-conditioned.
So here what you could do. It requires to get your hands dirty but since you are already forming the normal equations by hand and constructing the matrix along the way.
Well, I think you have done the hardest
So: Split your matrix A in rows so that a block of rows fits in memory. So A is M-by-N. And you partition with A1 of size N-by-N, the first N rows of A, then A2 of size M2-by-N, the next M2 rows of A, etc. (Same as what you do for forming the normal equations.)
Make sure that N-by-N (for R) and Mi-by-N (for Ai) fit in main memory. (And there are some other smaller array.) (This is same as what you do for forming the normal equations.)
Catch: we want M1 >= N. This will not work if M1 < N. For good use of space, it is better to take M1 = N. So I suggest M1 = N. Then you can have M2, M3, M4 anything you want.
Then
- Code: Select all
// DO: Bring A1 from disk to main memory or construct A1, A1 is first N rows of A
DGEQRF( N, N, A1, LDA1 ... )
// DO: Keep A1. It contains the matrix "R". (Not the final one, but we will update in there.)
// DO: Bring A2 from disk to main memory or construct A2, A2 is next M2 rows of A
DTPQRF( M2, N, 0, NB, A1, LDA1, A2, LDA2, … )
// DO: You can now discard A2.
// DO: Bring A3 from disk to main memory or construct A3, A3 is next M3 rows of A
DTPQRF( M3, N, 0, NB, A1, LDA1, A3, LDA3, … )
// DO: You can now discard A3.
// Etc.
Note: NB is the algorithmic block size of the algorithm. We decided to have it part of the interface so you (the user) need to decide what you want NB to be. (Maybe take NB=100 to start and you can tune later.) In LAPACK, in general, NB is set by the ILAENV( ) function. So this kind of interface slightly differs from what is standard in LAPACK. We also decided to export the "T matrix" to the user, see the T, LDT arguments in the interface. This is useful for performing efficiently the updates. (This avoids to have to recompute T.) So the interface is a little different. You will not really need T but you need to give it. It will be a workspace that DTPQRT needs.
At the end of this algorithm, A1 should contain the R factor of the QR factorization. This is called "tall-skinny flat-tree QR factorization".
The number of FLOPS will be ( 2 * M * N^2 - 2/3 N^3 ) so about 2 * M * N^2 if M >> N. The cost for normal equations is ( M * N^2 + 1/3 N^3 ) so about M * N^2 if M >> N.
So this method will be double the number of FLOPS if if M >> N. (Note that if M=N, this is the same number of FLOPS.)
To deal with your right-hand side b, (because at the end of the day, you do not want a QR factorization but you want to solve a least squares problem,) there are various ways. I think the easiest is to append b to A and so you would always work on an M-by-(N+1) matrix all the time. I hope this makes sense. Pseudo-code below.
- Code: Select all
// DO: construct A1 and b1 so that [A1 b1] is N-by(N+1) matrix. Call [A1 b1] "A1b1".
DGEQRF( N, N+1, A1b1, LDA1 ... )
// DO: Keep A1b1. It contains the matrix "Rb". (Not the final one, but we will update in there.)
// DO: construct A2 and b2 so that [A2 b2] is M2-by(N+1) matrix. Call [A2 b2] "A2b2".
DTPQRF( M2, N+1, 0, NB, A1b1, LDA1, A2b2, LDA2, … )
// DO: You can now discard A2b2.
// DO: construct A3 and b3 so that [A3 b3] is M3-by(N+1) matrix. Call [A3 b3] "A3b3".
DTPQRF( M3, N+1, 0, NB, A1b1, LDA1, A3b3, LDA3, … )
// DO: You can now discard A3b3.
// Etc.
// SOLVE A1 x = b1 (where A1 is upper triangular)
DTRSV( 'U', 'N', 'N', N, A1b1(1,1), LDA1, A1b1(1,N+1), 1)
[/code]
The solution x is in A1b1(1:N,N+1).
I think this is worth a shot.
If I understand what you are doing, this is minimal change and you have already done the hardest.
Let us know if that turns out to work. (Or not.)
Cheers,
Julien.