Hi,
I'm trying to write a basic sequential program which perform a tiled LU factorization
for pedagogical pupose. I took a look at the following tiled algorithm in some papers :
(n x n blocks matrix)
for (k=0; k<n; k++)
___dgetrf(A(k,k), L(k,k), U(k,k), P(k,k))
___for (j=k+1; j<n; j++)
______dgessm(A(k,j), L(k,k), P(k,k), U(k,j))
___for (i=k+1; i<n; i++)
______dtstrf(U(k,k), A(i,k), P(i,k))
______for (j=k+1; j<n; j++)
_________dssssm(U(k,j), A(i,j), L(i,k), P(i,k))
I tried to implement this algo using the 'dgetrf', 'dgessm', 'dtstrf' and 'dssssm'
implemented in the 'core_blas' section of the plasma library, however after many tries,
I obtained incorrect results. I suspect i'm passing wrong arguments since some arguments
are not detailed in the documentation or perhaps wrong matrix storage format (BDL).
If we take a concrete and simple example.
We have a 16x16 matrix (n=16):
double * a = {
1, 17, 33, 49, 65, 81, 97, 113, 129, 145, 161, 177, 193, 209, 225, 241
2, 18, 34, 50, 66, 82, 98, 114, 130, 146, 162, 178, 194, 210, 226, 242
3, 19, 35, 51, 67, 83, 99, 115, 131, 147, 163, 179, 195, 211, 227, 243
4, 20, 36, 52, 68, 84, 100, 116, 132, 148, 164, 180, 196, 212, 228, 244
5, 21, 37, 53, 69, 85, 101, 117, 133, 149, 165, 181, 197, 213, 229, 245
6, 22, 38, 54, 70, 86, 102, 118, 134, 150, 166, 182, 198, 214, 230, 246
7, 23, 39, 55, 71, 87, 103, 119, 135, 151, 167, 183, 199, 215, 231, 247
8, 24, 40, 56, 72, 88, 104, 120, 136, 152, 168, 184, 200, 216, 232, 248
9, 25, 41, 57, 73, 89, 105, 121, 137, 153, 169, 185, 201, 217, 233, 249
10, 26, 42, 58, 74, 90, 106, 122, 138, 154, 170, 186, 202, 218, 234, 250
11, 27, 43, 59, 75, 91, 107, 123, 139, 155, 171, 187, 203, 219, 235, 251
12, 28, 44, 60, 76, 92, 108, 124, 140, 156, 172, 188, 204, 220, 236, 252
13, 29, 45, 61, 77, 93, 109, 125, 141, 157, 173, 189, 205, 221, 237, 253
14, 30, 46, 62, 78, 94, 110, 126, 142, 158, 174, 190, 206, 222, 238, 254
15, 31, 47, 63, 79, 95, 111, 127, 143, 159, 175, 191, 207, 223, 239, 255,
16, 32, 48, 64, 80, 96, 112, 128, 144, 160, 176, 192, 208, 224, 240, 256 }
If we divide this matrix into 2x2 blocks using the CCRB layout, we obtains the following 4 blocks:
double * a00 = {
1, 17, 33, 49, 65, 81, 97, 113,
2, 18, 34, 50, 66, 82, 98, 114,
3, 19, 35, 51, 67, 83, 99, 115,
4, 20, 36, 52, 68, 84, 100, 116
5, 21, 37, 53, 69, 85, 101, 117
6, 22, 38, 54, 70, 86, 102, 118
7, 23, 39, 55, 71, 87, 103, 119
8, 24, 40, 56, 72, 88, 104, 120 }
double * a10 = {
9, 25, 41, 57, 73, 89, 105, 121
10, 26, 42, 58, 74, 90, 106, 122,
11, 27, 43, 59, 75, 91, 107, 123,
12, 28, 44, 60, 76, 92, 108, 124,
13, 29, 45, 61, 77, 93, 109, 125,
14, 30, 46, 62, 78, 94, 110, 126,
15, 31, 47, 63, 79, 95, 111, 127,
16, 32, 48, 64, 80, 96, 112, 128 }
double * a01 = {
129, 145, 161, 177, 193, 209, 225, 241
130, 146, 162, 178, 194, 210, 226, 242
131, 147, 163, 179, 195, 211, 227, 243
132, 148, 164, 180, 196, 212, 228, 244
133, 149, 165, 181, 197, 213, 229, 245
134, 150, 166, 182, 198, 214, 230, 246
135, 151, 167, 183, 199, 215, 231, 247
136, 152, 168, 184, 200, 216, 232, 248 }
double * a11 = {
137, 153, 169, 185, 201, 217, 233, 249
138, 154, 170, 186, 202, 218, 234, 250
139, 155, 171, 187, 203, 219, 235, 251
140, 156, 172, 188, 204, 220, 236, 252
141, 157, 173, 189, 205, 221, 237, 253
142, 158, 174, 190, 206, 222, 238, 254
143, 159, 175, 191, 207, 223, 239, 255
144, 160, 176, 192, 208, 224, 240, 256 }
And the 4 pivoting vectors :
int * p00 = { 1, 2, 3, 4, 5, 6, 7, 8 };
int * p10 = { 1, 2, 3, 4, 5, 6, 7, 8 };
int * p01 = { 1, 2, 3, 4, 5, 6, 7, 8 };
int * p11 = { 1, 2, 3, 4, 5, 6, 7, 8 };
If we apply the non-tiled lapack "dgetrf" to the complete matrix 'a', we obtains:
double * a = {
241.00, 0, 0.27, 0.14, 0.67, 0.40, 0.60, 0.54, 0.07, 0.20, 0.34, 0.73, 0.47, 0.87, 0.93, 0.80,
242.00, 1.00, 0.73, 0.87, 0.33, 0.60, 0.40, 0.47, 0.93, 0.80, 0.67, 0.27, 0.53, 0.13, 0.07, 0.20,
243.00, 1.99, 0, 0, 0, 0, 0, 0, 0, 0.50, 0, 0, 0, 0, 0, 0,
244.00, 2.99, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
245.00, 3.98, 0, 0, 0, 0, 0, 0, 0, 0.25, 0, 0, 0.50, 0, 0, 0,
246.00, 4.98, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
247.00, 5.98, 0, 0, 0, 0, 0, 0, 0, 0, 0.50, 0, 0, 0, 0, 0,
248.00, 6.97, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
249.00, 7.97, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
250, 8.96, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
251.00, 9.96, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
252.00, 10.95, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
253.00, 11.95, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
254.00, 12.95, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
255.00, 13.94, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
256.00, 14.94, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 }
Now, if we apply the tiled algorithm, we will perform the following operations:
- Algorithm:
dgetrf(a00, p00);
dgessm(a01, a00, p00);
dtstrf(a00, a10, p10);
dssssm(a01, a11, a10, p10);
dgetrf(a11, p11);
- implementation using the 'core_blas' library functions:
int n = 16;
int ib = 8; // does 'ib' affect the results ?
int k = n; // what's the 'K' parameter used in 'core_dgessm' and 'core_dssssm' ?
double * work = { ... }; // what's 'work' (used in 'dssssm') ? and what's it's dimensions ?
core_dgetrf(n, n, a00, n, p00, &info);
core_dgessm(n, n, k, ib, p00, a00, n, a00, n); // what's the K parameter ?
core_dtstrf(n, n, ib, n, a00, n, a10, n, a10, n, p10, work, ld_work, &info); // what's 'work' ? and what's it's dimensions ?
core_dssssm(n, n, n, n, k, ib, a01, n, a11, n, a10, n, a10, n, p10); // what's 'L2' parameter ?
core_dgetrf(n, n, a11, n, p11, &info);
Calling these functions whith these paramemters gives wrong results (different from lapack 'dgetrf' results), i tried
to change them but in a 'blind' way since i didn't figure out what's some parameters are for.
What's the correct arguments for my situation ? at least for this simple example ?
is there a step before which i should backup a tile then restore it ?
Finally and simply, what's wrong in this implementation ?
Thank you in advance for your responses.
