## About Tiled LU Factorization

Open forum for general discussions relating to PLASMA.

### About Tiled LU Factorization

Hi,

I'm trying to write a basic sequential program which perform a tiled LU factorization
for pedagogical pupose. I took a look at the following tiled algorithm in some papers :

(n x n blocks matrix)

for (k=0; k<n; k++)
___dgetrf(A(k,k), L(k,k), U(k,k), P(k,k))
___for (j=k+1; j<n; j++)
______dgessm(A(k,j), L(k,k), P(k,k), U(k,j))
___for (i=k+1; i<n; i++)
______dtstrf(U(k,k), A(i,k), P(i,k))
______for (j=k+1; j<n; j++)
_________dssssm(U(k,j), A(i,j), L(i,k), P(i,k))

I tried to implement this algo using the 'dgetrf', 'dgessm', 'dtstrf' and 'dssssm'
implemented in the 'core_blas' section of the plasma library, however after many tries,
I obtained incorrect results. I suspect i'm passing wrong arguments since some arguments
are not detailed in the documentation or perhaps wrong matrix storage format (BDL).

If we take a concrete and simple example.

We have a 16x16 matrix (n=16):

double * a = {
1, 17, 33, 49, 65, 81, 97, 113, 129, 145, 161, 177, 193, 209, 225, 241
2, 18, 34, 50, 66, 82, 98, 114, 130, 146, 162, 178, 194, 210, 226, 242
3, 19, 35, 51, 67, 83, 99, 115, 131, 147, 163, 179, 195, 211, 227, 243
4, 20, 36, 52, 68, 84, 100, 116, 132, 148, 164, 180, 196, 212, 228, 244
5, 21, 37, 53, 69, 85, 101, 117, 133, 149, 165, 181, 197, 213, 229, 245
6, 22, 38, 54, 70, 86, 102, 118, 134, 150, 166, 182, 198, 214, 230, 246
7, 23, 39, 55, 71, 87, 103, 119, 135, 151, 167, 183, 199, 215, 231, 247
8, 24, 40, 56, 72, 88, 104, 120, 136, 152, 168, 184, 200, 216, 232, 248
9, 25, 41, 57, 73, 89, 105, 121, 137, 153, 169, 185, 201, 217, 233, 249
10, 26, 42, 58, 74, 90, 106, 122, 138, 154, 170, 186, 202, 218, 234, 250
11, 27, 43, 59, 75, 91, 107, 123, 139, 155, 171, 187, 203, 219, 235, 251
12, 28, 44, 60, 76, 92, 108, 124, 140, 156, 172, 188, 204, 220, 236, 252
13, 29, 45, 61, 77, 93, 109, 125, 141, 157, 173, 189, 205, 221, 237, 253
14, 30, 46, 62, 78, 94, 110, 126, 142, 158, 174, 190, 206, 222, 238, 254
15, 31, 47, 63, 79, 95, 111, 127, 143, 159, 175, 191, 207, 223, 239, 255,
16, 32, 48, 64, 80, 96, 112, 128, 144, 160, 176, 192, 208, 224, 240, 256 }

If we divide this matrix into 2x2 blocks using the CCRB layout, we obtains the following 4 blocks:

double * a00 = {
1, 17, 33, 49, 65, 81, 97, 113,
2, 18, 34, 50, 66, 82, 98, 114,
3, 19, 35, 51, 67, 83, 99, 115,
4, 20, 36, 52, 68, 84, 100, 116
5, 21, 37, 53, 69, 85, 101, 117
6, 22, 38, 54, 70, 86, 102, 118
7, 23, 39, 55, 71, 87, 103, 119
8, 24, 40, 56, 72, 88, 104, 120 }

double * a10 = {
9, 25, 41, 57, 73, 89, 105, 121
10, 26, 42, 58, 74, 90, 106, 122,
11, 27, 43, 59, 75, 91, 107, 123,
12, 28, 44, 60, 76, 92, 108, 124,
13, 29, 45, 61, 77, 93, 109, 125,
14, 30, 46, 62, 78, 94, 110, 126,
15, 31, 47, 63, 79, 95, 111, 127,
16, 32, 48, 64, 80, 96, 112, 128 }

double * a01 = {
129, 145, 161, 177, 193, 209, 225, 241
130, 146, 162, 178, 194, 210, 226, 242
131, 147, 163, 179, 195, 211, 227, 243
132, 148, 164, 180, 196, 212, 228, 244
133, 149, 165, 181, 197, 213, 229, 245
134, 150, 166, 182, 198, 214, 230, 246
135, 151, 167, 183, 199, 215, 231, 247
136, 152, 168, 184, 200, 216, 232, 248 }

double * a11 = {
137, 153, 169, 185, 201, 217, 233, 249
138, 154, 170, 186, 202, 218, 234, 250
139, 155, 171, 187, 203, 219, 235, 251
140, 156, 172, 188, 204, 220, 236, 252
141, 157, 173, 189, 205, 221, 237, 253
142, 158, 174, 190, 206, 222, 238, 254
143, 159, 175, 191, 207, 223, 239, 255
144, 160, 176, 192, 208, 224, 240, 256 }

And the 4 pivoting vectors :

int * p00 = { 1, 2, 3, 4, 5, 6, 7, 8 };
int * p10 = { 1, 2, 3, 4, 5, 6, 7, 8 };
int * p01 = { 1, 2, 3, 4, 5, 6, 7, 8 };
int * p11 = { 1, 2, 3, 4, 5, 6, 7, 8 };

If we apply the non-tiled lapack "dgetrf" to the complete matrix 'a', we obtains:

double * a = {
241.00, 0, 0.27, 0.14, 0.67, 0.40, 0.60, 0.54, 0.07, 0.20, 0.34, 0.73, 0.47, 0.87, 0.93, 0.80,
242.00, 1.00, 0.73, 0.87, 0.33, 0.60, 0.40, 0.47, 0.93, 0.80, 0.67, 0.27, 0.53, 0.13, 0.07, 0.20,
243.00, 1.99, 0, 0, 0, 0, 0, 0, 0, 0.50, 0, 0, 0, 0, 0, 0,
244.00, 2.99, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
245.00, 3.98, 0, 0, 0, 0, 0, 0, 0, 0.25, 0, 0, 0.50, 0, 0, 0,
246.00, 4.98, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
247.00, 5.98, 0, 0, 0, 0, 0, 0, 0, 0, 0.50, 0, 0, 0, 0, 0,
248.00, 6.97, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
249.00, 7.97, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
250, 8.96, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
251.00, 9.96, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
252.00, 10.95, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
253.00, 11.95, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
254.00, 12.95, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
255.00, 13.94, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
256.00, 14.94, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 }

Now, if we apply the tiled algorithm, we will perform the following operations:

- Algorithm:

dgetrf(a00, p00);
dgessm(a01, a00, p00);
dtstrf(a00, a10, p10);
dssssm(a01, a11, a10, p10);
dgetrf(a11, p11);

- implementation using the 'core_blas' library functions:

int n = 16;
int ib = 8; // does 'ib' affect the results ?
int k = n; // what's the 'K' parameter used in 'core_dgessm' and 'core_dssssm' ?
double * work = { ... }; // what's 'work' (used in 'dssssm') ? and what's it's dimensions ?

core_dgetrf(n, n, a00, n, p00, &info);
core_dgessm(n, n, k, ib, p00, a00, n, a00, n); // what's the K parameter ?
core_dtstrf(n, n, ib, n, a00, n, a10, n, a10, n, p10, work, ld_work, &info); // what's 'work' ? and what's it's dimensions ?
core_dssssm(n, n, n, n, k, ib, a01, n, a11, n, a10, n, a10, n, p10); // what's 'L2' parameter ?
core_dgetrf(n, n, a11, n, p11, &info);

Calling these functions whith these paramemters gives wrong results (different from lapack 'dgetrf' results), i tried
to change them but in a 'blind' way since i didn't figure out what's some parameters are for.

What's the correct arguments for my situation ? at least for this simple example ?
is there a step before which i should backup a tile then restore it ?
Finally and simply, what's wrong in this implementation ?

Thank you in advance for your responses.
collider

Posts: 2
Joined: Thu Nov 29, 2012 6:42 am

### Re: About Tiled LU Factorization

It's all correct.
Make sure that your matrix is really laid out in tiles.
If you're not sure that you did it right, there is a set of PLASMA function to do the conversion back and forth.
Now, here is the catch.
PLASMA tile LU does not perform the same factorization as LAPACK LU.
The L and U factors are different, and so is the pivot vector.
So, comparing the factorization against the LAPACK factorization is not the right correctness test.
Try solving a linear system instead, and look if you are getting the same solution vector or small enough residual.
Just as a side note, PLASMA also has a fully LAPACK-compliant LU factorization.
This is actually what you will get if you call PLASMA_dgetrf.
Hope it helps.
Jakub

Posts: 79
Joined: Wed May 13, 2009 1:27 pm

### Re: About Tiled LU Factorization

First of all, thank you for your quick response. I will try to verify my code through linear systems resolution.
However, i have 2 little questions:

1) According to a paper, dssssm( Ukj, Aij, Lik, Pik ) performs: | Ukj, Aij | , Lik, Pik --> | Ukj, Aij | = Lik^-1 Pik | Ukj, Aij |
However, the 'core_dssssm' routine take as arguments A1, A2, L1 and L2. I understand what the first 3 are for:
A1 -> Ukj
A2 -> Aij
L1 -> Lik
But, what's L2 for ? we are supposed to use only L(i,k) , right ?
L2 -> ?

2) 'double * work' is used in 'core_dtstrf()'. in the documentation, it's defined as in/out but without any description.
What should we put/get in 'work' array, what's its dimensions and what is it used for ?

Thank You
collider

Posts: 2
Joined: Thu Nov 29, 2012 6:42 am

### Re: About Tiled LU Factorization

It's like that, because the tile LU produces auxiliary data in the course the factorization.
This is what is stored in the workspace and this is why there is L1 and L2.
This data has to be passed from the factorization routine to the solve routine.
You don't have to put anything in the workspace, just provide the space.
PLASMA provides routines for workspace allocation.
Look at:
/examples/example_dgesv.c
/examples/example_dgetrs.c
Jakub

Posts: 79
Joined: Wed May 13, 2009 1:27 pm

### Re: About Tiled LU Factorization

Valgrind indicates one block leak on example_dgesv.c
It´s concern PLASMA_Alloc_Workspace.

Example without Workspace are well.

openwiki

Posts: 2
Joined: Mon Mar 11, 2013 8:21 pm

### Re: About Tiled LU Factorization

Example_dgesv.c has one block memory leak in
PLASMA_Alloc_Workspace
Example without Workspace are quiet good ?