## Tiled QR operation count

Open forum for general discussions relating to PLASMA.

### Tiled QR operation count

Hi
I need to know the FLOPS count for TILED QR factorization. Below are the figures that I got from your two papers but they are slightly different. Could anyone please guide me which one of these shall I use as a measure of total number of operations for the PLASMA library routines.

DGEQT2: 2b^3
DLARFB: 3b^3
DTSQT2: 10/ 3b^3
DSSRFB: 5b^3

Reference: Parallel Tiled QR Factorization for Multicore Architectures --LAPACK Working Note # 190

SGEQRT: 4/3b^3
SLARFB: 2b^3
STSQRT: 2b^3
SSSRFB: 4b^3

Ref: QR Factorization for the CELL Processor -- LAPACK Working Note 201

Regards,
Gul
gul

Posts: 3
Joined: Mon Apr 18, 2011 5:46 pm

### Re: Tiled QR operation count

Hello Gul,

SGEQRT: 4/3b^3; SLARFB: 2b^3; STSQRT: 2b^3; SSSRFB: 4b^3
(as in LAWN 201).

Historical note: in LAWN 190, we did not use inner blocking. (So you should not the number in LAWN 190 any longer!) Since LAWN 191, we are using inner blocking. The flops count for these kernels with inner blocking are smaller. (So better.) The numbers above assume that b >> i where b is the tile size and i is the inner blocking parameter. (So there are some i b^2 terms that have been removed.)

If you want to take the inner block size (i) into account:
Code: Select all
`GEQRT: 4/3*b^3 + 2* b^2*i - 1/3*b*i^2ORMQRT: 2*b^3+3*b^2*i TSQRT: 2*b^3-1/2*b*i^2+3/2*b^2*i-b*i+b^2TSMQRT: 4*b^3 + b^2*i + 2*b^2`

Note
1) For i << b, you find again the numbers in LAWN 191, 201. For i = b, you find (!!almost!!) the numbers in LAWN 190. I am not going to understand the why I do not get exactly the number in 190.
2) So I am using PLASMA naming convention in the last part.

Cheers,
Julien.
Julien Langou

Posts: 4
Joined: Mon Apr 18, 2011 8:35 pm

### Re: Tiled QR operation count

Thank you very much for the answer.
I'll be grateful to get comments on my following two assumptions (my understanding so far)

1) We can use the same Flop count (SGEQRT: 4/3b^3; SLARFB: 2b^3; STSQRT: 2b^3; SSSRFB: 4b^3) for Tiled LU factorization as well.

2) The tiled QR factorization is represented as DAG, where nodes are the above mentioned kernels and edges are the dependencies between these kernels. The compute cost of the nodes is the flop count (mentioned above) and the communication volume between these kernels is O((N/NB)*(N/NB)*NB). I would like to know more about the communication cost estimation. e.g.
DGEQRT --> DORMQR (communication volume would be ? ). Assuming these are scheduled on different nodes.
DGEQRT --> DTSQRT
DTSQRT --> DSSMQR
DSSMQR --> DSSMQR, DTSQRT.

Any pointers to help me ascertain the correct communication cost would be highly appreciated.

Thanks
Gul
gul

Posts: 3
Joined: Mon Apr 18, 2011 5:46 pm

### Re: Tiled QR operation count

Code: Select all
`1) We can use the same Flop count (SGEQRT: 4/3b^3; SLARFB: 2b^3; STSQRT: 2b^3; SSSRFB: 4b^3) for Tiled LU factorization as well.`

No, you need to divide by a factor of two. (And the names of the kernels are changed.) The factor of two is the only interest in using LU. Otherwise nobody would bother with LU. I do no bother much with LU personnally.

Code: Select all
`2) The tiled QR factorization is represented as DAG, where nodes are the above mentioned kernels and edges are the dependencies between these kernels. The compute cost of the nodes is the flop count (mentioned above) and the communication volume between these kernels is O((N/NB)*(N/NB)*NB). I would like to know more about the communication cost estimation. e.g.DGEQRT --> DORMQR (communication volume would be ? ). Assuming these are scheduled on different nodes.DGEQRT --> DTSQRTDTSQRT --> DSSMQRDSSMQR --> DSSMQR, DTSQRT.`

OK I think I understand the question ...

So
For DGEQRT --> DORMQR, you need to send the "V" matrix (Householder reflectors) so that'll be (NB-1)*NB/2 (a lower triangular matrix without the diagonal).
For DTSQRT --> DSSMQR, you need to send the "V" matrix (Householder reflectors) so that'll be NB*NB.
For DGEQRT --> DTSQRT, you need to send the upper triangular matrix R, so that'll be (NB+1)*NB/2.
For DSSMQR --> DSSMQR, DTSQRT, you need to send a full tile that corresponds to a piece of R that is in going the process of being updated.

Well, need a picture on that one. Depends on your data distribution obviously. But if it's a kind of 2D block cyclic then something like this.

Julien.
Julien Langou

Posts: 4
Joined: Mon Apr 18, 2011 8:35 pm

### Re: Tiled QR operation count

Thank you Julien, It was so nice of you.

Regards.
/Gul
gul

Posts: 3
Joined: Mon Apr 18, 2011 5:46 pm