- Code: Select all
`1) We can use the same Flop count (SGEQRT: 4/3b^3; SLARFB: 2b^3; STSQRT: 2b^3; SSSRFB: 4b^3) for Tiled LU factorization as well.`

No, you need to divide by a factor of two. (And the names of the kernels are changed.) The factor of two is the only interest in using LU. Otherwise nobody would bother with LU. I do no bother much with LU personnally.

- Code: Select all
`2) The tiled QR factorization is represented as DAG, where nodes are the above mentioned kernels and edges are the dependencies between these kernels. The compute cost of the nodes is the flop count (mentioned above) and the communication volume between these kernels is O((N/NB)*(N/NB)*NB). I would like to know more about the communication cost estimation. e.g.`

DGEQRT --> DORMQR (communication volume would be ? ). Assuming these are scheduled on different nodes.

DGEQRT --> DTSQRT

DTSQRT --> DSSMQR

DSSMQR --> DSSMQR, DTSQRT.

OK I think I understand the question ...

So

For DGEQRT --> DORMQR, you need to send the "V" matrix (Householder reflectors) so that'll be (NB-1)*NB/2 (a lower triangular matrix without the diagonal).

For DTSQRT --> DSSMQR, you need to send the "V" matrix (Householder reflectors) so that'll be NB*NB.

For DGEQRT --> DTSQRT, you need to send the upper triangular matrix R, so that'll be (NB+1)*NB/2.

For DSSMQR --> DSSMQR, DTSQRT, you need to send a full tile that corresponds to a piece of R that is in going the process of being updated.

Well, need a picture on that one. Depends on your data distribution obviously. But if it's a kind of 2D block cyclic then something like this.

I cannot help you much more than that at this point.

Julien.