Hi,
The Tiled LU and QR cost more floating point operations than standard LAPACK implementation and to minimise this overhead the idea of inner blocking is used [1]. It means that I can not use the flop count of LAPACK from http://www.netlib.org/lapack/lawns/lawn41.ps for these operations. If this is true then could anyone please tell what would be the measure of total number of operations (flop count ) for each of the below block operations.
DGETRF
DGESSM
DTSTRF
DSSSSM
Thank you very much.
[1] A class of parallel tiled linear algebra algorithms for multicore architectures by Buttari et al
