The Tiled LU and QR cost more floating point operations than standard LAPACK implementation and to minimise this overhead the idea of inner blocking is used . It means that I can not use the flop count of LAPACK from http://www.netlib.org/lapack/lawns/lawn41.ps for these operations. If this is true then could anyone please tell what would be the measure of total number of operations (flop count ) for each of the below block operations.
Thank you very much.
 A class of parallel tiled linear algebra algorithms for multicore architectures by Buttari et al