I have implemented the procedure you desribed.
For the tiled storage, one needs to have additional (n/2)*(nb-2) numbers stored, where n is the matrix dimension and nb is the tile width.
This is needed as L11 and L22 need to be separated by a diagonal of width nb-2, as both of those matrices have a meaningful diagonal, and this diagonal has to coincide with the diagonal of a tile in PLASMA tiled format.
Practicaly this additional space is nothing compared with the memory needed for the matrix.
This implementation takes 10% more time for a very small matrix with n = 3,200.
For n = 16,000 there is no difference in time.
I tested it on a machine with 8 cores.
Thanks for the suggestion!!