First off let me say I enjoyed the NVIDIA webinar yesterday. It got me motivated to try out PLASMA/MAGMA.
Two issues thus far...
1) I have PLASMA compiled on 3 machines: core i5 running Ubuntu 10.04, core i7 running RHEL 5, and a dual quad core Xeon running RHEL 5. I have built PLASMA using gfortran/gcc 4.4 with -O3 flag linking in ATLAS BLAS (unthreaded). My main interest is in sgesv performance (called from fortran90). When I compare the wall-clock time of PLASMA vs. threaded sgesv from ATLAS, ATLAS wins usually by a factor of 2X, sometimes even 3X. I noticed in the webinar and in the user guide that comparisons are usually done on many-core machines. Is this performance expected in my case? If not, what tricks can I play to get PLASMA running optimally? I have played with setting the tile size and inner block sizes to no avail.
2) I understand (I think) that PLASMA operates on tiled matrices. Would it be better to directly build the matrices using tiled storage? Does it cause a performance drag if I fill my matrix in the "normal" way as compared to directly building a tiled matrix? I ask this because I notice a couple of things. One is that the memory usage goes way up upon entry to plasma_sgesv (presumably due to the plasma_shared_alloc calls). The other is that I notice there are then two calls to plasma_lapack_to_tile and plasma_tile_to_lapack. Is this where my speed hit is coming from in question 1?
Thanks.
