Here are some run times:
5000x5000 matrix solve via sgesv on corei7:
ATLAS 1 thread - 5.5s
ATLAS auto thread - 2.12
PLASMA+ATLAS with optimal block size, 1 thread - 15.27
PLASMA+ATLAS with optimal block size, 4 thread - 3.93
10000x10000:
ATLAS auto thread - 12.92
PLASMA+ATLAS with optimal block size, 4 thread - 30.78
MKL auto thread - 8.62
PLASMA+MKL (MKL_NUM_THREAD=1), 4 thread - 12.07
Perhaps I'm just not going to see big gains from PLASMA unless I run on a many-core machine. This is always the situation shown in the various graphs I've seen...machines with 16-64 cores rather than 4-8.
