I have installed PLASMA and MAGMA on the Intel MIC. MAGMA performs very well - better than MKL, but PLASMA is relatively poor, significantly slower than MKL. I believe I have used MKL BLAS in PLASMA and have set MKL_NUM_THREADS and OMP_NUM_THREADS to 1, but I have not tried any changes to the blocking sizes. (It is built explicitly for use on the MIC itself, not for use with offloading from the host.) I know that for MAGMA, setting the thread affinity with KMP_AFFINITY is crucial for best performance (a factor of 2 difference), and so wondered whether this might be the case with PLASMA. However KMP_AFFINITY seems to be ignored - there is no output of thread/node binding even when verbose is specified. It seems that some aspect of the PLASMA implementation results in KMP_AFFINITY being ignored. I know there is some reference to thread affinity within the code, specifically the sched_setaffinity function in plasmaos.c, However i do not know the implications of this, nor whether it would be possible or even desirable, to enable KMP_AFFINITY.
Given MAGMA works well and is designed (in part) for the MIC, I do not intend to pursue this much further, but it would be useful to identify the reason for poor performance, and whether enabling KMP_AFFINITY helps. Any thoughts on this would be appreciated.