Firstly, why it doesn't show up on your curve for one core ? You said you had the same problem ?
Secondly, you should not use several threads/cores for those problem sizes. They corresponds to what we use for the tile size, so there is no good explanations, maybe in one case all the tasks which are only 3 or 4 are executed by the same core while they are distributed over several cores in the other case, or it appears when you just go over the NB parameter, creating new tasks.
If you use the default parameters NB is set to 120, so the copy form lapack to tile and from tile to lapack is done with M=LDA. Maybe MKL is a little slower with a leading dimension greter than M.
Finally, if you want to do timing, you should use the timing directory which is more complete and up to date. It will allow you to change parmeters more easily.