I am profiling the LU benchmark from NPB suite by using two profiling tools with PAPI.
Although I have a lot of experience on this benchmark and the profiling tools I tried to profile it on a processor that I hadn't tried again. I profiled the following hardware counters PAPI_TOT_INS, PAPI_FP_OPS, PAPI_L2_TCM and PAPI_RES_STL.
I have access to the following machines:
1) 2 x one machine with 4 processors AMD 2.4 GHz (Opteron 6136), 8 cores / processor
2) 1 x one machine with 2 Intel X5650 3.07GHz
So I was trying to make an example of profiling and comparing the completed instructions per function while I execute the LU benchmark for 2 till 32 processes for the AMD machine.
My problem is that for two specific functions of the LU benchmark the value of the total completed instructions between the 8 and 16 processes varies a lot.
Values of the sum across the processes of the total completed instructions for the RHS function of the LU benchmark:
2 processes: 2.21e11
4 processes: 2.21e11
8 processes: 2.79e11
16 processes: 4.43e11
32 processes: 4.37e11
The variation from 8 to 16 processes is big enough. Just to mention that I have executed the LU benchmark on various different clusters and I never had this issue with other older AMD Opteron processors. So I tried the Intel machine and there was no such problem, the total completed instructions remain the same (almost). I was wondering if there is something that I don't know about this specific AMD processor and there is someone to propose me any advice or any reason about the results. I have the issue on both machines which are identical and also with both profiling tools so it seems not to be an issue of the tool.