Hello All,
We are experimenting our application kernels on Power8 and looking at various hardware counters.
For set of application kernels, we could like to measure memory bandwidth.
On BG-Q, I am using PAPI to measure application bandwidth using:
Average memory bandwidth = (PEVT_L2_FETCH_LINE + PEVT_L2_STORE_LINE) * 128 bytes / elapsed_time.
On power8, we are measuring bandwidth as:
(PM_L3_PREF_ALL + PM_L3_CO_MEM + PM_DATA_ALL_FROM_LMEM + PM_DATA_ALL_FROM_DMEM + PM_DATA_ALL_FROM_RMEM + PM_DATA_ALL_FROM_LL4 + PM_DATA_ALL_FROM_DL4 + PM_DATA_ALL_FROM_RL4) * 128 bytes / elapsed_time
Is this correct? When we use above formula with STREAM benchmark, we see 15-20% higher bandwidth than reported by benchmark itself (this is with single thread).
How about multi-threaded applications on power8? Are those counters shared? Or every thread needs to measure it separately? (it’s easy on BG-Q as L2 counters are shared, I am not entirely sure about power8)
If someone could provide some pointers, it will be great help!
Thanks!