I'm trying to determine if contention actually exists for the single FPU in AMD Bulldozer compute units in heavy FP codes. The code I'm working on is instrumented to count many PAPI events and I wondered if there was some event, or composition of events that could be used to determine alternating access to the shared FPU in the Bulldozer.
This is motivated from doing some performance measurements on our code using the PAPI_DP_OPS event and noticing huge disparities between PAPI_DP_OPS and PAPI_FP_OPS on the Bulldozer. The difference should be around 10% for this particular code and the amount of ops counted by PAPI_DP_OPS was nearly 3x the counted PAPI_FP_OPS. Could it be related to the splitting of the 256-bit AVX instructions by the Bulldozer?