One thing that you're bumping up against is a fundamental limit in the Opteron hardware. The events you're monitoring are measuring shared resources at the chip level, even though you are using a core level counter interface to do the measurement. For most events, each core has complete and exclusive control of its 4 counters. For the events monitoring shared resources (L3 cache, Hypertransport) the cores on a chip share a set of 4 chip level counters, even though the core level programming looks identical. Think of it like having 4 remote controls for the same TV set. The AMD BKDG document says that by convention each chip should only allow one core to program these events. Otherwise core 2 (for example) could overwrite the settings of core 1. You are at an advantage by pinning threads to cores since thread migration won't be an issue, but you still need to be aware of this contention issue. Further, PAPI counts events only when a thread is active. Thus, even with the above constraints met, the value in the counters will be a lower bound on the number of actual events.
I hope this sheds at least some light on what you're seeing.