Yes, maintaining counts per thread is one thing perf_counters does. It can also do multiplexing of events onto counters, so that if you have more events than counters, you can get decent approximations of counts.
As for performance, I would bet that perf_counters is as fast as perfctrs when self-monitoring using the mmap'd memory mechanism. I don't know how perfctrs could be faster on remote monitoring (or perhaps perfctrs doesn't support remote monitoring?) When doing remote monitoring, somehow you need to get the counter values from other threads, and that requires at least some sort of IPI mechanism, which is pretty heavy-weight.