Does PAPI report per-process values?

Open discussion of PAPI.

Does PAPI report per-process values?

Postby cincaipatron » Wed Oct 07, 2009 4:47 am

I have a clarifying (aka stupid) question:

I've a program which calls PAPI API to measure the requests between a particular CPU core and memory. The counter (Barcelona processor) is CPU_TO_DRAM_REQUESTS_TO_TARGET_NODE. Does the value reported by PAPI is only from the PAPI-ed program, or does it cumulative of all memory accesses issued by the core?
Posts: 2
Joined: Wed Sep 02, 2009 5:52 am

Re: Does PAPI report per-process values?

Postby Dan Terpstra » Wed Oct 07, 2009 1:36 pm

In general, PAPI measures and reports event counts on a per process or per thread basis. The counter state is saved and restored at context switch so you only see events related to your process. However, on multicore Opterons, there is an exception to this generality. Some events count chip-wide activity, like cache events in the L3 shared cache for Shanghai and Istanbul. These events are counted on a set of shared shadow counters that don't get properly saved and restored at context switch. I don't recall whether the event you're measuring, CPU_TO_DRAM_REQUESTS_TO_TARGET_NODE, is one of these maverick events or not. I suspect it is not, but haven't confirmed that.
Caveat emptor!
Dan Terpstra

Re: Does PAPI report per-process values?

Postby jdmccalpin » Thu Feb 04, 2010 4:25 pm

There are a number of issues here:
    * CPU_TO_DRAM_REQUESTS_TO_TARGET_NODE is a new event and I don't know if has been verified by AMD to give the correct answers.
    * The detailed behavior may depend on which interface layer you are using underneath PAPI.
    I use perfctr, which attempts to report "per-process" values by saving/restoring counters on context switches.
    * The default behavior of PAPI is to request that perfctr only accumulate counts while running user-space code, so any events that happen while the kernel is operating on behalf of your process will not be counted.
    A common example of missing counts is in code that touches memory for the first time -- the kernel does a lot of work to find and clear a page of physical memory and map it to your virtual address space. Any DRAM accesses that happen during this period will not normally be counted. (Though PAPI has the ability to request that counts are accumulated in both user-space and kernel-space code: see the PAPI_get_domain and PAPI_set_domain commands.)
    * DRAM accesses may be surprisingly asynchronous with respect to the underlying user-space code. For example, a victim block can be sent to memory at almost any time after the store that dirtied the cache line -- any time meaning anywhere between a few cycles and days/weeks/months. It all depends on how long it is before that L3 cache block is chosen as a victim for replacement. If you run a cache-contained code that does not cause any L3 cache misses, the dirty data could stay in the L2 or L3 for a long time.

    * A context switch will wait for all active instructions to complete before switching, so reads that go to remote memory should be counted along with the process that generated them, but I am not sure that you can depend on this for reads generated by prefetches (either software prefetches or prefetches from the core hardware prefetcher).
    * DRAM accesses due to the memory controller/DRAM controller prefetcher should not be counted by this event, since they are not explicitly associated with a CPU.

So the short answer is: PAPI attempts to report per-process values, but there are a number of ways in which these might vary from what you would expect.
Posts: 5
Joined: Fri Aug 28, 2009 11:21 am

Return to General discussion (read-only)

Who is online

Users browsing this forum: No registered users and 2 guests