NUMA events on AMD failing

Open discussion of PAPI.

NUMA events on AMD failing

Postby amerritt » Wed Jun 23, 2010 4:28 pm


I have a new AMD Magny-Cours processor and am encountering problems using PAPI while pinning my test program to certain domains. Using numactl I pin the program to execute on one domain and force it to allocate memory on another, then tell PAPI to measure the CPU to DRAM requests for a specific NUMA domain.

This is my application, and its output when I run it with numactl:
Code: Select all
$ cat sandbox.c
#include <stdio.h>
#include <papi.h>
#include <stdlib.h> // exit
#include <string.h>
#include <errno.h>

   //EVENT      = PAPI_TOT_INS,      // preset; total instructions
   //EVENT      = PAPI_TOT_CYC,      // preset; total cycles

   //EVENT      = 0x40040062,      // native; CPU_TO_DRAM_REQUESTS_TO_TARGET_NODE, all
   //EVENT      = 0x40002062,      // native; CPU_TO_DRAM_REQUESTS_TO_TARGET_NODE, to n3
   //EVENT      = 0x40040068,      // native; CPU_COMMAND_LATENCY_TO_TARGET_NODE_0_3_4_7, all
   EVENT      = 0x40040061,      // native; MEMORY_CONTROLLER_REQUESTS, all

   DATA_SIZE   = (128 << 20)

#define PAPI_WRAP(x)                           \
   do {                              \
      int error = (x);                     \
      if( error != PAPI_OK )                     \
      {                           \
         fprintf( stderr, "papi error: %s\n", PAPI_strerror(error) );   \
         if( error == PAPI_ESYS ) /* manual says to check errno */    \
            fprintf( stderr, "  errno: %s\n", strerror(errno) );   \
         exit(error);                     \
      }                           \
   } while( 0 )

typedef int      perfctr_eventset_t;
typedef long long   perfctr_values_t;
typedef int      papi_eventcode_t;

int main( void )
   perfctr_eventset_t   set = PAPI_NULL;
   perfctr_values_t   read[1];

   PAPI_set_debug( PAPI_VERB_ESTOP ); // make papi handle the errors and stop the program

   if( PAPI_library_init( PAPI_VER_CURRENT ) != PAPI_VER_CURRENT )
      fprintf( stderr, "error: library init\n" );
      return -1;
   printf( "papi initialized\n" );

   printf( "testing for event\n" );
   PAPI_WRAP( PAPI_query_event( EVENT ) );
   printf( "library says we're okay\n" );

   PAPI_WRAP( PAPI_create_eventset( &set ) );
   printf( "eventset initialized\n" );

   PAPI_WRAP( PAPI_add_event( set, EVENT ) );
   printf( "event added to set\n" );

   PAPI_WRAP( PAPI_start( set ) ); // clears counter

   unsigned long *data = malloc( DATA_SIZE );
   //memset( data, 0xDEADBEEF, DATA_SIZE );
   unsigned long i;
   for( i = 0; i < (DATA_SIZE/sizeof(unsigned long)); i++ )
      data[ i ] = 0xdeadbeefdeadbeef;

   PAPI_WRAP( PAPI_stop( set, read ) );

   printf( "value of counter: %lld\n", *read );

   PAPI_WRAP( PAPI_cleanup_eventset( set ) );
   PAPI_WRAP( PAPI_destroy_eventset( &set ) );

   return 0;

$ gcc -O0 -Wall -ggdb -D_GNU_SOURCE  sandbox.c -o sandbox -lpapi

$ numactl --cpubind=1 --membind=2 ./sandbox
papi initialized
testing for event
library says we're okay
eventset initialized
event added to set
PAPI Error: vperfctr_control() returned < 0.
papi error: PAPI_ESYS
  errno: Invalid argument

$ numactl --cpubind=0 --membind=3 ./sandbox
papi initialized
testing for event
library says we're okay
eventset initialized
event added to set
value of counter: 239222667

My workstation has 4 NUMA domains, 0-3. Each domain has 6 CPU cores and one MMU. This program initializes PAPI with a single event to count and only runs with one thread. It then allocates some memory, touches each byte then frees it and reads the counter. I determined I can only pin it to CPU cores 0 and 1, which reside on NUMA domains 0 and 3, respectively. The same problem occurs with other NUMA-related events, such as CPU_COMMAND_LATENCY_TO_TARGET_NODE_0_3_4_7, and MEMORY_CONTROLLER_REQUESTS. Binding to NUMA domains 1 and 2 fail consistently with many of these events, no matter which core I select, and also no matter which NUMA domain's memory to use for allocation. Preset events work as expected, and other non-NUMA events work fine as well.

Here is more information about my system:
Code: Select all
$ papi_version
PAPI Version:

$ uname -r

$ getenforce # SELinux

$ cat /sys/devices/system/node/node[0-3]/cpulist

$ lsmod | grep perfctr
perfctr               141112  0

$ perfex -i
PerfCtr Info:
abi_version      0x05020501
driver_version      2.6.41
cpu_type      19 (AMD Family 10h)
cpu_features      0x7 (rdpmc,rdtsc,pcint)
cpu_khz         2199988
tsc_to_cpu_mult      1
cpu_nrctrs      4
cpus         [0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23], total: 24
cpus_forbidden      [], total: 0

$ pwd
$ grep PERFCTR .config

ACPI is indeed enabled in my kernel's configuraiton file, so the SRAT table is available enabling the kernel to make an accurate mapping of the physical NUMA layout in its state, so I'm not sure what else could be causing this. Counters are available on each core, correct? Not on just two cores in separate domains or something funky?

Any help is appreciated; thank you in advance for suggestions.

Posts: 2
Joined: Tue Jun 22, 2010 12:48 pm

Re: NUMA events on AMD failing

Postby James Ralph » Thu Jun 24, 2010 12:07 pm


I think this might be an issue with perfctr.
I seem to recall that perfctr limits north bridge events on AMD chips
to the first core on a chip. (but then it let you run on core 1 ?! )

--- perfctr-2.6.39 RELEASENOTES----------
- x86.c: update AMD multicore detection to match the documentation
and actually work on current processors, set up cpumask of all
core0 CPUs, detect RevE processors, update p6_like_check_control()
to allow per-thread sessions to use AMD NB events on post-RevE
processors but limit them to core0 CPUs

Is it possible for you to try running on a 2.6.32 or newer kernel
using the kernel perf_events interface?
I was unable to recreate the problem on our Istanbul based machine
running perf_events.

James Ralph
Posts: 20
Joined: Tue Aug 25, 2009 2:43 pm

Re: NUMA events on AMD failing

Postby amerritt » Thu Jun 24, 2010 6:41 pm

You are indeed correct, I managed to figure out what the problem is. When the perfctr kernel module loads, it does force NorthBridge event monitoring to what it calls "core0" for each processor. perfctr does this because the AMD BKDG states that these events can only be monitored by a single core on a die; perfctr just chooses the one with the smallest core ID.

Code: Select all
papi-4.1.0/src/perfctr-2.6.x/linux/drivers/perfctr/x86.c::static void __init amd_mc_init_cpu( void *data)

The CPUID instruction gives core identifiers that enumerate through all cores in a package, not in a die. Extra information is needed to identify which die the core is located on, but until AMD releases updated BKDG and CPUID specs this functionality is missing.

Magny-Cours is actually two processors within a package (socket G34), each one a separate die with a dedicated MMU. Each die can have one core labeled as a "core0" but because the AMD CPUID spec is old and does not include information for determining which NUMA node a core is located on, perfctr is forced to assume each processor package has only one die and thus one NUMA domain. So perfctr thinks that socket 0 has 12 cores, 1 MMU and is one NUMA domain, when it is actually 2 dies, 2x6 cores, 2x MMUs, 2x NUMA domains.

I updated perfctr to account for this in my experiments. Thank you for your input :)

Posts: 2
Joined: Tue Jun 22, 2010 12:48 pm

Return to General discussion (read-only)

Who is online

Users browsing this forum: No registered users and 1 guest