PAPI_3.7.1 L1D_CACHE_LD doesn't work properly Intel Core i7

Open discussion of PAPI.

PAPI_3.7.1 L1D_CACHE_LD doesn't work properly Intel Core i7

Postby Dmitry » Tue Feb 09, 2010 12:29 pm

Good Afternoon,

the native event L1D_CACHE_LD.MESI doesn't work properly (Intel Core i7/Nehalem).
The results was ten times more than I expected:

What should be the result for the following code (without any optimization) ?
Code: Select all
double aArray[1000];
double aSumm = 0.0;
for(int aI=0; aI<1000;aI++)
{
 aSumm += aArray[aI];
}

The counter returns ~10 000 L1D_CACHE' loads.

Thank you
Dmitry
 
Posts: 13
Joined: Mon Dec 14, 2009 2:16 pm

Re: PAPI_3.7.1 L1D_CACHE_LD doesn't work properly Intel Core i7

Postby jdmccalpin » Tue Feb 09, 2010 1:31 pm

It is very difficult to determine what the "correct" answer should be in a case that works on uninitialized data.
In this case the answer will depend on the policy of the operating system for dealing with read-only access to uninitialized pages -- standard Linux versions operate differently than most other OS's, for example.

One approach that might help:
* initialize the data by storing to it (a small array like this will then be in the cache)
* flush the cache(s) by initializing another (much larger) array sized to be larger than the L3 cache size
* then go back and read your small aArray[]

Be aware that different chips count Data Cache load misses differently. On Family 10h Opterons, for example, if the data is brought into the Data Cache by the hardware prefetcher before the demand load is executed, then the demand load is not counted as a miss. I don't (yet) know how Intel handles this case with with Core i7.
jdmccalpin
 
Posts: 5
Joined: Fri Aug 28, 2009 11:21 am

Re: PAPI_3.7.1 L1D_CACHE_LD doesn't work properly Intel Core i7

Postby Dan Terpstra » Tue Feb 09, 2010 5:47 pm

If you assume that you are starting with a cold cache, then you would expect the number of L1 cache loads in this example (no reuse) to be a function of the cache line size, the data element size, and the number of elements. My Nehalem reports L1D cache lines of 64 bytes each. That holds 8 doubles at 64 bits each. So each L1D load should bring in 8 data elements and the maximum total loads should be (data size)/8, possibly reduced by prefetching.
However, when I run the example loop shown below with various size arrays, from 1000 to 800000 elements, I converge on 7 L1D load events per data element. Not quite as high as indicated below, but still higher than my prediction by more than a factor of 50.
What am I missing, John?
Dan Terpstra
 
Posts: 57
Joined: Mon Aug 24, 2009 5:42 pm

Re: PAPI_3.7.1 L1D_CACHE_LD doesn't work properly Intel Core i7

Postby jdmccalpin » Wed Feb 10, 2010 2:45 pm

According to the "Intel® 64 and IA-32 Architectures Optimization Reference Manual"
(http://www.intel.com/Assets/PDF/manual/248966.pdf), page B-53:

30. Load Rate: L1D_CACHE_LD.MESI / CPU_CLK_UNHALTED.CORE
One memory read operation can be served by a core each cycle. A high “Load Rate”
indicates that execution may be bound by memory read operations.

This suggests that the L1D_CACHE_LD.MESI counter measures load accesses, not load misses.
I don't know about Core i7 in particular, but many microarchitectures will repeatedly retry loads that have previously missed in the cache until they finally succeed.
If this is the case here, then I would expect the value to be approximately 1 increment per program load for data in the L1 Data Cache, then increasing systematically as the latency to load the Data Cache line increases. It may be interesting to replace the summation kernel with a pointer-chasing kernel (using a random or non-prefetchable stride) to ensure that there is only one load instruction outstanding at any time.
jdmccalpin
 
Posts: 5
Joined: Fri Aug 28, 2009 11:21 am

Re: PAPI_3.7.1 L1D_CACHE_LD doesn't work properly Intel Core i7

Postby Dmitry » Wed Feb 17, 2010 6:00 am

The following code was compiled with option -O0 (no optimization, xmm-register were involved in addition)
Code: Select all
double* aDataA; // array of size 4096*4096
double aTemp;
flushCache();
PAPI_COUNTERS_START();
for(aLoop<0;aLoop<NUMBER_OF_LOOPS;aLoop++)
{
   * calculation_DirectAccess or  calculation_RandomAccess
}
PAPI_COUNTERS_READ();

////////////////////////////////////////////////
*calculation_DirectAccess:
direct array access:
aTemp += aDataA[0];

aDataC[NUMBER_OF_ELEMENTS-1] = 2.0*aDataA[NUMBER_OF_ELEMENTS-1];

////////////////////////////////////////////
*calculation_RandomAccess:
random array access:
aTemp += aDataA[ random ];
… NUMBER_OF_ELEMENTS-1
aTemp + = aDataA[ random ];




The Results:
NUMBER_OF_LOOPS = 1000
NUMBER_OF_ELEMENTS = 1000

calculation_DirectAccess:
L1_CACHE_DCA = 3 002 302
L2_CACHE_DCA = 280
TOTAL_CYCLES = 9 875 479
TIME sec = 0.002850

calculation_RandomAccess:
L1_CACHE_DCA = 4 499 964
L2_CACHE_DCA = 1 869 159
TOTAL_CYCLES = 33 107 889
TIME sec = 0.009553


The Questions:
1. What is the unit of counters L1_CACHE_DCA & L2_CACHE_DCA ?
2. Are the cache access counters nested (see next questions)?
3. Where is the rest of the time in case of "calculation_RandomAccess" ?
TOTAL_CYCLES>>L1_CACHE_DCA + L2_CACHE_DCA

Somebody now the answers ?
Dmitry
 
Posts: 13
Joined: Mon Dec 14, 2009 2:16 pm

Re: PAPI_3.7.1 L1D_CACHE_LD doesn't work properly Intel Core i7

Postby vweaver1 » Wed Feb 17, 2010 4:10 pm

Which exact processor is the one causing trouble?

There is an erratum, at least for the 55xx version of Nehalem:

(from the Intel Xeon Processor 5500 Series Specification Update, Erratum AAK119)
Code: Select all
Problem:     The     performance    monitor    events    DCACHE_CACHE_LD       (Event   40H)    and
             DCACHE_CACHE_ST (Event 41h) count cacheable loads and stores that hit the L1
             cache. Due to this erratum, in addition to counting the completed loads and stores, the
             counter will incorrectly count speculative loads and stores that were aborted prior to
             completion.
Implication: The performance monitor events DCACHE_CACHE_LD and DCACHE_CACHE_ST may
             reflect a count higher than the actual number of events.

Although on code like this I wouldn't think the speculative loads/stores would be that high, and definitely not a factor of 10.
vweaver1
 
Posts: 50
Joined: Wed Feb 17, 2010 4:02 pm

Re: PAPI_3.7.1 L1D_CACHE_LD doesn't work properly Intel Core i7

Postby Dmitry » Fri Feb 19, 2010 11:33 am

factor 3...
Dmitry
 
Posts: 13
Joined: Mon Dec 14, 2009 2:16 pm

Re: PAPI_3.7.1 L1D_CACHE_LD doesn't work properly Intel Core i7

Postby vweaver1 » Wed Feb 24, 2010 11:32 am

which compiler are you using to compile your code?

I've found on our machines here that for some reason code compiled with gcc 4.1.x generates 3 dcache accesses per load when using SSE to load and add an fp double, wheras code compiled with gcc 4.3.x does the expected 1 dcache access.

gcc 4.1 generates:
Code: Select all
  401e18:       f3 0f 10 44 24 0c       movss  0xc(%rsp),%xmm0
  401e1e:       f3 0f 58 04 82          addss  (%rdx,%rax,4),%xmm0
  401e23:       48 83 c0 01             add    $0x1,%rax
  401e27:       48 3d e8 03 00 00       cmp    $0x3e8,%rax
  401e2d:       f3 0f 11 44 24 0c       movss  %xmm0,0xc(%rsp)
  401e33:       75 e3                   jne    401e18 <main+0x398>


gcc 4.3 generates:
Code: Select all
  402138:       f3 0f 58 00             addss  (%rax),%xmm0
  40213c:       48 83 c0 04             add    $0x4,%rax
  402140:       48 39 d0                cmp    %rdx,%rax
  402143:       75 f3                   jne    402138 <main+0x328>
vweaver1
 
Posts: 50
Joined: Wed Feb 17, 2010 4:02 pm

Re: PAPI_3.7.1 L1D_CACHE_LD doesn't work properly Intel Core i7

Postby Jamie Han » Sat Mar 13, 2010 10:54 am

Dmitry wrote:Good Afternoon,

the native event L1D_CACHE_LD.MESI doesn't work properly (Intel Core i7/Nehalem).
The results was ten times more than I expected:

What should be the result for the following code (without any optimization) ?
Code: Select all
double aArray[1000];
double aSumm = 0.0;
for(int aI=0; aI<1000;aI++)
{
 aSumm += aArray[aI];
}

The counter returns ~10 000 L1D_CACHE' loads.

Thank you


I've had the same issue with L1D_CACHE_LD.MESI not work properly and in my case I flushed the cache, initializing bigger array (in size) and when I got back to read my small array it was ok.

Best
Chris
Jamie Han
 
Posts: 1
Joined: Sat Mar 13, 2010 10:42 am
Location: 101 Watlington, Wimbledon, SW19 5RR, UK


Return to General discussion

Who is online

Users browsing this forum: Yahoo [Bot] and 1 guest

cron