The following code was compiled with option -O0 (no optimization, xmm-register were involved in addition)
- Code: Select all
double* aDataA; // array of size 4096*4096
double aTemp;
flushCache();
PAPI_COUNTERS_START();
for(aLoop<0;aLoop<NUMBER_OF_LOOPS;aLoop++)
{
* calculation_DirectAccess or calculation_RandomAccess
}
PAPI_COUNTERS_READ();
////////////////////////////////////////////////
*calculation_DirectAccess:
direct array access:
aTemp += aDataA[0];
…
aDataC[NUMBER_OF_ELEMENTS-1] = 2.0*aDataA[NUMBER_OF_ELEMENTS-1];
////////////////////////////////////////////
*calculation_RandomAccess:
random array access:
aTemp += aDataA[ random ];
… NUMBER_OF_ELEMENTS-1
aTemp + = aDataA[ random ];
The Results:
NUMBER_OF_LOOPS = 1000
NUMBER_OF_ELEMENTS = 1000
calculation_DirectAccess:
L1_CACHE_DCA = 3 002 302
L2_CACHE_DCA = 280
TOTAL_CYCLES = 9 875 479
TIME sec = 0.002850
calculation_RandomAccess:
L1_CACHE_DCA = 4 499 964
L2_CACHE_DCA = 1 869 159
TOTAL_CYCLES = 33 107 889
TIME sec = 0.009553
The Questions:
1. What is the unit of counters L1_CACHE_DCA & L2_CACHE_DCA ?
2. Are the cache access counters nested (see next questions)?
3. Where is the rest of the time in case of "calculation_RandomAccess" ?
TOTAL_CYCLES>>L1_CACHE_DCA + L2_CACHE_DCA
Somebody now the answers ?