The following code was compiled with option -O0 (no optimization, xmm-register were involved in addition)

`double* aDataA; // array of size 4096*4096`

double aTemp;

flushCache();

PAPI_COUNTERS_START();

for(aLoop<0;aLoop<NUMBER_OF_LOOPS;aLoop++)

{

* calculation_DirectAccess or calculation_RandomAccess

}

PAPI_COUNTERS_READ();

*calculation_DirectAccess:

direct array access:

aTemp += aDataA[0];

…

aDataC[NUMBER_OF_ELEMENTS-1] = 2.0*aDataA[NUMBER_OF_ELEMENTS-1];

*calculation_RandomAccess:

random array access:

aTemp += aDataA[ random ];

… NUMBER_OF_ELEMENTS-1

aTemp + = aDataA[ random ];

The Results:

NUMBER_OF_LOOPS = 1000

NUMBER_OF_ELEMENTS = 1000

calculation_DirectAccess:

L1_CACHE_DCA = 3 002 302

L2_CACHE_DCA = 280

TOTAL_CYCLES = 9 875 479

TIME sec = 0.002850

calculation_RandomAccess:

L1_CACHE_DCA = 4 499 964

L2_CACHE_DCA = 1 869 159

TOTAL_CYCLES = 33 107 889

TIME sec = 0.009553

The Questions:

1. What is the unit of counters L1_CACHE_DCA & L2_CACHE_DCA ?

2. Are the cache access counters nested (see next questions)?

3. Where is the rest of the time in case of "calculation_RandomAccess" ?

TOTAL_CYCLES>>L1_CACHE_DCA + L2_CACHE_DCA

Somebody now the answers ?