I am playing around with PAPI to do some performance measurements, specifically, TOT_CYC, TOT_INS, L2_DCM.
However, I am not quiet sure as to why the results vary. I tried 4 scenario and compared the results:
Scenario 1:
- Code: Select all
PAPI_start()
for i 0 to 500
call func()
end for
PAPI_end()
Results: TOT_CYC = 23330897, TOT_INS = 79542, L2_DCM = 779
Scenario 2:
- Code: Select all
for i 0 to 500
PAPI_start()
call func()
PAPI_end
end for
sum_the_values()
Result: TOT_CYC = 443303 TOT_INS = 146898 L2_DCM =830
Scenario 3:
- Code: Select all
void func()
{
PAPI_start()
do_something()
PAPI_end()
}
int main()
{
for i 0 to 500
call func()
end for
sum_the_results()
}
Resuls = TOT_CYC=171056 TOT_INS=128893 L2_DCM=445
Questions:
1. Scenario 1 seems to consume a LOT of CYC, but the tot INS looks too less than Scenario 2 or 3.
2. I noticed that in case of Scenario 2 and Scenario 3, 1st iteration alone seems to take "abnormally" lot of CYC and INS. Is there a specific reason? For example, in Scenario 3, 1st iteration was approx 13,000 CYC and 4200 INS, vs rest of the iterations which were fairly constant around 450 CYC and 250 INS.
3. Whats the overhead of making PAPI_start() and PAPI_stop calls ? Also, when calling PAPI_stop, does it also count the CYC and INS that it ended up using?
thanks,
J.Joba