Total L2 misses for multithreaded programs on multiple cores

Open discussion of PAPI.

Total L2 misses for multithreaded programs on multiple cores

Postby sijusamuel » Wed Nov 18, 2009 11:41 pm

How to obtain Total L2 cache misses for a program ( program A calls program B),
when program 'A' running in single thread and program 'B' running on mutiple threads. And each thread will be running on different cores in a multicore machine?

All the threads are joined in B and retuns the result to A.

Is the following steps are correct?

1. Creating/starting the event (PAPI_L2_DCM) in 'A' before call 'B'.
2. Reading event in 'A', after the call to 'B'
Whether 'B' is threaded/(or runnin in different cores) or not, does this have any significance?


Machine : Intel(R) Core(TM) i7 CPU using pcl counters


Thanks,
Siju Samuel
sijusamuel
 
Posts: 4
Joined: Wed Oct 14, 2009 3:50 pm

Re: Total L2 misses for multithreaded programs on multiple cores

Postby sijusamuel » Wed Dec 23, 2009 3:52 pm

For the above scenario, I made a sample program : Main invoke N threads and calls 'threadprog' function (parallel calls.). Main
and 'threadprog' make floating point computations and used PAPI to report counts per each thread. ThreadAffinity is set
to run each thread in each core..

Keep a reply if some thing is Wrong... !! Thanks....


/* PROG NAME : multiThreadmultliCoreAffinity.c
* Main : Create threads and invoke a function 'threadprog'. Threadprog does
* some floating point computation based on input, threadidentifier.
* Each thread has been given a thread affinity to possibly run in
* each core
*
* Main, track papi events.
* does itsown floating point computation.
* invoke Multiple thread
* track papi events
* End
*
* Change the Value of NUM_CORE based on taget system
*/
#include <pthread.h>
#include <stdio.h>
#include <sys/types.h>
#include <sys/wait.h>
#include <papi.h>

#define INDEX 10
#define NUM_EVENTS 3
#define NUM_CORE 4


void *threadprog(void *inp) ;


char format_OUT[] = { " ______________inmain : flpins: %lld cycles: %lld L2 miss : %lld \n" };
char format_OUT2[] = { " ______________inthreadprod : IDENTIFIER %d flpins: %lld cycles: %lld L2 miss : %lld \n" };

int main(int argc, char **argv)
{
int i;
pid_t childpid;
int numOfThreads = NUM_CORE;
float x[INDEX];
float sum = 0.0;
int identifier[numOfThreads];

//------------for papi ------------
int retval;
int Events[3] = { PAPI_FP_INS, PAPI_TOT_CYC, PAPI_L2_DCM };
long_long values[3];
int EventSet=PAPI_NULL;
//------------for papi end------------


pthread_t tid;
// int error;
long unsigned int error, CPU;
pthread_t *ptid;
pthread_attr_t myptthreadAttr[numOfThreads];

/* Initialize the Vector arrays for floating point computation */
for (i = 0; i < INDEX ; i++) {
x[i] = i*1.0;
}
/* Create thread identifier 0.. numberOfThread */
for (i = 0; i< numOfThreads; i++)
{
identifier[i]=i;
}

//------------for papi ------------
if ((retval = PAPI_library_init(PAPI_VER_CURRENT)) != PAPI_VER_CURRENT)
printf("Error ___________________01\n");


if ((retval = PAPI_create_eventset(&EventSet)) != PAPI_OK)
printf("Error ___________________02\n");


if (PAPI_add_events(EventSet, Events, NUM_EVENTS) != PAPI_OK)
printf("not added all events_________03 \n");

if ((retval = PAPI_start(EventSet)) != PAPI_OK)
printf("Error ____________________04\n");


//some Floating point computation in MAIN
for (i = 0; i < INDEX; i++)
sum = sum + x[i]*2; // 2* INDEX flop

if ((retval = PAPI_read(EventSet,values)) != PAPI_OK)
printf("Error ___________________05\n");
printf(format_OUT, values[0], values[1], values[2] ) ;
//------------for papi ------------


//allocate pid
if( (ptid = (pthread_t *) calloc(numOfThreads,sizeof(pthread_t))) == NULL) {
perror("Failed to Allocate SpaceFor thread IDs");
return 0;
}

// Create threads
for(i=0; i< numOfThreads; i++)
{
//
if(error = pthread_attr_init(myptthreadAttr+i)){ // initialize attributes
fprintf(stderr, "Failed to init thread: %s\n", strerror(error));
}
CPU = 1 << (i%8);

if(error = pthread_attr_setaffinity_np(
myptthreadAttr+i, sizeof (long unsigned int), &CPU)){
fprintf(stderr, "Failed to set affinity on thread: %s\n", strerror(error));
}

//passing &identifier[i] to function : threadprog
if(error = pthread_create((ptid+i), myptthreadAttr+i, //Second param -> affinity
threadprog, (identifier+i))){
fprintf(stderr, "Failed to create thread: %s\n", strerror(error));
// return 1;
}
}


//join for all thread / wait untill all thread completed
printf("just checking how many times ----1\n");
sleep(10);
for (i =0; i < numOfThreads; i++)
{
if(error = pthread_join(ptid[i],NULL))
fprintf(stderr, "Failed to join thread %d: %s\n", i, strerror(error));
}

free(ptid);


//------------for papi ------------

if ((retval = PAPI_read(EventSet,values)) != PAPI_OK)
printf("Error ___________________05\n");
printf(format_OUT, values[0], values[1], values[2] ) ;

if ((retval = PAPI_stop(EventSet,values)) != PAPI_OK)
printf("Error ___________________10 PAPI_stop %d \n", retval);

// if ((retval = PAPI_destroy_eventset(&EventSet)) != PAPI_OK)
// printf("Error ___________________11 PAPI_destroy_eventset %d \n", retval);
//------------for papi ------------

printf("End _of _ Main \n");
}



/* This will be running in its own thread(from main).
* Based on the input ( which is dummy thread identifier to represent a thread)
* do some floating point computation and return. Create , track and stop papi
* events.
*/

void *threadprog(void *inp) {
int INDX = 100, i;
int MAXINDX = 100000;
double locsum = 0.0;
float y[MAXINDX];
int identifier;

//------------for papi ------------
int retval;
int Events[3] = { PAPI_FP_INS, PAPI_TOT_CYC, PAPI_L2_DCM };
long_long values[3];
int EventSet=PAPI_NULL;
//------------for papi end ------------


/* Get thread identifier */

identifier = *((int *)inp);
printf("ThreadIdentfier-------------------------------------------------%d\n", identifier);
/* Initialize the Matrix arrays */
for (i = 0; i < MAXINDX ; i++)
y[i] = i*1.0;
//------------for papi ------------
if ((retval = PAPI_library_init(PAPI_VER_CURRENT)) != PAPI_VER_CURRENT)
printf("Error ___________________01\n");


if ((retval = PAPI_create_eventset(&EventSet)) != PAPI_OK)
printf("Error ___________________02\n");


if (PAPI_add_events(EventSet, Events, NUM_EVENTS) != PAPI_OK)
printf("not added all events_________03 \n");

if ((retval = PAPI_start(EventSet)) != PAPI_OK)
printf("Error ____________________04\n");
//------------for papi end ------------


//perform different number of float operation for each thread (used identifier)
for (i = 0; i < INDX*(identifier+1) ; i++)
locsum = locsum + y[i]+identifier;

//------------for papi ------------
if ((retval = PAPI_read(EventSet,values)) != PAPI_OK)
printf("Error ___________________05\n");

printf(format_OUT2, identifier, values[0], values[1], values[2] ) ;

if ((retval = PAPI_stop(EventSet,values)) != PAPI_OK)
printf("Error ___________________10 PAPI_stop %d \n", retval);

// if ((retval = PAPI_destroy_eventset(&EventSet)) != PAPI_OK)
// printf("Error ___________________11 PAPI_destroy_eventset %d \n", retval);
//------------for papi ------------
printf("sum %e\n ", locsum); // printing to avoid any optimization of sum computation
return NULL;
}


==============================================================
Makefile
==============================================================
PAPI_DIR=/usr/local/papi-3.7.1

CFLAGS=-I${PAPI_DIR}/include -lpthread
LDLIBS=${PAPI_DIR}/lib/libpapi.a



RCWdir = ./
CC = gcc
CCFLAGS = -g
IFLAGS = -I$(RCWdir) -I${PAPI_DIR}/include

threadex : threadex.o threadsubpg.o
$(CC) $(CCFLAGS) -o $@ threadex.o threadsubpg.o -lpthread $(LDLIBS)
threadex.o : $(RCWdir)/threadex.h threadex.c
$(CC) $(CCFLAGS) $(IFLAGS) -c threadex.c
threadsubpg.o : $(RCWdir)/threadex.h threadsubpg.c
$(CC) $(CCFLAGS) $(IFLAGS) -c threadsubpg.c
clean:
rm -f makechild.o makelib.o makechildparallel.o makethread.o makethreadpgreturn.o

multiThreadmultliCoreAffinity :multiThreadmultliCoreAffinity.c
==============================================================
Answer ;
=========================================================
ssamuel@etl-corei74:~/PAPIEX3/papi-3.7.1/src/siju$ ./multiThreadmultliCoreAffinity
______________inmain : flpins: 20 cycles: 2579 L2 miss : 2
ThreadIdentfier-------------------------------------------------0
ThreadIdentfier-------------------------------------------------1
just checking how many times ----1
ThreadIdentfier-------------------------------------------------3
ThreadIdentfier-------------------------------------------------2
______________inthreadprod : IDENTIFIER 0 flpins: 300 cycles: 2304 L2 miss : 4
sum 4.950000e+03
______________inthreadprod : IDENTIFIER 1 flpins: 600 cycles: 3378 L2 miss : 11
sum 2.010000e+04
______________inthreadprod : IDENTIFIER 2 flpins: 900 cycles: 4545 L2 miss : 9
______________inthreadprod : IDENTIFIER 3 flpins: 1200 cycles: 5906 L2 miss : 41
sum 4.545000e+04
sum 8.100000e+04
______________inmain : flpins: 20 cycles: 80797 L2 miss : 814
End _of _ Main






Thanks,
sijusamuel
sijusamuel
 
Posts: 4
Joined: Wed Oct 14, 2009 3:50 pm


Return to General discussion

Who is online

Users browsing this forum: Bing [Bot], Yahoo [Bot] and 1 guest

cron