1.0 INTRODUCTION PB is a set of programs that will instrument a code, and then create timing and performance files during run time. There are other scripts for analysis of the data. PB is a useful tool for both analyzing code, and for performance predictions of new machines. For analyzing code, PB can be used to get information about each subroutine and the time and flops required. Tools are provide to ease the instrumentation of Fortran 77 and C programs. These PB scripts will modify the user code to call timing routines, which in turn call PAPI. PAPI is a program developed by University of Tennessee, Knoxville (UTK). The purpose of PAPI is to provide a cross platform analysis tool, providing results in a consistent format. PAPI will return values for numerous functions including floating point operations performed and data transfer bandwidths. See the INSTALL file on how to install PB. If PAPI is available and installed, then PB can be used to count hardware events such as floating point operations (FLOPS). If PAPI is not installed then PB can be used to measure times in routines. If flop counts are available for the exact same run on another platform, with flop counts, then the two sets of performance files can be used to make performance predictions. The predictions of FLOP rates are made by combining the flops counts from one machine and the times on the other. 2.0 INSTRUMENTING the CODE: The process of instrumenting the user code is straight forward and usually requires only a small amount of editing time. It is mostly automated with only a verification of the process required. For Fortran 77 and C programs, respectively, the utilities "addpb" and "cpb" place calls to the PB routines inside subroutines. There are a few details which require editing of the main program, however. 2.1 Instrumenting Fortran 77 programs The main tool for instrumenting Fortran programs is the Perl script "addpb". This script currently requires each subroutine to have a return statement and does not support some Fortran 90 syntax, so be warned. To use addpb change to each source source directory, and enter the command "addpb". This will backup each source file into a *.bak file, and place a call PBSTART('routinename') before the first executable statement of each subroutine, and a call PBEND('routinename') One can use addpb to instrument the whole code, but if there are subroutines that are very short and called frequently then it may be best not to instrument those routines. You will next need to add a few calls manually. When the parallel code runs, each processor will write a separate *.pb file. Here is an example (Instrumenting the LESLIE3D combustion code). Immediately after the program gets the processor's MPI rank with an MPI call to "mpi_comm_rank", create a string containing a unique file stem name and pass it to PBINIT ( which initializes the PB library ), and the second argument is a PAPI event type as defined by PAPI. In this case we are measuring the number of actual floating-point instructions ( which is usually of chief importance ), but you may be interested in other PAPI "events". call mpi_comm_rank(MPI_COMM_WORLD, iam, ierr) pbfile(1:9) = 'leslie000' write(pbfile(7:9),'(I3.3)') iam call PBINIT(pbfile,'PAPI_FP_INS') call PBSTART('leslie3d') ! rest of main program call PBEND('leslie3d') call PBREPORT() end You must add the last line to actually write the PB information to the *.pb files ( you will notices that the files contain zero bytes until the PBREPORT routine is call at the end of your run ). To instrument an important section of code within a subroutine, bracket the block of code as follows: call PBSTART('routinename:chunk_name') { block of code to instrument } call PBEND('routinename:chunk_name'). To instrument a message-passing section of code to measure the Mbytes rate: integer pbytes call PBSTART('mpi_wrapper:section') { Communication calls } pbytes = { calculate actual bytes transferred } call PBCOMM(pbytes) call PBEND('mpi_wrapper:section') To instrument the Mbytes rate of an I/O transfer: integer pbytes call PBSTART('io_wrapper:section') { I/O calls } pbytes = { calculate actualy bytes transferred } call PBIO(pbytes) call PBEND('io_wrapper:section') 2.2 Instrumenting C programs The main tool for instrumenting C programs is the C program "cpb" which currently instruments only one file at a time. cpb file.c > new-file.c The main program will be need to be modified by hand: #include "pbUser.h" void main(...) { static void pbptr = 0; MPI_Init(&argc, &argv); MPI_Comm_size(MPI_COMM_WORLD, &nproc); MPI_Comm_rank(MPI_COMM_WORLD, &iam); #ifdef USE_PB sprintf(pbfile,"{program name}%03d",iam); PBinit(pbfile); PBstart("{program name}",&pbptr); #endif /* ... rest of main ... */ ENDPB(charge); REPORTPB(); } The cpb utility places macros named STARTPB, ENDPB. However, the STARTPB macro can only be used in as the first statement in a routine which is not possible in the main program. So, wrap the call to the PB library with an ifdef on USE_PB. If USE_PB is defined then the code will use the PB library. Otherwise it won't be compiled in. Similar to the FORTRAN programs we can measure I/O and communications with: int pbytes = 0; { STARTPB(ReadInputVariables); pbytes = { calculate number of bytes to message-pass } COMMBytesPB(pbytes); ENDPB(ReadInputVariables); } The same mechanism can be used to measure I/O transfers: int pbytes = 0; { STARTPB(ReadInputVariables); pbytes = { calculate number of bytes to message-pass } IOBytesPB(pbytes); ENDPB(ReadInputVariables); } 3.0 COMPILING, LINKING AND RUNNING The next step is to compile and link the program, and then run the program on the computer platform. This will create several files with the extension ".pb". You will have one *.pb file for each processor on your computer system. 3.1 Viewing PerfBench Analysis Reports using "perf" The perl script "perf" creates parallel profile reports. It is the main tool most users of PerfBench will use. Here is a brief description of its switches: perf [-ded] [-s type] [-p type] [-v view] [-flops pb_dir] file.pb ... where type must be one of the following: wall, calls, cpu, flops, rate, comm, io, name ( The default is sort by wall time ) -p type will set the type of percentage (wall default) -v view is the view type it is avg, min, max (avg default) -ded assume dedicated machine and use wall time instead of cpu for flopsRate -tol percent will ignore routines/groups take less than %percent cpu time -flops pb_dir will take a directory of pb files to find flop counts from 3.2 Example output from "perf". The example output from "perf" below shows three tables. The first shows the time spent in just the routine. The time spent in the instrumented child routines is not included. The second table gives an example on how to measure communication rates. The third table shows the times spent in each routine including their child routines. Exclusive Routine Performance: No instrumented child routines included Sorted by wall, Shown in avg mode: np calls cpu wall flops flops_rate % wall name 1 2 0.26999 0.26426 4e+07 148.15 49.960 work:block1 1 2 0.25999 0.26407 4e+07 153.85 49.924 work:block2 1 1 0 0.00057404 327 0 0.109 test1 1 2 0 3.3109e-05 94 0 0.006 work 1 1 0 4.2163e-06 50 0 0.001 comm 1 2 0 0 38 0 0.000 comm:mpi Comm Performance Sorted by wall, Shown in avg mode: np calls cpu wall comm comm_rate % wall name 1 2 0 0 2000 0 0.000 comm:mpi Inclusive Routine Performance: all child routines included Sorted by wall_Ttl, Shown in avg mode: np calls cpu_Ttl wall_Ttl flops_Ttl flops_Ttl_rate % wall name 1 1 0.53 0.52897 8.0001e+07 150.95 100.000 test1 1 2 0.52999 0.52838 8e+07 150.95 99.894 work 1 2 0.26999 0.26426 4e+07 148.15 49.960 work:block1 1 2 0.25999 0.26407 4e+07 153.85 49.924 work:block2 1 1 0 9.2231e-06 88 0 0.002 comm 1 2 0 0 38 0 0.000 comm:mpi The instrumented program is: program test1 real sum call PBINIT('test','PAPI_FP_INS') call PBSTART('test1') sum = 0.0 call work(sum) call work(sum) call comm() print *,'sum: ',sum call PBEND('test1') call PBREPORT() end subroutine work(sum) real a(1000) real b(1000) real sum,sum1,sum2 integer i,j call PBSTART('work') do i = 1,1000 a(i) = 1.0 b(i) = 2.0 enddo call PBSTART('work:block1') sum1 = 0.0 do j = 1,10000 do i = 1,1000 sum1 = sum1 + a(i)*b(i) enddo enddo sum = sum + sum1 call PBEND('work:block1') call PBSTART('work:block2') sum2 = 0.0 do j = 1,10000 do i = 1,1000 sum2 = sum2 + a(i)*b(i) enddo enddo sum = sum + sum2 call PBEND('work:block2') call PBEND('work') return end subroutine comm() ! This is a fake comm program call PBSTART('comm') call PBSTART('comm:mpi') call PBCOMM(1000) ! sending 1000 bytes; ! call MPI_send() call PBEND('comm:mpi') call PBSTART('comm:mpi') call PBCOMM(1000) ! receiving 1000 bytes; ! call MPI_recv call PBEND('comm:mpi') call PBEND('comm') return end