SCALASCA is a performance analysis tool being developed with the goal of making the optimization of parallel applications on large-scale systems both more effective and more efficient. It can automatically detect inefficient program behavior and highlight opportunities for performance improvement. SCALASCA builds on the idea of searching event traces of parallel applications for execution patterns indicating inefficient behavior. During the search process, SCALASCA classifies detected pattern instances by category and quantifies their significance for every program phase and system resource involved. The results are made available to the user in a flexible graphical user interface, where they can be investigated on varying levels of granularity. A distinctive feature of SCALASCA in comparison to its predecessor KOJAK is that it achieves a higher degree of scalability by analyzing the trace data in parallel. Currently, SCALASCA is restricted to MPI-1 applications, but support for further programming models will be added in the future.
The overall analysis workflow of SCALASCA is depicted in Figure 1. To generate trace files, it is necessary to instrument the application before running it, that is, to insert extra code that performs the event measurements. This extra code consists of calls to the measurement library that is linked to the application and that includes the logic needed to generate the trace files. During execution of the instrumented executable on the parallel machine, each application process generates one trace file containing the process-local events. After the application has terminated, the local trace files are analyzed in parallel. The analyzer, which is an MPI application in its own right, is executed on as many CPUs as the target application. This allows the user to run it after the target application within a single batch job, which avoids additional waiting time in the batch queue. The analysis produces a single result file, which can be viewed using the graphical browser shown in Figure 3.
Figure 1: SCALASCA analysis workflow.
In the following, we highlight the different aspects of SCALASCA in more detail:
Whereas MPI functions are always instrumented automatically using interposition wrappers based on the PMPI profiling interface, user code can be instrumented in several ways:
- Manually by inserting directives that are preprocessed by OPARI
- Automatically by letting the compiler do the job (supported only by some compilers)
- Automatically using TAU
A schematic view of SCALASCA's measurement system, which is divided into three layers, is shown in Figure 2. At the top, event adapters for user-specified annotations, compiler-generated function instrumentation, MPI library instrumentation, OpenMP instrumentation, and instrumentation for partitioned global address space languages accept events from the instrumented application and transfer them to the common runtime management layer through a uniform event interface. The common runtime management handles measurement acquisition for processes and threads, attributes them to events, and determines which output back-end they should be directed to - depending on the runtime measurement configuration. The user can choose between a runtime summary or profile (not yet available) and two different trace formats (EPILOG and OTF). Platform-specific timers and metric acquisition, along with runtime configuration and experiment archive management are provided as common utility modules to all three layers.
Figure 2: SCALASCA measurement system.
On-the-fly compression and decompression reduces trace file size, with an additional bonus in the form of reduced file reading and writing times (despite additional processing overheads).
The analyzer automatically transforms the traces, which must be in the EPILOG format, into a compact call-path profile that includes the execution time penalties caused by the different patterns. However, instead of sequentially analyzing a single and potentially large global trace file, SCALASCA analyzes multiple local trace files in parallel based on the same parallel programming paradigm as the one used by the application under investigation. The parallel analyzer uses a distributed memory approach, where each process reads only the local trace data that were recorded for the corresponding process of the target application. This addresses scalability specifically with respect to larger numbers of processes.
The actual analysis is accomplished by performing a parallel replay of the application's communication behavior. The central idea behind this approach is to analyze a communication operation using an operation of the same type. For example, to analyze a point-to-point message, the event data necessary to analyze this communication is also exchanged in point-to-point mode between the corresponding analysis processes. To do this, the new analysis traverses local traces in parallel and meets at the synchronization points of the target application by re-enacting the original communication. The current version of SCALASCA supports all but one rarely significant MPI-1 pattern offered by KOJAK. An illustrated list of patterns with detailed descriptions can be found here.
The parallel analyzer itself is implemented on top of an abstraction layer that offers basic functionality to access event-trace data more easily. Exploiting that on a parallel computer the amount of memory available to an application typically scales with the number processors used, the entire event trace is held in main memory, thus yielding performance-transparent access to individual events. In addition, the interface provides a global view of static program entities referenced by the events, such as code regions or communicators, and of the call tree.
The call-path profile with the analysis results can be viewed using a graphical browser, which is shown in Figure 3. The results are displayed along the following dimensions: (i) performance problem, (ii) call path, and (iii) system resource. Each dimension is represented as a tree browser that can be collapsed or expanded to achieve the desired level of granularity or specialization. The tree browsers are coupled such that the penalty caused by a particular performance problem can be broken down by call path and process or thread. The performance penalty caused by a pattern is shown both as a number and as a colored icon, which makes it easier to identify hotspots. A topological display maps the performance behavior onto virtual or physical process topologies. For example, Figure 3 shows the distribution of late-sender waiting times across the two-dimensional virtual process topology of SWEEP3D.
Figure 3: SCALASCA result presentation.