Counting Floating Point Operations on Intel Sandy Bridge and Ivy Bridge
Intel's new Sandy Bridge and Ivy Bridge cpu architectures provide a rich computing environment and a comprehensive performance monitoring unit with which to measure performance. These processors support 11 hardware performance counters per core: 3 fixed counters for core cycles, reference cycles and core instructions executed, in addition to 8 programmable counters with minimal restrictions. That's the good news. The bad news starts to show up when you actually use these counters in real situations. Most environments run with hyperthreading enabled. This allows each core to run two simultaneous interleaved threads, and hopefully keep the functional units filled to higher capacity. Those 8 programmable counters suddenly turn into 4, since each thread must maintain its own hardware counters. Further, most environments also run with a non-maskable interrupt (nmi) timer active. This can be implemented in a variety of ways, but cannot be guaranteed NOT to use one of the four remaining counters. That leaves 3 per thread. So this means that PAPI is only guaranteed 3 programmable counters at a given time, in addition to the 3 fixed counters mentioned earlier. The corollary is that any single PAPI derived event can consist of at most 3 programmable terms to be assured that it can always be counted. This is generally enough for most, but not all, situations.
Floating Point Flavors
Sandy Bridge and Ivy Bridge introduce a new set of more powerful AVX assembly instructions. These are vector floating point instructions that operate on up to 256 bits of information at a time. That's 4 simultaneous double precision operations, or 8 parallel single precision operations. You can't guarantee all 256 bits are always in use, so counting floating point operations can be a bit tricky. Because of this and the need for backwards compatibility, these chips continue to support earlier floating point instructions and hardware as well, including 128 bit SSE instructions, MMX instructions, and even the venerable x87 instructions. In both single and double precision versions. That makes 8 different flavors of floating point, and raises the potential need for as many as 8 events to count them all.
Sandy Bridge Floating Point Events
For the last several generations, one of the performance events provided by Intel to count floating point instructions has been called FP_COMP_OS_EXE. This event name is generally associated with one or more umasks, or attributes, to further define what kinds of floating point instructions are being counted. For Sandy Bridge, the available attributes include the following:
Although in theory it should be possible to combine all five of these attributes in a single event to count all variations of x87 and SSE floating point instructions, in practice these attributes are found to interact with each other in non-linear ways and must be empirically tested before they can be combined in a single counter. Further, the PACKED versions of these instructions represent more than one floating point operation each, and so can't simply be added to produce a meaningful result.
Intel engineers have verified that variations of this event count speculatively, leading to variable amounts of overcounting, depending on the algorithm. Further, as is discussed later in this article, speculative retries during resource stalls are also counted. Knowing this, it may be possible to use the excess counts as a way to monitor resource inefficiency.
To make matters more confusing, it appears that combining multiple attributes in a single counter produces a result that resembles total cycles more that combined floating point operations.
Sandy Bridge and AVX
You may have noticed that the event attributes shown above don't reference AVX instructions. That requires a separate event in another counter. The name of this event is SIMD_FP_256, and it supports two attributes: PACKED_SINGLE and PACKED_DOUBLE. As in the case of FP_COMP_OPS_EXE, these two attributes cannot be combined in practice without silently producing anomalous results.
Counter to the situation with FP_COMP_OPS_EXE, SIMD_FP_256 counts instructions retired rather than speculative instructions executed. That's a good thing, but overcounts are still observed, because this event also counts AVX operations that are not floating point, such as register loads and stores, and various logical operations. Since such data movement operations will generally be proportional to actual work, for a given algorithm, these counts, while theoretically inaccurate, should still prove to be useful as a measure of relative code performance.
The above discussion also does not mention MMX. There are no events available to Sandy Bridge that reference MMX. One can assume that MMX operations are being processed through SSE instructions and are counted as such.
Ivy Bridge Floating Point Events
In short, there are none. Neither the FP_COMP_OPS_EXE nor the SIMD_FP_256 are available on Ivy Bridge. Don't blame us; complain to Intel. Rumour has it that these events may still exist, but are not exposed through the documentation. This has not been confirmed and is left as an exercise for the reader.
Counting Floating Point Events on Sandy Bridge
In order to develop a feel for counting floating point events on the Sandy Bridge architectures, we present a series of tables below that collect a number of different events from several different computational kernels, including a multiply-add, a simple matrix multiply, and optimized GEMMs for both single and double precision. We also show results from several events with multiple attributes. Results with an error of < 5% are shown in green; errors < 15% are in orange; errors > 15% are red. Results that look suspiciously similar to PAPI_TOT_CYC are shown in blue.
Counting Basic Arithmetic
The table above illustrates unoptimized arithmetic operations. There is apparently no use of packed SSE instructions, and no evidence of x87 or AVX instructions. All the operations counted here are scalar. The double precision counts are within 15% of the theoretically expected value, while one single precision count deviates by almost 35% and the other is high by about 3.5%. All attempts at combining more than one unit mask, or attribute, resulted in counts that look surprisingly similar to cycle counts. This was also true for unreported attribute combinations, suggesting that attribute bits cannot be combined.
Counting Optimized GEMMs on Sandy Bridge
This table shows a pattern similar to the one in the table above. Packed single and double precision counts show up in the right places and quantities for both the SSE optimized and AVX optimized GEMMs. There are a small number of scalar and packed SSE operations that show up in the SGEMM case, possibly a result of incomplete AVX packing. There are also a very small number of x87 instructions that are counted in each case. Since these are negligible, they are ignored. As in the previous table, events with multiple attributes produce counts that are surprisingly similar to the equivalent cycle count.
PAPI Preset Definitions for Sandy Bridge
From the observations in the previous two tables, it becomes clear that no single definition can encompass all variations of floating point operations on Sandy Bridge. The table below defines PAPI Preset event definitions that encompass a range of cases with reasonable predictability while remaining within the constraint of using three counters or less. PAPI_FP_INS and _OPS are defined identically to include scalar operations only. This is a significant deviation from traditional definitions of these events, because all packed instructions are ignored. PAPI_SP_OPS and _DP_OPS count single and double precision events respectively. They each consist of three terms including scalar and packed SSE, and packed AVX, with terms appropriately scaled to represent operations rather than instructions. PAPI_VEC_SP and _DP count vector instructions in single and double precision using appropriately scaled SSE and AVX instructions.
The table below shows that in all cases where values are reported, the numbers have positive deviations from theoretical of varying magnitudes. The majority of counts are high by < 15%, which could be attributable to speculative execution. The deviations between measured FP_INS and FP_OPS offer an indication of run-to-run variability in a range from 0.2% to 8 or 9%. Highly optimized operations, such as the GEMMs actually show the best accuracy for both SSE and AVX versions, with deviation from theoretical on the order of 1 to 2%.
AVX and Cache
John McCalpin at TACC has observed that in general Intel performance counters increment at instruction issue unless the event name specifies "retired". This can lead to overcounting if an instruction is reissued, for example, while waiting for a cache miss to be satisfied. One way to test this hypothesis would be to explicitly write code to load the SSE or AVX registers before performing arithmetic operations. If the hypothesis is correct, the overcounting should be significantly reduced. Specifically in the case of AVX floating point instructions, it appears that overcounts can be explained by this instruction re-issue phenomenon. John has done some tests with the STREAM benchmark suggesting a strong correlation between overcounting and average cache latency. This also suggests an explanation for the relatively small error in AVX DGEMM and SGEMM results, since these algorithms have been optimized to minimize cache misses, and thus retries.
Sandy Bridge and Ivy Bridge are powerful new processors in the Intel lineage. Both offer a wealth of opportunities for performance measurement. However, measuring the traditional standby floating point metric must be done with care. You can't do it at all on Ivy Bridge. On Sandy Bridge, be forewarned that although accurate measurements can be made, particularly for highly optimized code, no single PAPI metric is likely to capture all floating point operations. Remember the error bars. Some measurements will be less accurate than others, and the errors will almost always be positive (overcounting) due to speculative execution. Since speculation is likely to be proportional to the amount of floating point work done, even these inaccurate measurements should provide insight when used within the same codes.
If these numbers inspire or challenge you to make more detailed observations with this hardware, please share your conclusions with us. We'd be happy to add further insight into the above report.