The BEAST Challenge

Accelerators are designed for very high computing power and very high energy efficiency. Usually, this is accomplished by devoting much more area to floating point logic, and much less area to instruction control logic. Typically, accelerators require some form of instruction-level vectorization, and some form of thread-level parallelization. Most of the time, accelerators utilize static pipelines with no out-of-order execution (OOE) capabilities and no branch prediction.

Usually, this forces a style of programming where unrolling is taken to the extreme to expose instruction-level parallelism to the fullest. Frequently, corner cases are ignored, loop boundaries are fixed, and entire loop nests are eliminated by complete unrolling. It is often much more efficient to compute redundant results and discard them, than to take a branch.

Tuning kernels for accelerators is a challenge on its own, as the programmer faces a typical conundrum of multi-variable optimization. Not only is the number of parameters large, and therefore so is the resulting search space, but the parameters are also usually linked by counterintuitive relationships, i.e., a seemingly beneficial setting for one prevents a reasonable setting for another.

The BEAST Advantage

VDP Operation BEAST thrives in a large parameter space, not limited by any artificial constraints that could potentially hinder the search for the optimal solution. A large space is dealt with efficiently by powerful pruning, using a set of derived metrics, stemming from the harsh realities of the accelerator architecture. Inferior and invalid kernels are easily weeded out and discarded.

BEAST provides a scalable platform for conducting large benchmarking sweeps through sizable sets of kernels, using potentially massive, multi-node, multi-accelerator systems. BEAST allows for exhaustive profiling, harvesting of a huge amount of performance data, and reduction of that information to a comprehensible set, through a collection of robust machine learning techniques.

Finally, BEAST descends down to aggressive instruction-level optimizations, where it applies powerful, potentially correctness-violating optimizations, and subsequently passes the kernels through user-supplied code for validation. Unlike with classic compiler optimizations, here BEAST can exercise the freedom of discarding a multitude of irrelevant corner cases.

The BEAST Difference

BEAST adopts the glass box principle, as opposed to the black box principle. BEAST's objective is to produce the fastest possible kernels through a transparent process relying strongly on a healthy feedback loop with the developer. This allows for incremental refinement of the code, resulting not only in the fastest code, but also increased developer's insight into the performance characteristics of the kernel being optimized.

Sep 22 2017 Admin Login