Wide Postfix Primitive Applications
In this post, we implement and discuss our own take on an idea presented in the Reduceron project’s Memo 38. There, Matthew Naylor suggests that (binary) primitive applications could be transformed to postfix position and evaluated with the help of a primitive stack. This allows wider, flatter applications via Reverse Polish notation. While this was suggested as an interesting idea, it was not publicly implemented or analysed in anger — that’s what we offer in this memo.
The 2012 Reduceron’s Primitive Evaluation
As per Section 5.2 of The Reduceron reconfigured and re-evaluated, they implement quite a neat approach to evaluating primitives.
The set of primitive operations is extended with a flipped version of each operation (where p* denotes the flipped version of p). During compilation, every binary primitive application is transformed into an infix form. Given a primitive operation, p, and two integer-typed arguments, n and m, we follow one of either:
$$ \begin{array}{r l} p\ m\ n\ &\rightarrow\ m\ p\ n \\ p\ m\ n\ &\rightarrow\ n\ p{*}\ m \end{array} $$
We are free to choose either rule, but attempting to place any arguments which we expect to already be fully evaluated in the final position will give a slight bonus.
At run time, we have two reduction rules for primitive operations. Just in terms of the main stack, these rules are:
$$ \begin{array}{r r} (\mathtt{INT}\ m: \mathtt{PRI}\ p: \mathtt{INT}\ n: \bar{s})\ \rightarrow & (\mathtt{alu}\ p\ m\ n : \bar{s}) \\ (\mathtt{INT}\ m: \mathtt{PRI}\ p: x: \bar{s})\ \rightarrow & (x: \mathtt{PRI}\ p{*}: \mathtt{INT}\ m: \bar{s}) \end{array} $$
The first rule reduces primitive applications once both arguments are evaluated
and the latter swaps the order of the two arguments in order to evaluate the
second argument. As long as we have hardware support for swapping arguments to
the ALU, this is already quite a nice implementation. We can flatten the first
argument to any binary primitive operation, but not the second since we need to
be able to atomically swap positions once the first has been evaluated. This
flattening opportunity is important as it allows us to make better use of our
entire SpineLen
and ApLen
parameters. Since it would be quite cheap to add
another stack to the hardware architecture, let’s consider a reduction scheme
which does allow us to flatten both applications.
A Postfix Primitive Scheme
Instead of translating binary primitive applications to an infix position, let’s try postfix instead. To facilitate this, we also introduce a new primitive stack to hold evaluated arguments to our postfix primitive operations.
Our compile-time translation becomes either one of:
$$ \begin{array}{r l} p\ m\ n\ &\rightarrow\ m\ n\ p\\ p\ m\ n\ &\rightarrow\ n\ m\ p{*} \end{array} $$
The run time reduction rules are now relative to the main stack and the primitive stack:
$$ \begin{array}{r l c r l} (\mathtt{INT}\ m: \mathtt{INT}\ n: \mathtt{PRI}\ p: rest) &, (\bar{i}) &\rightarrow& (\mathtt{alu}\ p\ m\ n : rest)& , (\bar{i}) \\ (\mathtt{INT}\ n: \mathtt{PRI}\ p: rest) &, (m : \bar{i}) &\rightarrow& (\mathtt{alu}\ p\ m\ n : rest)& , (\bar{i}) \\ (\mathtt{INT}\ m: x : \mathtt{PRI}\ p: rest) &, (\bar{i}) &\rightarrow& (x: \mathtt{PRI}\ p: rest) &, (m : \bar{i}) \end{array} $$
The first rule directly evaluates primitive applications with both arguments already evaluated. The second evaluates a primitive application whose first argument has already been placed on the stack of primitives. The third moves a primitive application’s first argument to the stack of primitives in the case that we still need to evaluate the second.
Clearly we have increased the complexity of the reduction rules here. The only thing this buys us is the opportunity for flattening both argument subexpressions. We can gain longer spinal applications and wider heap applications. Using an example from here, we can rewrite the deep and narrow expression:
+) (f x) (g y) (
into a single flat and wide expression:
+) f x g y (
This lets us skip the cost of unwinding the (f x)
and (g y)
subexpressions
from the heap. In general, this is a nice win. However, we are now faced with
two options during compilation. Do we flatten or preserve these subexpressions?
While it looks appealing to always flatten them (aiming to side-step the cost of
unwind cycles), there are two scenarios where this could be detrimental:
- Edge cases of template splitting. If flattening would cause us to spill over into one extra template but the original approach does not, we might cause damage to the code size. The required clock cycles are equivalent, but we are less efficient with the code size. Often the opposite will be true: a longer spine might pack more efficiently instead.
- The interaction with PRS strategy can be difficult. If we flatten an argument onto the spine, we lose the ability to apply PRS to the primitive application. If we are certain that the application will not be a successful PRS candidate, flattening will never damage the #cycles. However, for a successful PRS candidate, we might damage #cycles. For our dynamic PRS system, this proves challenging to reconcile. A static analysis for PRS would enable us to make the right choice here.
The postfix approach seems like the Right Thing to do, but it does raise these new compile-time questions. Let’s look at the (nonoptimal) results for our benchmarks if we take the easy way out: always flatten primitive applications in the spine.
Results
Since there is some tension with the PRS scheme, we present results both with
and without PRS enabled. To start, let’s look at the results for reduction run
time. Without PRS, we always have non-negative impact relative to our previous
optimisations. We show modest improvements for some benchmarks; second to Fib
,
CountDown
shows ≈ 3.7% increase. Once PRS is enabled, our
improvements are slightly reduced. This was already expected since we might
clobber the opportunity for successful PRS candidates without careful static
analysis. KnuthBendix
is the only benchmark which is negatively impacted
by the postfix primitive scheme with PRS enabled.
In general, we also see slight improvements in the number of triggered GC events
since we need fewer heap applications per template. Again, we have a
non-negative impact without PRS but KnuthBendix
remains problematic once PRS
is enabled.
Finally, let’s consider the code size of each benchmark. Although the effects
are largely positive, we can see another impact of blindly choosing to flatten
spinal primitive applications. Both Adjoxo
and Clausify
slightly grow in
code size! This is not entangled with the PRS scheme and could be resolved with a
fairly simple compile-time check. This effect does not influence the number of
reduction cycles directly, but it does positively impact #GC events. There ought
to be a balance found between time spent performing GC and the overall code
size, although we currently do not consider this in the compiler.
Overall, this scheme feels like the Right Thing to do, giving us the opportunity to build wider and shallower applications. However, we’ve seen that this new choice can negatively impact us without some further static analysis. Perhaps this is enough to justify investigation of a static PRS scheme.