HAFLANG - Wide Postfix Primitive Applications

Wide Postfix Primitive Applications

By Craig Ramsay

In this post, we implement and discuss our own take on an idea presented in the Reduceron project’s Memo 38. There, Matthew Naylor suggests that (binary) primitive applications could be transformed to postfix position and evaluated with the help of a primitive stack. This allows wider, flatter applications via Reverse Polish notation. While this was suggested as an interesting idea, it was not publicly implemented or analysed in anger — that’s what we offer in this memo.

The 2012 Reduceron’s Primitive Evaluation

As per Section 5.2 of The Reduceron reconfigured and re-evaluated, they implement quite a neat approach to evaluating primitives.

The set of primitive operations is extended with a flipped version of each operation (where p* denotes the flipped version of p). During compilation, every binary primitive application is transformed into an infix form. Given a primitive operation, p, and two integer-typed arguments, n and m, we follow one of either:

$$ \begin{array}{r l} p\ m\ n\ &\rightarrow\ m\ p\ n \\ p\ m\ n\ &\rightarrow\ n\ p{*}\ m \end{array} $$

We are free to choose either rule, but attempting to place any arguments which we expect to already be fully evaluated in the final position will give a slight bonus.

At run time, we have two reduction rules for primitive operations. Just in terms of the main stack, these rules are:

$$ \begin{array}{r r} (\mathtt{INT}\ m: \mathtt{PRI}\ p: \mathtt{INT}\ n: \bar{s})\ \rightarrow & (\mathtt{alu}\ p\ m\ n : \bar{s}) \\ (\mathtt{INT}\ m: \mathtt{PRI}\ p: x: \bar{s})\ \rightarrow & (x: \mathtt{PRI}\ p{*}: \mathtt{INT}\ m: \bar{s}) \end{array} $$

The first rule reduces primitive applications once both arguments are evaluated and the latter swaps the order of the two arguments in order to evaluate the second argument. As long as we have hardware support for swapping arguments to the ALU, this is already quite a nice implementation. We can flatten the first argument to any binary primitive operation, but not the second since we need to be able to atomically swap positions once the first has been evaluated. This flattening opportunity is important as it allows us to make better use of our entire SpineLen and ApLen parameters. Since it would be quite cheap to add another stack to the hardware architecture, let’s consider a reduction scheme which does allow us to flatten both applications.

A Postfix Primitive Scheme

Instead of translating binary primitive applications to an infix position, let’s try postfix instead. To facilitate this, we also introduce a new primitive stack to hold evaluated arguments to our postfix primitive operations.

Our compile-time translation becomes either one of:

$$ \begin{array}{r l} p\ m\ n\ &\rightarrow\ m\ n\ p\\ p\ m\ n\ &\rightarrow\ n\ m\ p{*} \end{array} $$

The run time reduction rules are now relative to the main stack and the primitive stack:

$$ \begin{array}{r l c r l} (\mathtt{INT}\ m: \mathtt{INT}\ n: \mathtt{PRI}\ p: rest) &, (\bar{i}) &\rightarrow& (\mathtt{alu}\ p\ m\ n : rest)& , (\bar{i}) \\ (\mathtt{INT}\ n: \mathtt{PRI}\ p: rest) &, (m : \bar{i}) &\rightarrow& (\mathtt{alu}\ p\ m\ n : rest)& , (\bar{i}) \\ (\mathtt{INT}\ m: x : \mathtt{PRI}\ p: rest) &, (\bar{i}) &\rightarrow& (x: \mathtt{PRI}\ p: rest) &, (m : \bar{i}) \end{array} $$

The first rule directly evaluates primitive applications with both arguments already evaluated. The second evaluates a primitive application whose first argument has already been placed on the stack of primitives. The third moves a primitive application’s first argument to the stack of primitives in the case that we still need to evaluate the second.

Clearly we have increased the complexity of the reduction rules here. The only thing this buys us is the opportunity for flattening both argument subexpressions. We can gain longer spinal applications and wider heap applications. Using an example from here, we can rewrite the deep and narrow expression:

(+) (f x) (g y)

into a single flat and wide expression:

f x g y (+)

This lets us skip the cost of unwinding the (f x) and (g y) subexpressions from the heap. In general, this is a nice win. However, we are now faced with two options during compilation. Do we flatten or preserve these subexpressions? While it looks appealing to always flatten them (aiming to side-step the cost of unwind cycles), there are two scenarios where this could be detrimental:

Edge cases of template splitting. If flattening would cause us to spill over into one extra template but the original approach does not, we might cause damage to the code size. The required clock cycles are equivalent, but we are less efficient with the code size. Often the opposite will be true: a longer spine might pack more efficiently instead.
The interaction with PRS strategy can be difficult. If we flatten an argument onto the spine, we lose the ability to apply PRS to the primitive application. If we are certain that the application will not be a successful PRS candidate, flattening will never damage the #cycles. However, for a successful PRS candidate, we might damage #cycles. For our dynamic PRS system, this proves challenging to reconcile. A static analysis for PRS would enable us to make the right choice here.

The postfix approach seems like the Right Thing to do, but it does raise these new compile-time questions. Let’s look at the (nonoptimal) results for our benchmarks if we take the easy way out: always flatten primitive applications in the spine.

Results

Since there is some tension with the PRS scheme, we present results both with and without PRS enabled. To start, let’s look at the results for reduction run time. Without PRS, we always have non-negative impact relative to our previous optimisations. We show modest improvements for some benchmarks; second to Fib, CountDown shows ≈ 3.7% increase. Once PRS is enabled, our improvements are slightly reduced. This was already expected since we might clobber the opportunity for successful PRS candidates without careful static analysis. KnuthBendix is the only benchmark which is negatively impacted by the postfix primitive scheme with PRS enabled.

In general, we also see slight improvements in the number of triggered GC events since we need fewer heap applications per template. Again, we have a non-negative impact without PRS but KnuthBendix remains problematic once PRS is enabled.

Finally, let’s consider the code size of each benchmark. Although the effects are largely positive, we can see another impact of blindly choosing to flatten spinal primitive applications. Both Adjoxo and Clausify slightly grow in code size! This is not entangled with the PRS scheme and could be resolved with a fairly simple compile-time check. This effect does not influence the number of reduction cycles directly, but it does positively impact #GC events. There ought to be a balance found between time spent performing GC and the overall code size, although we currently do not consider this in the compiler.

Overall, this scheme feels like the Right Thing to do, giving us the opportunity to build wider and shallower applications. However, we’ve seen that this new choice can negatively impact us without some further static analysis. Perhaps this is enough to justify investigation of a static PRS scheme.