HAFLANG - Relaxing Reduceron's Template Constraints

Relaxing Reduceron's Template Constraints

By Craig Ramsay

Our own work towards HAFLANG is heavily inspired by the Reduceron project. The Reduceron is an FPGA-based template instantiation architecture for a small, lazy functional language called F-lite. An F-lite program is a set of supercombinators, each of which get translated to a set of templates — the Reduceron’s native representation of a program. This process remains quite true to the user’s original source: we are evaluating the functional program as directly as is feasible.

Consider a toy example:

foo x y =
  let { a = h y }
  in f (g x a) a;

We could directly draw the supercombinator foo as a graph. The (simplified) translation from this graph to Reduceron templates is a two step process:

We divide the graph into two categories: the (pink) spine down the left side, and a set of non-spinal subexpressions (blue) branching towards the right.
We map the spine and subexpressions to a number of templates, splitting up any constructs which are too large for the hardware.

When evaluated, the spine application is pushed directly onto the main stack, while the non-spinal subexpressions are allocated on the heap. Note that the hardware architecture necessarily has fixed dimensions. The three most important dimensions, illustrated above, are:

SpineLen: Maximum number of atoms per spinal application. Dictates how parallel our main stack implementation needs to be.
ApLen: Maximum number of atoms per application on the heap. Dictates how wide our heap memory needs to be.
MaxAps: Maximum number of applications we can allocate on the heap simultaneously. Dictates the number of ports our heap memory needs to have.

Assuming our example graph obeys all of these fixed dimensions, we can quite directly write out a corresponding template with one spinal application and two heap applications. (The Alts field is for handling case expressions and will be explored in future memos.)

$$ \textit{#0}\ \left[\ \ \begin{array}{r l} \text{Alts}\ & [] \\\\ \text{Spine} & [\begin{array}{c c c c c c} \texttt{FUN f}, & \texttt{VAR 1}, & \texttt{VAR 0} \end{array}]\\\\ \text{Heap} & \left\{ \begin{array}{l} \texttt{VAR 0} \mapsto \texttt{App}\ [\begin{array}{c c c c} \texttt{FUN h}, & \texttt{ARG y} \end{array}]\\ \texttt{VAR 1} \mapsto \texttt{App}\ [\begin{array}{c c c c} \texttt{FUN g}, & \texttt{ARG x}, & \texttt{VAR 0} \end{array}]\\ \end{array}\right. \end{array} \right. $$

This is the happy path for the translation to templates. There are, however, some constraints on the Reduceron templates which are overly strict. We want to see if we can relax these constraints with a modest increase in hardware complexity.

The Constraints

1. The `refersCheck`

The head of any template’s spine will direct which reduction rule is used directly after template instantiation. If the head of the spine is VAR x, our next step is to unwind/dereference x from the heap memory. Below, we show an example template representing sumDoubles x y = (x+x) + (y+y). Note in particular that the VAR in head position refers to an application which is being allocated by the same template.

$$ \textit{sumDoubles}\ \left[\ \ \begin{array}{r l} \text{Alts}\ & [] \\\\ \text{Spine} & [\begin{array}{c c c c c c} \texttt{VAR 0}, & \texttt{(+)}, & \texttt{VAR 1} \end{array}]\\\\ \text{Heap} & \left\{ \begin{array}{l} \texttt{VAR 0} \mapsto \texttt{App}\ [\begin{array}{c c c c} \texttt{ARG x}, & \texttt{(+)}, & \texttt{ARG x} \end{array}]\\ \texttt{VAR 1} \mapsto \texttt{App}\ [\begin{array}{c c c c} \texttt{ARG y}, & \texttt{(+)}, & \texttt{ARG y} \end{array}]\\ \end{array}\right. \end{array} \right. $$

There is one subtle problem here. Since the template is responsible for instantiating VAR 0 and allocating it to the heap, we might try to prefetch VAR 0 for unwinding before it has been comitted to memory!

To avoid this conflict, the F-lite compiler checks for this condition and will split the template into two. Now the application will be allocated in the first template and the original spine appears in the second template. This incurrs a cost in terms of the code size (+1 template) and run time (+1 cycle for instantiation).

$$ \begin{array}{l} \textit{#0}\ \left[\ \ \begin{array}{r l} \text{Alts}\ & [] \\\\ \text{Spine} & [\begin{array}{c c c c c c} \texttt{FUN #1} \end{array}]\\\\ \text{Heap} & \left\{ \begin{array}{l} \texttt{VAR 0} \mapsto \texttt{App}\ [\begin{array}{c c c c} \texttt{ARG x}, & \texttt{(+)}, & \texttt{ARG x} \end{array}]\\ \texttt{VAR 1} \mapsto \texttt{App}\ [\begin{array}{c c c c} \texttt{ARG y}, & \texttt{(+)}, & \texttt{ARG y} \end{array}]\\ \end{array}\right. \end{array} \right. \\~\\ \textit{#1}\ \left[\ \ \begin{array}{r l} \text{Alts}\ & [] \\\\ \text{Spine} & [\begin{array}{c c c c c c} \texttt{VAR 0}, & \texttt{(+)}, & \texttt{VAR 1} \end{array}]\\\\ \text{Heap} & \left\{\ \right. \end{array} \right. \end{array} $$

Instead, we propose a simple forwarding mechanism to support the original problematic template. Here an application instantiated during the previous cycle is available directly, without taking a round-trip to the heap memory.

2. Conservative heap activity

There is one more constraint which apeases the limits of the heap memory. We also need to ensure we have enough physical ports to the heap memory to dispatch all transactions needed in a single cycle. Although the compiler does have a MaxAps parameter (#heap applications per template), this is not the full picture. If the top of the stack happens to be a VAR, the original architecture will attempt to prefetch it from the heap. We run into difficulty when we need this prefetching but our template also tries to instantiate MaxAps applications to the heap.

The F-lite chooses to eliminate this possibility quite conservatively. Unless we are certain that a template’s spine is a single FUN atom (a pattern introduced by the template splitting process), we restrict the number of new heap applications to MaxAps-1.

There is scope for weakening this restriction further with just static analysis:

The spine can be any length; only the first atom is important.
The head position could be any atom which, when instantiated, is a non-VAR. This includes INT, CON, and FUN. Any ARGs are excluded since we do not know what they will resolve to until run-time.

We choose to go one step further and make this choice dynamic, deferring it until run-time. The static version will always need to be conservative since we cannot statically determine what an ARG atom will become after instantiation. We opt to always allow templates to allocate MaxAps heap nodes. If the head position of the spine resolves to a VAR (which is not in our new forwarding registers!), we incur a 1-cycle bubble while waiting on the heap memory.

3. Spine splitting

Finally, the F-lite compiler takes a simple approach to splitting spinal applications which are longer than SpineLen. The bottom SpineLen-1 atoms remain in the spine and the rest are relocated to (possibly several) heap applications. This means that for any set of split templates, only the final template actually stores useful spine data. This effect was also demonstrated back in the split version of sumDoubles.

If we could more intelligently divide long spinal applications between templates, we would win on two fronts:

We often have a smaller code size. There can be better utilisation of previously unused split template spines and fewer heap applications since none are required for overspill of long spines.
We often need fewer cycles per reduction. We avoid the overhead of unwinding auxilliary applications generated during spine splitting.

The first point is not always true — if there are sufficient free heap allocations in a template, the original method could pack more efficiently into a single template. However, the results later in this memo suggest the new approach offers a good advantage on average. Ideally, our compiler would be aware of the remaining template resources for each supercombinator and could make an informed decision about which method to use depending on context.

Implementing the incremental construction of a spine in the compiler is non-trivial since elements will be moved on (or popped from) the stack incrementally, impacting the structure and reach of the next portion of the spine. To side-step this complexity, we make a small addition to the hardware architecture. We freeze a copy of the top SpineLen elements of the stack whenever we begin the instantiation of a set of linked templates. All instantiation is then performed against this frozen copy, while we are free to clobber the arguments on the live stack.

Results

We have implemented all three changes to the compilation of templates and the emulated Reduceron hardware. Below we show results normalised against the original implementation. There are three distinct metrics of interest here:

Code size
Reduction run time (in cycles, not including GC)
Number of GC events triggered (a proxy for heap allocations)

First of all, we do see modest improvements to the run time of all benchmarks except Fib and Queens. This effect is due to two factors: more efficient template packing leads to fewer cycles spent on instantiation, and better use of spinal applications avoids the overhead of unnecessary dereferencing.

Next, we highlight the first of these factors in isolation — comparing the number of templates required for each benchmark. For this metric, we also see modest improvements nearly across the board. The only outlier is Fib which an unrepresentitively small code size.

Finally, we demonstrate that all benchmarks which have “hot” supercombinators templates with spines longer than SpineLen also benerift from reduced heap activity. Since our long spines no longer spill over into the heap, we have fewer allocations and fewer GC events are triggered.

These three tweaks give us modest improvements across the board with only cheap additions to the hardware architecture.

Cheers,
Craig

The Constraints

1. The refersCheck

2. Conservative heap activity

3. Spine splitting

Results

1. The `refersCheck`