Relaxing Reduceron's Template Constraints
Our own work towards HAFLANG is heavily inspired by the Reduceron project. The Reduceron is an FPGA-based template instantiation architecture for a small, lazy functional language called F-lite. An F-lite program is a set of supercombinators, each of which get translated to a set of templates — the Reduceron’s native representation of a program. This process remains quite true to the user’s original source: we are evaluating the functional program as directly as is feasible.
Consider a toy example:
=
foo x y let { a = h y }
in f (g x a) a;
We could directly draw the supercombinator foo
as a graph. The (simplified)
translation from this graph to Reduceron templates is a two step process:
- We divide the graph into two categories: the (pink) spine down the left side, and a set of non-spinal subexpressions (blue) branching towards the right.
- We map the spine and subexpressions to a number of templates, splitting up any constructs which are too large for the hardware.
When evaluated, the spine application is pushed directly onto the main stack, while the non-spinal subexpressions are allocated on the heap. Note that the hardware architecture necessarily has fixed dimensions. The three most important dimensions, illustrated above, are:
SpineLen
- Maximum number of atoms per spinal application. Dictates how parallel our main stack implementation needs to be.
ApLen
- Maximum number of atoms per application on the heap. Dictates how wide our heap memory needs to be.
MaxAps
- Maximum number of applications we can allocate on the heap simultaneously. Dictates the number of ports our heap memory needs to have.
Assuming our example graph obeys all of these fixed dimensions, we can quite directly write out a corresponding template with one spinal application and two heap applications. (The Alts field is for handling case expressions and will be explored in future memos.)
$$ \textit{#0}\ \left[\ \ \begin{array}{r l} \text{Alts}\ & [] \\\\ \text{Spine} & [\begin{array}{c c c c c c} \texttt{FUN f}, & \texttt{VAR 1}, & \texttt{VAR 0} \end{array}]\\\\ \text{Heap} & \left\{ \begin{array}{l} \texttt{VAR 0} \mapsto \texttt{App}\ [\begin{array}{c c c c} \texttt{FUN h}, & \texttt{ARG y} \end{array}]\\ \texttt{VAR 1} \mapsto \texttt{App}\ [\begin{array}{c c c c} \texttt{FUN g}, & \texttt{ARG x}, & \texttt{VAR 0} \end{array}]\\ \end{array}\right. \end{array} \right. $$
This is the happy path for the translation to templates. There are, however, some constraints on the Reduceron templates which are overly strict. We want to see if we can relax these constraints with a modest increase in hardware complexity.
The Constraints
1. The refersCheck
The head of any template’s spine will direct which reduction rule is used
directly after template instantiation. If the head of the spine is VAR x
, our
next step is to unwind/dereference x
from the heap memory. Below, we show an
example template representing sumDoubles x y = (x+x) + (y+y)
. Note in
particular that the VAR
in head position refers to an application which is
being allocated by the same template.
$$ \textit{sumDoubles}\ \left[\ \ \begin{array}{r l} \text{Alts}\ & [] \\\\ \text{Spine} & [\begin{array}{c c c c c c} \texttt{VAR 0}, & \texttt{(+)}, & \texttt{VAR 1} \end{array}]\\\\ \text{Heap} & \left\{ \begin{array}{l} \texttt{VAR 0} \mapsto \texttt{App}\ [\begin{array}{c c c c} \texttt{ARG x}, & \texttt{(+)}, & \texttt{ARG x} \end{array}]\\ \texttt{VAR 1} \mapsto \texttt{App}\ [\begin{array}{c c c c} \texttt{ARG y}, & \texttt{(+)}, & \texttt{ARG y} \end{array}]\\ \end{array}\right. \end{array} \right. $$
There is one subtle problem here. Since the template is responsible for
instantiating VAR 0
and allocating it to the heap, we might try to prefetch
VAR 0
for unwinding before it has been comitted to memory!
To avoid this conflict, the F-lite compiler checks for this condition and will split the template into two. Now the application will be allocated in the first template and the original spine appears in the second template. This incurrs a cost in terms of the code size (+1 template) and run time (+1 cycle for instantiation).
$$ \begin{array}{l} \textit{#0}\ \left[\ \ \begin{array}{r l} \text{Alts}\ & [] \\\\ \text{Spine} & [\begin{array}{c c c c c c} \texttt{FUN #1} \end{array}]\\\\ \text{Heap} & \left\{ \begin{array}{l} \texttt{VAR 0} \mapsto \texttt{App}\ [\begin{array}{c c c c} \texttt{ARG x}, & \texttt{(+)}, & \texttt{ARG x} \end{array}]\\ \texttt{VAR 1} \mapsto \texttt{App}\ [\begin{array}{c c c c} \texttt{ARG y}, & \texttt{(+)}, & \texttt{ARG y} \end{array}]\\ \end{array}\right. \end{array} \right. \\~\\ \textit{#1}\ \left[\ \ \begin{array}{r l} \text{Alts}\ & [] \\\\ \text{Spine} & [\begin{array}{c c c c c c} \texttt{VAR 0}, & \texttt{(+)}, & \texttt{VAR 1} \end{array}]\\\\ \text{Heap} & \left\{\ \right. \end{array} \right. \end{array} $$
Instead, we propose a simple forwarding mechanism to support the original problematic template. Here an application instantiated during the previous cycle is available directly, without taking a round-trip to the heap memory.
2. Conservative heap activity
There is one more constraint which apeases the limits of the heap memory. We
also need to ensure we have enough physical ports to the heap memory to dispatch
all transactions needed in a single cycle. Although the compiler does have a
MaxAps
parameter (#heap applications per template), this is not the full
picture. If the top of the stack happens to be a VAR
, the original
architecture will attempt to prefetch it from the heap. We run into difficulty
when we need this prefetching but our template also tries to instantiate
MaxAps
applications to the heap.
The F-lite chooses to eliminate this possibility quite conservatively. Unless we
are certain that a template’s spine is a single FUN
atom (a pattern introduced
by the template splitting process), we restrict the number of new heap
applications to MaxAps
-1.
There is scope for weakening this restriction further with just static analysis:
- The spine can be any length; only the first atom is important.
- The head position could be any atom which, when instantiated, is a
non-
VAR
. This includesINT
,CON
, andFUN
. AnyARG
s are excluded since we do not know what they will resolve to until run-time.
We choose to go one step further and make this choice dynamic, deferring it
until run-time. The static version will always need to be conservative since we
cannot statically determine what an ARG
atom will become after instantiation.
We opt to always allow templates to allocate MaxAps
heap nodes. If the head
position of the spine resolves to a VAR
(which is not in our new forwarding
registers!), we incur a 1-cycle bubble while waiting on the heap memory.
3. Spine splitting
Finally, the F-lite compiler takes a simple approach to splitting spinal
applications which are longer than SpineLen
. The bottom SpineLen-1
atoms
remain in the spine and the rest are relocated to (possibly several) heap
applications. This means that for any set of split templates, only the final
template actually stores useful spine data. This effect was also demonstrated
back in the split version of sumDoubles
.
If we could more intelligently divide long spinal applications between templates, we would win on two fronts:
- We often have a smaller code size. There can be better utilisation of previously unused split template spines and fewer heap applications since none are required for overspill of long spines.
- We often need fewer cycles per reduction. We avoid the overhead of unwinding auxilliary applications generated during spine splitting.
The first point is not always true — if there are sufficient free heap allocations in a template, the original method could pack more efficiently into a single template. However, the results later in this memo suggest the new approach offers a good advantage on average. Ideally, our compiler would be aware of the remaining template resources for each supercombinator and could make an informed decision about which method to use depending on context.
Implementing the incremental construction of a spine in the compiler is
non-trivial since elements will be moved on (or popped from) the stack
incrementally, impacting the structure and reach of the next portion of the
spine. To side-step this complexity, we make a small addition to the hardware
architecture. We freeze a copy of the top SpineLen
elements of the stack
whenever we begin the instantiation of a set of linked templates. All
instantiation is then performed against this frozen copy, while we are free to
clobber the arguments on the live stack.
Results
We have implemented all three changes to the compilation of templates and the emulated Reduceron hardware. Below we show results normalised against the original implementation. There are three distinct metrics of interest here:
- Code size
- Reduction run time (in cycles, not including GC)
- Number of GC events triggered (a proxy for heap allocations)
First of all, we do see modest improvements to the run time of all benchmarks
except Fib
and Queens
. This effect is due to two factors: more efficient
template packing leads to fewer cycles spent on instantiation, and better use of
spinal applications avoids the overhead of unnecessary dereferencing.
Next, we highlight the first of these factors in isolation — comparing the
number of templates required for each benchmark. For this metric, we also see
modest improvements nearly across the board. The only outlier is Fib
which an
unrepresentitively small code size.
Finally, we demonstrate that all benchmarks which have “hot” supercombinators
templates with spines longer than SpineLen
also benerift from reduced heap
activity. Since our long spines no longer spill over into the heap, we have
fewer allocations and fewer GC events are triggered.
These three tweaks give us modest improvements across the board with only cheap additions to the hardware architecture.
Cheers,
Craig