HAFLANG - Inlining simple case alternatives

Inlining simple case alternatives

By Craig Ramsay

In this post, we try to reduce the code size of compiled F-lite programs since we are currently bounded by the memory resources of our Reduceron-inspired processor architecture.

All case expressions in F-lite programs infer multiple templates:

At least one to evaluate the case scrutinee, and…
At least one per case alternative

Each alternative must be mapped to a contiguous part of the template memory, since case expressions get compiled down to a case table. These case tables are represented as offsets into the template memory, with an offset based on the tag of the case scrutinee once evaluated to WHNF. This keeps the size of the case table quite small in the template format, but requires each alternative to occupy at least one (quite large!) template regardless of complexity.

Our proposal is that we could encode trivial case alternatives inline within our case tables, minimising overall code size. To keep our case tables small enough to be appealing, we only offer a special case for two-way choices.

For this proposal, we explore inlining alternatives which are trivial — just a single atom and a number of elements to pop off the stack. This does not gain us any run-time performance directly, but we can substantially reduce the total number of templates required for a program. We will see scenarios with up to a 43% improvement in template count.

Original case handling

Let’s recap how case expressions are handled in Reduceron by looking at a small example. We define a binary and function and use it in a specialised fold over lists of booleans, called ands.

and False x = False;
and True  x = x;

ands Nil = True;
ands (Cons x xs) = and x (ands xs);

The F-lite compiler will desugar the pattern matching to case expressions on the RHS. It then transforms the case expressions of the form

$$\mathtt{case}\ e\ \mathtt{of}\ \lbrace C_0\ \overrightarrow{x_0}\ ;\ \ldots\ ;\ C_n \overrightarrow{x_m} \rbrace$$

into a series of templates, where each case alternative is lifted out to their own templates. More formally, this form is:

$$ \begin{array}{l} e\ \langle \mathtt{FUN}\ \mathit{alt}_0\ ,\ \ldots\ ,\ \mathtt{FUN}\ \mathit{alt}_m \rangle\ \overrightarrow{v} \end{array} $$

assuming that v⃗ includes all free variables in any case alternative and each new function alt_i implements the original alternative with the extra v⃗ arguments. These case tables are lifted out to the Lut field of a template when in spinal position, or are attributed to a particular heap application. Our full program compiles down to a set of 5 templates after inlining:

$$ \begin{array}{l} \textit{sums#0}\ \left[\ \ \begin{array}{r l} \text{Alts}\ & [\texttt{FUN sums#3}] \\\\ \text{Spine} & [\begin{array}{c c c c c c} \texttt{ARG xs}, & \texttt{INT 0} \end{array}]\\\\ \text{Heap} & \left\{\right. \end{array} \right. \\ \textit{sums#1}\ \left[\ \ \begin{array}{r l} \text{Alts}\ & [] \\\\ \text{Spine} & [\begin{array}{c c c c c c} \texttt{CON 0 0} \end{array}]\\\\ \text{Heap} & \left\{\right. \end{array} \right. \\ \textit{sums#2}\ \left[\ \ \begin{array}{r l} \text{Alts}\ & [\texttt{FUN sums#3}] \\\\ \text{Spine} & [\begin{array}{c c c c c c} \texttt{ARG x}, & \texttt{INT 0} \end{array}]\\\\ \text{Heap} & \left\{\right. \end{array} \right. \\ \textit{sums#3}\ \left[\ \ \begin{array}{r l} \text{Alts}\ & [\texttt{FUN sums#1}] \\\\ \text{Spine} & [\begin{array}{c c c c c c} \texttt{ARG x}, & \texttt{ARG xs} \end{array}]\\\\ \text{Heap} & \left\{\right. \end{array} \right. \\ \textit{sums#4} \left[\ \ \begin{array}{r l} \text{Alts}\ & [] \\\\ \text{Spine} & [\begin{array}{c c c c c c} \texttt{CON 1} \end{array}]\\\\ \text{Heap} & \left\{\right. \end{array} \right. \end{array} $$

Note that order of the alternatives is tied to the alphabetic ordering of the constructor names being scrutinised. For lists we expect Cons then Nil, and for booleans we expect False then True. Each of these templates does very little work compared to the maximum computational power of a single template. Instead, we offer a different compilation scheme.

A new case table representation

Originally, case tables were compiled down to a single integer: the offset into the template memory for the base of the table. We extend this to optionally be a binary choice between two stripped-down atoms, represented as:

data LUT = LOffset Int
         | LInline (Alt,Alt)

data Alt = AFun TmplAddr
         | AInt Pops Val
         | AArg Pops Index
         | ACon Pops Arity Tag

One vital aspect towards the success of the inlining technique is trying to minimise the number of bits representing an Alt. For our default parameters (SpineLen=6 and TmplAddr=10 bits) we can encode Alt as an 11-bit structure while only substantially limiting the range of representable integers. This feels like a reasonable trade-off since we anticipate that trivial integer alternatives will usually be small values in a recursion’s base-case. Our 11-bit encoding is:

$$ \begin{array}{r l l l l} \mathtt{AFun\ a} \mapsto & 1 & a_9\ a_8\ a_7\ a_6 & a_5\ a_4\ a_3\ a_2\ a_1a_0 \\ \mathtt{AInt\ p\ v} \mapsto & 01 & p_2\ p_1\ p_0 & v_5\ v_4\ v_3\ v_2\ v_1\ v_0 \\ \mathtt{AArg\ p\ i} \mapsto & 001 & p_2\ p_1\ p_0 & i_4\ i_3\ i_2\ i_1\ i_0 \\ \mathtt{ACon\ p\ a\ t} \mapsto & 0001 & p_2\ p_1\ p_0 & a_2\ a_1\ a_0 & t_0 \end{array} $$

For our default parameters, including the new case alternative encoding for the spinal application and any CASE heap applications increases the template size from 470 bits to 493; an increase of 4.9%. For this to be worthwhile, we would need an average reduction in the number of templates by well over 5%.

With these new case alternative constructs, we can compile our example program quite simply. We follow the same process creating templates for each alternative. After inlining etc. we identify which 2-way cases have at least one trivial alternative and inline them into the new LInline construct.

$$ \begin{array}{l} \textit{sums#0}\ \left[\ \ \begin{array}{r l} \text{Alts}\ & [\texttt{LInline (AFun sums#2, ACon 1 0 True)}] \\\\ \text{Spine} & [\begin{array}{c c c c c c} \texttt{ARG xs}, & \texttt{INT 0} \end{array}]\\\\ \text{Heap} & \left\{\right. \end{array} \right. \\ \textit{sums#1}\ \left[\ \ \begin{array}{r l} \text{Alts}\ & [\texttt{LInline (AFun sums#2, ACon 1 0 True)}] \\\\ \text{Spine} & [\begin{array}{c c c c c c} \texttt{ARG xs}, & \texttt{INT 0} \end{array}]\\\\ \text{Heap} & \left\{\right. \end{array} \right. \\ \textit{sums#2}\ \left[\ \ \begin{array}{r l} \text{Alts}\ & [\texttt{LInline (ACon 1 0 False, AFun sums#1)}] \\\\ \text{Spine} & [\begin{array}{c c c c c c} \texttt{ARG x}, & \texttt{ARG xs} \end{array}]\\\\ \text{Heap} & \left\{\right. \end{array} \right. \end{array} $$

Results

Let’s take a look at how this change impacts the code size for each of our benchmarks, accounting for the increased template size.

That’s encouraging! We never see a negative impact due to the increased template size. OrdList, an example with many two-way case expressions with trivial alternatives, shrinks from 18330 bits to just 10846! Inlining trivial case alternatives seems to work towards minimising the memory requirements of our reduction core — currently the most limiting factor of our implementation.