Inlining simple case alternatives
In this post, we try to reduce the code size of compiled F-lite programs since we are currently bounded by the memory resources of our Reduceron-inspired processor architecture.
All case expressions in F-lite programs infer multiple templates:
- At least one to evaluate the case scrutinee, and…
- At least one per case alternative
Each alternative must be mapped to a contiguous part of the template memory, since case expressions get compiled down to a case table. These case tables are represented as offsets into the template memory, with an offset based on the tag of the case scrutinee once evaluated to WHNF. This keeps the size of the case table quite small in the template format, but requires each alternative to occupy at least one (quite large!) template regardless of complexity.
Our proposal is that we could encode trivial case alternatives inline within our case tables, minimising overall code size. To keep our case tables small enough to be appealing, we only offer a special case for two-way choices.
For this proposal, we explore inlining alternatives which are trivial — just a single atom and a number of elements to pop off the stack. This does not gain us any run-time performance directly, but we can substantially reduce the total number of templates required for a program. We will see scenarios with up to a 43% improvement in template count.
Original case handling
Let’s recap how case expressions are handled in Reduceron by looking at a small
example. We define a binary and
function and use it in a specialised fold over
lists of booleans, called ands
.
and False x = False;
and True x = x;
Nil = True;
ands Cons x xs) = and x (ands xs); ands (
The F-lite compiler will desugar the pattern matching to case expressions on the RHS. It then transforms the case expressions of the form
$$\mathtt{case}\ e\ \mathtt{of}\ \lbrace C_0\ \overrightarrow{x_0}\ ;\ \ldots\ ;\ C_n \overrightarrow{x_m} \rbrace$$
into a series of templates, where each case alternative is lifted out to their own templates. More formally, this form is:
$$ \begin{array}{l} e\ \langle \mathtt{FUN}\ \mathit{alt}_0\ ,\ \ldots\ ,\ \mathtt{FUN}\ \mathit{alt}_m \rangle\ \overrightarrow{v} \end{array} $$
assuming that v⃗ includes all free variables in any case
alternative and each new function alti implements the original
alternative with the extra v⃗ arguments. These case tables
are lifted out to the Lut
field of a template when in spinal position, or are
attributed to a particular heap application. Our full program compiles down to a
set of 5 templates after inlining:
$$ \begin{array}{l} \textit{sums#0}\ \left[\ \ \begin{array}{r l} \text{Alts}\ & [\texttt{FUN sums#3}] \\\\ \text{Spine} & [\begin{array}{c c c c c c} \texttt{ARG xs}, & \texttt{INT 0} \end{array}]\\\\ \text{Heap} & \left\{\right. \end{array} \right. \\ \textit{sums#1}\ \left[\ \ \begin{array}{r l} \text{Alts}\ & [] \\\\ \text{Spine} & [\begin{array}{c c c c c c} \texttt{CON 0 0} \end{array}]\\\\ \text{Heap} & \left\{\right. \end{array} \right. \\ \textit{sums#2}\ \left[\ \ \begin{array}{r l} \text{Alts}\ & [\texttt{FUN sums#3}] \\\\ \text{Spine} & [\begin{array}{c c c c c c} \texttt{ARG x}, & \texttt{INT 0} \end{array}]\\\\ \text{Heap} & \left\{\right. \end{array} \right. \\ \textit{sums#3}\ \left[\ \ \begin{array}{r l} \text{Alts}\ & [\texttt{FUN sums#1}] \\\\ \text{Spine} & [\begin{array}{c c c c c c} \texttt{ARG x}, & \texttt{ARG xs} \end{array}]\\\\ \text{Heap} & \left\{\right. \end{array} \right. \\ \textit{sums#4} \left[\ \ \begin{array}{r l} \text{Alts}\ & [] \\\\ \text{Spine} & [\begin{array}{c c c c c c} \texttt{CON 1} \end{array}]\\\\ \text{Heap} & \left\{\right. \end{array} \right. \end{array} $$
Note that order of the alternatives is tied to the alphabetic ordering of the
constructor names being scrutinised. For lists we expect Cons
then Nil
, and
for booleans we expect False
then True
. Each of these templates does very
little work compared to the maximum computational power of a single template.
Instead, we offer a different compilation scheme.
A new case table representation
Originally, case tables were compiled down to a single integer: the offset into the template memory for the base of the table. We extend this to optionally be a binary choice between two stripped-down atoms, represented as:
data LUT = LOffset Int
| LInline (Alt,Alt)
data Alt = AFun TmplAddr
| AInt Pops Val
| AArg Pops Index
| ACon Pops Arity Tag
One vital aspect towards the success of the inlining technique is trying to
minimise the number of bits representing an Alt
. For our default parameters
(SpineLen=6
and TmplAddr=10 bits
) we can encode Alt
as an 11-bit structure
while only substantially limiting the range of representable integers. This
feels like a reasonable trade-off since we anticipate that trivial integer
alternatives will usually be small values in a recursion’s base-case. Our 11-bit
encoding is:
$$ \begin{array}{r l l l l} \mathtt{AFun\ a} \mapsto & 1 & a_9\ a_8\ a_7\ a_6 & a_5\ a_4\ a_3\ a_2\ a_1a_0 \\ \mathtt{AInt\ p\ v} \mapsto & 01 & p_2\ p_1\ p_0 & v_5\ v_4\ v_3\ v_2\ v_1\ v_0 \\ \mathtt{AArg\ p\ i} \mapsto & 001 & p_2\ p_1\ p_0 & i_4\ i_3\ i_2\ i_1\ i_0 \\ \mathtt{ACon\ p\ a\ t} \mapsto & 0001 & p_2\ p_1\ p_0 & a_2\ a_1\ a_0 & t_0 \end{array} $$
For our default parameters, including the new case alternative encoding for the
spinal application and any CASE
heap applications increases the template size
from 470 bits to 493; an increase of 4.9%. For this to be worthwhile, we would
need an average reduction in the number of templates by well over 5%.
With these new case alternative constructs, we can compile our example program
quite simply. We follow the same process creating templates for each
alternative. After inlining etc. we identify which 2-way cases have at least one
trivial alternative and inline them into the new LInline
construct.
$$ \begin{array}{l} \textit{sums#0}\ \left[\ \ \begin{array}{r l} \text{Alts}\ & [\texttt{LInline (AFun sums#2, ACon 1 0 True)}] \\\\ \text{Spine} & [\begin{array}{c c c c c c} \texttt{ARG xs}, & \texttt{INT 0} \end{array}]\\\\ \text{Heap} & \left\{\right. \end{array} \right. \\ \textit{sums#1}\ \left[\ \ \begin{array}{r l} \text{Alts}\ & [\texttt{LInline (AFun sums#2, ACon 1 0 True)}] \\\\ \text{Spine} & [\begin{array}{c c c c c c} \texttt{ARG xs}, & \texttt{INT 0} \end{array}]\\\\ \text{Heap} & \left\{\right. \end{array} \right. \\ \textit{sums#2}\ \left[\ \ \begin{array}{r l} \text{Alts}\ & [\texttt{LInline (ACon 1 0 False, AFun sums#1)}] \\\\ \text{Spine} & [\begin{array}{c c c c c c} \texttt{ARG x}, & \texttt{ARG xs} \end{array}]\\\\ \text{Heap} & \left\{\right. \end{array} \right. \end{array} $$
Results
Let’s take a look at how this change impacts the code size for each of our benchmarks, accounting for the increased template size.
That’s encouraging! We never see a negative impact due to the increased template
size. OrdList
, an example with many two-way case expressions with trivial
alternatives, shrinks from 18330 bits to just 10846! Inlining trivial case
alternatives seems to work towards minimising the memory requirements of our
reduction core — currently the most limiting factor of our implementation.