RFC-0039: Parser as composed Frame state machines

Summary

framec’s parser — the stage that turns a token stream into the system syntax tree — can be re-expressed as a composition of Frame state machines: a backbone (a coordinating machine whose states are the grammar’s major nonterminals) that delegates focused sub-problems to specialist oracle machines (sub-machines it consults for one bounded answer). Both terms are defined under Terminology. This extends the dogfooding discipline of RFC-0035 from scanning into parsing.

The organizing idea is stronger than “we can write the parser in Frame.” It is that the class of machine a sub-problem needs is a formal statement of how hard that sub-problem is. When a flat finite-state machine — often a single self-looping state — solves a piece of the parser, that machine is a constructive proof the piece is regular: no stack, no unbounded memory, no hidden recursion, no subtle cases. The places that genuinely need more — a stack to balance delimiters — are exactly the places that use push$ / pop$. The architecture therefore makes each component’s complexity visible and labeled, certifies that the simple parts are provably simple, and scales cleanly as the language matures: a construct that grows harder adopts a richer machine, and that adoption is itself the signal that its complexity rose.

Terminology

This RFC uses three terms beyond everyday programming vocabulary. The first two were coined here and are now promoted to the glossary (backbone, oracle); the third is standard computer-science theory, glossed for convenience.

  • Backbone — the top-level parser machine. Its states correspond to the grammar’s major nonterminals (a section, a state, a handler); it drives the parse forward and hands focused sub-problems off to oracles. Called a backbone because it is the structural spine the specialists attach to.
  • Oracle (specialist) — a small, focused machine the backbone consults to answer one bounded question, or to parse one sub-structure — e.g. “does a parameter list open here?” or “capture this balanced expression.” It returns a result (a verdict), and the backbone transitions on that verdict without needing to know how the oracle reached it. The name is borrowed from the theoretical-computer-science oracle machine: a machine that may consult a sub-machine treated as a black box. Concretely, an oracle is just another Frame system the backbone instantiates and calls — exactly as today’s state-variable parser instantiates the balanced-expression scanner.
  • Machine class / the hierarchy — the Chomsky hierarchy orders machines by power. A finite-state machine (recognizing a regular grammar) keeps no memory beyond its current state. A pushdown machine (a context-free grammar) adds a stack — what you need to balance arbitrarily nested delimiters. Richer classes add more. Choosing the weakest machine that solves a problem places an upper bound on how complex the problem is; that bound is the proof this RFC leans on.

Motivation

Frame spans the hierarchy. A plain Frame machine — states and transitions, no stack — is a finite automaton. Add the runtime state stack via push$ / pop$ and it is a pushdown automaton: push$ saves the current compartment (an activation record), -> pop$ restores it. So a single language expresses both “this is regular” and “this is context-free,” and which one you wrote down is a claim about the problem.

Choosing the weakest sufficient machine is a proof. In the Chomsky hierarchy, regular ⊂ context-free ⊂ recursively-enumerable. If a problem can be solved by a finite-state machine, it is regular — and exhibiting that machine proves it, the way a short program proves a task is computable. A solution written as an FSM certifies, checkably, that the problem it owns “has no subtleties”: nothing it must remember beyond its current state, nothing it must recurse over. That certificate is worth as much as the code.

The dogfooding discipline (RFC-0035). When a part of the compiler is a state machine, framec writes it as a Frame system and generates the Rust. Scanning is already dogfooded across every target. The parser is the largest front-end stage still written as conventional code — and it is a particularly clear case, for a reason this RFC makes precise.

Why now. A recent change made the domain-section parser stop hand-rolling its own balanced-expression scan and instead reuse the shared scanner machine the state-variable grammar already uses — fixing a multi-line-literal parsing bug as a direct consequence. That is concrete evidence the leaf-oracle pattern pays off. This RFC generalizes it.

The principle: machine class documents complexity

Read the parser through the hierarchy and its structure becomes self-describing:

  • Regular (finite-state) — the backbone. The descent system → section* , machine → state*, state → member* is repetition without self-nesting. Each such nonterminal is a self-loop: a state that consumes one item and transitions to itself until the section ends. A self-loop is the most legible structure in the language — it says, at a glance, “a flat list of these, nothing nested.” That it suffices is the proof that the backbone grammar is regular.

  • Context-free (pushdown) — the leaves. The one genuinely context-free job in framec’s parsing is balancing delimiters — matching ()[]{} inside an expression or initializer that may nest arbitrarily. That, and only that, is where a stack is required, and it lives in the scanner oracles (e.g. the balanced-expression scanner the domain and state-variable parsers already share), which use the runtime stack for exactly this.

The payoff is that the architecture draws the complexity boundary on the page. A reader sees finite-state machines for the regular skeleton and stack-using machines for the context-free leaves, and needs no further argument about which parts are simple — the machine class is the argument. “This part has no hidden state” stops being a comment and becomes a property you can read off the diagram.

The design

Two shapes realize this; framec’s grammar makes the first the natural one.

Composed flat machines (the backbone today)

A small tree of per-nonterminal Frame systems, each a 2–3 state self-loop, each driving a shared token cursor (peek / advance / check + the native-mode switch), composed by instantiation — the proven pattern where the state-variable parser already news up the balanced-expression scanner.

@@system MachineParser {                  // machine → state*
    interface: parse()
    machine:
        $States {
            $>() {
                if self.cur.at_section_end() { -> $Done }
                else {
                    var st = StateParser::new(self.cur)   // delegate one state
                    st.parse()
                    self.states.push(st.result)
                    -> $States                            // self-loop = "a flat list"
                }
            }
        }
        $Done { }
    domain: cur: TokenCursor = ...  states: Vec<StateAst> = Vec::new()
}
@@system StateParser {                     // state → header + member*
    interface: parse()
    machine:
        $Header { parse() { /* name, optional parent, optional params via an oracle */ -> $Body } }
        $Body {
            $>() {
                if self.cur.check(RBrace) { -> $Done }
                else { /* dispatch one member to its oracle/child */ -> $Body }   // self-loop
            }
        }
        $Done { }
}

Each machine is flat because its grammar is flat; the nesting (machine → state → handler) is carried by composition between machines. The balanced-delimiter leaves are the only components that reach for push$ / pop$. The shape of the system is the shape of the complexity.

Single stack machine (reserved for genuine recursion)

A construct that is actually self-recursive — where a nonterminal contains itself to unbounded depth — is written as one machine that uses push$ to descend and -> pop$ to ascend. framec’s grammar has no such construct today (state parentage is a reference, not textual nesting; expressions are carried as opaque native code). So this shape is held in reserve — and the day a construct needs it is the day the language gains a provably context-free piece.

Evolution: scaling by machine class as the language matures

This is where composition earns its keep over the long run. Frame and framec will grow — and each new construct slots into the architecture at the machine class its complexity demands, with no change to the surrounding pattern:

  • A new flat section or member kind (more interface forms, new handler shapes, additional declarations) is another self-loop, or one more arm in an existing dispatch. It stays regular; the proof of simplicity is preserved for free.
  • A construct that introduces real nesting — should framec ever parse expressions (expr → expr op expr, nested calls and parentheses), nested type syntax, or textually nested states — gets a machine that uses push$ / pop$. Adopting the stack is not a workaround; it is the honest, visible declaration that this construct is context-free, recorded in the one place a reader already looks. The complexity is localized to that component; the backbone stays regular.
  • A construct that needs richer recovery or unbounded lookahead would, by the same logic, declare itself by the machinery it requires — and would be a signal worth scrutinizing in review, because most of a language should stay in the lower classes.

The architecture thus turns the Chomsky hierarchy into a maintenance invariant: components advertise their complexity by the machine they are, new complexity is contained rather than diffused, and “most of this compiler is finite-state” remains a property you can verify by looking — not a hope. As the language matures, the diagram of the parser doubles as a map of where its genuine complexity lives.

Recommendation

The key words MUST, SHOULD, MAY are to be interpreted as in RFC 2119.

  1. framec SHOULD converge the parser’s remaining inline sub-scanners (attribute headers, state-variable declarations, context constructs, call-argument lists) onto the existing dogfooded scanner machines — behavior-preserving. This realizes the leaf-oracle layer and removes the class of duplication that recently produced a parsing bug.
  2. framec SHOULD express the parser backbone as composed flat Frame machines (the self-loop design above), establishing the pattern on one section first (the machine → state → handler subtree) so the idiom, the token-cursor contract, and the oracle interface are settled before the remaining sections follow. Each step is gated on byte-output parity against the differential matrix and fuzz corpus.
  3. New parser components MUST be written at the weakest machine class that solves them — a flat machine where the sub-grammar is regular, a stack-using machine where it is genuinely context-free — so the machine class continues to document complexity truthfully.

The first section is the proof of the pattern; the remainder is mechanical repetition of a now-legible idiom. The target is a major or minor release, never a patch.

Spike validation

A throwaway spike expressed the machine → state → handler subtree as a single flat Frame backbone (states $States, $StateHeader, $StateBody, $Done), compiled it to Rust, and ran it over a representative token stream. It parsed the structure correctly using only self-loops and no push$ / pop$ — confirming, executably, that this part of the grammar is regular. framec’s codegen handled transitions inside native match arms cleanly, and the parse()-drives-a-transition-chain shape worked end to end.

The spike deliberately stubbed one thing, and that is where the real design effort lay: it consumed a pre-tokenized stream with handler bodies already collapsed to a single token. The full implementation had to settle the token-cursor contract over the real lexer and the native-mode body handoff — how the backbone drives the tokenizer and delegates { … } bodies to the body-capture oracle. Both are now resolved in the shipped implementation (see below); the spike is superseded by the real backbone and was not committed.

Implementation status

Both halves shipped on branch rfc-0039-parser-fsm, in parity-gated increments. Each preserved byte-identical output against the per-backend snapshot tests (which compile the generated code), backed by differential and absolute unit tests, with clippy + cargo fmt --all clean at every step.

B-half — the parser’s entire token-level outer grammar runs as one Frame backbone (SystemBackbone), all flat self-loops, no push$/pop$:

Level States Commit
Sentinel + section loop $Start, $Sections, $Done (Stage 0/1 — earlier on branch)
Machine state loop $Machine (Stage 1)
State header + body $StateHeader, $StateBody c0e5823
Interface method loop $Interface 97109be
Actions / operations loops $Actions, $Operations a645f14

Each terminal delegates to an existing pub(crate) oracle on Parser (take_state_header, parse_interface_method, parse_action, parse_operation, parse_state_var_decl, parse_{enter,exit,event}_handler). The recursive Parser::parse_system is retained as the differential-parity reference — a set of tests parse the same source both ways and assert an identical SystemAst.

A-half — every hand-rolled scanner in the parser delegates to a dogfooded FSM:

Was Now Commit
parse_domain inline @@[…] walk AttributeScannerFsm (scan_attribute) c92dc26
segmenter identify_pragma (129 lines) AttributeScannerFsm (scan_attribute) b143f6e
call_args find_matching_close_paren / find_top_level_comma_or_end ExprScannerFsm cdb10b1

ExprScannerFsm was generalized into the single balanced-delimiter primitive via configurable terminator flags (stop_{semicolon,newline,comma,close_paren}) defaulting to the original ;/\n behavior, so its assignment-RHS consumers (domain = init, @@:return, $.var) are byte-unchanged.

Remaining: the recursive Parser::parse_system reference is intentionally kept until the change soaks across the 17-language matrix + fuzz corpus, then it can be retired. The domain: section’s outer line structure (indentation, comment-lines, field-name layout) stays a byte-walk by design — it is domain-specific layout, not a balanced-delimiter or attribute scan, so it is outside both the backbone (token-loop) and leaf-oracle (delimiter-scan) tracks. If desired, the backbone / oracle terms can be promoted into glossary.md.

Every new component is written at the weakest sufficient machine class (Recommendation 3): flat where the sub-grammar is regular, stack-using only where it is genuinely context-free — and nothing in framec’s outer grammar turned out to need a stack.

A consequence of B1: conversion had to be top-down

Because a backbone owns the parser (B1 — it can’t borrow it; Frame domain fields hold no lifetimes), only one backbone can hold the parser at a time, and a backbone cannot be invoked from inside a &mut self parser method (you can’t move self out of a borrow). So the migration could not start at an inner procedure (e.g. the machine loop); it started at the outermost loop — the system-section loop — invoked from the entry point that owns the parser by value, and proceeded downward. Nested grammar levels became more states of the one backbone system (distinct state-types + self-loops — no push$/pop$, since the nesting is fixed-depth), not separately-owning sub-machines. This is exactly how $Sections$Machine$StateHeader/$StateBody are organized: one SystemBackbone system, one owned parser domain field, many states.

Examples

Illustrative — schematic Frame, not compiled. The self-loop as a readable certificate of “a flat list, nothing nested”:

$Members {
    $>() {
        if self.cur.at_end() { -> $Done }
        else { self.collect_one(); -> $Members }   // regular: one state, no stack
    }
}

The reserved stack shape, for a construct that is actually recursive:

$Expr { open_group() { push$ -> $Expr }            // descend into nesting
        close_group() { -> pop$ } }                 // ascend — context-free, declared

Alternatives

  • Keep the backbone as conventional recursive descent. Workable, and what ships today. It does not make the complexity boundary legible the way composed machines do, and it leaves the leaf-scanner duplication in place. Adopting the oracle layer (recommendation 1) is valuable independently and SHOULD happen regardless.
  • One machine per grammar production, no conventional code anywhere (“machines all the way down”). Rejected: it fragments at scale and obscures rather than clarifies. The recommended design is compositional — machines where they make structure clear, conventional code at the leaves (opaque native bodies) where there is no grammar to model.
  • A single stack machine for the whole parser. Rejected as the default: framec’s backbone is regular, so modelling it with an explicit stack would hide its simplicity behind machinery it does not need — the opposite of the goal. The stack shape is reserved for genuinely recursive constructs.

Migration

Internal architecture only — no user-visible syntax or behavior change. Source-additive. Each step is behavior- and byte-output-preserving, validated by the existing differential matrix, snapshot tests, and fuzz corpus.

Unresolved questions

  • Resolved. The oracle interface contract: a backbone owns the Parser (and through it the lexer) as a domain field and calls peek/advance/check plus the take_*/parse_* oracle methods directly from its handler actions; the small bounded lookahead framec needs is served by the lexer cursor, with no restore machinery required. Settled by the system-section backbone and unchanged for every level below it.
  • Resolved. The backbone and oracle (specialist) terms are promoted into the glossary.
  • Resolved. The leaf-oracle convergence is tracked here (A-half above), not folded into RFC-0035.
  • Open. When to retire the retained recursive Parser::parse_system reference — after the change soaks across the matrix + fuzz corpus.

References