RFC-0029: Fuzz infrastructure status, deferred work catalog, and next-priority recommendations

Status: Draft (Status report + forward-looking catalog)
Author: Mark Truluck mark.truluck@cogiton.com
Created: 2026-05-18
Closes (supersedes): Roadmap #172 — “Fuzz runtime exec pipeline (new infrastructure)”

Framing — read this first. Roadmap #172 framed the fuzz harness as “produces .frm source + compiles via framec; doesn’t run the generated programs.” That framing is months stale — the harness already executes generated programs across all 17 backends in two complementary modes. This RFC documents what’s actually built, what’s been deliberately deferred, and what remains genuinely actionable. It does not propose new infrastructure; instead it converts #172 into an honest catalog so future work picks the right sub-task at the right time. Pattern matches RFC-0028, which did the same for the in-process API claim.

Summary

The fuzz harness in framec-test-env/fuzz/ is a mature, multi-phase runtime-execution corpus:

35,740 generated cases across 20+ phase directories
14 phase generators (gen_perm.py, gen_persist.py, etc.) producing per-phase Frame source
Two execution modes:
- Differential trace (diff_harness/run_fuzz.py, phases 2-7): transpile + compile + run each case on every applicable backend, byte-diff the trace stream against the Python oracle. Format contract in TRACE_FORMAT.md. Defects surface as runtime divergence.
- Assertion-driven (run_<phase>.sh, phases 8-14, 21, 24): each case ships its own PASS/FAIL driver; runner compiles + executes via per-language toolchain (python3, node, cargo build, javac+java, kotlinc, gcc/g++, erlc+erl, etc.) and grep stdout for PASS: / FAIL:.
Meta-runner (run_all.sh) with --tier=smoke|core|full / --tag=<comma-list> / --lang=<name> flag contract, parallel-by-lang.
17-backend coverage for most phases (some phases skip inapplicable backends — e.g. Go for async).
DEFECTS.md — closed-defect log; D1 through D18 found and fixed by the fuzz harness in 2026-04 / 2026-05.

The roadmap entry framing “major new infra; multi-week” was correct circa some prior session. As of 2026-05-18, the infrastructure exists, has caught 18 distinct framec defects in production, and runs as part of the developer workflow (not yet in CI; see § “CI integration question” below).

What the harness covers today

Per FUZZ_PLAN.md § Phases 1-10 status (2026-04-28):

Phase	What it exercises	Cases	Backends	Status
1	Trace harness foundation	—	—	shipped
2	`@@persist` save/load	1,377	17	clean
3	`@@:self` recursive dispatch	2,646	16 + Erlang 54 of 162	clean
4	HSM parent/child dispatch	1,377	17	clean
5	Operations (interface call shapes)	459	17	clean
6	Async dispatch	220	11 wired	clean
7	Multi-system	168	14	clean
8	Negative — E-code triggering	18	17	clean
9	Nested syntax (curated regression)	85	17	clean
10	Expression cross-product	7,820	17	clean (D1/D2/D3 closed)
11	Stmt-pair	1,700	17	clean
12	Ctrl-flow	4,800	17	clean
13	Shadow (var scoping)	1,700	17	clean
14	HSM-cross	1,360	17	clean
15 (wave 3)	State-args typed (int/str/bool/float)	7 patterns × 17	17	clean — D4-D8 fixed
17 (wave 2)	Multievent + self-call mid-handler	2,040	17	clean
18 (wave 1)	Stress / boundary (N=10/N=100)	102	17	clean
19 (wave 4)	Push/pop depth-3 + HSM forward	2,380	17	clean
20 (wave 1)	const + `@@:system.state`	480	16 (Erlang skipped)	clean
21 (wave 1)	Arithmetic edges	680	17	clean
24 (wave 7)	Persist × { state-args, enter-args, HSM, push, multi-event, async, multi-system }	75 + per-wave	17 incremental	clean — D16/D17/D18 fixed

Totals: ~30,000 case-runs passing across the 17-backend matrix. 18 framec defects (D1-D18) caught by the harness and fixed in framepiler over April-May 2026.

What’s been deliberately deferred

Per FUZZ_PLAN.md § Remaining roadmap (2026-04-29):

Phase 22 — Error / panic recovery: SKIP (documented per-backend semantics, not framec bugs)

The behavior of generated code when a handler raises (Rust panic, Java exception, Python raise) is per-target-language semantics — not a framec correctness question. A per-language guide section covering panic propagation through dispatch is more useful than a fuzz corpus. Not a framec defect surface.

Phase 23 — True concurrency: SKIP (out-of-scope by design)

Frame is single-threaded by spec — a single system instance dispatches events serially. Multi-threaded access is the user application’s responsibility (lock externally). Not a framec concern.

Phase 18 wave 2 (N≥1000 + wide domain + deep HSM): DEFERRED

The wave 1 endurance tests at N=10/N=100 cover the basic axes; N≥1000 would require language-native loop emission in the generators, which is significant generator work for marginal additional defect yield. Re-evaluate if/when a long-run defect is suspected.

Phase 20 wave 2 (const-as-transition-arg, const-from-system-param, `@@:system.state` in conditions, Erlang atom normalization): DEFERRED

Wave 1 (480 cases) covered the basic axes clean. Wave 2 axes are plausibly defect-bearing but speculatively so. Land when a real const-related issue surfaces.

Phase 21 wave 2 (type coercion): DEFERRED — target-side behavior

Wave 1 (680 cases) covered arithmetic edges. Wave 2 (int ↔ str / float, signed ↔ unsigned) would document per-target coercion semantics, not framec correctness. Same reasoning as Phase 22.

Persist × { list-typed state-args }: DEFERRED

Phase 15 wave 3 typed state-args list candidates were noted with “unclear value-density.” Could be opened if a list-typed-persist defect appears.

Architectural — nested system instance auto-rehydrate on JSON persist: OUT OF SCOPE

Per Phase 24 wave 7 finding (2026-04-30): JSON-based persist backends lose class info when a nested system is saved as a domain field — primitive round-trip works but method invocation post- restore does not. Auto-rehydration would need either a Frame-level type registry pass or per-backend custom serializers. Documented limitation; test 82 deliberately tests only primitive round-trip. Not a fuzz problem to solve.

Genuinely-remaining executable work

These are the items where future work would have measurable defect-finding value:

A. Persist cross-product completion: extend test 80-87 patterns to remaining 14 backends

Phase 24 (gen_persist_x.py) shipped wave 1+2 for 3 backends (python_3, javascript, typescript) covering 5 patterns: state-args, enter-args, HSM-state-args, push-state-args, multi-event-cross-save. Wave 3+ would extend to the remaining 14 backends. Per the “Wave 3+ candidates” note: Rust, Java, Erlang are highest priority because they received the most invasive parts of the persist bulk fix and are most likely to harbor remaining serialization defects.

Effort: ~2-3 hours per backend (each has its own JSON serialization shape — record builds for Erlang, derive(Serialize) for Rust, Jackson for Java, etc.) × 14 backends = ~30-40 hours, realistically a 1-2 week part-time effort.

Defect yield: Probably 3-5 framec defects per the historical ratio (D4-D8 + D16-D18 all came from persist cross-product runs).

Trigger to start: a real persist-related defect that the existing 9-backend coverage didn’t catch. No urgency today.

B. Phase 3 (selfcall) Erlang completion: 54 of 162 cases

Erlang’s selfcall coverage is partial because some @@:self shapes produced erlc-incompatible code prior to Wave-1 phase 17 fixes (D1). The remaining 108 cases may now compile cleanly.

Effort: ~1-2 hours to re-run the deferred subset, identify which now pass, and any remaining failures.

Defect yield: Low if the D1 fix was comprehensive; higher if there’s a different class of issue lurking.

Trigger to start: bandwidth, or specific Erlang selfcall issue surfacing.

C. CI integration — fuzz smoke as a gate

RESOLVED 2026-05-18 by RFC-0031. This RFC originally argued the trigger to start was “a defect that gets through the matrix but that fuzz would have caught — none known today; this is preventative.” Within days of writing that, the RFC-0030 fuzz repair surfaced five such defects (Rust clone, C++ factory bypass, E606 assignment-form gap, Erlang split_inline_arm × 2). All would have shipped to users if the matrix had stayed the only signal. The preventative argument is no longer hypothetical.

RFC-0031 Layer 2 codifies fuzz-smoke (plus matrix + framec-local checks) as a pre-merge gate on every push. Implementation tracked as roadmap #437.

(Original analysis retained below for context.)

The fuzz harness today runs developer-side. CI runs only the matrix (5,455 fixtures, 143s wall). run_all.sh --tier=smoke takes ~2 min per FUZZ_PLAN.md § Wall clock budget targets — plausible CI gate.

Effort: ~2-4 hours: docker image with framec + all 17 toolchains for fuzz exec (mostly already in the matrix image), GitHub Actions workflow, smoke-tier wiring.

Defect yield: ~~Indirect — wouldn’t find new bugs~~ — revised 2026-05-18: surfaces real defects on every breaking RFC, prevents the corpus-rot pattern that this RFC documented as already in progress.

D. Diagnostic-axis-by-axis test discipline (process improvement)

Per FUZZ_PLAN.md § Diagnostic-axis lesson: Phase 19 wave 3 confirmed that isolated single-axis tests catch contract bugs that cross-product tests miss. The Erlang frame_stack defect took until wave 3 to surface because waves 1+2 always exercised push/pop combined with other state shapes; the bug only triggered when push/pop was tested in isolation.

Action: when adding new waves to existing phases, include at least one test per invariant the runtime promises to preserve, exercising ONLY that invariant. Document in FUZZ_PLAN.md so future wave authors follow the pattern.

Effort: process discipline, not new code. Add a paragraph to FUZZ_PLAN.md. ~15 min.

E. Per-language guides for fuzz patterns (TODO section in FUZZ_PLAN.md)

There’s a TODO section for per-language idiomatic-Frame guides. This is documentation work, not test infrastructure.

Effort: ~2-4 hours per language × 17 = 1-2 weeks if pursued.

Defect yield: zero — pure documentation.

Trigger: external contributor confusion, or onboarding cost becoming a real friction point.

CI integration question

RESOLVED 2026-05-18 by RFC-0031. Condition (a) above — “a real regression slips past the matrix that fuzz-smoke would have caught” — was satisfied during the very next session. The RFC-0030 fuzz repair surfaced 5 framec defects (Rust clone, C++ factory bypass, E606 assignment-form gap, Erlang split_inline_arm × 2). None caught by the matrix. All caught by the fuzz harness once repaired. The “not urgent” framing turned out to be wrong.

RFC-0031 codifies fuzz-smoke as a pre-merge CI gate (Layer 2) alongside the matrix and framec-local checks. Implementation tracked as roadmap #437.

(Original question and analysis retained below for context. The “Recommendation: defer” was a mistake — the defects existed at the time; nobody had run the harness to notice.)

Today’s split:

Matrix (make test, 143s wall, 5,455 fixtures): runs on every commit via framec pre-commit hook (doc validator) and manual make test. Not in CI workflow per .github/. Catches fixture-corpus regressions.
Fuzz (run_all.sh, ~2 min smoke / ~45 min full): developer- side only. Not in CI. Catches structurally-generated edge cases the matrix doesn’t.

Question: should run_all.sh --tier=smoke run as a CI gate?

Arguments for:

2-minute budget is reasonable for a PR check
18 historical defects all caught by fuzz first — high signal
Catches regressions the matrix corpus misses by construction (the matrix is hand-written; fuzz is generated)

Arguments against:

Docker setup is heavy for CI (17 toolchain images, ~5-15 GB)
Smoke tier overlaps significantly with matrix coverage
Developers run it locally anyway when touching codegen

~~Recommendation~~ Mistaken recommendation: defer until either (a) a real regression slips past the matrix that fuzz-smoke would have caught, or (b) CI infrastructure is being re-architected for other reasons. Not urgent.

The mistake was treating (a) as a hypothetical-future condition rather than a present one. The defects were already in the code the day this RFC was written — fuzz just wasn’t being run to find them. RFC-0031 fixes the pattern by making fuzz-smoke a hard pre-merge gate, eliminating the “nobody ran the harness” failure mode entirely.

What this RFC recommends

Close roadmap #172 as superseded by this RFC. The “major new infrastructure” framing was correct months ago when it was written; it’s stale now. The infrastructure exists.
Don’t open #172 sub-tasks proactively. Each deferred item (A through E) has a clear “trigger to start” condition. Wait for the trigger.
Treat fuzz as developer-side preventative infrastructure for now. Run ./run_all.sh --tier=smoke locally before any codegen-touching commit; run --tier=full before any major codegen change (RFC milestone, backend rewrite, etc.).
Adopt diagnostic-axis-by-axis discipline (Item D above) for future fuzz wave authoring. Document the pattern in FUZZ_PLAN.md to make it a standing process rule.
If a real defect-density argument emerges for one of the deferred items (A: persist × 14 remaining backends; B: Erlang selfcall remainder), open a dedicated roadmap task with that item’s effort + yield estimate, not “complete #172.”

Drawbacks

Closing #172 may look like work was punted. Same caveat as RFC-0028: the work was done, just unlabeled. Pattern of stale roadmap entries has happened ≥5 times in the 2026-05-17/18 session (#402, #406, #435, #403, #171, #172). This is a memory- drift issue, not real backlog.
No active push to extend fuzz coverage. Per the deferral rationale, this is correct — speculative coverage extension has lower defect yield than reacting to defect surfaces.
CI not gated by fuzz. Accepted today; reconsider if a matrix-passes-but-fuzz-would-have-caught regression occurs.

Unresolved questions

Is the Phase 22 / Phase 23 “skip” stable, or worth revisiting as Frame matures? Both are documented as out-of-scope today (Phase 22 = target-side semantics; Phase 23 = single-threaded by design). If Frame ever introduces explicit concurrency primitives, Phase 23 needs revisiting. Today: stay skipped.
Should the differential-trace harness (phases 2-7) absorb the assertion-driven phases (8-14)? Differential trace is strictly more thorough — byte-diff against Python oracle catches semantic divergence the assertion driver doesn’t. But the assertion driver was easier to write per-phase. Migration would be ~1-2 days per phase. Not urgent — both modes work, the duplication isn’t expensive.
Should cases_* directories be checked into git, or generated on-demand? Currently 35,740 cases × ~17 backend variants are committed. They’re regenerable from the gen_*.py scripts. Pros for committing: stable regression corpus, no generation-time. Cons: large repo, every git checkout touches many files. Net: current “commit them” is fine; flag if repo size becomes a problem.

References

framec-test-env/fuzz/FUZZ_PLAN.md — authoritative phase catalog + remaining-roadmap detail
framec-test-env/fuzz/TEST_INFRA_ROADMAP.md — runner contract (tier + tag + lang flags) + wall-clock budgets
framec-test-env/fuzz/TRACE_FORMAT.md — differential-trace line format contract (the “byte-identical stdout” spec)
framec-test-env/fuzz/DEFECTS.md — closed-defect log (D1-D18, 18 framec defects caught by fuzz)
framec-test-env/fuzz/diff_harness/run_fuzz.py — phase 2-7 differential trace runner
framec-test-env/fuzz/run_all.sh — meta-runner entry point
Roadmap #172 (closed by this RFC) — original framing as “major new infra”
RFC-0028 — parallel closure of the in-process-API stale-framing pattern