RFC-0012: Persist Stress Testing — Coverage, Limits, and Hash-Based Validation

Prompt Engineer: Mark Truluck mark@frame-lang.org Status: Amendment shipped (Phases A–B + D, 2026-05-02 → 2026-05-03). Phase C (schema versioning + @@[migrate]) deferred to roadmap. Created: 2026-05-01 Last design discussion: 2026-05-01

Status

In progress. This RFC captures design discussions held during the nested-system persistence rollout. The rollout itself shipped (5-deep nested persist works on all 17 backends — C and Erlang reached parity in the 2026-05-02 wave 8 push). The RFC tracks both the test discipline needed for the persist contract and the contract decisions themselves as they land.

Decided / in implementation: the quiescent contract for save_state (see “Resolved decisions” below). Hard cut, no soft warning. Implementation tracked as a separate work stream.

Still open: the rest of the questions in “Open contract questions” plus the test discipline build-out.

Resolved decisions

Quiescent contract for `save_state` (decided 2026-05-01)

save_state requires the system to be quiescent — defined as _context_stack empty (no event being dispatched, no handler in flight, no pending return).

Why: mid-event saves are inherently undefined today. Three concrete ambiguities:

Pending transition. Frame queues -> $New until the handler returns. Mid-handler, __compartment.state is still $Old. A snapshot here loses the queued transition; restore resumes from $Old with no record of the intent.
Partial @@:return. Some backends store the return value on the top context frame. Mid-handler save would persist a return value that no caller will ever consume.
@@:data["key"] per-call data. Lives on the context frame. Saving it captures partial intermediate state; restoring it makes no sense without restoring the call that owned it.

The quiescent contract eliminates the entire class of ambiguity.

Orthogonal — _state_stack (push/pop) is fine during quiescent. _state_stack is the modal stack (push$/pop$); _context_stack is the call stack. Save between interface calls when push$ has built a multi-level stack is normal supported behavior. Quiescent only forbids mid-call.

Edge cases:

Constructor $>(). Initial-enter context is pushed and popped before constructor returns. Quiescent at user-visible time. ✓
Restore + __skipInitialEnter. Bypasses $>() entirely; no context ever pushed. Quiescent at end of restore. ✓
Async (Java CompletableFuture, Rust Future, etc.). Handler awaits → suspended → context still on stack. Save from a separate task during the await is non-quiescent and correctly errors.
Pop-with-transition. Both queued — mid-handler state. Quiescent check fires. ✓
Multi-system cross-call. Outer.tick() calls Inner.cycle(). Each system has its own _context_stack. Outer is non-quiescent (mid-tick) even when Inner is between its own calls. Saving Outer here correctly errors.

Error contract:

Code: E700
Name: system not quiescent
Per-backend mechanism:
- Native exceptions (Java/Kotlin/C#/Swift/Python/Ruby/PHP/Dart/JS/TS/Lua): throw FrameNotQuiescent (or per-backend convention) with code=E700, message="system not quiescent".
- Rust: return Result<String, FrameError>; FrameError::NotQuiescent.
- Go: return (string, error).
- C: return null + sets errno-equivalent / writes to error out-parameter (per existing C persist convention).
- Erlang: persist is sidecar — return {error, system_not_quiescent}.
- GDScript: return empty + push to error queue (per GDScript convention).

Backwards compatibility: hard cut. Pre-1.0, persist still being hardened. Mid-event save was always undefined — promoting it to a hard error is a stricter contract, not a behavior change for any correctly-written user code.

Implementation cost:

Codegen: ~3 lines per backend at start of save_state. ~45 lines total + per-backend error type definitions.
Tests: new test 88 exercising the error path. Mechanism varies per backend (probe via self.save_state() from inside a handler where Frame syntax allows it; manual context-stack injection otherwise). Skip backends where neither path works cleanly; document per-backend coverage.
Docs: persist-contract section in frame_runtime.md; per-language guide updates for the error type and idiomatic handling.

Test design (test 88):

Frame source has a handler that attempts self.save_state() from inside the handler body.
Native test calls the handler and asserts the runtime raises E700 / system not quiescent (per-backend error mechanism).
Backends where self.save_state() from a handler isn’t expressible via Frame’s escape hatches: skip with documented reason. Defer until those backends grow the syntax (or skip permanently).

This decision closes open contract question 4 below.

Background

Frame’s @@persist mechanism rounds out a state machine snapshot to a JSON string (Python uses pickle), and rebuilds an instance from that string on load. Through the work in framepiler commits 5849b9a3 and earlier, persist now works across all 17 backends for primitive domain vars, state-args, push/pop, HSM, async, multi-event sequences, and (post cafdec8) nested system instances 5-deep.

Test corpus today:

tests/common/positive/primary/56_persist_* through 83_persist_5deep.*
Per-feature waves documented in framec-test-env/fuzz/FUZZ_PLAN.md
DEFECTS.md captures D1–D18, all closed

What we don’t have:

Property-based testing of the round-trip invariant
Adversarial input testing (corrupted JSON, wrong-class restore)
Schema-evolution testing (snapshot format compatibility)
A spec for cyclic / shared-reference graphs
A spec for mid-event save semantics

Hand-written tests catch known shapes. They cannot catch the “unknown unknown” — corner cases we didn’t think to write a test for. The question this RFC answers: how far can we push coverage, and what’s the cleanest mechanism to do so?

Discussion summary

This RFC captures a design discussion. Preserved verbatim where the rationale matters; condensed elsewhere.

Graph topologies — what shapes do we test?

Test 83 covers a linear chain (L1→L2→L3→L4→L5). That’s the simplest non-trivial topology. Untouched shapes:

Branching — one parent holds multiple distinct children (a: @@A() ; b: @@B()). Most generators assume one child slot. Likely to surface bugs in any backend that hand-rolls the field iteration.
Same-type siblings — a: @@Counter() ; b: @@Counter() — two instances of the same class as separate fields. Should round-trip independently with no aliasing.
Tree / fan-out — N children, each with subtrees. Stresses framework recursion limits.
Diamond / shared reference — A holds B and C; B and C both hold the same D. After restore, are B’s d and C’s d the same logical instance, or separate copies? Today: separate copies (JSON has no reference semantics). Worth a deliberate test that asserts the expected answer.
Cycle / back-reference — A holds B holds A. Naive serializers infinite-loop; pickle handles it via memoization; JSON-based backends will stack-overflow today. Worth testing the failure mode (graceful error vs. hang vs. crash).
Self-reference — A holds A directly. Sharper version of cycle.

Cross-products with other Frame features

Persist × N is where bugs concentrate. Combinations done in existing waves: state-args (D5–D11, D15), HSM (test 57, 60), push/pop (test 58, 60), async (test 81), multi-event (test 59).

Untouched in combination with nested systems:

Nested systems × HSM
Nested systems × push/pop
Nested systems × async
Nested systems × state-args holding nested systems ($Active(c: Counter) — pass a system instance as a state-arg)

The 4-way cross (persist × HSM × nested × push/pop) is the maximum-density spot. Past 4-way crosses found the highest defect density (D16, D17, Wave 1 Phase 14 Erlang defects).

Tricky data inside nested systems

Test 83’s L1–L5 each have a single int and a child field. Real systems carry richer state. Risks:

Nested level with collection state-args (Map<String, List<int>>) — exercises type-ignorant persist and nested-system round-trip simultaneously
Nested level with its own HSM — tests that compartment-chain serialization is per-system, not system-global
Nested level with state-vars that change pre/post-save
Nested level with operations (which mutate but aren’t event-driven)

Identity & invariants

What gets preserved by restore that we might not realize:

__compartment chain (current state hierarchy)
_state_stack (push history)
HSM parent_compartment back-pointers
Nested children, recursively

What should NOT be preserved (negative invariants):

_context_stack (per-event scratchpad — would leak request data across persist boundaries)
Pending event queue (if any)
Any closure references / live timer handles

A test discipline that explicitly asserts negative invariants is just as valuable as one that asserts positives. We don’t have any negative-invariant tests today.

Failure / adversarial input

For a real production-grade contract, we should test:

Corrupted JSON — does each backend throw a typed exception, or panic / segfault?
Truncated input — same question
Wrong-class snapshot — load a Container blob into Counter. Today: undefined behavior. Should reject with a clear error.
Schema evolution — user added a domain field after the snapshot was written. Does the field default? Does old code reject? Does newer code load the old snapshot?

Open contract questions

Before we write tests, we need answers (or deliberate “undefined”) for:

Cycles: graceful error or memoize and preserve sharing? Discussion expanded below under “Cycles in the persist graph.” Recommendation: Option A (E702 detect+error). Decision deferred pending customer feedback.
Shared references: lock in “duplicate on restore” or treat as a future feature (memoization-based preservation)? Effectively answered by the cycles decision: Option A keeps “duplicate on restore.” Option B would preserve sharing as a side effect.
Schema evolution: in scope for the test suite, or save for a later production-readiness milestone?
_context_stack mid-event: should saving during a handler be disallowed (throw), or do we promise something about what’s captured? RESOLVED (2026-05-01) — see “Resolved decisions / Quiescent contract” above. Mid-event save is a hard error: E700 / system not quiescent.
Adversarial input contract: typed exception (named what?), generic panic, or silent garbage? Discussion expanded below under “Adversarial input — threat model and proposed contract.” Decision deferred pending threat-model selection.
Concurrent save during async-await: ~~undefined? rejected? captured-at-suspension?~~ Effectively resolved by the quiescent contract — concurrent save during await is non- quiescent (handler context still on stack). Errors with E700. Concurrent save from another system in the same process is still untested; that’s a separate concern.

These questions gate the test design. Without answers, tests can’t assert anything meaningful — there’s no contract to compare against.

Cycles in the persist graph

(Expanded from open question 1, decision deferred pending customer feedback.)

What “cycle” means

The persist graph is rooted at the system you call save_state on. Each @@SystemName field is an edge to a child instance. A cycle is when traversal returns to a previously-visited instance:

Self-reference (1-cycle): A holds an A (self.peer = some_a).
Mutual (2-cycle): A holds B; B holds A.
Longer: A→B→C→A.

Construction path

Static @@SystemName() initializers cannot form cycles. A’s @@B() constructs B; if B has @@A(), B’s construction triggers another A construction, infinitely. The program crashes during construction, before persist enters the picture. framec doesn’t currently catch this — could add a static cycle check (E430-class) but it’s a separate concern.
Runtime mutation can form cycles. A handler with a system-typed parameter that does self.x = arg lets users wire a.set_b(b); b.set_a(a). Real cycles, real risk.

What each backend does today

Backend	Behavior on cyclic save
14 JSON backends + C	Stack overflow — `save_state` recurses indefinitely
Python (pickle)	Round-trips correctly with shared identity preserved — pickle’s memo table
Erlang	gen_statem call chain deadlocks (timeout)

Python is unique. Every other backend produces a hard crash.

Three options

Option A — Detect and error (E702). Each save_state maintains a thread-local in-flight set; recursion into an already-visited instance throws E702: cycle detected in persist graph. RAII / try-finally cleans up.

Pros: simple, fast, uniform across backends, matches E700 philosophy, ~150 LOC codegen + ~600 LOC tests, ~1.5 days.
Cons: regresses Python (loses pickle’s cycle handling). Cycle- using code has to be rewritten (store IDs instead of object references).

Option B — Memoize (preserve sharing). Each instance gets a unique ID at save time; repeat visits write {"_ref": N} instead of duplicating state. Restore is two-pass: allocate, then wire.

Pros: cyclic graphs round-trip; shared references preserve identity post-restore (real feature beyond cycles).
Cons: per-backend complexity is significant — two-pass save + two-pass restore + per-backend identity hashing. Wire format gets ID space (harder schema migration). ~600-800 LOC, ~5-7 days.

Option C — Document as undefined. Zero work; stack overflow remains the failure mode.

Recommendation

Option A, three reasons:

Cost/value alignment. Cycles aren’t a feature most users want; B’s complexity buys a niche capability.
Sharing-preservation isn’t free even with B. Once you commit to B, you’re partway down the road to “Frame persist is a full object-graph serializer” — a much bigger commitment.
The Python regression is the right call. Currently pickle handles cycles silently; if a user moves their app from Python to Java, the cycle becomes a stack overflow with no warning. Option A makes the contract uniform: cycles ALWAYS error.

Tests for Option A (test 89)

Frame source: two systems where the user constructs a cycle via runtime mutation (a.set_peer(b); b.set_peer(a)). Driver calls save_state, expects E702. Per-backend error mechanism (throw / panic / abort / push_error) follows E700 conventions.

Open questions

Option A or B? Recommend A.
Python policy: lose pickle cycle support (uniform contract) or keep as documented Python-only behavior?
Add the static @@SystemName() initializer cycle check?
Make “shared references duplicate on restore” semantics explicit in docs?

Python: switch from pickle to JSON-based persist

(Discussion piece, decision deferred pending customer feedback. Tightly coupled to the cycles question above and the adversarial- input section below.)

Current state

Python uses pickle.dumps/loads (line 1336 of interface_gen.rs — a 2-line implementation). Every other backend uses JSON via the language’s idiomatic library.

The case for switching

Closes the highest-severity adversarial-input item. pickle.loads on attacker-controlled input is RCE. JSON is data-only; the worst an attacker does is craft malformed JSON, which json.loads rejects cleanly.
Cross-backend wire format becomes viable. RFC-0012’s “cross-backend Wire Format” item moves from “deferred / 1-2 weeks” to “already done.” Save on Python, restore on JS.
Uniform contract. E700 / E701 / E702 map cleanly across all 17 backends without Python-specific exceptions.
Debuggability. print(o.save_state()) shows readable JSON, not opaque pickle bytes.
Test 86 byte-canonical idempotence becomes valid for Python. Currently skipped because pickle bytes aren’t JSON-comparable.

The case against

Loses pickle’s “any object” capability. Pickle preserves arbitrary Python objects (custom classes, lambdas, etc.). JSON handles int / float / bool / str / None / list / dict only. In practice, Frame domain types track what other backends accept (primitives + nested systems), so Python users who already wanted portability are unaffected. Custom-class domain fields are uncommon.
Loses pickle’s cycle support. Pickle’s memo table preserves cyclic graphs. JSON-based Python would crash on cycles like the other 14 backends. If Option A from the cycles section ships, this becomes uniform — Python aligns with everyone else.
Breaking change. Existing pickle blobs become unreadable. Hard cut, no auto-migration. Same precedent as E700.
Codegen complexity goes from 2 lines to ~80. Mirrors what the other JSON backends already do.

Implementation sketch

Direct port of JS saveState/restoreState to Python:

def save_state(self):
    if self._context_stack:
        raise RuntimeError("E700: system not quiescent")
    import json

    def ser_comp(c):
        if not c: return None
        return {
            "state": c.state,
            "state_args": list(c.state_args),
            "state_vars": dict(c.state_vars),
            "enter_args": list(c.enter_args),
            "exit_args": list(c.exit_args),
            "forward_event": c.forward_event,
            "parent_compartment": ser_comp(c.parent_compartment),
        }

    j = {
        "_compartment": ser_comp(self.__compartment),
        "_state_stack": [ser_comp(c) for c in self._state_stack],
    }
    # per-domain-field handling — recurse for nested @@SystemName
    # ...
    return json.dumps(j)

@staticmethod
def restore_state(json_str):
    import json
    j = json.loads(json_str)
    cls = <SystemName>
    cls.__skipInitialEnter = True
    instance = cls()
    cls.__skipInitialEnter = False
    instance.__compartment = deser_comp(j["_compartment"])
    # ...
    return instance

__skipInitialEnter is the same static-flag pattern used by Java and C# today.

Migration path

Pre-1.0: hard cut. Document loudly. Existing pickle blobs become unreadable; users discard or re-create.
Bundle with cycles work (Option A E702): single matrix run, single test rollout, single user-facing migration.
Optional flag (@@persist(format=pickle)) if customer feedback shows real demand for arbitrary-object preservation. Default to JSON. Costs ~50 LOC to keep both code paths.

Effort estimate

Codegen switch: ~1 day.
Cycles work bundled (Option A): +~1.5 days.
Per-language guide + RFC + matrix updates: ~0.5 day.
Total: ~3 days for the bundled wave.

Open questions

Hard cut, or @@persist(format=pickle) opt-in?
Bundle with cycles, or separate?
Re-enable test 86 byte-canonical idempotence for Python during the migration?

Adversarial input — threat model and proposed contract

(Expanded from open question 5, decision deferred pending threat- model selection.)

What “adversarial input” means

Calling restore_state with a JSON blob that’s malformed, corrupted, malicious, or just wrong. Concrete shapes:

Truncated JSON. Blob cut mid-document. Parser fails fast.
Type mismatches. _compartment.state should be a string, blob has 42. Restore uses wrong type — fails or silently corrupts depending on backend.
Missing required fields. No _compartment key. Restore NPEs trying to access it.
Wrong-class blob. Saved Outer, restored as Foo. Field shapes mismatch.
Unknown extra fields. Blob from a future framec version with new fields. Forwards-compat: should be ignored.
State name not in topology. state: "$Bogus" references a state that doesn’t exist in this system. Already produces RestoreError per the existing topology-validation pass.
Numeric overflow. i32 field with value 2^33. Backend parser truncates or errors.
Maliciously-crafted blob. Billion-laughs equivalent (deeply nested arrays designed to OOM), excessive nesting that overflows the parser’s recursion, gigantic strings.
Pickle-specific (Python only). Pickle deserializes class instantiations including __reduce__ methods. A crafted pickle blob runs arbitrary code on pickle.loads. This is a documented Python vulnerability; pickle docs explicitly warn “never unpickle untrusted data.” Only Frame backend currently exposed to this is Python.

Three threat models, three different scopes

The work required depends entirely on what users do with restore_state:

(A) Local file save/load — game state, editor sessions, crash recovery to disk. Threat: filesystem corruption. Rare. Need: define a clear error so users can fall back to defaults. No security work.

(B) Network/database/cookie persistence — session state over the wire, multi-tenant systems where one tenant’s blob might be loaded by another’s code. Threat: attacker controls the blob. DoS via OOM, parse errors, RCE on Python pickle. Need: hardened input validation, typed errors per failure mode, switch Python off pickle (security-critical for this threat model), defense against depth bombs. Significant work (3–5 days plus security review).

(C) Process snapshot for crash recovery — same-process save/restore for fault tolerance. Threat: filesystem corruption, not adversarial. Need: robust error handling, no security hardening.

Current state per backend

restore_state behavior under adversarial input today:

Backend	Behavior
Python (pickle)	RCE risk on malicious input. Major hole if used over the wire.
Java/Kotlin (Jackson)	Throws Jackson exceptions on parse errors; type coercion silently does wrong things on mismatches.
Rust (serde derive)	Strict — fails fast on missing fields, type mismatches.
Go (encoding/json)	Silently ignores unknown fields, errors on type mismatches.
C++ (nlohmann)	Parses leniently; typed access throws on mismatch.
Lua (cjson)	`error()` on malformed input.
C# (System.Text.Json)	Throws on parse errors; lenient on type.
Swift (Codable)	Throws DecodingError on shape mismatch.
PHP (json_decode)	Returns null on parse failure; type coercion silent.
Ruby (JSON)	Throws JSON::ParserError.
Dart (jsonDecode)	Throws FormatException.
JS/TS	JSON.parse throws SyntaxError.
GDScript (var_to_bytes)	Returns null/empty on bad input.
C (cJSON)	Returns NULL on parse failure; manual checks needed.
Erlang (sidecar)	Throws erlang:error on bad term.

No uniform contract. Failure modes range from “throws clear typed error” to “silently corrupts” to “executes arbitrary code.”

Proposed contract — `E701: corrupted snapshot`

Mirror the E700 pattern. Spec says restore_state should fail with E701: corrupted snapshot on any of:

Parse failure (malformed JSON / pickle / etc.)
Missing required structural field (_compartment, _state_stack)
Type mismatch on a structural field (state name not a string, state_stack not an array, etc.)
State name not in _HSM_CHAIN (already raises RestoreError today; subsume into E701 or keep separate code per the topology question — pick one)
Wrong-class blob (no system-name marker; debatable whether framec embeds one — discussed below)

Per-backend mechanism follows E700 conventions:

Throw on JVM/dynamic langs/C++/Dart.
Panic on Rust/Go.
Abort on C/Swift.
Empty return + error queue on GDScript.
Erlang: {error, corrupted_snapshot} tuple or erlang:error.

What E701 does NOT cover:

Numeric overflow within valid JSON (out-of-spec but parseable).
Forwards-compatible unknown fields (silently ignored, not an error).
Adversarial DoS (depth bombs, gigantic strings) — separate hardening pass under threat model B.

What about the Python pickle problem?

For threat model B, pickle is non-negotiably a problem. Options:

Replace pickle with JSON for Python persist. Match the other backends. Loses pickle’s “preserves arbitrary Python objects” property — domain fields would need explicit JSON serialization rules like the typed backends. Significant codegen change.
Add a @@persist(safe) opt-in mode that uses JSON. Default stays pickle for backward compat.
Document only. Add a security warning to the Python guide and persist docs: “do not unpickle untrusted blobs.” No code change.

Option (3) is fine for threat model A or C. (1) or (2) only needed if Frame officially supports B.

Embedded class marker for wrong-class detection

Today’s blob has no “this was saved from class X” marker. Restoring an Outer blob into Foo.restore_state produces undefined behavior (probably parse error or silent garbage, depending on field overlap). A 1-line fix: include "_system": "Outer" in the saved JSON, validate on restore. Would close one E701 case cleanly.

Recommended path

The minimum work that closes the contract gap:

Document the threat model. One paragraph in frame_runtime.md: “restore_state assumes trusted input. Untrusted-source blobs need separate validation. Python pickle is especially dangerous for untrusted input — switch to a JSON-based approach if needed.” Effort: 30 min.
Define E701: corrupted snapshot with the same per- backend mechanism table as E700. Codegen wraps each backend’s parse-and-validate path so failures convert to E701. Effort: ~1 day.
Add embedded class marker ("_system": "<SystemName>"). Validate in restore_state prologue. Effort: ~2 hours across 15 backends.
Test 89 — adversarial input smoke. Per backend, ~5 cases: truncated, wrong type on structural field, missing field, wrong-class blob, state-name-not-in-topology. Verify each produces E701, not crash/UB. Effort: ~1 day.

Total: ~2.5 days. Closes question 5 for threat models A and C.

Defer until production use case appears:

Pickle replacement (1 or 2 above).
Depth-bomb/string-bomb hardening.
Numeric-range validation on domain fields.

These are threat-model-B work. Build them when someone needs B; don’t build speculative security infrastructure.

Decision needed from review

Pick the default threat model. (A) seems most defensible for current Frame; (B) requires a security commitment.
Confirm E701 as the error code or pick a different number.
Decide whether RestoreError (existing topology-validation error) merges into E701 or stays as a sibling code.
Confirm minimum viable scope: 1+2+3+4 above, or smaller?

Theoretical limits

We pushed on “what’s the theoretical best for coverage?” Three strata of difficulty:

Solvable mechanically (with engineering work)

Item	Why tractable
Linear / tree / branching topologies	Just more domain fields. Existing Option A handles them.
Cycle detection + graceful error	Visited-set during serialize. ~30 LOC per backend.
Shared-reference preservation	Memoization (`{"__id": 42, "data": ...}` + `{"__ref": 42}`). Pickle does this; JSON-based backends just need an ID table. ~100 LOC per backend.
Cross-products (HSM, push/pop, async × nested)	Mechanical extension of existing waves.
Failure modes (corrupted/truncated input)	Wrap each backend’s parse call. ~10 LOC per backend.
Wrong-class snapshot rejection	Embed `__sys: "L1"` marker; check on restore. ~5 LOC per backend.
Additive schema evolution	Already mostly works — JSON ignores missing/extra keys. Just needs explicit testing.
Mid-event-handler save rejection	Check `_context_stack.empty()` at save entry; throw if not. Few LOC.
Negative invariants (`_context_stack` not in snapshot)	Diff the JSON; assert keys absent. Unit test.

All tractable. None require new theory — just engineering.

Tractable but require user-written code

Item	What’s needed from user
Semantic schema evolution (int → string field; system split/merge)	User-written migration function. Framework can route old → new via versioned `@@migrate` block.
Domain constraints / invariants	User-asserts post-restore. Framework can call a `validate()` hook.
Concurrent multi-thread save	User-supplied locking. Frame doesn’t enter the SMP/threading domain.
Custom type handling (in C, Rust, etc.)	User-supplied pack/unpack. Already established as the Frame contract.

These are tractable in the sense the framework can facilitate them, but they fundamentally require user input. No automated machine generates a migration from “old schema” to “new schema” without knowing the user’s intent.

Genuinely intractable

Item	Why
Continuation-style save (snapshot mid-`await`, resume from exact suspension point)	Requires first-class continuations or stackful coroutines at the language runtime level. Possible in Scheme, Smalltalk, some pickle subsets. Not possible for Rust async, JS Promise, Java CompletableFuture, etc. — their async types are not serializable. The achievable answer is “saves happen between events, not during” — a contract, not a test.
Universal observational equivalence proof	Rice’s theorem. You can test event sequences (sample-based confidence); you cannot prove equivalence for all inputs.
Auto-migration of arbitrary semantic changes	The user’s intent is not in the schema. Framework can detect the diff but can’t infer what to do with it.
Save during true concurrency without user-supplied isolation	Framework can’t know which threads access which fields.

Where the practical ceiling sits

For Frame’s persistence as a whole, “excellent coverage” is achievable up to and including the boundary between framework and user concerns:

All graph topologies (chain, branch, tree, diamond, cycle-detected) — mechanical
All Frame feature × persist crosses (HSM, push/pop, async, multi-event, state-args, etc.) — mechanical
Sharing/identity preservation if specced — mechanical via memoization
Adversarial input contract (typed exceptions, schema validation) — mechanical
Round-trip property assertion at scale via fuzzer — mechanical

What pushes us to the real ceiling: property-based testing.

Hash-based round-trip validation

Hash equality is the cleanest invariant for fuzz-scale testing. The pattern:

h1 = hash(canonical(instance.save_state()))
snap = instance.save_state()
restored = Class.restore_state(snap)
h2 = hash(canonical(restored.save_state()))
assert h1 == h2     // round-trip preserves state

If the two hashes match, the serializable representation is bit-identical pre/post round-trip. Strong invariant. Cheap (milliseconds per cycle). Easy to fan out across thousands of generated cases.

Subtleties

Canonical form is essential. Cannot hash raw save_state() output directly:
- JSON object key order varies (some serializers don’t preserve insertion order)
- Float representation (1.0 vs 1 vs 1.00) varies by backend
- Whitespace varies
- Pickle: object identity creates different byte sequences for the same logical value
Fix: normalize before hashing. Sort keys lexicographically; format floats with fixed precision; strip whitespace. ~20 LOC of canonicalization. SHA-256 the result.
Hash captures state, not behavior. A bug that drops a state field would change the hash — caught. A bug that subtly changes behavior without changing state (extremely rare for Frame since handlers are class methods, not closures) wouldn’t be caught by hash alone. Property-based event-replay testing complements it.
_context_stack should be excluded. It’s per-event scratchpad. If you save mid-event the hash will mismatch, but that’s “don’t do that” not “broken” — the contract should forbid mid-event save. The canonicalizer should drop _context_stack (or save_state should reject if it’s non-empty).
Per-backend hash, not cross-backend. Hash equality after round-trip on the same backend is the realistic test. Python’s hash will differ from Java’s hash for the same Frame system because each emits its own JSON shape. Same-backend round-trip is what we care about for “did persist work.”
What it catches that observational testing misses:
- Field reordering bugs that don’t affect tested behavior
- Compartment chain corruption that isn’t exercised by your test events
- Push/pop stack drift in branches your events didn’t reach
- Nested-system state preserved at one level but truncated at another

Cross-backend gold-standard variant

Define a “Frame Wire Format” — backend-agnostic canonical JSON shape with explicit key ordering, normalized floats, version stamp. Each backend emits it. Then hash(WireFormat) == hash(WireFormat) across Python and Java for the same logical state.

Enables: saving in Python, restoring in Java, and verifying equivalence. Real engineering investment (~few days per backend), real payoff for serialization-format compatibility. Skip unless cross-backend persist is a stated goal.

Effort to add

Canonicalizer: ~30 LOC of test-harness code (one normalizer, parses each backend’s JSON output)
Hash helper: 5 LOC (SHA-256 of canonical bytes)
Property test: 20 LOC fuzz loop generating random states + asserting h1 == h2
Wire it into a per-backend test runner: ~2 hours

Total: less than a day to add to existing test infrastructure.

Property-based event-replay testing

The strictly stronger invariant — strictly because hash equality is the necessary condition, observational equivalence is the sufficient one:

events = generate_random_event_sequence(N)
b1 = run_events(instance, events)
b2 = run_events(restore(save(instance)), events)
assert b1 == b2

If b1 == b2 for thousands of random event sequences, persist is correct for that system with cryptographic confidence.

Combined with hash testing:

Hash: cheap, instant, covers state preservation
Behavior: expensive, covers continued operation post-restore

A fuzz harness that does both:

Generates a Frame system per axis spec (depth × branching × HSM × push/pop × async)
Generates a random event sequence
Asserts hash equality after save→restore (cheap)
Asserts behavior(events on restored) == behavior(events on original) (more expensive)
Asserts no invariant violation (_context_stack empty in snapshot, etc.)
Mutates the saved snapshot adversarially and asserts the right failure mode

Run for an hour per backend. If nothing breaks, you’re at the practical ceiling.

Effort: ~3–5 days of test infrastructure. Pays off forever. Most hand-coded tests (including 83) become “regression anchors” for specific known cases; the fuzzer covers unknown unknowns.

Recommended path forward

Ranked. Each step gates the next.

Step 1: Spec the contract (~1 day)

Pick answers for the six open questions in §Open contract questions. Without these, tests can’t assert anything meaningful. Document in a Frame contract doc (docs/persist-contract.md or similar).

Step 2: Hash-based round-trip testing (~1 day)

Add canonicalizer + hash assertion to the test runner. Wire it into every existing persist test in the matrix as a sanity check (should be all-pass; if any flag, that’s a real defect).

Step 3: Hand-cataloged graph topology tests (~2 days)

Write ~30 tests covering:

Branching (1, 2, 5 child fields)
Same-type siblings
Diamond
Cycle (assert spec’d failure mode)
Self-reference (assert spec’d failure mode)
Tree fan-out

Each test runs hash-based round-trip assertion. Regression anchors.

Step 4: Property-based fuzzer (~3–5 days)

Build the fuzz harness:

Frame system generator (parameterized by axis spec)
Event sequence generator
Hash + behavior + invariant assertions
Adversarial mutation tier

Run continuously; treat as fuzz tier (long runs, occasional new defect surfaces).

Step 5: Schema-evolution test suite (optional, ~3 days)

If schema evolution is in scope (per Step 1’s contract decision):

Snapshot v1 + framework v2 → assert tolerated
Deleted field + old snapshot → assert tolerated
Renamed field → assert user-written migration path works
Type change → assert user-written migration path works

Step 6: Cross-backend Wire Format (optional, ~1–2 weeks)

Only if “save in Python, restore in Java” is a goal. Define canonical JSON; each backend emits it; round-trip assertion across backends.

Total to “theoretical best” (excluding optional steps): ~2 weeks of focused work. After that, marginal coverage gains become rapidly more expensive for diminishing returns.

Drawbacks / alternatives

Drawback: contract-first work delays test value

Steps 1 and 2 don’t add tests for ~2 days. If you’d rather see results sooner, swap order: write hash-based assertion first, discover the contract gaps as they manifest. Risk: some tests will need to be rewritten once the contract is settled.

Alternative: behavior testing only, skip hash

Behavior testing is sufficient for correctness. Hash is an optimization for fuzz-scale testing. If we’re not building a fuzzer, hash-based testing buys less. Recommendation against: hash testing is cheap enough that it’s worth doing even for the hand-cataloged test tier.

Alternative: cross-backend Wire Format first

If the strategic goal is “save anywhere, restore anywhere,” start with Wire Format. But this is significant engineering for a use case that may not be on the near-term roadmap. Default: defer.

Alternative: skip property-based testing entirely

Hand-cataloged tests + hash assertion catches ~95% of bug density. Property-based testing catches the long tail. If budget is tight, skip the fuzzer and accept that some corner cases will surface as production bugs. Recommendation: don’t skip; the fuzzer is the difference between “we tested the known cases” and “we tested arbitrary cases.”

Open questions for review

Before implementation:

Cycle policy: graceful error, or memoize and preserve?
Shared-reference policy: duplicate (current), or memoize and preserve?
Schema evolution scope: in-suite, or production-readiness milestone?
Mid-event save: forbid (throw at save call), or capture _context_stack and document?
Adversarial input contract: typed exception (named what? PersistFormatError / PersistVersionError / PersistSchemaError?), or generic.
Concurrent save semantics: undefined, document; or single-threaded contract enforced by lock check.
Test infra investment: hash + cataloged only (~1 week), or full property-based fuzzer (~2 weeks)?
Cross-backend Wire Format: in scope, deferred, or out of scope?

Implementation status

Not started. RFC parked pending review of open questions.

The actual implementation work is well-scoped (~1–2 weeks depending on scope answers above), but should not begin until the contract questions are settled. Otherwise tests will assert behaviors that need to be rewritten when the contract is set.

References

Test 83 5-deep nested persist: framepiler cafdec8, test_env ec179fbf
Memory: type_ignorant_persist_2026_04_30.md
DEFECTS.md (closed): D1–D18
FUZZ_PLAN.md (Phase 24, waves 1–7)

Amendment 2026-05-02: `@@[save]` / `@@[load]` operation attributes

Motivation

The status-quo persist contract emits static func restore_state(data) -> Self on every backend, mutates a class-static __skipInitialEnter flag around .new(), and re-uses that flag in the constructor’s initial-enter path to skip the normal lifecycle. This works on every backend whose static-method scope can resolve the script’s own class identifier — but it doesn’t work on GDScript, where a script’s static function cannot resolve its own class_name (empirically verified against Godot 4.6.2).

We considered eight candidate fixes (A–H) when investigating this. A (class_name declaration) was the natural first attempt and doesn’t actually work — Godot’s static funcs cannot see their own class even with class_name. Every other option either requires per-target divergence in the public contract, hardcoded resource paths, or doesn’t address the architectural cost: __skipInitialEnter is a class-static race window, and embedding the class identifier into a static method body is a fragile coupling between codegen and target scoping rules.

Design

Four attributes replace the existing contract:

@@[persist(<FormatType>)] — system-level. Declares the system participates in persistence and selects the wire format (e.g. JSON). Format names are opaque strings plumbed through to per- backend ser/deser implementations; Frame doesn’t validate the name beyond syntactic well-formedness. Default when omitted: JSON.
@@[save] — operation attribute. Marks the operation Frame should fill in as the save entry point. Signature: (): <FormatType>. The operation has no body in source — Frame generates the body based on the format. Regular instance method. Caller invokes it as inst.<op_name>() and gets the serialized payload.
@@[load] — operation attribute. Marks the operation Frame should fill in as the load entry point. Signature: (data: <FormatType>). No body in source. Regular instance method. Caller invokes it on an existing instance to overwrite the compartment with the persisted state.
@@[no_persist] — domain field attribute. Marks a field as transient. The save body skips it; the load body leaves it at its default initializer value. Used for fields that hold external resources (sockets, file handles, UI references) that can’t be serialized and must be wired by the host after construction.

Example:

@@[persist(JSON)]
@@system Foo {
    interface:
        bump()
        get_n(): int

    operations:
        @@[save]   pickle(): JSON
        @@[load]   unpickle(data: JSON)

    machine:
        $S0 {
            bump() { self.n = self.n + 1 }
            get_n(): int { @@:(self.n) }
        }
    domain:
        n: int = 0
}

User code — uniform across all 17 backends, two-step pattern:

foo = Foo()                       # $S0 enter fires (idempotent for typical systems)
foo.bump(); foo.bump()
data = foo.pickle()               # @@[save] op, body framework-generated

foo2 = Foo()                      # construct fresh; $S0 enter fires
foo2.unpickle(data)               # @@[load] op overwrites compartment with snapshot
assert foo2.get_n() == 2

var foo = Foo.new()
foo.bump(); foo.bump()
var data = foo.pickle()

var foo2 = Foo.new()
foo2.unpickle(data)

Foo foo = new Foo();
JSON data = foo.pickle();

Foo foo2 = new Foo();
foo2.unpickle(data);

Every backend uses the same shape: regular instance methods. No static-method-on-its-own-class scoping issue. GDScript fix is structural — the bug class can’t recur because there’s no static method to resolve.

`$S0` enter on restore — known semantics

Calling Foo() followed by foo.unpickle(data) fires $S0’s >() enter handler once before unpickle overwrites the compartment with the persisted state. For typical persist systems (whose $S0 enter just initializes domain defaults), this is invisible — the defaults get overwritten immediately.

For systems with externally observable side effects in $S0 enter (e.g., a print(...), network handshake, file open), those effects fire once on every restore. Workarounds:

Make $S0 enter idempotent / pure (best practice anyway).
Gate side effects on a domain flag that the load body can clear.
Move the side effect to a non-$S0 state and transition there manually after load.

This is documented as a contract limitation rather than worked around in codegen. An earlier draft proposed special “no-init constructor” syntax (@@Foo.unpickle(data)) to bypass $S0 enter on restore, but the per-backend lowering (constructor overload + tag-dispatched ctor + factory function) was complexity we deemed not worth paying for the narrow case of “user has observable side effects in $S0 enter.” The two-step pattern is uniformly simple and covers the common case.

Pre / post hooks

Not provided. The user wraps inst.<save_op>() with whatever they want in caller code, and similarly arranges any post-load wiring after the load construction returns. If they need the post-load wiring to be guaranteed (e.g., reconnect a socket every time), they declare a regular operations: method and call it explicitly:

foo2 = Foo()
foo2.unpickle(data)
foo2.reconnect()                 # regular operation, user's responsibility

Earlier drafts added @@[before_save] / @@[after_save] / @@[before_load] / @@[after_load] attributes to provide bracketing hooks, but every real use case for those collapses into “user code in the calling function” except post-restore wiring — and even that is reasonably the user’s responsibility, since Frame can’t know which external resources their app uses.

If real demand surfaces for post-restore wiring as a Frame primitive (rather than an app concern), a future @@[on_load] attribute on a regular operation can be added without breaking the four-attribute contract.

Validator rules

@@[save] and @@[load] valid only on operations of @@[persist] systems. Otherwise E801 (attribute at wrong position).
@@[no_persist] valid only on domain fields of @@[persist] systems. Otherwise E801.
At most one @@[save] and one @@[load] per system. Otherwise E810 (proposed: duplicate persist operation).
Save op signature: zero parameters, return type matches the format type from @@[persist(<Format>)]. Otherwise E811 (proposed: persist save signature mismatch).
Load op signature: one parameter typed as the format, no return type. Otherwise E812 (proposed: persist load signature mismatch).
Operations with @@[save] / @@[load] must have no body in source — Frame generates it. A user-provided body is E813 (proposed: persist op body is framework-generated).

Migration

Pre-1.0 hard cut, RFC-0013 wave 1+2 playbook. Frame source on the existing contract (no @@[save]/@@[load] ops, magic save_state/restore_state interface) becomes invalid; framec emits E814 (proposed: bare-form persist contract is no longer accepted — declare @@[save] and @@[load] operations).

Test corpus migration: every @@[persist] system declares the two operations; drivers update from Foo.restore_state(data) (static) to foo = Foo(); foo.unpickle(data) (two-step). Mechanical sed; the operation names are conventionally save_state / restore_state unless users want different names.

Phasing

Phase A ✅ (2026-05-02): Parser + validator for the four attributes. GDScript codegen end-to-end (proves the design). Test fixture + matrix verification GDScript-only. Closed the GDScript bug; unblocked frame-arcade scoreboard.
Phase B1 ✅ (2026-05-02): All 17 backend codegens accept the new contract additively. Legacy contract preserved everywhere for backwards compatibility (matrix proof: 4,275 / 4,275 passing). Per-backend changes:
- Family 1 (dynamic): Python, JS, TS, Ruby, Lua, PHP, Dart, GDScript — target = self/this/$this; load body drops construction-bypass, mutates self in place.
- Family 2 (typed JVM/Swift): Java, Kotlin, C#, Swift — legacy RuntimeHelpers.GetUninitializedObject / ReflectionClass::newInstanceWithoutConstructor stays under legacy; new contract drops the bypass entirely.
- Family 3 (systems): Rust, C++ — Rust uses struct-literal bypass under legacy, direct self.X = ... under new; C++ similar with (*this).X = ....
- Family 4 (factory shape): Go, C, Erlang — Go: receiver method (new) vs package-level Restore<Sys> (legacy); C: <Sys>_load_op(<Sys>* self, json) (new) vs <Sys>* <Sys>_restore_state(json) (legacy); Erlang: design exclusion — gen_statem Pid model means load is always a factory, just renamed under user’s @@[save]/@@[load].
Phase B2 ✅ (2026-05-02): Canonical end-to-end test 93_persist_save_load_contract ported to all 17 backends. Frame source declares operations: @@[save] / @@[load]; driver creates instance, mutates, saves, creates fresh instance, loads snapshot, asserts state. Surfaced + fixed 3 codegen bugs:
- Rust + Erlang duplicate operations (system_codegen.rs skip not propagated to rust_system.rs / erlang_system.rs)
- Rust load-param type ignored user declaration (fixed via new SystemAst::load_op_param_type() helper)
- Go data collision with user’s load param
Phase B3 ✅ (2026-05-03): Hard-cut E814 shipped. Bare @@[persist] now errors out; every persist system must declare @@[save] and @@[load] ops. The full legacy fixture migration (~425 fixtures across 17 backends + linux
- demos + erlang multi) landed in test_env commits 54f11d7d, bcaa5e0d, 4e487f40, d627359d, b3dd4cdc. Matrix 4,275 / 4,275 across 17 backends.
Phase B4 ✅ (2026-05-02, this section): Documentation — RFC-0012 status, frame_runtime.md, per-language guides.
Phase C (deferred to roadmap): schema versioning + @@[migrate] operation chain. See “Future roadmap” below.
Phase D ✅ (2026-05-03, framepiler a61390e): @@[on_load] post-load hook. Fifth attribute. Marks an operation that fires automatically after restore_state populates self, so user code can re-establish derived state, fire watchers, validate invariants. AST helper SystemAst::on_load_op_name(); validator recognizes the attribute (E810 enforces at-most-one); codegen appends target.<name>() (per-language form) to each backend’s restore body via interface_gen::on_load_call helper. Test fixture: 95_persist_on_load_hook.fpy. Wired in 14 backends (Erlang’s gen_statem dispatch deferred — separate codegen).

Phase A alone closed the GDScript bug. Phase B1+B2 made the contract usable on every backend. Phase B3 hard-cut shipped 2026-05-03 once the legacy fixture migration completed.

Retired by RFC-0015 (framepiler 66c9573, 2026-05-04). See rfc-0015.md for the lifecycle attribute design that supersedes this.

Future roadmap (post-Phase B)

The four-attribute contract above covers Frame’s current target use cases (game save/restore, app state, web session). For Frame to expand into adjacent use cases (long-lived state, workflow orchestration), additional surfaces are needed. Recorded here as deferred work, not in scope for the GDScript-bug-driven amendment.

Survey: how comparable systems handle persistence

Honest comparison of Frame’s persist scope vs. nearby systems we’d plausibly be measured against:

System	State model	Persistence	Schema evolution	Concurrency
Airflow	DAG of tasks; queued/running/success/failed states	External metadata DB; per-row, per-task-instance	Versioned DAG code; older runs locked to historical DAG	DB row locks
AWS Step Functions	JSON state machine	Internal AWS-managed; every transition durable	Versioned state machine ARNs	Per-execution; AWS-handled
K8s operators	Reconciliation loop on CRDs	etcd via API; spec/status separation	Versioned APIs (v1alpha/beta/v1); conversion webhooks	Optimistic via resourceVersion
Terraform	Declarative resource graph	tfstate JSON; remote backend optional	`terraform state mv`; provider versioning	State locks (S3+DynamoDB)
Erlang OTP	Actor + supervisor tree	mnesia / DETS / external	Hot-code-loading + state migration callbacks	Per-process mailbox
Akka	Actor + persistence	Event sourcing log + snapshots	Schema evolution via event adapters	Per-actor mailbox
Hibernate / JPA	POJO entities	DB rows; lazy/eager loading	`@Version` + Liquibase/Flyway migrations	DB transaction isolation

Use-case alignment for the four-attribute contract:

Use case	Covered?
Game save/restore (frame-arcade)	✅
Mobile/desktop app state restoration	✅
Web session state (server-side)	✅
Embedded device state across firmware updates	⚠️ — needs schema versioning
Workflow orchestration (Airflow-style)	❌ — needs WAL + observable transitions
Distributed state machines	❌ — concurrency / leader election out of scope
Long-lived business processes (Step Functions Wait, weeks/months)	❌ — needs durable wait + versioning
Infrastructure state (Terraform-style)	❌ — needs locking + versioning
Event-sourcing actor (Akka-style)	⚠️ — Frame snapshots, not event-sourced

The first three are realistic Frame use cases today. The next two are aspirational with schema versioning (Phase C below). The bottom four are out of scope — they’d require Frame to grow new surfaces (write-ahead logging, distributed locking, durable timers) that shouldn’t be baked into core persist.

Roadmap item 1: schema versioning + `@@[migrate]` (Phase C)

Long-lived state outlives code revisions. Adding a domain field, renaming a state, restructuring HSM hierarchy — every such change breaks old snapshots. Comparable systems all version their state representations.

Proposed extension:

@@[persist(JSON, version=2)]
@@system Foo {
    operations:
        @@[save]                save(): JSON
        @@[load]                load(data: JSON)

        @@[migrate(from=1, to=2)]
        v1_to_v2(old: JSON): JSON   # body: user transforms old shape to new
}

On load(data), framework reads version field from the payload. If mismatched, walks the chain of @@[migrate] ops to forward-migrate from data["version"] to the current. Fail loudly if no chain exists (E815 proposed). Each migration op transforms the payload one version forward; the framework chains them.

Validator rules (additional):

@@[migrate] valid only on @@[persist] system operations.
from and to must be integer literals; to == from + 1 (one-step migrations).
Migration chain from any version value present in test snapshots to the current version must be complete (validator can detect gaps given a manifest, or report at load time).

Implementation note: the version field is embedded in the save payload by the framework, not user-provided. Format-specific (JSON: top-level "version" field; Protobuf: a reserved tag).

When to ship: when a real customer hits a breaking schema change. Not needed for game/session use cases that are inherently single-version.

Roadmap item 2: framework boundaries documented in `frame_runtime.md`

Set explicit expectations:

Frame’s persistence is point-in-time snapshot. It does not provide:

Write-ahead logging — auto-save-on-transition is not built in. Every save is user-triggered.

Distributed locking / leader election — single-instance only. Coordination across processes is the host’s responsibility.

Long-lived dehydrated waits — Frame is synchronous. Wait-then- resume across hours/days needs an external scheduler that holds snapshots and reconstitutes the system on the trigger.

Event sourcing — only state snapshots, not transition history. The save reflects “current state,” not “how we got here.”

If your use case needs these, layer them above Frame:

Persist the snapshot to a durable store (file, database, S3).

Coordinate snapshot timing in your host app.

For distributed state, use a coordinator (etcd, ZooKeeper, Raft).

For event sourcing, log every event externally; replay through Frame’s normal dispatch on restore.

Frame’s @@[persist] is the right tool for: game saves, mobile app state, server-side sessions, embedded device state, single- instance workflows. It is the wrong tool for: workflow orchestration platforms, infrastructure-as-code state, distributed consensus, long-running multi-day business processes.

Land this section as part of Phase B’s frame_runtime.md updates. Zero implementation cost; high value in setting user expectations correctly.

Roadmap item 3: `@@[on_load]` post-load wiring hook (Phase D) — SHIPPED 2026-05-03

@@[on_load] is an operation attribute that fires automatically after restore_state populates self, before any user-triggered event can dispatch. The user writes the body; framec emits a call to it at the end of the framework-managed restore body.

@@[persist]
@@system Counter {
    operations:
        @@[save]
        save_state(): bytes {}

        @@[load]
        restore_state(data: bytes) {}

        @@[on_load]
        rebuild_derived() {
            # called automatically after restore_state body completes,
            # before any user-triggered event can dispatch
            self.doubled = self.n * 2
            self.was_restored = true
        }
    ...
}

At-most-one per system (E810); requires @@[persist] (E801). Wired in 14 backends (Erlang’s gen_statem dispatch deferred). Test fixture: 95_persist_on_load_hook.fpy. framepiler a61390e.

Roadmap item 4: pluggable serializer registry

Today, the format token (JSON, Protobuf, etc.) is a string matched against per-backend hardcoded ser/deser implementations. Future: allow users to register custom serializers per format token, analogous to serde’s Serialize / Deserialize derive macros or Akka’s serializer config.

Defer until customer use case (e.g., encrypted-at-rest snapshots, custom binary format for embedded targets).

Roadmap item 5: incremental / differential save

For large systems where full snapshot is expensive, support a “what changed since last save” mode. Akin to Terraform’s plan-then-apply or Airflow’s per-row updates. Useful for:

Systems with large domain state (>1MB serialized).
High-frequency saves (every event).

Defer indefinitely — current Frame use cases are well within full- snapshot perf budgets.

Roadmap item 6: durable write-ahead-logging mode

For workflow-orchestration use cases where every state transition must be durable before the action is taken (Step Functions / Airflow contract). Would require:

Auto-save-on-transition wired into Frame’s dispatch loop.
A user-provided durable-write callback (or built-in support for common stores: SQLite, Postgres, file).
Recovery semantics: restart resumes from last durable transition.

This is a significant scope expansion — effectively Frame would become a workflow engine, competing with the systems in the survey table above. Defer until product direction explicitly aims here.

Open questions (current four-attribute design)

Default operation names when user wants the simplest possible declaration? Could allow @@[save] / @@[load] with no user-named operation and Frame auto-creates save_state / restore_state operations. Reduces boilerplate to one attribute on the system. Tradeoff: implicit operation generation conflicts with Frame’s “everything in interface: / operations: is user- declared” principle.
Format negotiation when the user-named load op is invoked with data that was saved under a different format? Currently the format is system-static, so this can’t happen unless the same system declaration changes formats across binary versions. Per roadmap item 1, this is the schema-versioning problem; deferred.
@@[no_persist] interaction with state vars / enter-args / state-args? These are compartment fields, not domain fields. The attribute is currently scoped to domain fields only; if users want transient state vars the recommended pattern is to lift them to domain with @@[no_persist]. Could be revisited.

RFC-0012: Persist Stress Testing — Coverage, Limits, and Hash-Based Validation

Status

Resolved decisions

Quiescent contract for save_state (decided 2026-05-01)

Background

Discussion summary

Graph topologies — what shapes do we test?

Cross-products with other Frame features

Tricky data inside nested systems

Identity & invariants

Failure / adversarial input

Open contract questions

Cycles in the persist graph

What “cycle” means

Construction path

What each backend does today

Three options

Recommendation

Tests for Option A (test 89)

Open questions

Python: switch from pickle to JSON-based persist

Current state

The case for switching

The case against

Implementation sketch

Migration path

Effort estimate

Open questions

Adversarial input — threat model and proposed contract

What “adversarial input” means

Three threat models, three different scopes

Current state per backend

Proposed contract — E701: corrupted snapshot

What about the Python pickle problem?

Embedded class marker for wrong-class detection

Recommended path

Decision needed from review

Theoretical limits

Solvable mechanically (with engineering work)

Tractable but require user-written code

Genuinely intractable

Where the practical ceiling sits

Hash-based round-trip validation

Subtleties

Cross-backend gold-standard variant

Effort to add

Property-based event-replay testing

Recommended path forward

Step 1: Spec the contract (~1 day)

Step 2: Hash-based round-trip testing (~1 day)

Step 3: Hand-cataloged graph topology tests (~2 days)

Step 4: Property-based fuzzer (~3–5 days)

Step 5: Schema-evolution test suite (optional, ~3 days)

Step 6: Cross-backend Wire Format (optional, ~1–2 weeks)

Drawbacks / alternatives

Drawback: contract-first work delays test value

Alternative: behavior testing only, skip hash

Alternative: cross-backend Wire Format first

Alternative: skip property-based testing entirely

Open questions for review

Implementation status

References

Amendment 2026-05-02: @@[save] / @@[load] operation attributes

Motivation

Design

$S0 enter on restore — known semantics

Pre / post hooks

Validator rules

Migration

Phasing

Future roadmap (post-Phase B)

Survey: how comparable systems handle persistence

Roadmap item 1: schema versioning + @@[migrate] (Phase C)

Roadmap item 2: framework boundaries documented in frame_runtime.md

Roadmap item 3: @@[on_load] post-load wiring hook (Phase D) — SHIPPED 2026-05-03

Roadmap item 4: pluggable serializer registry

Roadmap item 5: incremental / differential save

Roadmap item 6: durable write-ahead-logging mode

Open questions (current four-attribute design)

Quiescent contract for `save_state` (decided 2026-05-01)

Proposed contract — `E701: corrupted snapshot`

Amendment 2026-05-02: `@@[save]` / `@@[load]` operation attributes

`$S0` enter on restore — known semantics

Roadmap item 1: schema versioning + `@@[migrate]` (Phase C)

Roadmap item 2: framework boundaries documented in `frame_runtime.md`

Roadmap item 3: `@@[on_load]` post-load wiring hook (Phase D) — SHIPPED 2026-05-03