RFC-0012: Persist Stress Testing — Coverage, Limits, and Hash-Based Validation
Prompt Engineer: Mark Truluck mark@frame-lang.org
Status: Amendment shipped (Phases A–B + D, 2026-05-02 → 2026-05-03). Phase C (schema versioning + @@[migrate]) deferred to roadmap.
Created: 2026-05-01
Last design discussion: 2026-05-01
Status
In progress. This RFC captures design discussions held during the nested-system persistence rollout. The rollout itself shipped (5-deep nested persist works on all 17 backends — C and Erlang reached parity in the 2026-05-02 wave 8 push). The RFC tracks both the test discipline needed for the persist contract and the contract decisions themselves as they land.
Decided / in implementation: the quiescent contract for
save_state (see “Resolved decisions” below). Hard cut, no soft
warning. Implementation tracked as a separate work stream.
Still open: the rest of the questions in “Open contract questions” plus the test discipline build-out.
Resolved decisions
Quiescent contract for save_state (decided 2026-05-01)
save_state requires the system to be quiescent — defined as
_context_stack empty (no event being dispatched, no handler in
flight, no pending return).
Why: mid-event saves are inherently undefined today. Three concrete ambiguities:
- Pending transition. Frame queues
-> $Newuntil the handler returns. Mid-handler,__compartment.stateis still$Old. A snapshot here loses the queued transition; restore resumes from$Oldwith no record of the intent. - Partial
@@:return. Some backends store the return value on the top context frame. Mid-handler save would persist a return value that no caller will ever consume. @@:data["key"]per-call data. Lives on the context frame. Saving it captures partial intermediate state; restoring it makes no sense without restoring the call that owned it.
The quiescent contract eliminates the entire class of ambiguity.
Orthogonal — _state_stack (push/pop) is fine during quiescent.
_state_stack is the modal stack (push$/pop$); _context_stack is
the call stack. Save between interface calls when push$ has built
a multi-level stack is normal supported behavior. Quiescent only
forbids mid-call.
Edge cases:
- Constructor
$>(). Initial-enter context is pushed and popped before constructor returns. Quiescent at user-visible time. ✓ - Restore +
__skipInitialEnter. Bypasses$>()entirely; no context ever pushed. Quiescent at end of restore. ✓ - Async (Java CompletableFuture, Rust Future, etc.). Handler awaits → suspended → context still on stack. Save from a separate task during the await is non-quiescent and correctly errors.
- Pop-with-transition. Both queued — mid-handler state. Quiescent check fires. ✓
- Multi-system cross-call. Outer.tick() calls Inner.cycle().
Each system has its own
_context_stack. Outer is non-quiescent (mid-tick) even when Inner is between its own calls. Saving Outer here correctly errors.
Error contract:
- Code:
E700 - Name:
system not quiescent - Per-backend mechanism:
- Native exceptions (Java/Kotlin/C#/Swift/Python/Ruby/PHP/Dart/JS/TS/Lua):
throw
FrameNotQuiescent(or per-backend convention) withcode=E700, message="system not quiescent". - Rust: return
Result<String, FrameError>;FrameError::NotQuiescent. - Go: return
(string, error). - C: return null + sets errno-equivalent / writes to error out-parameter (per existing C persist convention).
- Erlang: persist is sidecar — return
{error, system_not_quiescent}. - GDScript: return empty + push to error queue (per GDScript convention).
- Native exceptions (Java/Kotlin/C#/Swift/Python/Ruby/PHP/Dart/JS/TS/Lua):
throw
Backwards compatibility: hard cut. Pre-1.0, persist still being hardened. Mid-event save was always undefined — promoting it to a hard error is a stricter contract, not a behavior change for any correctly-written user code.
Implementation cost:
- Codegen: ~3 lines per backend at start of
save_state. ~45 lines total + per-backend error type definitions. - Tests: new test 88 exercising the error path. Mechanism varies
per backend (probe via
self.save_state()from inside a handler where Frame syntax allows it; manual context-stack injection otherwise). Skip backends where neither path works cleanly; document per-backend coverage. - Docs: persist-contract section in
frame_runtime.md; per-language guide updates for the error type and idiomatic handling.
Test design (test 88):
- Frame source has a handler that attempts
self.save_state()from inside the handler body. - Native test calls the handler and asserts the runtime raises
E700 / system not quiescent(per-backend error mechanism). - Backends where
self.save_state()from a handler isn’t expressible via Frame’s escape hatches: skip with documented reason. Defer until those backends grow the syntax (or skip permanently).
This decision closes open contract question 4 below.
Background
Frame’s @@persist mechanism rounds out a state machine snapshot
to a JSON string (Python uses pickle), and rebuilds an instance
from that string on load. Through the work in framepiler commits
5849b9a3 and earlier, persist now works across all 17 backends
for primitive domain vars, state-args, push/pop, HSM, async,
multi-event sequences, and (post cafdec8) nested system
instances 5-deep.
Test corpus today:
tests/common/positive/primary/56_persist_*through83_persist_5deep.*- Per-feature waves documented in
framec-test-env/fuzz/FUZZ_PLAN.md - DEFECTS.md captures D1–D18, all closed
What we don’t have:
- Property-based testing of the round-trip invariant
- Adversarial input testing (corrupted JSON, wrong-class restore)
- Schema-evolution testing (snapshot format compatibility)
- A spec for cyclic / shared-reference graphs
- A spec for mid-event save semantics
Hand-written tests catch known shapes. They cannot catch the “unknown unknown” — corner cases we didn’t think to write a test for. The question this RFC answers: how far can we push coverage, and what’s the cleanest mechanism to do so?
Discussion summary
This RFC captures a design discussion. Preserved verbatim where the rationale matters; condensed elsewhere.
Graph topologies — what shapes do we test?
Test 83 covers a linear chain (L1→L2→L3→L4→L5). That’s the simplest non-trivial topology. Untouched shapes:
- Branching — one parent holds multiple distinct children
(
a: @@A() ; b: @@B()). Most generators assume one child slot. Likely to surface bugs in any backend that hand-rolls the field iteration. - Same-type siblings —
a: @@Counter() ; b: @@Counter()— two instances of the same class as separate fields. Should round-trip independently with no aliasing. - Tree / fan-out — N children, each with subtrees. Stresses framework recursion limits.
- Diamond / shared reference — A holds B and C; B and C both
hold the same D. After restore, are B’s
dand C’sdthe same logical instance, or separate copies? Today: separate copies (JSON has no reference semantics). Worth a deliberate test that asserts the expected answer. - Cycle / back-reference — A holds B holds A. Naive serializers infinite-loop; pickle handles it via memoization; JSON-based backends will stack-overflow today. Worth testing the failure mode (graceful error vs. hang vs. crash).
- Self-reference — A holds A directly. Sharper version of cycle.
Cross-products with other Frame features
Persist × N is where bugs concentrate. Combinations done in existing waves: state-args (D5–D11, D15), HSM (test 57, 60), push/pop (test 58, 60), async (test 81), multi-event (test 59).
Untouched in combination with nested systems:
- Nested systems × HSM
- Nested systems × push/pop
- Nested systems × async
- Nested systems × state-args holding nested systems
(
$Active(c: Counter)— pass a system instance as a state-arg)
The 4-way cross (persist × HSM × nested × push/pop) is the maximum-density spot. Past 4-way crosses found the highest defect density (D16, D17, Wave 1 Phase 14 Erlang defects).
Tricky data inside nested systems
Test 83’s L1–L5 each have a single int and a child field.
Real systems carry richer state. Risks:
- Nested level with collection state-args (
Map<String, List<int>>) — exercises type-ignorant persist and nested-system round-trip simultaneously - Nested level with its own HSM — tests that compartment-chain serialization is per-system, not system-global
- Nested level with state-vars that change pre/post-save
- Nested level with operations (which mutate but aren’t event-driven)
Identity & invariants
What gets preserved by restore that we might not realize:
__compartmentchain (current state hierarchy)_state_stack(push history)- HSM
parent_compartmentback-pointers - Nested children, recursively
What should NOT be preserved (negative invariants):
_context_stack(per-event scratchpad — would leak request data across persist boundaries)- Pending event queue (if any)
- Any closure references / live timer handles
A test discipline that explicitly asserts negative invariants is just as valuable as one that asserts positives. We don’t have any negative-invariant tests today.
Failure / adversarial input
For a real production-grade contract, we should test:
- Corrupted JSON — does each backend throw a typed exception, or panic / segfault?
- Truncated input — same question
- Wrong-class snapshot — load a
Containerblob intoCounter. Today: undefined behavior. Should reject with a clear error. - Schema evolution — user added a domain field after the snapshot was written. Does the field default? Does old code reject? Does newer code load the old snapshot?
Open contract questions
Before we write tests, we need answers (or deliberate “undefined”) for:
- Cycles: graceful error or memoize and preserve sharing? Discussion expanded below under “Cycles in the persist graph.” Recommendation: Option A (E702 detect+error). Decision deferred pending customer feedback.
- Shared references: lock in “duplicate on restore” or treat as a future feature (memoization-based preservation)? Effectively answered by the cycles decision: Option A keeps “duplicate on restore.” Option B would preserve sharing as a side effect.
- Schema evolution: in scope for the test suite, or save for a later production-readiness milestone?
RESOLVED (2026-05-01) — see “Resolved decisions / Quiescent contract” above. Mid-event save is a hard error:_context_stackmid-event: should saving during a handler be disallowed (throw), or do we promise something about what’s captured?E700 / system not quiescent.- Adversarial input contract: typed exception (named what?), generic panic, or silent garbage? Discussion expanded below under “Adversarial input — threat model and proposed contract.” Decision deferred pending threat-model selection.
- Concurrent save during async-await:
undefined? rejected? captured-at-suspension?Effectively resolved by the quiescent contract — concurrent save during await is non- quiescent (handler context still on stack). Errors with E700. Concurrent save from another system in the same process is still untested; that’s a separate concern.
These questions gate the test design. Without answers, tests can’t assert anything meaningful — there’s no contract to compare against.
Cycles in the persist graph
(Expanded from open question 1, decision deferred pending customer feedback.)
What “cycle” means
The persist graph is rooted at the system you call save_state
on. Each @@SystemName field is an edge to a child instance. A
cycle is when traversal returns to a previously-visited instance:
- Self-reference (1-cycle): A holds an A
(
self.peer = some_a). - Mutual (2-cycle): A holds B; B holds A.
- Longer: A→B→C→A.
Construction path
- Static
@@SystemName()initializers cannot form cycles. A’s@@B()constructs B; if B has@@A(), B’s construction triggers another A construction, infinitely. The program crashes during construction, before persist enters the picture. framec doesn’t currently catch this — could add a static cycle check (E430-class) but it’s a separate concern. - Runtime mutation can form cycles. A handler with a
system-typed parameter that does
self.x = arglets users wirea.set_b(b); b.set_a(a). Real cycles, real risk.
What each backend does today
| Backend | Behavior on cyclic save |
|---|---|
| 14 JSON backends + C | Stack overflow — save_state recurses indefinitely |
| Python (pickle) | Round-trips correctly with shared identity preserved — pickle’s memo table |
| Erlang | gen_statem call chain deadlocks (timeout) |
Python is unique. Every other backend produces a hard crash.
Three options
Option A — Detect and error (E702). Each save_state
maintains a thread-local in-flight set; recursion into an
already-visited instance throws E702: cycle detected in persist
graph. RAII / try-finally cleans up.
- Pros: simple, fast, uniform across backends, matches E700 philosophy, ~150 LOC codegen + ~600 LOC tests, ~1.5 days.
- Cons: regresses Python (loses pickle’s cycle handling). Cycle- using code has to be rewritten (store IDs instead of object references).
Option B — Memoize (preserve sharing). Each instance gets a
unique ID at save time; repeat visits write {"_ref": N}
instead of duplicating state. Restore is two-pass: allocate,
then wire.
- Pros: cyclic graphs round-trip; shared references preserve identity post-restore (real feature beyond cycles).
- Cons: per-backend complexity is significant — two-pass save + two-pass restore + per-backend identity hashing. Wire format gets ID space (harder schema migration). ~600-800 LOC, ~5-7 days.
Option C — Document as undefined. Zero work; stack overflow remains the failure mode.
Recommendation
Option A, three reasons:
- Cost/value alignment. Cycles aren’t a feature most users want; B’s complexity buys a niche capability.
- Sharing-preservation isn’t free even with B. Once you commit to B, you’re partway down the road to “Frame persist is a full object-graph serializer” — a much bigger commitment.
- The Python regression is the right call. Currently pickle handles cycles silently; if a user moves their app from Python to Java, the cycle becomes a stack overflow with no warning. Option A makes the contract uniform: cycles ALWAYS error.
Tests for Option A (test 89)
Frame source: two systems where the user constructs a cycle via
runtime mutation (a.set_peer(b); b.set_peer(a)). Driver calls
save_state, expects E702. Per-backend error mechanism (throw /
panic / abort / push_error) follows E700 conventions.
Open questions
- Option A or B? Recommend A.
- Python policy: lose pickle cycle support (uniform contract) or keep as documented Python-only behavior?
- Add the static
@@SystemName()initializer cycle check? - Make “shared references duplicate on restore” semantics explicit in docs?
Python: switch from pickle to JSON-based persist
(Discussion piece, decision deferred pending customer feedback. Tightly coupled to the cycles question above and the adversarial- input section below.)
Current state
Python uses pickle.dumps/loads (line 1336 of interface_gen.rs
— a 2-line implementation). Every other backend uses JSON via the
language’s idiomatic library.
The case for switching
- Closes the highest-severity adversarial-input item.
pickle.loadson attacker-controlled input is RCE. JSON is data-only; the worst an attacker does is craft malformed JSON, whichjson.loadsrejects cleanly. - Cross-backend wire format becomes viable. RFC-0012’s “cross-backend Wire Format” item moves from “deferred / 1-2 weeks” to “already done.” Save on Python, restore on JS.
- Uniform contract. E700 / E701 / E702 map cleanly across all 17 backends without Python-specific exceptions.
- Debuggability.
print(o.save_state())shows readable JSON, not opaque pickle bytes. - Test 86 byte-canonical idempotence becomes valid for Python. Currently skipped because pickle bytes aren’t JSON-comparable.
The case against
- Loses pickle’s “any object” capability. Pickle preserves
arbitrary Python objects (custom classes, lambdas, etc.). JSON
handles
int / float / bool / str / None / list / dictonly. In practice, Frame domain types track what other backends accept (primitives + nested systems), so Python users who already wanted portability are unaffected. Custom-class domain fields are uncommon. - Loses pickle’s cycle support. Pickle’s memo table preserves cyclic graphs. JSON-based Python would crash on cycles like the other 14 backends. If Option A from the cycles section ships, this becomes uniform — Python aligns with everyone else.
- Breaking change. Existing pickle blobs become unreadable. Hard cut, no auto-migration. Same precedent as E700.
- Codegen complexity goes from 2 lines to ~80. Mirrors what the other JSON backends already do.
Implementation sketch
Direct port of JS saveState/restoreState to Python:
def save_state(self):
if self._context_stack:
raise RuntimeError("E700: system not quiescent")
import json
def ser_comp(c):
if not c: return None
return {
"state": c.state,
"state_args": list(c.state_args),
"state_vars": dict(c.state_vars),
"enter_args": list(c.enter_args),
"exit_args": list(c.exit_args),
"forward_event": c.forward_event,
"parent_compartment": ser_comp(c.parent_compartment),
}
j = {
"_compartment": ser_comp(self.__compartment),
"_state_stack": [ser_comp(c) for c in self._state_stack],
}
# per-domain-field handling — recurse for nested @@SystemName
# ...
return json.dumps(j)
@staticmethod
def restore_state(json_str):
import json
j = json.loads(json_str)
cls = <SystemName>
cls.__skipInitialEnter = True
instance = cls()
cls.__skipInitialEnter = False
instance.__compartment = deser_comp(j["_compartment"])
# ...
return instance
__skipInitialEnter is the same static-flag pattern used by Java
and C# today.
Migration path
- Pre-1.0: hard cut. Document loudly. Existing pickle blobs become unreadable; users discard or re-create.
- Bundle with cycles work (Option A E702): single matrix run, single test rollout, single user-facing migration.
- Optional flag (
@@persist(format=pickle)) if customer feedback shows real demand for arbitrary-object preservation. Default to JSON. Costs ~50 LOC to keep both code paths.
Effort estimate
- Codegen switch: ~1 day.
- Cycles work bundled (Option A): +~1.5 days.
- Per-language guide + RFC + matrix updates: ~0.5 day.
- Total: ~3 days for the bundled wave.
Open questions
- Hard cut, or
@@persist(format=pickle)opt-in? - Bundle with cycles, or separate?
- Re-enable test 86 byte-canonical idempotence for Python during the migration?
Adversarial input — threat model and proposed contract
(Expanded from open question 5, decision deferred pending threat- model selection.)
What “adversarial input” means
Calling restore_state with a JSON blob that’s malformed,
corrupted, malicious, or just wrong. Concrete shapes:
- Truncated JSON. Blob cut mid-document. Parser fails fast.
- Type mismatches.
_compartment.stateshould be a string, blob has42. Restore uses wrong type — fails or silently corrupts depending on backend. - Missing required fields. No
_compartmentkey. Restore NPEs trying to access it. - Wrong-class blob. Saved
Outer, restored asFoo. Field shapes mismatch. - Unknown extra fields. Blob from a future framec version with new fields. Forwards-compat: should be ignored.
- State name not in topology.
state: "$Bogus"references a state that doesn’t exist in this system. Already producesRestoreErrorper the existing topology-validation pass. - Numeric overflow.
i32field with value2^33. Backend parser truncates or errors. - Maliciously-crafted blob. Billion-laughs equivalent (deeply nested arrays designed to OOM), excessive nesting that overflows the parser’s recursion, gigantic strings.
- Pickle-specific (Python only). Pickle deserializes class
instantiations including
__reduce__methods. A crafted pickle blob runs arbitrary code onpickle.loads. This is a documented Python vulnerability; pickle docs explicitly warn “never unpickle untrusted data.” Only Frame backend currently exposed to this is Python.
Three threat models, three different scopes
The work required depends entirely on what users do with
restore_state:
(A) Local file save/load — game state, editor sessions, crash recovery to disk. Threat: filesystem corruption. Rare. Need: define a clear error so users can fall back to defaults. No security work.
(B) Network/database/cookie persistence — session state over the wire, multi-tenant systems where one tenant’s blob might be loaded by another’s code. Threat: attacker controls the blob. DoS via OOM, parse errors, RCE on Python pickle. Need: hardened input validation, typed errors per failure mode, switch Python off pickle (security-critical for this threat model), defense against depth bombs. Significant work (3–5 days plus security review).
(C) Process snapshot for crash recovery — same-process save/restore for fault tolerance. Threat: filesystem corruption, not adversarial. Need: robust error handling, no security hardening.
Current state per backend
restore_state behavior under adversarial input today:
| Backend | Behavior |
|---|---|
| Python (pickle) | RCE risk on malicious input. Major hole if used over the wire. |
| Java/Kotlin (Jackson) | Throws Jackson exceptions on parse errors; type coercion silently does wrong things on mismatches. |
| Rust (serde derive) | Strict — fails fast on missing fields, type mismatches. |
| Go (encoding/json) | Silently ignores unknown fields, errors on type mismatches. |
| C++ (nlohmann) | Parses leniently; typed access throws on mismatch. |
| Lua (cjson) | error() on malformed input. |
| C# (System.Text.Json) | Throws on parse errors; lenient on type. |
| Swift (Codable) | Throws DecodingError on shape mismatch. |
| PHP (json_decode) | Returns null on parse failure; type coercion silent. |
| Ruby (JSON) | Throws JSON::ParserError. |
| Dart (jsonDecode) | Throws FormatException. |
| JS/TS | JSON.parse throws SyntaxError. |
| GDScript (var_to_bytes) | Returns null/empty on bad input. |
| C (cJSON) | Returns NULL on parse failure; manual checks needed. |
| Erlang (sidecar) | Throws erlang:error on bad term. |
No uniform contract. Failure modes range from “throws clear typed error” to “silently corrupts” to “executes arbitrary code.”
Proposed contract — E701: corrupted snapshot
Mirror the E700 pattern. Spec says restore_state should fail
with E701: corrupted snapshot on any of:
- Parse failure (malformed JSON / pickle / etc.)
- Missing required structural field (
_compartment,_state_stack) - Type mismatch on a structural field (state name not a string, state_stack not an array, etc.)
- State name not in
_HSM_CHAIN(already raisesRestoreErrortoday; subsume into E701 or keep separate code per the topology question — pick one) - Wrong-class blob (no system-name marker; debatable whether framec embeds one — discussed below)
Per-backend mechanism follows E700 conventions:
- Throw on JVM/dynamic langs/C++/Dart.
- Panic on Rust/Go.
- Abort on C/Swift.
- Empty return + error queue on GDScript.
- Erlang:
{error, corrupted_snapshot}tuple orerlang:error.
What E701 does NOT cover:
- Numeric overflow within valid JSON (out-of-spec but parseable).
- Forwards-compatible unknown fields (silently ignored, not an error).
- Adversarial DoS (depth bombs, gigantic strings) — separate hardening pass under threat model B.
What about the Python pickle problem?
For threat model B, pickle is non-negotiably a problem. Options:
- Replace pickle with JSON for Python persist. Match the other backends. Loses pickle’s “preserves arbitrary Python objects” property — domain fields would need explicit JSON serialization rules like the typed backends. Significant codegen change.
- Add a
@@persist(safe)opt-in mode that uses JSON. Default stays pickle for backward compat. - Document only. Add a security warning to the Python guide and persist docs: “do not unpickle untrusted blobs.” No code change.
Option (3) is fine for threat model A or C. (1) or (2) only needed if Frame officially supports B.
Embedded class marker for wrong-class detection
Today’s blob has no “this was saved from class X” marker.
Restoring an Outer blob into Foo.restore_state produces
undefined behavior (probably parse error or silent garbage,
depending on field overlap). A 1-line fix: include
"_system": "Outer" in the saved JSON, validate on restore.
Would close one E701 case cleanly.
Recommended path
The minimum work that closes the contract gap:
- Document the threat model. One paragraph in
frame_runtime.md: “restore_state assumes trusted input. Untrusted-source blobs need separate validation. Python pickle is especially dangerous for untrusted input — switch to a JSON-based approach if needed.” Effort: 30 min. - Define
E701: corrupted snapshotwith the same per- backend mechanism table as E700. Codegen wraps each backend’s parse-and-validate path so failures convert to E701. Effort: ~1 day. - Add embedded class marker (
"_system": "<SystemName>"). Validate inrestore_stateprologue. Effort: ~2 hours across 15 backends. - Test 89 — adversarial input smoke. Per backend, ~5 cases: truncated, wrong type on structural field, missing field, wrong-class blob, state-name-not-in-topology. Verify each produces E701, not crash/UB. Effort: ~1 day.
Total: ~2.5 days. Closes question 5 for threat models A and C.
Defer until production use case appears:
- Pickle replacement (1 or 2 above).
- Depth-bomb/string-bomb hardening.
- Numeric-range validation on domain fields.
These are threat-model-B work. Build them when someone needs B; don’t build speculative security infrastructure.
Decision needed from review
- Pick the default threat model. (A) seems most defensible for current Frame; (B) requires a security commitment.
- Confirm
E701as the error code or pick a different number. - Decide whether
RestoreError(existing topology-validation error) merges intoE701or stays as a sibling code. - Confirm minimum viable scope: 1+2+3+4 above, or smaller?
Theoretical limits
We pushed on “what’s the theoretical best for coverage?” Three strata of difficulty:
Solvable mechanically (with engineering work)
| Item | Why tractable |
|---|---|
| Linear / tree / branching topologies | Just more domain fields. Existing Option A handles them. |
| Cycle detection + graceful error | Visited-set during serialize. ~30 LOC per backend. |
| Shared-reference preservation | Memoization ({"__id": 42, "data": ...} + {"__ref": 42}). Pickle does this; JSON-based backends just need an ID table. ~100 LOC per backend. |
| Cross-products (HSM, push/pop, async × nested) | Mechanical extension of existing waves. |
| Failure modes (corrupted/truncated input) | Wrap each backend’s parse call. ~10 LOC per backend. |
| Wrong-class snapshot rejection | Embed __sys: "L1" marker; check on restore. ~5 LOC per backend. |
| Additive schema evolution | Already mostly works — JSON ignores missing/extra keys. Just needs explicit testing. |
| Mid-event-handler save rejection | Check _context_stack.empty() at save entry; throw if not. Few LOC. |
Negative invariants (_context_stack not in snapshot) |
Diff the JSON; assert keys absent. Unit test. |
All tractable. None require new theory — just engineering.
Tractable but require user-written code
| Item | What’s needed from user |
|---|---|
| Semantic schema evolution (int → string field; system split/merge) | User-written migration function. Framework can route old → new via versioned @@migrate block. |
| Domain constraints / invariants | User-asserts post-restore. Framework can call a validate() hook. |
| Concurrent multi-thread save | User-supplied locking. Frame doesn’t enter the SMP/threading domain. |
| Custom type handling (in C, Rust, etc.) | User-supplied pack/unpack. Already established as the Frame contract. |
These are tractable in the sense the framework can facilitate them, but they fundamentally require user input. No automated machine generates a migration from “old schema” to “new schema” without knowing the user’s intent.
Genuinely intractable
| Item | Why |
|---|---|
Continuation-style save (snapshot mid-await, resume from exact suspension point) |
Requires first-class continuations or stackful coroutines at the language runtime level. Possible in Scheme, Smalltalk, some pickle subsets. Not possible for Rust async, JS Promise, Java CompletableFuture, etc. — their async types are not serializable. The achievable answer is “saves happen between events, not during” — a contract, not a test. |
| Universal observational equivalence proof | Rice’s theorem. You can test event sequences (sample-based confidence); you cannot prove equivalence for all inputs. |
| Auto-migration of arbitrary semantic changes | The user’s intent is not in the schema. Framework can detect the diff but can’t infer what to do with it. |
| Save during true concurrency without user-supplied isolation | Framework can’t know which threads access which fields. |
Where the practical ceiling sits
For Frame’s persistence as a whole, “excellent coverage” is achievable up to and including the boundary between framework and user concerns:
- All graph topologies (chain, branch, tree, diamond, cycle-detected) — mechanical
- All Frame feature × persist crosses (HSM, push/pop, async, multi-event, state-args, etc.) — mechanical
- Sharing/identity preservation if specced — mechanical via memoization
- Adversarial input contract (typed exceptions, schema validation) — mechanical
- Round-trip property assertion at scale via fuzzer — mechanical
What pushes us to the real ceiling: property-based testing.
Hash-based round-trip validation
Hash equality is the cleanest invariant for fuzz-scale testing. The pattern:
h1 = hash(canonical(instance.save_state()))
snap = instance.save_state()
restored = Class.restore_state(snap)
h2 = hash(canonical(restored.save_state()))
assert h1 == h2 // round-trip preserves state
If the two hashes match, the serializable representation is bit-identical pre/post round-trip. Strong invariant. Cheap (milliseconds per cycle). Easy to fan out across thousands of generated cases.
Subtleties
- Canonical form is essential. Cannot hash raw
save_state()output directly:- JSON object key order varies (some serializers don’t preserve insertion order)
- Float representation (
1.0vs1vs1.00) varies by backend - Whitespace varies
- Pickle: object identity creates different byte sequences for the same logical value
Fix: normalize before hashing. Sort keys lexicographically; format floats with fixed precision; strip whitespace. ~20 LOC of canonicalization. SHA-256 the result.
-
Hash captures state, not behavior. A bug that drops a state field would change the hash — caught. A bug that subtly changes behavior without changing state (extremely rare for Frame since handlers are class methods, not closures) wouldn’t be caught by hash alone. Property-based event-replay testing complements it.
-
_context_stackshould be excluded. It’s per-event scratchpad. If you save mid-event the hash will mismatch, but that’s “don’t do that” not “broken” — the contract should forbid mid-event save. The canonicalizer should drop_context_stack(or save_state should reject if it’s non-empty). -
Per-backend hash, not cross-backend. Hash equality after round-trip on the same backend is the realistic test. Python’s hash will differ from Java’s hash for the same Frame system because each emits its own JSON shape. Same-backend round-trip is what we care about for “did persist work.”
- What it catches that observational testing misses:
- Field reordering bugs that don’t affect tested behavior
- Compartment chain corruption that isn’t exercised by your test events
- Push/pop stack drift in branches your events didn’t reach
- Nested-system state preserved at one level but truncated at another
Cross-backend gold-standard variant
Define a “Frame Wire Format” — backend-agnostic canonical JSON
shape with explicit key ordering, normalized floats, version
stamp. Each backend emits it. Then hash(WireFormat) ==
hash(WireFormat) across Python and Java for the same logical
state.
Enables: saving in Python, restoring in Java, and verifying equivalence. Real engineering investment (~few days per backend), real payoff for serialization-format compatibility. Skip unless cross-backend persist is a stated goal.
Effort to add
- Canonicalizer: ~30 LOC of test-harness code (one normalizer, parses each backend’s JSON output)
- Hash helper: 5 LOC (SHA-256 of canonical bytes)
- Property test: 20 LOC fuzz loop generating random states +
asserting
h1 == h2 - Wire it into a per-backend test runner: ~2 hours
Total: less than a day to add to existing test infrastructure.
Property-based event-replay testing
The strictly stronger invariant — strictly because hash equality is the necessary condition, observational equivalence is the sufficient one:
events = generate_random_event_sequence(N)
b1 = run_events(instance, events)
b2 = run_events(restore(save(instance)), events)
assert b1 == b2
If b1 == b2 for thousands of random event sequences, persist is
correct for that system with cryptographic confidence.
Combined with hash testing:
- Hash: cheap, instant, covers state preservation
- Behavior: expensive, covers continued operation post-restore
A fuzz harness that does both:
- Generates a Frame system per axis spec (depth × branching × HSM × push/pop × async)
- Generates a random event sequence
- Asserts hash equality after save→restore (cheap)
- Asserts
behavior(events on restored) == behavior(events on original)(more expensive) - Asserts no invariant violation (
_context_stackempty in snapshot, etc.) - Mutates the saved snapshot adversarially and asserts the right failure mode
Run for an hour per backend. If nothing breaks, you’re at the practical ceiling.
Effort: ~3–5 days of test infrastructure. Pays off forever. Most hand-coded tests (including 83) become “regression anchors” for specific known cases; the fuzzer covers unknown unknowns.
Recommended path forward
Ranked. Each step gates the next.
Step 1: Spec the contract (~1 day)
Pick answers for the six open questions in §Open contract
questions. Without these, tests can’t assert anything meaningful.
Document in a Frame contract doc (docs/persist-contract.md or
similar).
Step 2: Hash-based round-trip testing (~1 day)
Add canonicalizer + hash assertion to the test runner. Wire it into every existing persist test in the matrix as a sanity check (should be all-pass; if any flag, that’s a real defect).
Step 3: Hand-cataloged graph topology tests (~2 days)
Write ~30 tests covering:
- Branching (1, 2, 5 child fields)
- Same-type siblings
- Diamond
- Cycle (assert spec’d failure mode)
- Self-reference (assert spec’d failure mode)
- Tree fan-out
Each test runs hash-based round-trip assertion. Regression anchors.
Step 4: Property-based fuzzer (~3–5 days)
Build the fuzz harness:
- Frame system generator (parameterized by axis spec)
- Event sequence generator
- Hash + behavior + invariant assertions
- Adversarial mutation tier
Run continuously; treat as fuzz tier (long runs, occasional new defect surfaces).
Step 5: Schema-evolution test suite (optional, ~3 days)
If schema evolution is in scope (per Step 1’s contract decision):
- Snapshot v1 + framework v2 → assert tolerated
- Deleted field + old snapshot → assert tolerated
- Renamed field → assert user-written migration path works
- Type change → assert user-written migration path works
Step 6: Cross-backend Wire Format (optional, ~1–2 weeks)
Only if “save in Python, restore in Java” is a goal. Define canonical JSON; each backend emits it; round-trip assertion across backends.
Total to “theoretical best” (excluding optional steps): ~2 weeks of focused work. After that, marginal coverage gains become rapidly more expensive for diminishing returns.
Drawbacks / alternatives
Drawback: contract-first work delays test value
Steps 1 and 2 don’t add tests for ~2 days. If you’d rather see results sooner, swap order: write hash-based assertion first, discover the contract gaps as they manifest. Risk: some tests will need to be rewritten once the contract is settled.
Alternative: behavior testing only, skip hash
Behavior testing is sufficient for correctness. Hash is an optimization for fuzz-scale testing. If we’re not building a fuzzer, hash-based testing buys less. Recommendation against: hash testing is cheap enough that it’s worth doing even for the hand-cataloged test tier.
Alternative: cross-backend Wire Format first
If the strategic goal is “save anywhere, restore anywhere,” start with Wire Format. But this is significant engineering for a use case that may not be on the near-term roadmap. Default: defer.
Alternative: skip property-based testing entirely
Hand-cataloged tests + hash assertion catches ~95% of bug density. Property-based testing catches the long tail. If budget is tight, skip the fuzzer and accept that some corner cases will surface as production bugs. Recommendation: don’t skip; the fuzzer is the difference between “we tested the known cases” and “we tested arbitrary cases.”
Open questions for review
Before implementation:
- Cycle policy: graceful error, or memoize and preserve?
- Shared-reference policy: duplicate (current), or memoize and preserve?
- Schema evolution scope: in-suite, or production-readiness milestone?
- Mid-event save: forbid (throw at save call), or capture
_context_stackand document? - Adversarial input contract: typed exception (named what?
PersistFormatError/PersistVersionError/PersistSchemaError?), or generic. - Concurrent save semantics: undefined, document; or single-threaded contract enforced by lock check.
- Test infra investment: hash + cataloged only (~1 week), or full property-based fuzzer (~2 weeks)?
- Cross-backend Wire Format: in scope, deferred, or out of scope?
Implementation status
Not started. RFC parked pending review of open questions.
The actual implementation work is well-scoped (~1–2 weeks depending on scope answers above), but should not begin until the contract questions are settled. Otherwise tests will assert behaviors that need to be rewritten when the contract is set.
References
- Test 83 5-deep nested persist: framepiler
cafdec8, test_envec179fbf - Memory:
type_ignorant_persist_2026_04_30.md - DEFECTS.md (closed): D1–D18
- FUZZ_PLAN.md (Phase 24, waves 1–7)
Amendment 2026-05-02: @@[save] / @@[load] operation attributes
Motivation
The status-quo persist contract emits static func restore_state(data)
-> Self on every backend, mutates a class-static __skipInitialEnter
flag around .new(), and re-uses that flag in the constructor’s
initial-enter path to skip the normal lifecycle. This works on every
backend whose static-method scope can resolve the script’s own class
identifier — but it doesn’t work on GDScript, where a script’s static
function cannot resolve its own class_name (empirically verified
against Godot 4.6.2).
We considered eight candidate fixes (A–H) when investigating this. A (class_name declaration)
was the natural first attempt and doesn’t actually work — Godot’s
static funcs cannot see their own class even with class_name. Every
other option either requires per-target divergence in the public
contract, hardcoded resource paths, or doesn’t address the
architectural cost: __skipInitialEnter is a class-static race
window, and embedding the class identifier into a static method body
is a fragile coupling between codegen and target scoping rules.
Design
Four attributes replace the existing contract:
-
@@[persist(<FormatType>)]— system-level. Declares the system participates in persistence and selects the wire format (e.g.JSON). Format names are opaque strings plumbed through to per- backend ser/deser implementations; Frame doesn’t validate the name beyond syntactic well-formedness. Default when omitted:JSON. -
@@[save]— operation attribute. Marks the operation Frame should fill in as the save entry point. Signature:(): <FormatType>. The operation has no body in source — Frame generates the body based on the format. Regular instance method. Caller invokes it asinst.<op_name>()and gets the serialized payload. -
@@[load]— operation attribute. Marks the operation Frame should fill in as the load entry point. Signature:(data: <FormatType>). No body in source. Regular instance method. Caller invokes it on an existing instance to overwrite the compartment with the persisted state. -
@@[no_persist]— domain field attribute. Marks a field as transient. The save body skips it; the load body leaves it at its default initializer value. Used for fields that hold external resources (sockets, file handles, UI references) that can’t be serialized and must be wired by the host after construction.
Example:
@@[persist(JSON)]
@@system Foo {
interface:
bump()
get_n(): int
operations:
@@[save] pickle(): JSON
@@[load] unpickle(data: JSON)
machine:
$S0 {
bump() { self.n = self.n + 1 }
get_n(): int { @@:(self.n) }
}
domain:
n: int = 0
}
User code — uniform across all 17 backends, two-step pattern:
foo = Foo() # $S0 enter fires (idempotent for typical systems)
foo.bump(); foo.bump()
data = foo.pickle() # @@[save] op, body framework-generated
foo2 = Foo() # construct fresh; $S0 enter fires
foo2.unpickle(data) # @@[load] op overwrites compartment with snapshot
assert foo2.get_n() == 2
var foo = Foo.new()
foo.bump(); foo.bump()
var data = foo.pickle()
var foo2 = Foo.new()
foo2.unpickle(data)
Foo foo = new Foo();
JSON data = foo.pickle();
Foo foo2 = new Foo();
foo2.unpickle(data);
Every backend uses the same shape: regular instance methods. No static-method-on-its-own-class scoping issue. GDScript fix is structural — the bug class can’t recur because there’s no static method to resolve.
$S0 enter on restore — known semantics
Calling Foo() followed by foo.unpickle(data) fires $S0’s >()
enter handler once before unpickle overwrites the compartment with
the persisted state. For typical persist systems (whose $S0 enter
just initializes domain defaults), this is invisible — the defaults
get overwritten immediately.
For systems with externally observable side effects in $S0 enter
(e.g., a print(...), network handshake, file open), those effects
fire once on every restore. Workarounds:
- Make
$S0enter idempotent / pure (best practice anyway). - Gate side effects on a domain flag that the load body can clear.
- Move the side effect to a non-
$S0state and transition there manually after load.
This is documented as a contract limitation rather than worked around
in codegen. An earlier draft proposed special “no-init constructor”
syntax (@@Foo.unpickle(data)) to bypass $S0 enter on restore, but
the per-backend lowering (constructor overload + tag-dispatched ctor +
factory function) was complexity we deemed not worth paying for the
narrow case of “user has observable side effects in $S0 enter.”
The two-step pattern is uniformly simple and covers the common case.
Pre / post hooks
Not provided. The user wraps inst.<save_op>() with whatever they
want in caller code, and similarly arranges any post-load wiring after
the load construction returns. If they need the post-load wiring to
be guaranteed (e.g., reconnect a socket every time), they declare a
regular operations: method and call it explicitly:
foo2 = Foo()
foo2.unpickle(data)
foo2.reconnect() # regular operation, user's responsibility
Earlier drafts added @@[before_save] / @@[after_save] /
@@[before_load] / @@[after_load] attributes to provide bracketing
hooks, but every real use case for those collapses into “user code in
the calling function” except post-restore wiring — and even that is
reasonably the user’s responsibility, since Frame can’t know which
external resources their app uses.
If real demand surfaces for post-restore wiring as a Frame primitive
(rather than an app concern), a future @@[on_load] attribute on a
regular operation can be added without breaking the four-attribute
contract.
Validator rules
@@[save]and@@[load]valid only on operations of@@[persist]systems. Otherwise E801 (attribute at wrong position).@@[no_persist]valid only on domain fields of@@[persist]systems. Otherwise E801.- At most one
@@[save]and one@@[load]per system. Otherwise E810 (proposed: duplicate persist operation). - Save op signature: zero parameters, return type matches the
format type from
@@[persist(<Format>)]. Otherwise E811 (proposed: persist save signature mismatch). - Load op signature: one parameter typed as the format, no return type. Otherwise E812 (proposed: persist load signature mismatch).
- Operations with
@@[save]/@@[load]must have no body in source — Frame generates it. A user-provided body is E813 (proposed: persist op body is framework-generated).
Migration
Pre-1.0 hard cut, RFC-0013 wave 1+2 playbook. Frame source on the
existing contract (no @@[save]/@@[load] ops, magic
save_state/restore_state interface) becomes invalid; framec emits
E814 (proposed: bare-form persist contract is no longer accepted —
declare @@[save] and @@[load] operations).
Test corpus migration: every @@[persist] system declares the two
operations; drivers update from Foo.restore_state(data) (static) to
foo = Foo(); foo.unpickle(data) (two-step). Mechanical sed; the
operation names are conventionally save_state / restore_state
unless users want different names.
Phasing
-
Phase A ✅ (2026-05-02): Parser + validator for the four attributes. GDScript codegen end-to-end (proves the design). Test fixture + matrix verification GDScript-only. Closed the GDScript bug; unblocked
frame-arcadescoreboard. - Phase B1 ✅ (2026-05-02): All 17 backend codegens accept the
new contract additively. Legacy contract preserved everywhere
for backwards compatibility (matrix proof: 4,275 / 4,275
passing). Per-backend changes:
- Family 1 (dynamic): Python, JS, TS, Ruby, Lua, PHP, Dart,
GDScript —
target = self/this/$this; load body drops construction-bypass, mutates self in place. - Family 2 (typed JVM/Swift): Java, Kotlin, C#, Swift —
legacy
RuntimeHelpers.GetUninitializedObject/ReflectionClass::newInstanceWithoutConstructorstays under legacy; new contract drops the bypass entirely. - Family 3 (systems): Rust, C++ — Rust uses struct-literal
bypass under legacy, direct
self.X = ...under new; C++ similar with(*this).X = .... - Family 4 (factory shape): Go, C, Erlang — Go: receiver
method (new) vs package-level
Restore<Sys>(legacy); C:<Sys>_load_op(<Sys>* self, json)(new) vs<Sys>* <Sys>_restore_state(json)(legacy); Erlang: design exclusion —gen_statemPid model means load is always a factory, just renamed under user’s@@[save]/@@[load].
- Family 1 (dynamic): Python, JS, TS, Ruby, Lua, PHP, Dart,
GDScript —
- Phase B2 ✅ (2026-05-02): Canonical end-to-end test
93_persist_save_load_contractported to all 17 backends. Frame source declaresoperations: @@[save] / @@[load]; driver creates instance, mutates, saves, creates fresh instance, loads snapshot, asserts state. Surfaced + fixed 3 codegen bugs:- Rust + Erlang duplicate operations (system_codegen.rs skip not propagated to rust_system.rs / erlang_system.rs)
- Rust load-param type ignored user declaration (fixed via
new
SystemAst::load_op_param_type()helper) - Go
datacollision with user’s load param
- Phase B3 ✅ (2026-05-03): Hard-cut E814 shipped. Bare
@@[persist]now errors out; every persist system must declare@@[save]and@@[load]ops. The full legacy fixture migration (~425 fixtures across 17 backends + linux- demos + erlang multi) landed in test_env commits
54f11d7d,bcaa5e0d,4e487f40,d627359d,b3dd4cdc. Matrix 4,275 / 4,275 across 17 backends.
- demos + erlang multi) landed in test_env commits
-
Phase B4 ✅ (2026-05-02, this section): Documentation — RFC-0012 status, frame_runtime.md, per-language guides.
- Phase C (deferred to roadmap): schema versioning +
@@[migrate]operation chain. See “Future roadmap” below. - Phase D ✅ (2026-05-03, framepiler
a61390e):@@[on_load]post-load hook. Fifth attribute. Marks an operation that fires automatically afterrestore_statepopulates self, so user code can re-establish derived state, fire watchers, validate invariants. AST helperSystemAst::on_load_op_name(); validator recognizes the attribute (E810 enforces at-most-one); codegen appendstarget.<name>()(per-language form) to each backend’s restore body viainterface_gen::on_load_callhelper. Test fixture:95_persist_on_load_hook.fpy. Wired in 14 backends (Erlang’s gen_statem dispatch deferred — separate codegen).
Phase A alone closed the GDScript bug. Phase B1+B2 made the contract usable on every backend. Phase B3 hard-cut shipped 2026-05-03 once the legacy fixture migration completed.
Retired by RFC-0015 (framepiler 66c9573, 2026-05-04). See rfc-0015.md for the lifecycle attribute design that supersedes this.
Future roadmap (post-Phase B)
The four-attribute contract above covers Frame’s current target use cases (game save/restore, app state, web session). For Frame to expand into adjacent use cases (long-lived state, workflow orchestration), additional surfaces are needed. Recorded here as deferred work, not in scope for the GDScript-bug-driven amendment.
Survey: how comparable systems handle persistence
Honest comparison of Frame’s persist scope vs. nearby systems we’d plausibly be measured against:
| System | State model | Persistence | Schema evolution | Concurrency |
|---|---|---|---|---|
| Airflow | DAG of tasks; queued/running/success/failed states | External metadata DB; per-row, per-task-instance | Versioned DAG code; older runs locked to historical DAG | DB row locks |
| AWS Step Functions | JSON state machine | Internal AWS-managed; every transition durable | Versioned state machine ARNs | Per-execution; AWS-handled |
| K8s operators | Reconciliation loop on CRDs | etcd via API; spec/status separation | Versioned APIs (v1alpha/beta/v1); conversion webhooks | Optimistic via resourceVersion |
| Terraform | Declarative resource graph | tfstate JSON; remote backend optional | terraform state mv; provider versioning |
State locks (S3+DynamoDB) |
| Erlang OTP | Actor + supervisor tree | mnesia / DETS / external | Hot-code-loading + state migration callbacks | Per-process mailbox |
| Akka | Actor + persistence | Event sourcing log + snapshots | Schema evolution via event adapters | Per-actor mailbox |
| Hibernate / JPA | POJO entities | DB rows; lazy/eager loading | @Version + Liquibase/Flyway migrations |
DB transaction isolation |
Use-case alignment for the four-attribute contract:
| Use case | Covered? |
|---|---|
| Game save/restore (frame-arcade) | ✅ |
| Mobile/desktop app state restoration | ✅ |
| Web session state (server-side) | ✅ |
| Embedded device state across firmware updates | ⚠️ — needs schema versioning |
| Workflow orchestration (Airflow-style) | ❌ — needs WAL + observable transitions |
| Distributed state machines | ❌ — concurrency / leader election out of scope |
| Long-lived business processes (Step Functions Wait, weeks/months) | ❌ — needs durable wait + versioning |
| Infrastructure state (Terraform-style) | ❌ — needs locking + versioning |
| Event-sourcing actor (Akka-style) | ⚠️ — Frame snapshots, not event-sourced |
The first three are realistic Frame use cases today. The next two are aspirational with schema versioning (Phase C below). The bottom four are out of scope — they’d require Frame to grow new surfaces (write-ahead logging, distributed locking, durable timers) that shouldn’t be baked into core persist.
Roadmap item 1: schema versioning + @@[migrate] (Phase C)
Long-lived state outlives code revisions. Adding a domain field, renaming a state, restructuring HSM hierarchy — every such change breaks old snapshots. Comparable systems all version their state representations.
Proposed extension:
@@[persist(JSON, version=2)]
@@system Foo {
operations:
@@[save] save(): JSON
@@[load] load(data: JSON)
@@[migrate(from=1, to=2)]
v1_to_v2(old: JSON): JSON # body: user transforms old shape to new
}
On load(data), framework reads version field from the payload. If
mismatched, walks the chain of @@[migrate] ops to forward-migrate
from data["version"] to the current. Fail loudly if no chain exists
(E815 proposed). Each migration op transforms the payload one
version forward; the framework chains them.
Validator rules (additional):
@@[migrate]valid only on@@[persist]system operations.fromandtomust be integer literals;to == from + 1(one-step migrations).- Migration chain from any
versionvalue present in test snapshots to the current version must be complete (validator can detect gaps given a manifest, or report at load time).
Implementation note: the version field is embedded in the save payload
by the framework, not user-provided. Format-specific (JSON: top-level
"version" field; Protobuf: a reserved tag).
When to ship: when a real customer hits a breaking schema change. Not needed for game/session use cases that are inherently single-version.
Roadmap item 2: framework boundaries documented in frame_runtime.md
Set explicit expectations:
Frame’s persistence is point-in-time snapshot. It does not provide:
- Write-ahead logging — auto-save-on-transition is not built in. Every save is user-triggered.
- Distributed locking / leader election — single-instance only. Coordination across processes is the host’s responsibility.
- Long-lived dehydrated waits — Frame is synchronous. Wait-then- resume across hours/days needs an external scheduler that holds snapshots and reconstitutes the system on the trigger.
- Event sourcing — only state snapshots, not transition history. The save reflects “current state,” not “how we got here.”
If your use case needs these, layer them above Frame:
- Persist the snapshot to a durable store (file, database, S3).
- Coordinate snapshot timing in your host app.
- For distributed state, use a coordinator (etcd, ZooKeeper, Raft).
- For event sourcing, log every event externally; replay through Frame’s normal dispatch on restore.
Frame’s
@@[persist]is the right tool for: game saves, mobile app state, server-side sessions, embedded device state, single- instance workflows. It is the wrong tool for: workflow orchestration platforms, infrastructure-as-code state, distributed consensus, long-running multi-day business processes.
Land this section as part of Phase B’s frame_runtime.md updates.
Zero implementation cost; high value in setting user expectations
correctly.
Roadmap item 3: @@[on_load] post-load wiring hook (Phase D) — SHIPPED 2026-05-03
@@[on_load] is an operation attribute that fires automatically
after restore_state populates self, before any user-triggered
event can dispatch. The user writes the body; framec emits a
call to it at the end of the framework-managed restore body.
@@[persist]
@@system Counter {
operations:
@@[save]
save_state(): bytes {}
@@[load]
restore_state(data: bytes) {}
@@[on_load]
rebuild_derived() {
# called automatically after restore_state body completes,
# before any user-triggered event can dispatch
self.doubled = self.n * 2
self.was_restored = true
}
...
}
At-most-one per system (E810); requires @@[persist] (E801).
Wired in 14 backends (Erlang’s gen_statem dispatch deferred).
Test fixture: 95_persist_on_load_hook.fpy. framepiler a61390e.
Roadmap item 4: pluggable serializer registry
Today, the format token (JSON, Protobuf, etc.) is a string
matched against per-backend hardcoded ser/deser implementations.
Future: allow users to register custom serializers per format token,
analogous to serde’s Serialize / Deserialize derive macros or
Akka’s serializer config.
Defer until customer use case (e.g., encrypted-at-rest snapshots, custom binary format for embedded targets).
Roadmap item 5: incremental / differential save
For large systems where full snapshot is expensive, support a “what changed since last save” mode. Akin to Terraform’s plan-then-apply or Airflow’s per-row updates. Useful for:
- Systems with large domain state (>1MB serialized).
- High-frequency saves (every event).
Defer indefinitely — current Frame use cases are well within full- snapshot perf budgets.
Roadmap item 6: durable write-ahead-logging mode
For workflow-orchestration use cases where every state transition must be durable before the action is taken (Step Functions / Airflow contract). Would require:
- Auto-save-on-transition wired into Frame’s dispatch loop.
- A user-provided durable-write callback (or built-in support for common stores: SQLite, Postgres, file).
- Recovery semantics: restart resumes from last durable transition.
This is a significant scope expansion — effectively Frame would become a workflow engine, competing with the systems in the survey table above. Defer until product direction explicitly aims here.
Open questions (current four-attribute design)
-
Default operation names when user wants the simplest possible declaration? Could allow
@@[save]/@@[load]with no user-named operation and Frame auto-createssave_state/restore_stateoperations. Reduces boilerplate to one attribute on the system. Tradeoff: implicit operation generation conflicts with Frame’s “everything ininterface:/operations:is user- declared” principle. -
Format negotiation when the user-named load op is invoked with data that was saved under a different format? Currently the format is system-static, so this can’t happen unless the same system declaration changes formats across binary versions. Per roadmap item 1, this is the schema-versioning problem; deferred.
-
@@[no_persist]interaction with state vars / enter-args / state-args? These are compartment fields, not domain fields. The attribute is currently scoped to domain fields only; if users want transient state vars the recommended pattern is to lift them to domain with@@[no_persist]. Could be revisited.