Partition + heal — VOPR-style liveness after a fault lifts¶
What this teaches: the difference between random message loss (self-heals via retries) and a sustained partition or crash (only recoverable after the fault ends). FaultSchedule expresses the latter; a goal checks whether the system converges once the fault lifts.
The question: does it make progress?¶
The FoundationDB/TigerBeetle school of simulation testing asks not just "is it safe?" but "does it converge once things settle?" A system can be perfectly safe — never serving wrong data — but permanently wedged, serving no data at all.
Random per-message loss (NetworkModel(loss=0.3)) cannot express this question: an unlucky sequence of drops eventually heals by itself as retries pile up. A partition or crash is different — the system can only recover after the fault ends.
The three scenarios¶
1. Partition + fire-and-forget → WEDGED¶
A FaultSchedule(partitions=(Partition(frozenset({("s", "r")}), start=0, end=10),)) blocks all s → r messages for ticks 0–10. The fire-and-forget source sends once at startup. The single push is eaten by the partition. After tick 10 the link heals, but no message was ever queued for replay — the agent never converges.
2. Partition + per-tick re-push → CONVERGES¶
With repush=True, the source re-sends every 2 ticks until acked. After the partition heals (tick 10), the next re-push lands. The agent applies it and acks. Goal reached.
3. Cold restart → LOST ON RESTART¶
A FaultSchedule(crashes=(Crash(node="r", start=3, end=8, lose_state=True),)) crashes and cold-restarts the replica between ticks 3 and 8. If the source already acked and stopped retrying, the applied value is gone — the replica starts from 0 and never hears the value again.
Source¶
"""VOPR-style liveness: does the system converge once a fault HEALS?
The FoundationDB/TigerBeetle simulation-testing question isn't just "is it safe?" but "does it make
progress?" -- after the network heals, does the replica actually catch up? musil's `simulate` injects
a *sustained, heal-able* fault from a `FaultSchedule` and then asserts convergence once it lifts. A
plain random-loss `NetworkModel` can't express this: independent per-message loss self-heals via
retries. A partition or a crash is different -- the system can only recover after it ends.
A source pushes a value to a replica; the goal is `applied == desired` once things settle.
* PARTITION the s->r link for a window, then heal:
- fire-and-forget (one push, no retry) -> the partition eats the only push -> WEDGED forever.
- per-tick re-push -> the first push after the heal lands -> CONVERGES.
* CRASH the replica with a cold restart (it loses state):
- with no re-push, an already-acked value is gone after restart -> LOST ON RESTART.
Each failure is a `kind="liveness"` report carrying a reproducing seed.
Run me: python examples/partition_heal.py
"""
from __future__ import annotations
from collections.abc import Callable, Mapping
from dataclasses import dataclass
from musil.sim import BaseNode, Context, Crash, FaultSchedule, Node, Partition, SimReport, simulate
class Source(BaseNode):
"""Pushes the desired value to the replica; with ``repush`` it re-sends on a timer until acked."""
def __init__(self, *, repush: bool) -> None:
self.desired = 1
self._repush = repush
self._acked = False
def on_start(self, ctx: Context) -> None:
ctx.send("r", self.desired)
if self._repush:
ctx.set_timer("push", delay=2)
def on_message(self, ctx: Context, src: str, payload: object) -> None:
if payload == "ack":
self._acked = True
def on_timer(self, ctx: Context, name: str) -> None:
if self._repush and not self._acked:
ctx.send("r", self.desired)
ctx.set_timer("push", delay=2)
class Replica(BaseNode):
"""Applies the largest value it has seen and acks. A cold restart wipes ``applied`` back to 0."""
def __init__(self) -> None:
self.applied = 0
def on_message(self, ctx: Context, src: str, payload: object) -> None:
assert isinstance(payload, int)
self.applied = max(self.applied, payload)
ctx.send(src, "ack")
@dataclass(frozen=True)
class World:
desired: int
applied: int
def _snapshot(nodes: Mapping[str, Node]) -> World:
s, r = nodes["s"], nodes["r"]
assert isinstance(s, Source) and isinstance(r, Replica)
return World(desired=s.desired, applied=r.applied)
def _factory(repush: bool) -> Callable[[], Mapping[str, Node]]:
def make() -> Mapping[str, Node]:
return {"s": Source(repush=repush), "r": Replica()}
return make
def _converges(repush: bool, faults: FaultSchedule) -> SimReport[World]:
return simulate(
_factory(repush=repush),
seeds=range(50),
snapshot=_snapshot,
faults=faults,
goal=lambda w: w.applied == w.desired,
goal_name="applied==desired",
)
def main() -> int:
print("VOPR-style liveness -- converge after the fault heals?\n")
partition = FaultSchedule(partitions=(Partition(frozenset({("s", "r")}), start=0, end=10),))
faf = _converges(repush=False, faults=partition)
assert not faf.ok and faf.failure is not None
print(f" partition + fire-and-forget : WEDGED (kind={faf.failure.kind}, seed={faf.failure.seed})")
rep = _converges(repush=True, faults=partition)
assert rep.ok, str(rep)
print(f" partition + per-tick re-push: CONVERGES ({rep.runs} seeds, all green)")
crash = FaultSchedule(crashes=(Crash(node="r", start=3, end=8, lose_state=True),))
lost = _converges(repush=False, faults=crash)
assert not lost.ok and lost.failure is not None
print(f" crash (cold restart) : LOST ON RESTART (kind={lost.failure.kind}, seed={lost.failure.seed})")
print("\nOnly the retrying system survives a heal; a cold restart loses an un-retried write.")
return 0
if __name__ == "__main__":
raise SystemExit(main())
Run it¶
See also¶
- Reconciliation under loss — the same convergence question with random loss and a model-backed refinement check.
- K8s scheduler — open-system model checking where the platform can evict nodes, not just drop messages.