Multi-environment composition — K8s + AWS + etcd simultaneously¶
What this teaches: check_open accepts any number of EnvironmentSpec objects and merges their behaviors and guarantees into a single BFS. The checker explores every interleaving of platform faults across all three environments at once.
The scenario¶
A service writes config to etcd and data to S3, while running inside a Kubernetes cluster. Three external systems can fault independently:
| Environment | Fault modeled |
|---|---|
| K8s | Pod eviction (service goes to "evicted") |
| AWS S3 | Request throttling (in-flight writes cancelled, service goes to "throttled") |
| etcd | Leader change (in-flight reads cancelled, service goes to "no-leader") |
Each fault is expressed as an EnvironmentSpec using the pre-built factories from musil.environments:
k8s_env = pod_eviction(can_evict=_can_evict, on_evict=_on_evict)
aws_env = aws_api_faults(can_throttle=_can_throttle, on_throttle=_on_throttle)
etcd_env = cluster_faults(can_leader_change=_can_leader_change, on_leader_change=_on_leader_change)
result = check_open(service, k8s_env, aws_env, etcd_env)
What the BFS explores¶
The combined BFS explores every ordering of:
- the service's own actions (start-s3-write, complete-s3-write, start-etcd-read, complete-etcd-read, recover)
- K8s evicting the pod
- AWS throttling an in-flight write
- etcd triggering a leader election during an in-flight read
At no point does the checker assume faults happen in any particular order or at any particular time. If a fault + partial-write + recovery sequence can violate an invariant, the BFS finds it.
The invariants¶
"not-permanently-failed": s.status != "running" and s.retries >= 5
"no-pending-after-eviction": s.status == "evicted" and (s.s3_writes_pending > 0 or ...)
With retry-with-backoff, both pass. The service recovers from each fault type within the retry budget.
Residual proof obligations¶
The pre-built environments ship with named Assumption objects citing the actual source (Kubernetes docs, AWS SLA, the Raft paper). Those that are "unverified" surface in result.unverified_assumptions — a machine-readable list of what the BFS cannot prove on its own.
Source¶
"""Multi-environment open-system verification example.
This example verifies a simple distributed service against three simultaneous
adversarial environments:
1. K8s pod eviction (scheduler can evict the pod at any time)
2. AWS throttling (the downstream S3 write may be throttled)
3. etcd leader change (the config store may be mid-election)
The service model implements retry-with-backoff for each fault type.
Without retries the model fails the invariant (stale config or missed writes).
With retries it passes — and the output lists the residual proof obligations.
Run::
uv run python examples/open_system_composed.py
"""
from __future__ import annotations
from dataclasses import dataclass, replace
from musil import Action, Model, check_open
from musil.environments.aws import aws_api_faults
from musil.environments.etcd import cluster_faults
from musil.environments.k8s import pod_eviction
# ---------------------------------------------------------------------------
# State
# ---------------------------------------------------------------------------
@dataclass(frozen=True)
class ServiceState:
"""State of a service that writes config to etcd and data to S3."""
status: str = "running" # running | evicted | throttled | no-leader | recovering
s3_writes_pending: int = 0
etcd_reads_pending: int = 0
retries: int = 0
# ---------------------------------------------------------------------------
# System actions
# ---------------------------------------------------------------------------
def start_s3_write(s: ServiceState) -> ServiceState:
return replace(s, s3_writes_pending=s.s3_writes_pending + 1)
def complete_s3_write(s: ServiceState) -> ServiceState:
return replace(s, s3_writes_pending=s.s3_writes_pending - 1)
def start_etcd_read(s: ServiceState) -> ServiceState:
return replace(s, etcd_reads_pending=s.etcd_reads_pending + 1)
def complete_etcd_read(s: ServiceState) -> ServiceState:
return replace(s, etcd_reads_pending=s.etcd_reads_pending - 1)
def recover(s: ServiceState) -> ServiceState:
return replace(s, status="running", retries=s.retries + 1)
# ---------------------------------------------------------------------------
# System model
# ---------------------------------------------------------------------------
service = Model(
init=ServiceState(),
actions=[
Action("start-s3-write",
lambda s: s.status == "running" and s.s3_writes_pending < 2,
start_s3_write),
Action("complete-s3-write",
lambda s: s.s3_writes_pending > 0 and s.status == "running",
complete_s3_write),
Action("start-etcd-read",
lambda s: s.status == "running" and s.etcd_reads_pending < 2,
start_etcd_read),
Action("complete-etcd-read",
lambda s: s.etcd_reads_pending > 0 and s.status == "running",
complete_etcd_read),
Action("recover",
lambda s: s.status != "running" and s.retries < 5,
recover),
],
invariants={
"not-permanently-failed": lambda s: not (s.status != "running" and s.retries >= 5),
"no-pending-after-eviction": lambda s: not (
s.status == "evicted" and (s.s3_writes_pending > 0 or s.etcd_reads_pending > 0)
),
},
)
# ---------------------------------------------------------------------------
# Action predicates and transitions (named functions required for pyright strict)
# ---------------------------------------------------------------------------
def _can_evict(s: ServiceState) -> bool:
return s.status == "running"
def _on_evict(s: ServiceState) -> ServiceState:
return replace(s, status="evicted", s3_writes_pending=0, etcd_reads_pending=0)
def _can_throttle(s: ServiceState) -> bool:
return s.s3_writes_pending > 0 and s.status == "running"
def _on_throttle(s: ServiceState) -> ServiceState:
return replace(s, status="throttled", s3_writes_pending=0)
def _can_leader_change(s: ServiceState) -> bool:
return s.etcd_reads_pending > 0 and s.status == "running"
def _on_leader_change(s: ServiceState) -> ServiceState:
return replace(s, status="no-leader", etcd_reads_pending=0)
# ---------------------------------------------------------------------------
# Environment specs
# ---------------------------------------------------------------------------
k8s_env = pod_eviction(can_evict=_can_evict, on_evict=_on_evict)
aws_env = aws_api_faults(can_throttle=_can_throttle, on_throttle=_on_throttle)
etcd_env = cluster_faults(can_leader_change=_can_leader_change, on_leader_change=_on_leader_change)
# ---------------------------------------------------------------------------
# Verify
# ---------------------------------------------------------------------------
def main() -> None:
result = check_open(service, k8s_env, aws_env, etcd_env)
if result.ok:
print("PASS Service invariants hold under K8s + AWS + etcd faults.")
else:
print("FAIL Invariant violated!")
r = result.result
if r.trace:
for step in r.trace:
print(f" {step}")
if result.unverified_assumptions:
print("\nResidual proof obligations (must be verified externally):")
for a in result.unverified_assumptions:
tag = f"[{a.status}]"
print(f" {tag:12} {a.name}")
print(f" {a.description}")
if a.source:
print(f" Source: {a.source}")
else:
print("\nNo residual proof obligations.")
if __name__ == "__main__":
main()
Run it¶
See also¶
- K8s scheduler — a standalone K8s example that shows the fragile vs. resilient contrast in detail.
- Pre-built environments — full reference for
musil.environments.k8s,.aws,.etcd,.linux,.sel4. - Open systems (conceptual guide) — the theory behind composing multiple
EnvironmentSpecobjects.