Skip to content

K8s scheduler — open-system verification against pod eviction

What this teaches: a closed model always passes because it controls everything. check_open adds an adversarial EnvironmentSpec whose behaviors the BFS explores exhaustively — exposing failures that only appear when the platform acts, not just the service.

Closed vs. open systems

A closed model is self-contained — every state transition is under the model's control. In isolation, the service is correct by construction: it starts, runs, degrades, and recovers.

An open model admits external actors. A Kubernetes cluster can evict a running pod at any moment — after a node goes unhealthy, during a rolling update, or simply because resource pressure triggers the eviction controller. The service does not control when this happens.

check_open(model, k8s_env) merges the environment's behaviors (the eviction action) into the BFS. Every possible eviction timing is explored — including mid-startup, mid-recovery, and the very first tick.

Two versions of the environment

Fragile — eviction silently kills the service:

def evict(s): return replace(s, status="unresponsive")

The availability invariant (status != "unresponsive") is violated the moment K8s fires. musil finds it immediately.

Resilient — eviction triggers a controlled restart:

def evict(s): return replace(s, status="starting", restarts=s.restarts + 1)

The service re-enters the startup lifecycle. musil verifies the invariant holds across all reachable states, including repeated evictions up to max_restarts.

Residual proof obligations

Even the resilient service has assumptions it can't verify internally:

[unverified] node-capacity
             At least one schedulable node is always available after eviction
[unverified] eviction-rate-bounded
             Eviction rate is bounded so max_restarts is not exceeded in practice

These surface as unverified_assumptions in the OpenResult. They are now explicit proof obligations — not silent hypotheses buried in documentation.

Source

"""Open-system model checking: a distributed service against K8s-like platform faults.

A *closed* system has no external components — the model controls everything. This example shows
the *open* system problem: a distributed service running inside a Kubernetes cluster where the
platform can evict pods or restrict network access at any moment.

``check`` on the closed model always passes (the service is correct in isolation). ``check_open``
adds a K8s ``EnvironmentSpec`` whose behaviors model platform faults. Without resilience, a pod
eviction moves the service to an "unresponsive" state — a silent violation of the availability
invariant. With resilience (the service restarts and re-registers), the same faults are survivable.

  python examples/k8s_scheduler.py
"""

from __future__ import annotations

from dataclasses import dataclass, replace

from musil import Action, Assumption, EnvironmentSpec, Model, check, check_open


@dataclass(frozen=True)
class ServiceState:
    """State of a distributed service."""

    status: str          # "starting", "healthy", "degraded", "unresponsive"
    restarts: int = 0
    max_restarts: int = 3


def _service_model() -> Model[ServiceState]:
    """Service lifecycle: start → healthy → (degraded → healthy) loop.
    The system model is the same in both cases; resilience lives in the EnvironmentSpec.
    """
    def do_start(s: ServiceState) -> ServiceState:
        return replace(s, status="healthy")

    def do_degrade(s: ServiceState) -> ServiceState:
        return replace(s, status="degraded")

    def do_recover(s: ServiceState) -> ServiceState:
        return replace(s, status="healthy")

    def can_start(s: ServiceState) -> bool:
        return s.status == "starting"

    def can_degrade(s: ServiceState) -> bool:
        return s.status == "healthy"

    def can_recover(s: ServiceState) -> bool:
        return s.status == "degraded"

    def valid_status(s: ServiceState) -> bool:
        return s.status in {"starting", "healthy", "degraded", "unresponsive"}

    def availability_invariant(s: ServiceState) -> bool:
        # The core guarantee: the service must never reach "unresponsive" (silent failure).
        return s.status != "unresponsive"

    def restarts_bounded(s: ServiceState) -> bool:
        return s.restarts <= s.max_restarts

    return Model(
        init=ServiceState("starting"),
        actions=[
            Action("start",   can_start,   do_start),
            Action("degrade", can_degrade, do_degrade),
            Action("recover", can_recover, do_recover),
        ],
        invariants={
            "valid-status":          valid_status,
            "always-responsive":     availability_invariant,  # the key safety property
            "restarts-bounded":      restarts_bounded,
        },
    )


def _k8s_env(resilient: bool) -> EnvironmentSpec[ServiceState]:
    """K8s platform: can evict a running pod at any time.

    Without resilience: eviction silently moves the service to "unresponsive" — violates
    the availability invariant.

    With resilience: eviction triggers a controlled restart (increments restart counter)
    and re-enters "starting" — the service recovers. Fails only if restarts are exhausted.
    """
    if resilient:
        def evict(s: ServiceState) -> ServiceState:
            # Resilient: pod eviction triggers a restart (re-enters starting lifecycle).
            return replace(s, status="starting", restarts=s.restarts + 1)
    else:
        def evict(s: ServiceState) -> ServiceState:
            # Fragile: pod eviction silently kills the service → unresponsive (violates availability).
            return replace(s, status="unresponsive")

    def can_evict(s: ServiceState) -> bool:
        # K8s can only evict while the service has restart budget remaining.
        # Once max_restarts is exhausted, the cluster stops scheduling the pod (operator alert).
        return s.status in {"healthy", "degraded"} and s.restarts < s.max_restarts

    return EnvironmentSpec[ServiceState](
        name="k8s",
        behaviors=[
            Action("k8s:evict-pod", can_evict, evict),
        ],
        guarantees={
            # K8s doesn't corrupt the restart counter (it only increments on eviction)
            "restarts-non-negative": lambda s: s.restarts >= 0,
        },
        assumptions={
            "node-capacity": Assumption(
                name="node-capacity",
                description="At least one schedulable node is always available after eviction",
                status="unverified",
                source="Kubernetes docs: Pod lifecycle",
            ),
            "eviction-rate-bounded": Assumption(
                name="eviction-rate-bounded",
                description="Eviction rate is bounded so max_restarts is not exceeded in practice",
                status="unverified",
                source="cluster SLA",
            ),
        },
    )


def main() -> int:
    print("Open-system model checking: distributed service vs K8s pod eviction\n")

    model = _service_model()

    # 1. Closed model: no environment. Always passes.
    closed = check(model)
    print(f"  Closed model (no faults):               {closed}")
    assert closed.ok

    # 2. Open model, fragile: eviction silently kills the service → availability invariant violated.
    fragile = check_open(model, _k8s_env(resilient=False))
    print(f"  Open model, fragile (vs k8s eviction):  {fragile.result}")
    assert not fragile.ok
    assert fragile.result.invariant == "always-responsive"

    # 3. Open model, resilient: eviction triggers a restart. Service recovers.
    resilient = check_open(model, _k8s_env(resilient=True))
    print(f"  Open model, resilient (vs k8s eviction): {resilient.result}")
    assert resilient.ok, str(resilient)

    # 4. Show residual proof obligations.
    print(f"\n  Residual proof obligations ({len(resilient.unverified_assumptions)} unverified):")
    for a in resilient.unverified_assumptions:
        print(f"    [{a.status}] {a.name}")
        print(f"             {a.description}")

    print(
        "\nKey insight: the same service model fails under K8s eviction without restart logic.\n"
        "check_open explores EVERY possible eviction timing (the BFS is adversarial), so\n"
        "the availability invariant is verified over all reachable states — not just tested."
    )
    return 0


if __name__ == "__main__":
    raise SystemExit(main())

Run it

python examples/k8s_scheduler.py

See also