A textbook of concurrency and distributed systems¶
The examples in musil are not a grab-bag. Concurrency and distributed systems have a canon — a
fixed set of classic problems, and almost every one is bounded, state-machine-shaped, and driven by
ordering or failure rather than by input. That is exactly musil's diet. So examples/ is an
executable textbook: each chapter is a classic problem modelled as states + guarded transitions +
invariants, where musil finds the classic bug and verifies the classic fix.
Each chapter does double duty — it is documentation (read it to learn the problem) and a test of
musil itself (each classic exercises a different feature: deadlock, fairness, channels, composition,
crash-recovery). Every completed chapter has its own page with a full walkthrough and the embedded
source; run any example directly with python examples/<name>.py.
When musil is the right tool¶
A model checker earns its keep when correctness depends on what happened and in what order, not on what the input was. The dividing line:
Is the bug a function of the input, or a function of history and ordering? Input / pure logic → a decision procedure (SAT / SMT / BDD). History / ordering / concurrency / failure → a model checker.
A stateless rule table ("does this packet match?") is the first kind — the wrong tool, because the question is quantified over inputs, not over reachable states. A protocol that can race, reorder, drop, partition, or wedge is the second kind — musil's home.
How a chapter is built¶
Every chapter follows the same four beats:
- Design — describe the protocol as a small state machine: a frozen dataclass plus guarded actions.
- Model the error — inject the fault (a race via interleaving, or loss / crash / partition). musil returns the shortest counterexample: the exact sequence that breaks the invariant.
- Prevent — change the design until the invariant holds across every reachable state.
- Guard — keep the model (and, for real code,
simulate/check_refinement) as a permanent regression test.
Finite-state machines¶
The simplest models: a single state field, a handful of actions, one invariant.
| Example | Bug / lesson | musil feature |
|---|---|---|
| Traffic light | Minimal FSM; confirm the three-state cycle is safe | check, to_dot |
| Service lifecycle | Build the model from the code's own transition table — zero drift | status_field_actions, declared_states |
Shared-memory concurrency¶
Two or more actors share mutable state. The bug lives in the interleaving — a specific ordering that no single-threaded test can find.
| Example | Bug / lesson | musil feature |
|---|---|---|
| Lost update | Non-atomic read-modify-write; one increment silently lost | check, interleaving |
| Producer–consumer | Check-then-act race overflows a bounded buffer; atomic fix passes | check, check_liveness |
| Mutual exclusion (Peterson) | Exhaustive proof that Peterson's algorithm is correct | check, invariant |
| Readers–writers | Safety holds for both policies; reader-preference starves the writer | check, check_liveness, fairness |
| Dining philosophers | Circular-wait deadlock; resource-ordering rule eliminates it | check, deadlock |
| Bank transfer | Lock-ordering deadlock; conservation invariant holds through the fix | check, invariant |
Distributed protocols¶
Nodes that communicate over a network. Correctness depends on message ordering, loss, and failure.
| Example | Bug / lesson | musil feature |
|---|---|---|
| Reconciliation under loss | Fire-and-forget fails on a lossy link; re-push converges — verified by simulation + refinement | simulate, check_refinement, check_liveness |
| Partition + heal | A sustained partition wedges fire-and-forget forever; cold restart loses un-retried writes | simulate, FaultSchedule, Partition, Crash |
Open systems¶
Real systems interact with external actors — platforms, clouds, humans — that can behave
adversarially. check_open and EnvironmentSpec extend the BFS to include those behaviors.
| Example | Bug / lesson | musil feature |
|---|---|---|
| K8s scheduler | Pod eviction violates availability unless the service restarts; residual obligations surfaced | check_open, EnvironmentSpec, Assumption |
| Byzantine service | Wrong answers and wire corruption; validation and mutation hooks catch both | AdversarialNode, NetworkModel.mutate |
| Multi-environment composition | K8s + AWS + etcd faults composed into a single BFS | check_open with multiple envs |
| TCAS II | Four scenarios: correct coordination, Überlingen (pilot deviates), post-fix, Tenerife (v7.1 reversal) | check_open, EnvironmentSpec, Assumption |
On the roadmap¶
The chapters still to write — each forces a musil feature the examples above don't yet exercise.
Agreement & coordination¶
| Problem | What it teaches | musil feature |
|---|---|---|
| Two generals | agreement is impossible over a lossy link | liveness fails under loss |
| Two-/three-phase commit | atomic commit; the coordinator-failure blocking problem | channels, crash |
| Leader election | at most one leader, and eventually a leader | safety + liveness |
| Single-decree consensus (Paxos/Raft) | never decide two different values | composition, safety |
| Distributed lock + fencing | an expired lock that still acts | channels, safety |
Replication & consistency¶
| Problem | What it teaches | musil feature |
|---|---|---|
| Quorum (Dynamo) | read-after-write holds iff W + R > N | composition |
| Primary–backup | ack-before-replicate durability under failover | crash, refinement |
| CRDTs (G-counter, OR-set, LWW-register) | convergence / strong eventual consistency | composition |
Messaging patterns¶
| Problem | What it teaches | musil feature |
|---|---|---|
| At-least-once queue + visibility timeout | the double-process race; exactly-once is a fiction | channels, timers, simulation |
| Outbox / dual-write | making a db-write and a publish atomic | crash |
| Idempotent receiver | dedup makes redelivery safe | channels, duplication |
| Failure detector / heartbeat | dead-vs-slow under partition | timers, partition |
State machines & coherence¶
| Problem | What it teaches | musil feature |
|---|---|---|
| TCP handshake / teardown | half-open connections, simultaneous close | safety |
| Token ring | exactly one token; token loss and regeneration | safety, liveness |
| Cache coherence (MESI) | no two caches hold one line in Modified | composition, safety |
The payoff¶
The canon is also the feature-coverage matrix. Each classic forces a different part of musil into existence and hardens it: dining philosophers and bank transfer drive deadlock; readers–writers and leader election drive fairness / liveness; two-phase commit and broadcast drive channels; Paxos and replication drive composition; durability and primary–backup drive crash-recovery; split-brain and the failure detector drive partition. So "implement the textbook" writes the documentation and exhaustively exercises the tool — the same keystrokes.