Skip to main content

Documentation Index

Fetch the complete documentation index at: https://resources.devweekends.com/llms.txt

Use this file to discover all available pages before exploring further.

Formal Verification

In distributed systems, testing and code reviews are rarely enough. The state space of a system with NN nodes, MM messages, and asynchronous delays is astronomically large. Formal Verification allows us to mathematically prove that our algorithms are correct before we write a single line of production code. Think of it this way: testing tells you “the specific scenarios I checked work correctly.” Formal verification tells you “every possible scenario works correctly, including the ones I never thought of.” For a 3-node system with message reordering and failures, the number of possible interleavings can be in the billions. No amount of unit testing or integration testing will cover them all. TLA+ and model checkers explore that entire space exhaustively.
Staff+ Context: Principal engineers at companies like AWS, MongoDB, and Microsoft use formal methods (specifically TLA+) to find design flaws in core protocols (like Paxos or Raft) that would have taken years to manifest in production.

Why Formal Methods?

Most distributed systems bugs are Heisenbugs—non-deterministic race conditions that depend on a specific, rare interleaving of messages and failures.
  1. State Space Explosion: Even a simple 3-node system can have millions of possible states when network partitions and message reordering are involved.
  2. Mental Models are Fallible: Humans are notoriously bad at reasoning about concurrency. We tend to think “happily path-first.”
  3. Documentation as Code: A formal specification is an unambiguous description of how the system should behave.

TLA+ and PlusCal: The Language of Systems

TLA+ (Temporal Logic of Actions) is the industry standard for specifying and verifying concurrent and distributed systems. It was created by Leslie Lamport (who also created Paxos and LaTeX).

1. The Core Logic

TLA+ uses First-Order Logic and Set Theory to describe the state space.
  • Sets: {1, 2, 3}
  • Quantifiers: \A x \in S : P(x) (For all x in S, P is true) and \E x \in S : P(x) (There exists an x in S such that P is true).
  • Functions: [n \in Nodes |-> "follower"] (A mapping of nodes to their status).

2. Modeling the Network

In TLA+, we don’t use “network sockets.” Instead, we model the network as a Shared Set of Messages. This is one of the most elegant aspects of TLA+ — the network is just a set, and all the complexity of real networking (reordering, duplication, delay) emerges naturally from nondeterministic set operations.
  • Sending: msgs' = msgs \cup {new_msg} — adds a message to the “wire” (it does not get removed when read, modeling the possibility of duplicate delivery)
  • Receiving: \E m \in msgs : Receive(m) — any message in the set can be received at any time, modeling arbitrary reordering
  • Nondeterminism: Since TLC explores all paths, it will naturally try receiving messages in different orders, modeling reordering and delays. This is where the power lies — the model checker considers orderings you would never think to test.
  • Packet Loss: We can model loss by simply not requiring a message to be processed, or by an explicit “Drop” action.

3. PlusCal Deep Dive: Modeling 2-Phase Commit (2PC)

PlusCal is a “pseudocode” language that translates into TLA+. It allows you to write algorithms that look like code but are checked like math.

2PC Specification

This specification models the classic Two-Phase Commit protocol. Even in this simple example, notice how TLA+ forces you to be precise about every state transition — there is no room for hand-waving like “then the coordinator sends a message.”
--algorithm TwoPhaseCommit
variables 
    (* Initial state: all resource managers are working, TM is idle, no messages sent *)
    rmState = [rm \in RMs |-> "working"],  (* Each RM starts in "working" state *)
    tmState = "init",                       (* Transaction Manager starts idle *)
    msgs = {};                              (* Network is empty -- no messages in flight *)

define
    (* SAFETY INVARIANT: No RM can commit while another aborts. 
       This is the core guarantee of 2PC. If TLC finds a state where
       this is violated, we have a bug in our protocol design. *)
    Consistent == 
        \A r1, r2 \in RMs : 
            ~ (rmState[r1] = "committed" /\ rmState[r2] = "aborted")
end define;

process TransactionManager = "TM"
begin
    TM_Start:
        tmState := "preparing";
        msgs := msgs \cup {[type |-> "prepare"]};
    TM_Wait:
        if \A rm \in RMs : rmState[rm] = "prepared" then
            tmState := "committing";
            msgs := msgs \cup {[type |-> "commit"]};
        else
            tmState := "aborting";
            msgs := msgs \cup {[type |-> "abort"]};
        end if;
end process;

process ResourceManager \in RMs
begin
    RM_Work:
        either 
            rmState[self] := "prepared";
        or 
            rmState[self] := "aborted";
        end either;
end process;

The “Staff” Challenge: Identifying the Flaw

When you run this through TLC, it passes the Consistent invariant. However, if you add a Liveness check (<> (tmState = "committed" \/ tmState = "aborting")), TLC will find a counter-example where the TM crashes during TM_Wait, leaving all RMs blocked forever. This is the classic “blocking” problem of 2PC that leads us to 3PC or Paxos.
Why this matters in practice: This is not a theoretical curiosity. The blocking problem of 2PC is the reason that distributed databases like Spanner use Paxos internally to replicate the transaction coordinator itself. Without that, a single coordinator crash can leave distributed transactions hanging indefinitely, holding locks across multiple shards. If someone in an interview asks “why not just use 2PC?”, this blocking scenario is the answer.

4. The “Design-Verify-Implement” Workflow

For high-stakes distributed systems, the workflow is:
  1. Design: Draft the algorithm on a whiteboard.
  2. Verify: Write a TLA+ spec. Run TLC to find the “edge of the edge” cases.
  3. Implement: Translate the verified spec into Go/Rust/Java.
  4. Test: Use Deterministic Simulation Testing (see Fault Tolerance chapter) to ensure the implementation matches the spec.

Staff Tip: Model Refinement

Advanced users use Refinement Mappings. You write a “High-Level Spec” (e.g., “A set that supports Add/Remove”) and a “Low-Level Spec” (e.g., “A distributed Raft-based replicated log”). You then prove that every step the low-level system takes is a valid step in the high-level spec. This is the gold standard for correctness.

The TLC Model Checker

The TLC Model Checker is the tool that executes your specification.
  1. It starts from the Init state.
  2. It exhaustively explores every possible path (interleaving of actions).
  3. If it finds a state where an Invariant is violated, it provides a Counter-example (Error Trace).

The Power of the Error Trace

An error trace shows you the exact sequence of steps (e.g., Node 1 crashes \rightarrow Message 2 delayed \rightarrow Node 3 times out) that leads to the failure. This is often a scenario the designer never considered.

Formal Methods at Scale

CompanyUsage
AWSUsed TLA+ to verify S3’s replication logic and DynamoDB’s consistency.
MongoDBVerified the Raft-based replication protocol.
MicrosoftUsed for verifying cache coherence protocols and Azure services.

Staff+ Deep Dive: Safety vs. Liveness

In interviews and design docs, you must distinguish between these two properties:

1. Safety (Always P)

A safety property is a guarantee that “nothing bad happens.”
  • Example: “No two nodes ever hold the same lock.”
  • Verification: If a bad state exists, TLC will find a finite trace to it.

2. Liveness (Eventually P)

A liveness property is a guarantee that “something good eventually happens.”
  • Example: “If a client sends a request, it will eventually receive a response.”
  • Verification: This requires proving the system doesn’t get stuck in an infinite loop or deadlock (Deadlock checking is a subset of liveness).

Next Steps

Formal verification is a steep learning curve but a superpower for distributed systems architects. The ROI is highest for correctness-critical protocols (consensus, replication, transaction coordination) where a bug found in production could cause data loss or corruption. For CRUD application logic, the cost of formal verification usually outweighs the benefit — use property-based testing instead.

Learn TLA+

Leslie Lamport’s official TLA+ resources.

Learn TLA (Hillel Wayne)

The best practical guide for engineers.