Skip to main content

Formal Verification

In distributed systems, testing and code reviews are rarely enough. The state space of a system with NN nodes, MM messages, and asynchronous delays is astronomically large. Formal Verification allows us to mathematically prove that our algorithms are correct before we write a single line of production code.
Staff+ Context: Principal engineers at companies like AWS, MongoDB, and Microsoft use formal methods (specifically TLA+) to find design flaws in core protocols (like Paxos or Raft) that would have taken years to manifest in production.

Why Formal Methods?

Most distributed systems bugs are Heisenbugs—non-deterministic race conditions that depend on a specific, rare interleaving of messages and failures.
  1. State Space Explosion: Even a simple 3-node system can have millions of possible states when network partitions and message reordering are involved.
  2. Mental Models are Fallible: Humans are notoriously bad at reasoning about concurrency. We tend to think “happily path-first.”
  3. Documentation as Code: A formal specification is an unambiguous description of how the system should behave.

TLA+ and PlusCal: The Language of Systems

TLA+ (Temporal Logic of Actions) is the industry standard for specifying and verifying concurrent and distributed systems. It was created by Leslie Lamport (who also created Paxos and LaTeX).

1. The Core Logic

TLA+ uses First-Order Logic and Set Theory to describe the state space.
  • Sets: {1, 2, 3}
  • Quantifiers: \A x \in S : P(x) (For all x in S, P is true) and \E x \in S : P(x) (There exists an x in S such that P is true).
  • Functions: [n \in Nodes |-> "follower"] (A mapping of nodes to their status).

2. Modeling the Network

In TLA+, we don’t use “network sockets.” Instead, we model the network as a Shared Set of Messages.
  • Sending: msgs' = msgs \cup {new_msg}
  • Receiving: \E m \in msgs : Receive(m)
  • Nondeterminism: Since TLC explores all paths, it will naturally try receiving messages in different orders, modeling reordering and delays.
  • Packet Loss: We can model loss by simply not requiring a message to be processed, or by an explicit “Drop” action.

3. PlusCal Deep Dive: Modeling 2-Phase Commit (2PC)

PlusCal is a “pseudocode” language that translates into TLA+. It allows you to write algorithms that look like code but are checked like math.

2PC Specification

--algorithm TwoPhaseCommit
variables 
    rmState = [rm \in RMs |-> "working"],
    tmState = "init",
    msgs = {};

define
    (* Safety: If any RM commits, no other RM can abort *)
    Consistent == 
        \A r1, r2 \in RMs : 
            ~ (rmState[r1] = "committed" /\ rmState[r2] = "aborted")
end define;

process TransactionManager = "TM"
begin
    TM_Start:
        tmState := "preparing";
        msgs := msgs \cup {[type |-> "prepare"]};
    TM_Wait:
        if \A rm \in RMs : rmState[rm] = "prepared" then
            tmState := "committing";
            msgs := msgs \cup {[type |-> "commit"]};
        else
            tmState := "aborting";
            msgs := msgs \cup {[type |-> "abort"]};
        end if;
end process;

process ResourceManager \in RMs
begin
    RM_Work:
        either 
            rmState[self] := "prepared";
        or 
            rmState[self] := "aborted";
        end either;
end process;

The “Staff” Challenge: Identifying the Flaw

When you run this through TLC, it passes the Consistent invariant. However, if you add a Liveness check (<> (tmState = "committed" \/ tmState = "aborting")), TLC will find a counter-example where the TM crashes during TM_Wait, leaving all RMs blocked forever. This is the classic “blocking” problem of 2PC that leads us to 3PC or Paxos.

4. The “Design-Verify-Implement” Workflow

For high-stakes distributed systems, the workflow is:
  1. Design: Draft the algorithm on a whiteboard.
  2. Verify: Write a TLA+ spec. Run TLC to find the “edge of the edge” cases.
  3. Implement: Translate the verified spec into Go/Rust/Java.
  4. Test: Use Deterministic Simulation Testing (see Fault Tolerance chapter) to ensure the implementation matches the spec.

Staff Tip: Model Refinement

Advanced users use Refinement Mappings. You write a “High-Level Spec” (e.g., “A set that supports Add/Remove”) and a “Low-Level Spec” (e.g., “A distributed Raft-based replicated log”). You then prove that every step the low-level system takes is a valid step in the high-level spec. This is the gold standard for correctness.

The TLC Model Checker

The TLC Model Checker is the tool that executes your specification.
  1. It starts from the Init state.
  2. It exhaustively explores every possible path (interleaving of actions).
  3. If it finds a state where an Invariant is violated, it provides a Counter-example (Error Trace).

The Power of the Error Trace

An error trace shows you the exact sequence of steps (e.g., Node 1 crashes \rightarrow Message 2 delayed \rightarrow Node 3 times out) that leads to the failure. This is often a scenario the designer never considered.

Formal Methods at Scale

CompanyUsage
AWSUsed TLA+ to verify S3’s replication logic and DynamoDB’s consistency.
MongoDBVerified the Raft-based replication protocol.
MicrosoftUsed for verifying cache coherence protocols and Azure services.

Staff+ Deep Dive: Safety vs. Liveness

In interviews and design docs, you must distinguish between these two properties:

1. Safety (Always P)

A safety property is a guarantee that “nothing bad happens.”
  • Example: “No two nodes ever hold the same lock.”
  • Verification: If a bad state exists, TLC will find a finite trace to it.

2. Liveness (Eventually P)

A liveness property is a guarantee that “something good eventually happens.”
  • Example: “If a client sends a request, it will eventually receive a response.”
  • Verification: This requires proving the system doesn’t get stuck in an infinite loop or deadlock (Deadlock checking is a subset of liveness).

Next Steps

Formal verification is a steep learning curve but a superpower for distributed systems architects.