Formal Verification
Why Formal Methods?
TLA+ and PlusCal: The Language of Systems
1. The Core Logic
2. Modeling the Network
3. PlusCal Deep Dive: Modeling 2-Phase Commit (2PC)
2PC Specification
The “Staff” Challenge: Identifying the Flaw
4. The “Design-Verify-Implement” Workflow
Staff Tip: Model Refinement
The TLC Model Checker
The Power of the Error Trace
Formal Methods at Scale
Staff+ Deep Dive: Safety vs. Liveness
1. Safety (Always P)
2. Liveness (Eventually P)
Next Steps

Formal Verification

In distributed systems, testing and code reviews are rarely enough. The state space of a system with

N

nodes,

M

messages, and asynchronous delays is astronomically large. Formal Verification allows us to mathematically prove that our algorithms are correct before we write a single line of production code.

Staff+ Context: Principal engineers at companies like AWS, MongoDB, and Microsoft use formal methods (specifically TLA+) to find design flaws in core protocols (like Paxos or Raft) that would have taken years to manifest in production.

Why Formal Methods?

Most distributed systems bugs are Heisenbugs—non-deterministic race conditions that depend on a specific, rare interleaving of messages and failures.

State Space Explosion: Even a simple 3-node system can have millions of possible states when network partitions and message reordering are involved.
Mental Models are Fallible: Humans are notoriously bad at reasoning about concurrency. We tend to think “happily path-first.”
Documentation as Code: A formal specification is an unambiguous description of how the system should behave.

TLA+ and PlusCal: The Language of Systems

TLA+ (Temporal Logic of Actions) is the industry standard for specifying and verifying concurrent and distributed systems. It was created by Leslie Lamport (who also created Paxos and LaTeX).

1. The Core Logic

TLA+ uses First-Order Logic and Set Theory to describe the state space.

Sets: {1, 2, 3}
Quantifiers: \A x \in S : P(x) (For all x in S, P is true) and \E x \in S : P(x) (There exists an x in S such that P is true).
Functions: [n \in Nodes |-> "follower"] (A mapping of nodes to their status).

2. Modeling the Network

In TLA+, we don’t use “network sockets.” Instead, we model the network as a Shared Set of Messages.

Sending: msgs' = msgs \cup {new_msg}
Receiving: \E m \in msgs : Receive(m)
Nondeterminism: Since TLC explores all paths, it will naturally try receiving messages in different orders, modeling reordering and delays.
Packet Loss: We can model loss by simply not requiring a message to be processed, or by an explicit “Drop” action.

3. PlusCal Deep Dive: Modeling 2-Phase Commit (2PC)

PlusCal is a “pseudocode” language that translates into TLA+. It allows you to write algorithms that look like code but are checked like math.

2PC Specification

--algorithm TwoPhaseCommit
variables 
    rmState = [rm \in RMs |-> "working"],
    tmState = "init",
    msgs = {};

define
    (* Safety: If any RM commits, no other RM can abort *)
    Consistent == 
        \A r1, r2 \in RMs : 
            ~ (rmState[r1] = "committed" /\ rmState[r2] = "aborted")
end define;

process TransactionManager = "TM"
begin
    TM_Start:
        tmState := "preparing";
        msgs := msgs \cup {[type |-> "prepare"]};
    TM_Wait:
        if \A rm \in RMs : rmState[rm] = "prepared" then
            tmState := "committing";
            msgs := msgs \cup {[type |-> "commit"]};
        else
            tmState := "aborting";
            msgs := msgs \cup {[type |-> "abort"]};
        end if;
end process;

process ResourceManager \in RMs
begin
    RM_Work:
        either 
            rmState[self] := "prepared";
        or 
            rmState[self] := "aborted";
        end either;
end process;

The “Staff” Challenge: Identifying the Flaw

When you run this through TLC, it passes the Consistent invariant. However, if you add a Liveness check (<> (tmState = "committed" \/ tmState = "aborting")), TLC will find a counter-example where the TM crashes during TM_Wait, leaving all RMs blocked forever. This is the classic “blocking” problem of 2PC that leads us to 3PC or Paxos.

4. The “Design-Verify-Implement” Workflow

For high-stakes distributed systems, the workflow is:

Design: Draft the algorithm on a whiteboard.
Verify: Write a TLA+ spec. Run TLC to find the “edge of the edge” cases.
Implement: Translate the verified spec into Go/Rust/Java.
Test: Use Deterministic Simulation Testing (see Fault Tolerance chapter) to ensure the implementation matches the spec.

Advanced users use Refinement Mappings. You write a “High-Level Spec” (e.g., “A set that supports Add/Remove”) and a “Low-Level Spec” (e.g., “A distributed Raft-based replicated log”). You then prove that every step the low-level system takes is a valid step in the high-level spec. This is the gold standard for correctness.

The TLC Model Checker

The TLC Model Checker is the tool that executes your specification.

It starts from the Init state.
It exhaustively explores every possible path (interleaving of actions).
If it finds a state where an Invariant is violated, it provides a Counter-example (Error Trace).

The Power of the Error Trace

An error trace shows you the exact sequence of steps (e.g., Node 1 crashes

\rightarrow

Message 2 delayed

\rightarrow

Node 3 times out) that leads to the failure. This is often a scenario the designer never considered.

Formal Methods at Scale

Company	Usage
AWS	Used TLA+ to verify S3’s replication logic and DynamoDB’s consistency.
MongoDB	Verified the Raft-based replication protocol.
Microsoft	Used for verifying cache coherence protocols and Azure services.

Staff+ Deep Dive: Safety vs. Liveness

In interviews and design docs, you must distinguish between these two properties:

1. Safety (Always P)

A safety property is a guarantee that “nothing bad happens.”

Example: “No two nodes ever hold the same lock.”
Verification: If a bad state exists, TLC will find a finite trace to it.

2. Liveness (Eventually P)

A liveness property is a guarantee that “something good eventually happens.”

Example: “If a client sends a request, it will eventually receive a response.”
Verification: This requires proving the system doesn’t get stuck in an infinite loop or deadlock (Deadlock checking is a subset of liveness).

Next Steps

Formal verification is a steep learning curve but a superpower for distributed systems architects.

Learn TLA+

Leslie Lamport’s official TLA+ resources.

Learn TLA (Hillel Wayne)

The best practical guide for engineers.

Gossip Protocols Transactions

Overview

Testing & Code Quality

Crash Courses

AI Engineering

Math for ML - Understanding Linear Algebra

Probability & Statistics for ML

Math for ML - Understanding Calculus

ML Mastery

Deep Learning Mastery

NestJS Mastery

Microservices Mastery

Low Level Design

OOP Concepts

SOLID Principles

Design Patterns

LLD Case Studies

System Design (HLD)

Senior Level (L5+/Staff)

HLD Case Studies

Engineering Fundamentals

DevOps & Operations

Azure Cloud Engineering

AWS Cloud

AWS Monitoring & Observability

AWS Security Services

AWS Serverless

AWS Operations

AWS Advanced

AWS Case Studies

GCP Cloud Engineering

DevOps Tools

Database Engineering

HIPAA Compliance Mastery

Operating Systems

Linux Internals

Distributed Systems

Networking Mastery

Build Your Own X

Go Lang Mastery

C Programming

Classic Research Papers

Distributed System Tools

​Formal Verification

​Why Formal Methods?

​TLA+ and PlusCal: The Language of Systems

​1. The Core Logic

​2. Modeling the Network

​3. PlusCal Deep Dive: Modeling 2-Phase Commit (2PC)

​2PC Specification

​The “Staff” Challenge: Identifying the Flaw

​4. The “Design-Verify-Implement” Workflow

​Staff Tip: Model Refinement

​The TLC Model Checker

​The Power of the Error Trace

​Formal Methods at Scale

​Staff+ Deep Dive: Safety vs. Liveness

​1. Safety (Always P)

​2. Liveness (Eventually P)

​Next Steps

Learn TLA+

Learn TLA (Hillel Wayne)

Formal Verification

Why Formal Methods?

TLA+ and PlusCal: The Language of Systems

1. The Core Logic

2. Modeling the Network

3. PlusCal Deep Dive: Modeling 2-Phase Commit (2PC)

2PC Specification

The “Staff” Challenge: Identifying the Flaw

4. The “Design-Verify-Implement” Workflow

Staff Tip: Model Refinement

The TLC Model Checker

The Power of the Error Trace

Formal Methods at Scale

Staff+ Deep Dive: Safety vs. Liveness

1. Safety (Always P)

2. Liveness (Eventually P)

Next Steps