Skip to main content

The Cassandra Paper & Core Architecture

Module Duration: 3-4 hours Learning Style: Theory + Historical Context + Architectural Foundations Outcome: Deep understanding of WHY Cassandra was designed the way it is

Why Start with the Paper?

Most Cassandra courses jump straight into CQL syntax and cluster setup. We’re taking a different approach. Understanding the original research paper gives you:
  • Conceptual clarity: Know WHY design decisions were made
  • Troubleshooting intuition: Predict behavior based on first principles
  • Interview advantage: Explain trade-offs confidently
  • Architectural thinking: Apply distributed systems principles broadly
We’ll break down the paper in plain language - no PhD required. Focus on understanding the concepts and the “why” behind them.

The Facebook Problem (2007-2008)

The Challenge: Inbox Search at Scale

In 2007, Facebook was growing explosively:
  • 500 million users (and growing)
  • Billions of messages sent daily
  • Users wanted to search their entire message history
  • Searches needed to be fast (< 100ms)
The problem: How do you build a system that can:
  1. Ingest millions of writes per second (new messages)
  2. Search across billions of messages instantly
  3. Scale horizontally as Facebook grows
  4. Never go down (high availability)
  5. Work across multiple datacenters (Facebook was global)

Why Existing Solutions Failed

The Problem:
  • Sharding messages across MySQL servers was complex
  • JOIN-heavy queries too slow at scale
  • Write bottlenecks (single-master replication)
  • Hard to scale horizontally
Facebook’s Experience: Already using thousands of MySQL servers, but message search would require 10x more complexity.
The Good:
  • Excellent availability and partition tolerance (AP in CAP theorem)
  • Linear scalability
  • Multi-datacenter support
The Problem:
  • Key-value only - no structured queries
  • No column-based data model
  • Limited query flexibility
What Facebook Liked: Dynamo’s peer-to-peer architecture, consistent hashing, and gossip protocol.
The Good:
  • Column-family data model (flexible schema)
  • Fast writes and structured data
  • Proven at massive scale
The Problem:
  • Single-master architecture (GFS + Chubby)
  • Not designed for multi-datacenter
  • Centralized metadata server = single point of failure
What Facebook Liked: The column-family storage model and fast write path.

The Insight

Facebook engineers Avinash Lakshman (formerly of Amazon Dynamo team) and Prashant Malik realized: Why not combine the best of both?
  • Take Dynamo’s distribution model (no single point of failure, multi-DC replication)
  • Take Bigtable’s data model (column families, structured queries)
  • Add Facebook-specific optimizations (tunable consistency, efficient range queries)
This became Apache Cassandra.

The Cassandra Paper

Full Citation: Avinash Lakshman and Prashant Malik. “Cassandra - A Decentralized Structured Storage System.” ACM SIGOPS Operating Systems Review, 2010. First Presented: Facebook Engineering blog, 2008 Open Sourced: 2008 Apache Incubator: 2009 Apache Top-Level Project: 2010
The paper is surprisingly readable compared to most academic papers. It’s only 6 pages and focuses on practical engineering trade-offs rather than theoretical proofs.

What Makes This Paper Special

Unlike many research papers, the Cassandra paper:
  • Describes a production system serving real users (not a prototype)
  • Focuses on engineering trade-offs (not just novel algorithms)
  • Explains why design choices were made for Facebook’s workload
  • Includes real performance numbers from production

Core Design Principles

Principle 1: No Single Point of Failure

Traditional Master-Slave Architecture:
┌─────────┐
│  Master │  ← Single point of failure
│ (Writes)│  ← Bottleneck for metadata
└────┬────┘

  ┌──┴──┬──────┐
  ▼     ▼      ▼
┌───┐ ┌───┐ ┌───┐
│S1 │ │S2 │ │S3 │  Slaves (read replicas)
└───┘ └───┘ └───┘

Problem: If master fails, writes stop!
Cassandra’s Peer-to-Peer Architecture:
      ┌───┐
   ┌─▶│ A │◀─┐
   │  └───┘  │
   │         │
┌──┴──┐   ┌──┴──┐
│  D  │   │  B  │  ← All nodes are equal
└──┬──┘   └──┬──┘  ← Any node can serve reads/writes
   │         │     ← No single point of failure
   │  ┌───┐  │
   └─▶│ C │◀─┘
      └───┘

Every node can accept writes and reads!
Key Insight: Unlike systems with a master node, every Cassandra node is identical. This means:
  • No bottlenecks
  • No single point of failure
  • Linear scalability (add nodes = add capacity)

Principle 2: Tunable Consistency

Most databases force you to choose:
  • Strong Consistency (CP in CAP theorem) → Slower, but always correct
  • Eventual Consistency (AP in CAP theorem) → Faster, but may be stale
Cassandra’s Innovation: Choose per-query!
-- Strong consistency (wait for majority)
SELECT * FROM messages WHERE user_id = 123
  USING CONSISTENCY QUORUM;

-- Fast reads (return immediately, may be slightly stale)
SELECT * FROM messages WHERE user_id = 123
  USING CONSISTENCY ONE;

-- Critical writes (wait for all replicas)
INSERT INTO payments (user_id, amount) VALUES (123, 99.99)
  USING CONSISTENCY ALL;
Real-World Example (Netflix):
  • User profiles: Strong consistency (QUORUM) - must be accurate
  • Video thumbnails: Weak consistency (ONE) - stale data acceptable
  • Billing data: Strongest consistency (ALL) - money must be exact

Principle 3: Always Writable

Traditional Databases: If you can’t reach a majority of replicas → Write fails Cassandra: If even one node is reachable → Write succeeds How? Hinted Handoff:
User writes message to Node A

Normal case:
Node A → Replicate to B, C (success)

Network partition (B and C unreachable):
Node A → Store "hint" locally

      Later, when B and C come back:

Node A → Replay hints to B and C
Trade-off:
  • ✅ High availability (writes never fail due to network issues)
  • ⚠️ Temporary inconsistency (hints must be replayed)
When this matters: Global applications where network partitions happen across datacenters

Principle 4: Scale Linearly

Goal: 2x nodes = 2x throughput How Cassandra Achieves This:
1

Consistent Hashing

Data distributed evenly across nodes using a hash ring:
     hash(0)

   ┌───▼───┐
┌─▶│   A   │──┐
│  └───────┘  │
│             ▼
│  ┌───────┐
│  │   B   │  Each node owns a range
│  └───────┘  of hash values
│      │
│      ▼
│  ┌───────┐
└──│   C   │
   └───────┘
Adding a node? Only affects neighbors - no full reshuffle!
2

No Centralized Metadata

Unlike HDFS (centralized NameNode), every Cassandra node knows the full cluster topology via gossip protocol.No metadata bottleneck = scales to thousands of nodes.
3

Data Locality

Client can query any node (coordinator), which knows where data lives and routes request directly.No indirection = low latency at any scale.
Proven at Scale:
  • Apple: 75,000+ nodes
  • Netflix: Hundreds of nodes across 3 AWS regions
  • Discord: Trillions of messages

Cassandra Architecture Deep Dive

The Ring Topology

Cassandra organizes nodes in a ring using consistent hashing:
                  Token Range: 0

                    ┌───▼───┐
                    │ Node A│
                    │ Token │
                    │ Range │
                    │ 0-25  │
                    └───┬───┘

     Token Range: 75    │         Token Range: 25
            │           │              │
        ┌───▼───┐       │          ┌───▼───┐
        │ Node D│       │          │ Node B│
        │ Token │       │          │ Token │
        │ Range │       │          │ Range │
        │ 76-99 │       │          │ 26-50 │
        └───┬───┘       │          └───┬───┘
            │           │              │
            │           │              │
            │       ┌───▼───┐          │
            │       │ Node C│          │
            └───────│ Token │──────────┘
                    │ Range │
                    │ 51-75 │
                    └───────┘

                  Token Range: 50
How Data is Placed:
1

Hash the Partition Key

hash("user123") = 42
# Hash function: Murmur3 (default) or MD5
# Output: Integer in range [0, 2^63]
2

Find Owning Node

Hash value 42 falls in range [26-50]
→ Data goes to Node B (primary replica)
3

Replicate to Successors

Replication Factor = 3
→ Primary: Node B
→ Replica 1: Node C (next in ring)
→ Replica 2: Node D (next after C)
Adding a Node (Elasticity):
Before (4 nodes):
A [0-25] → B [26-50] → C [51-75] → D [76-99]

Add Node E at token 63:
A [0-25] → B [26-50] → C [51-62] → E [63-75] → D [76-99]

                        Only C and E exchange data!
                        Other nodes unaffected
Why This Matters:
  • Adding nodes doesn’t cause full data reshuffling
  • Removal/failure similarly localized
  • Scales to thousands of nodes efficiently

The Write Path (Why Writes Are Fast)

Cassandra is optimized for writes. Here’s why: Write Path Flow:
┌──────────────────────────────────────────────────────┐
│ 1. Client sends write to any node (Coordinator)     │
└──────────────────────┬───────────────────────────────┘


┌──────────────────────────────────────────────────────┐
│ 2. Coordinator determines replicas (using ring)      │
│    Example: RF=3 → Nodes B, C, D                     │
└──────────────────────┬───────────────────────────────┘

        ┌──────────────┼──────────────┐
        ▼              ▼              ▼
   ┌────────┐     ┌────────┐     ┌────────┐
   │ Node B │     │ Node C │     │ Node D │
   └────┬───┘     └────┬───┘     └────┬───┘
        │              │              │
        │ 3. Each replica does:       │
        │    a) Append to CommitLog (disk, sequential)
        │    b) Write to MemTable (memory)
        │    c) Return ACK
        │              │              │
        └──────────────┼──────────────┘


┌──────────────────────────────────────────────────────┐
│ 4. Coordinator waits for consistency level           │
│    - ONE: Wait for 1 ACK (fastest)                   │
│    - QUORUM: Wait for 2/3 ACKs (balanced)            │
│    - ALL: Wait for 3/3 ACKs (slowest, most consistent)│
└──────────────────────┬───────────────────────────────┘


┌──────────────────────────────────────────────────────┐
│ 5. Return success to client                          │
└──────────────────────────────────────────────────────┘

Later (asynchronously):
   MemTable full → Flush to SSTable (sorted, immutable file on disk)
Why So Fast?

Sequential Disk I/O

CommitLog is append-only. Sequential writes are 100x faster than random writes on spinning disks.

In-Memory Writes

MemTable is in RAM. Write completes after memory write (async flush to disk later).

No Read-Before-Write

Unlike B-trees, Cassandra doesn’t read existing data before writing. Just append and sort later.

Batched Disk Flushes

MemTable accumulates writes in memory, then flushes in one large sequential write (efficient).
Real Numbers:
  • Write latency: 1-2ms (p99) for local writes
  • Write throughput: 100,000+ writes/sec per node (on modern hardware)
Trade-off:
  • Fast writes come at cost of read complexity (data spread across multiple SSTables)
  • Compaction needed to merge SSTables (covered in Module 3)

The Read Path (More Complex)

Reads are more complex than writes because data may be scattered: Read Path Flow:
┌──────────────────────────────────────────────────────┐
│ 1. Client sends read to coordinator                  │
│    SELECT * FROM messages WHERE user_id = 123        │
└──────────────────────┬───────────────────────────────┘


┌──────────────────────────────────────────────────────┐
│ 2. Coordinator determines replicas (Nodes B, C, D)   │
│    Consistency = QUORUM → Query 2 replicas           │
└──────────────────────┬───────────────────────────────┘

                ┌──────┴──────┐
                ▼             ▼
           ┌────────┐    ┌────────┐
           │ Node B │    │ Node C │
           └────┬───┘    └────┬───┘
                │             │
     3. Each node checks:     │
        a) MemTable (memory)  │
        b) Bloom Filters (is data in SSTable?)
        c) SSTables (disk, may be multiple)
                │             │
                └──────┬──────┘


┌──────────────────────────────────────────────────────┐
│ 4. Coordinator compares timestamps                   │
│    - If data matches → Return to client              │
│    - If mismatch → Trigger read repair               │
└──────────────────────┬───────────────────────────────┘


┌──────────────────────────────────────────────────────┐
│ 5. Return most recent data to client                 │
└──────────────────────────────────────────────────────┘
Read Optimizations:
Problem: Checking every SSTable is expensiveSolution: Bloom filter (probabilistic data structure)
  • In-memory bitmap for each SSTable
  • Can say “definitely NOT in this SSTable” (avoid disk read)
  • Or “maybe in this SSTable” (check disk)
Impact: Reduces disk I/O by 90%+ for negative lookups
Problem: SSTables are large (GBs), scanning is slowSolution: Partition index (in-memory)
  • Maps partition keys → byte offset in SSTable
  • Jump directly to data location
Impact: O(1) lookup instead of full table scan
Problem: Hot partitions read repeatedlySolution: Cache frequently accessed rows in memory
  • Configurable per table
  • Bypass MemTable + SSTable checks
Impact: Sub-millisecond latency for cached readsTrade-off: Uses heap memory (can cause GC pressure)
Why Reads Are Slower Than Writes:
  • Must check multiple SSTables (writes just append)
  • May require network requests to replicas (for consistency)
  • Disk reads are random I/O (slower than sequential writes)
Mitigation:
  • Compaction merges SSTables (fewer files to check)
  • Bloom filters skip empty SSTables
  • Partition caches help for hot data

Data Model: Column Families

Cassandra uses a wide-column data model inspired by Bigtable:

Traditional RDBMS vs Cassandra

Relational (MySQL):
-- Fixed schema
CREATE TABLE users (
    id INT PRIMARY KEY,
    name VARCHAR(50),
    email VARCHAR(100)
);

-- Must pre-define all columns
Cassandra:
-- Flexible schema
CREATE TABLE users (
    id UUID PRIMARY KEY,
    name TEXT,
    email TEXT
    -- Can add more columns per row dynamically
);

-- Different rows can have different columns
INSERT INTO users (id, name, email) VALUES (...);
INSERT INTO users (id, name, email, phone) VALUES (...);

The Power of Clustering Columns

Cassandra’s killer feature: Clustering columns for efficient range queries:
CREATE TABLE messages (
    user_id UUID,
    timestamp TIMESTAMP,
    message_id UUID,
    body TEXT,
    PRIMARY KEY (user_id, timestamp)
    --           ^^^^^^^^  ^^^^^^^^^
    --           Partition  Clustering
    --           Key        Column
) WITH CLUSTERING ORDER BY (timestamp DESC);
How This Works:
1

Partition Key Determines Node

user_id = "alice" → hash("alice") = 42 → Node B
All of Alice’s messages stored on same node(s)!
2

Clustering Column Determines Sort Order

Within Node B's partition for Alice:
[
  {timestamp: 2024-01-15 10:30, message: "Hello"},
  {timestamp: 2024-01-15 10:25, message: "Hi"},
  {timestamp: 2024-01-15 10:20, message: "Hey"}
]
↑ Sorted by timestamp DESC on disk
3

Efficient Range Queries

-- Get Alice's last 10 messages (single partition, sequential read!)
SELECT * FROM messages
WHERE user_id = 'alice'
LIMIT 10;

-- O(1) partition lookup + sequential read (FAST)
Why This Matters:
  • Time-series data: Messages, logs, events naturally ordered by time
  • Single partition read: No scatter-gather across nodes
  • Sequential disk I/O: Much faster than random reads
Constraint:
  • ⚠️ Must always query by partition key (can’t query all messages across all users efficiently)
  • This is the price of scalability: Trade flexibility for performance

Replication & Consistency

Replication Strategies

SimpleStrategy (single datacenter):
CREATE KEYSPACE my_app WITH REPLICATION = {
    'class': 'SimpleStrategy',
    'replication_factor': 3
};
  • Replicas placed on consecutive nodes in ring
  • Suitable for development/single-DC deployments
NetworkTopologyStrategy (production):
CREATE KEYSPACE my_app WITH REPLICATION = {
    'class': 'NetworkTopologyStrategy',
    'dc1': 3,  -- 3 replicas in datacenter 1
    'dc2': 2   -- 2 replicas in datacenter 2
};
  • Datacenter-aware: Replicas spread across racks and DCs
  • Multi-region: Survive entire datacenter failures
  • Performance: Serve reads locally in each region

Consistency Levels (Per-Query Tuning)

Replication Factor = 3 (Nodes A, B, C)

┌────────────┬────────────┬───────────────┬──────────────┐
│ Level      │ Writes     │ Reads         │ Trade-off    │
├────────────┼────────────┼───────────────┼──────────────┤
│ ONE        │ 1 ACK      │ 1 response    │ Fastest,     │
│            │            │               │ least        │
│            │            │               │ consistent   │
├────────────┼────────────┼───────────────┼──────────────┤
│ QUORUM     │ 2/3 ACKs   │ 2/3 responses │ Balanced     │
│            │            │               │ (most common)│
├────────────┼────────────┼───────────────┼──────────────┤
│ ALL        │ 3/3 ACKs   │ 3/3 responses │ Slowest,     │
│            │            │               │ most         │
│            │            │               │ consistent   │
└────────────┴────────────┴───────────────┴──────────────┘
Strong Consistency Formula:
R + W > RF  (Read + Write replicas > Replication Factor)

Examples:
- R=QUORUM (2), W=QUORUM (2), RF=3 → 2+2 > 3 ✓ (Strong)
- R=ONE (1), W=ALL (3), RF=3     → 1+3 > 3 ✓ (Strong)
- R=ONE (1), W=ONE (1), RF=3     → 1+1 > 3 ✗ (Eventual)
Real-World Example (E-commerce):
-- Product catalog (read-heavy, eventual OK)
SELECT * FROM products WHERE id = 123
  USING CONSISTENCY ONE;

-- Shopping cart (needs consistency)
SELECT * FROM carts WHERE user_id = 456
  USING CONSISTENCY QUORUM;

-- Payment processing (critical!)
INSERT INTO orders (id, total, status) VALUES (...)
  USING CONSISTENCY ALL;

Gossip Protocol (Failure Detection)

How does Cassandra detect node failures without a master? Gossip Protocol (inspired by Dynamo):
1

Periodic Communication

Every second, each node:
1. Picks 1-3 random nodes
2. Exchanges cluster state:
   - Which nodes are alive?
   - What are their token ranges?
   - What's their load?
2

Failure Detection

Node A hasn't heard from Node C in 3 gossip rounds
→ Mark C as "suspected down"

If confirmed by other nodes → Mark C as "down"
→ Stop sending requests to C
→ Trigger hinted handoff
3

State Propagation

Node A learns C is down
→ Gossips to B, D, E
→ B gossips to F, G
→ Entire cluster knows within seconds

(Exponential spread, like real gossip!)
Why Gossip?
  • Decentralized: No single coordinator
  • Scalable: O(log N) messages to reach all nodes
  • Fault-tolerant: Works even if many nodes fail
  • Eventually consistent: Cluster state converges
Tuning Parameters:
# cassandra.yaml
phi_convict_threshold: 8  # Lower = more sensitive to failures
                          # Higher = more tolerant of network delays

CAP Theorem and Cassandra’s Choice

Quick CAP Recap

CAP Theorem: Distributed systems can provide at most 2 of 3:
  • Consistency: All nodes see the same data
  • Availability: System responds to requests
  • Partition Tolerance: Works despite network failures
In practice: Network partitions happen, so choose CP or AP.

Cassandra’s Flexibility

Cassandra is AP by default, but can be configured CP: AP Configuration (High Availability):
-- Writes succeed even if some replicas unreachable
WRITE USING CONSISTENCY ONE;
READ USING CONSISTENCY ONE;

-- System always available, but may be temporarily inconsistent
Use case: Social media feeds, recommendations, caches CP Configuration (Strong Consistency):
-- Writes fail if quorum unreachable
WRITE USING CONSISTENCY QUORUM;
READ USING CONSISTENCY QUORUM;

-- Sacrifice availability for consistency (may fail during partition)
Use case: Financial transactions, inventory management The Innovation: You choose per-query, not per-cluster!

Key Insights from the Paper

1. Always Writable (Hinted Handoff)

Problem: Node B is down, but client tries to write to partition owned by B. Solution:
Client → Coordinator (Node A)

Node A checks: "B is down, C and D are alive"

Node A writes to C and D (replicas)

Node A stores "hint" for B locally:
   "When B comes back, replay this write"

Later: B comes online

Node A replays hints to B
Trade-off:
  • ✅ Writes never fail due to node failures
  • ⚠️ Temporary inconsistency until hints replayed
  • ⚠️ Hints can build up (configure max_hint_window)

2. Read Repair (Eventual Consistency)

Problem: Replicas may have different data (due to failures, network delays). Solution:
Client reads at QUORUM (2/3 replicas)

Node A: {user: "alice", age: 30, updated: T1}
Node B: {user: "alice", age: 30, updated: T1}
Node C: {user: "alice", age: 25, updated: T0}  ← Stale!

Coordinator detects mismatch:
   → Return newest (age: 30) to client
   → Background: Send newest to Node C (read repair)
   → Eventually: All replicas consistent
When it runs:
  • Foreground: During reads at QUORUM/ALL (before returning to client)
  • Background: Probabilistic read repair (10% of ONE reads, configurable)

3. Anti-Entropy Repair (Scheduled Maintenance)

Problem: Read repair only fixes accessed data. What about cold data? Solution: Nodetool repair (Merkle tree comparison):
# Compare data across replicas using hash trees
nodetool repair

# Cassandra:
# 1. Builds Merkle trees for each replica
# 2. Compares hashes (efficient, no full data transfer)
# 3. Identifies differences
# 4. Streams missing/different data
Best Practice: Run repair every gc_grace_seconds (default 10 days) to prevent:
  • Deleted data from resurrecting (tombstones)
  • Replicas diverging permanently

Compaction (LSM Tree Maintenance)

Cassandra uses Log-Structured Merge (LSM) trees, which require compaction: Why Compaction?
Without compaction:
MemTable flushes → SSTable1 (10 MB)
MemTable flushes → SSTable2 (10 MB)
MemTable flushes → SSTable3 (10 MB)
...
MemTable flushes → SSTable100 (10 MB)

Problem: Read must check 100+ SSTables (SLOW!)
With compaction:
Background process merges SSTables:
SSTable1 + SSTable2 + SSTable3 → SSTable_merged (30 MB)

Benefits:
- Fewer files to check (faster reads)
- Remove deleted data (tombstones)
- Reclaim disk space
Compaction Strategies (covered deeply in Module 3):
  • SizeTieredCompactionStrategy (STCS): Default, good for writes
  • LeveledCompactionStrategy (LCS): Better for reads, more I/O
  • TimeWindowCompactionStrategy (TWCS): Optimized for time-series

Real-World Design Patterns

Pattern 1: Time-Series Data (IoT Sensors)

CREATE TABLE sensor_data (
    sensor_id UUID,
    timestamp TIMESTAMP,
    temperature DECIMAL,
    humidity DECIMAL,
    PRIMARY KEY (sensor_id, timestamp)
) WITH CLUSTERING ORDER BY (timestamp DESC)
  AND COMPACTION = {'class': 'TimeWindowCompactionStrategy'};

-- Efficient queries:
-- "Get last 24 hours of data for sensor X"
SELECT * FROM sensor_data
WHERE sensor_id = ?
  AND timestamp > now() - 24h;
Why Cassandra?
  • Millions of writes/sec (new sensor readings)
  • Time-based queries (last N hours)
  • Data naturally partitioned by sensor

Pattern 2: User Activity Streams (Social Media)

CREATE TABLE user_timeline (
    user_id UUID,
    post_timestamp TIMESTAMP,
    post_id UUID,
    author_id UUID,
    content TEXT,
    PRIMARY KEY (user_id, post_timestamp, post_id)
) WITH CLUSTERING ORDER BY (post_timestamp DESC);

-- Efficient queries:
-- "Get Alice's feed (last 50 posts)"
SELECT * FROM user_timeline
WHERE user_id = 'alice'
LIMIT 50;
Why Cassandra?
  • Personalized feeds per user (partition key = user)
  • Chronological order (clustering by timestamp)
  • Scales to billions of users

Pattern 3: Distributed Counters

CREATE TABLE page_views (
    page_id TEXT PRIMARY KEY,
    views COUNTER
);

-- Increment counter (no read-before-write!)
UPDATE page_views
SET views = views + 1
WHERE page_id = '/home';
Why Cassandra?
  • Distributed counting without coordination
  • High-throughput increments
  • Eventually consistent totals

When NOT to Use Cassandra

Be honest about trade-offs: Avoid Cassandra for:
Why: Cassandra only supports single-partition lightweight transactions (slow, Paxos-based).What it can’t do: Multi-partition transactions (e.g., transfer money between two accounts).Alternative: Use PostgreSQL, CockroachDB, or Spanner for strong ACID needs.
Why: No JOIN support. Cassandra requires denormalization (store data multiple ways).What it can’t do: SELECT * FROM users JOIN orders ON users.id = orders.user_idAlternative: Use relational databases or pre-compute joins (materialized views).
Why: Must query by partition key (primary key). No full table scans.What it can’t do: SELECT * FROM users WHERE email = 'alice@example.com' (unless email is partition key)Alternative: Use Elasticsearch for search, or design tables for known query patterns.
Why: Cassandra’s operational complexity only justified at scale.What it can’t do: Efficiently manage < 100 GB datasets (overhead not worth it).Alternative: Use PostgreSQL, MySQL for small-to-medium data.

Key Takeaways

Before moving to hands-on modules, internalize these principles:

Peer-to-Peer = No SPOF

Every node is equal. No master means no bottleneck, no single point of failure. This is Cassandra’s superpower.

Tunable Consistency

Choose per-query: QUORUM for critical data, ONE for speed. This flexibility is unique to Cassandra.

Write-Optimized (LSM)

Sequential writes to CommitLog + MemTable make writes blazingly fast. Compaction is the price.

Query-Driven Modeling

Design tables for your queries, not normalization. Embrace denormalization for performance.

Interview Preparation

Understanding the paper gives you a huge advantage:
Architecture Questions:
  • “Why is Cassandra called a peer-to-peer system?”
    • Answer: No master node, all nodes equal, gossip for coordination (contrast with HDFS NameNode)
  • “Explain Cassandra’s write path”
    • Answer: CommitLog (sequential disk) → MemTable (memory) → SSTable (async flush)
  • “How does Cassandra handle node failures?”
    • Answer: Hinted handoff (temporary), read repair (on reads), anti-entropy repair (scheduled)
Trade-off Questions:
  • “Why are writes faster than reads in Cassandra?”
    • Answer: Writes append to log, reads must check multiple SSTables + bloom filters
  • “What’s the downside of eventual consistency?”
    • Answer: Stale reads possible, application must tolerate (use QUORUM for important data)
Design Questions:
  • “Design a messaging system like WhatsApp”
    • Answer: Partition by user_id, cluster by timestamp, denormalize for sender/receiver views

Primary Source

The Cassandra Paper (2010)
  • Full PDF
  • Read: Sections 1-3 (Introduction, Data Model, System Architecture)
  • Skim: Section 4 (Implementation)
  • Focus: Section 5 (Experiences, real production numbers from Facebook)

Foundational Papers (Context)

  1. Dynamo: Amazon’s Highly Available Key-value Store (2007)
    • Full PDF
    • Understand: Consistent hashing, vector clocks, gossip protocol
  2. Bigtable: A Distributed Storage System for Structured Data (2006)
    • Full PDF
    • Understand: Column-family model, LSM trees, tablets

Cassandra-Specific Resources

  • “Cassandra: The Definitive Guide” (O’Reilly) - Chapter 2 (Cassandra’s approach)
  • DataStax Academy - Free course: “DS101: Introduction to Apache Cassandra”

Practical Exercise

Before moving to implementation:
1

Read the Paper

Download the Cassandra paper and read sections 1-3 (30-45 min).As you read, note:
  • Which design choices surprise you?
  • How do Facebook’s requirements drive architecture?
2

Compare with Other Systems

Create a comparison table:
┌──────────────┬───────────┬──────────┬───────────┐
│ Feature      │ Cassandra │ MySQL    │ MongoDB   │
├──────────────┼───────────┼──────────┼───────────┤
│ Consistency  │ Tunable   │ Strong   │ Eventual  │
│ Architecture │ P2P       │ Master   │ Primary+  │
│              │           │ -Slave   │ Secondary │
│ Scaling      │ Horizontal│ Vertical │ Horizontal│
│ Writes       │ Fast (LSM)│ Slow(B+) │ Fast      │
│ JOINs        │ No        │ Yes      │ Limited   │
└──────────────┴───────────┴──────────┴───────────┘
3

Thought Experiments

Consider these scenarios:
  • Two datacenters lose connectivity (network partition) - what happens to reads/writes?
  • A node fails during a QUORUM write (RF=3) - does the write succeed?
  • You need to query users by email (not primary key) - what are your options?
4

Whiteboard Exercise

Draw the architecture:
  1. Cassandra ring with 4 nodes, show token ranges
  2. Write path: Client → Coordinator → Replicas (CommitLog, MemTable)
  3. Read path: Client → Coordinator → Replicas (MemTable + SSTables)
Explain each step to yourself or a peer.

What’s Next?

Now that you understand the why behind Cassandra’s design, let’s get hands-on with data modeling and CQL.

Module 2: Data Modeling & CQL Mastery

Learn query-driven data modeling, primary keys, and write your first CQL queries
Pro Tip: The concepts from this module (ring topology, consistency levels, write/read paths) will be referenced constantly. Bookmark this page!

Fun Facts

Named after the Greek mythological prophet Cassandra, who could see the future but was cursed to never be believed.The metaphor: Cassandra (the database) can “predict” where data lives (consistent hashing) and handle future failures (replication), but requires trust in its eventual consistency model.Also: The authors wanted a name suggesting “seeing the future” (forecasting scale).
When Cassandra was built:
  • 150+ billion messages stored
  • 15 TB of data
  • 1 billion writes/day
  • 50 GB new data daily
Today, Facebook (Meta) uses a fork called “RocksDB” for similar workloads, but Cassandra lives on at Netflix, Apple, Discord, and thousands more.
There’s a fork: DataStax Enterprise (DSE) vs. Apache Cassandra
  • Apache Cassandra: Open source, community-driven
  • DataStax Enterprise: Commercial version with analytics (Spark integration), search, graph features
This course focuses on open-source Apache Cassandra (works for both).

Ready to dive into data modeling? Let’s learn how to think in Cassandra’s query-driven paradigm.

Next: Data Modeling & CQL

Master the art of designing schemas for Cassandra’s distributed architecture