The Cassandra Paper & Core Architecture

Module Duration: 3-4 hours Learning Style: Theory + Historical Context + Architectural Foundations Outcome: Deep understanding of WHY Cassandra was designed the way it is

Why Start with the Paper?

Most Cassandra courses jump straight into CQL syntax and cluster setup. We’re taking a different approach. Understanding the original research paper gives you:

Conceptual clarity: Know WHY design decisions were made
Troubleshooting intuition: Predict behavior based on first principles
Interview advantage: Explain trade-offs confidently
Architectural thinking: Apply distributed systems principles broadly

We’ll break down the paper in plain language - no PhD required. Focus on understanding the concepts and the “why” behind them.

The Facebook Problem (2007-2008)

The Challenge: Inbox Search at Scale

In 2007, Facebook was growing explosively:

500 million users (and growing)
Billions of messages sent daily
Users wanted to search their entire message history
Searches needed to be fast (< 100ms)

The problem: How do you build a system that can:

Ingest millions of writes per second (new messages)
Search across billions of messages instantly
Scale horizontally as Facebook grows
Never go down (high availability)
Work across multiple datacenters (Facebook was global)

Why Existing Solutions Failed

MySQL (Traditional RDBMS)

The Problem:

Sharding messages across MySQL servers was complex
JOIN-heavy queries too slow at scale
Write bottlenecks (single-master replication)
Hard to scale horizontally

Facebook’s Experience: Already using thousands of MySQL servers, but message search would require 10x more complexity.

Amazon Dynamo

The Good:

Excellent availability and partition tolerance (AP in CAP theorem)
Linear scalability
Multi-datacenter support

The Problem:

Key-value only - no structured queries
No column-based data model
Limited query flexibility

What Facebook Liked: Dynamo’s peer-to-peer architecture, consistent hashing, and gossip protocol.

Google Bigtable

The Good:

Column-family data model (flexible schema)
Fast writes and structured data
Proven at massive scale

The Problem:

Single-master architecture (GFS + Chubby)
Not designed for multi-datacenter
Centralized metadata server = single point of failure

What Facebook Liked: The column-family storage model and fast write path.

The Insight

Facebook engineers Avinash Lakshman (formerly of Amazon Dynamo team) and Prashant Malik realized: Why not combine the best of both?

Take Dynamo’s distribution model (no single point of failure, multi-DC replication)
Take Bigtable’s data model (column families, structured queries)
Add Facebook-specific optimizations (tunable consistency, efficient range queries)

This became Apache Cassandra.

The Cassandra Paper

Full Citation: Avinash Lakshman and Prashant Malik. “Cassandra - A Decentralized Structured Storage System.” ACM SIGOPS Operating Systems Review, 2010. First Presented: Facebook Engineering blog, 2008 Open Sourced: 2008 Apache Incubator: 2009 Apache Top-Level Project: 2010

The paper is surprisingly readable compared to most academic papers. It’s only 6 pages and focuses on practical engineering trade-offs rather than theoretical proofs.

What Makes This Paper Special

Unlike many research papers, the Cassandra paper:

Describes a production system serving real users (not a prototype)
Focuses on engineering trade-offs (not just novel algorithms)
Explains why design choices were made for Facebook’s workload
Includes real performance numbers from production

Core Design Principles

Principle 1: No Single Point of Failure

Traditional Master-Slave Architecture:

┌─────────┐
│  Master │  ← Single point of failure
│ (Writes)│  ← Bottleneck for metadata
└────┬────┘
     │
  ┌──┴──┬──────┐
  ▼     ▼      ▼
┌───┐ ┌───┐ ┌───┐
│S1 │ │S2 │ │S3 │  Slaves (read replicas)
└───┘ └───┘ └───┘

Problem: If master fails, writes stop!

Cassandra’s Peer-to-Peer Architecture:

      ┌───┐
   ┌─▶│ A │◀─┐
   │  └───┘  │
   │         │
┌──┴──┐   ┌──┴──┐
│  D  │   │  B  │  ← All nodes are equal
└──┬──┘   └──┬──┘  ← Any node can serve reads/writes
   │         │     ← No single point of failure
   │  ┌───┐  │
   └─▶│ C │◀─┘
      └───┘

Every node can accept writes and reads!

Key Insight: Unlike systems with a master node, every Cassandra node is identical. This means:

No bottlenecks
No single point of failure
Linear scalability (add nodes = add capacity)

Principle 2: Tunable Consistency

Most databases force you to choose:

Strong Consistency (CP in CAP theorem) → Slower, but always correct
Eventual Consistency (AP in CAP theorem) → Faster, but may be stale

Cassandra’s Innovation: Choose per-query!

-- Strong consistency (wait for majority)
SELECT * FROM messages WHERE user_id = 123
  USING CONSISTENCY QUORUM;

-- Fast reads (return immediately, may be slightly stale)
SELECT * FROM messages WHERE user_id = 123
  USING CONSISTENCY ONE;

-- Critical writes (wait for all replicas)
INSERT INTO payments (user_id, amount) VALUES (123, 99.99)
  USING CONSISTENCY ALL;

Real-World Example (Netflix):

User profiles: Strong consistency (QUORUM) - must be accurate
Video thumbnails: Weak consistency (ONE) - stale data acceptable
Billing data: Strongest consistency (ALL) - money must be exact

Principle 3: Always Writable

Traditional Databases: If you can’t reach a majority of replicas → Write fails Cassandra: If even one node is reachable → Write succeeds How? Hinted Handoff:

User writes message to Node A

Normal case:
Node A → Replicate to B, C (success)

Network partition (B and C unreachable):
Node A → Store "hint" locally
         ↓
      Later, when B and C come back:
         ↓
Node A → Replay hints to B and C

Trade-off:

✅ High availability (writes never fail due to network issues)
⚠️ Temporary inconsistency (hints must be replayed)

When this matters: Global applications where network partitions happen across datacenters

Principle 4: Scale Linearly

Goal: 2x nodes = 2x throughput How Cassandra Achieves This:

Consistent Hashing

Data distributed evenly across nodes using a hash ring:

     hash(0)
       │
   ┌───▼───┐
┌─▶│   A   │──┐
│  └───────┘  │
│             ▼
│  ┌───────┐
│  │   B   │  Each node owns a range
│  └───────┘  of hash values
│      │
│      ▼
│  ┌───────┐
└──│   C   │
   └───────┘

Adding a node? Only affects neighbors - no full reshuffle!

No Centralized Metadata

Unlike HDFS (centralized NameNode), every Cassandra node knows the full cluster topology via gossip protocol.No metadata bottleneck = scales to thousands of nodes.

Data Locality

Client can query any node (coordinator), which knows where data lives and routes request directly.No indirection = low latency at any scale.

Proven at Scale:

Apple: 75,000+ nodes
Netflix: Hundreds of nodes across 3 AWS regions
Discord: Trillions of messages

Cassandra Architecture Deep Dive

The Ring Topology

Cassandra organizes nodes in a ring using consistent hashing:

                  Token Range: 0
                        │
                    ┌───▼───┐
                    │ Node A│
                    │ Token │
                    │ Range │
                    │ 0-25  │
                    └───┬───┘
                        │
     Token Range: 75    │         Token Range: 25
            │           │              │
        ┌───▼───┐       │          ┌───▼───┐
        │ Node D│       │          │ Node B│
        │ Token │       │          │ Token │
        │ Range │       │          │ Range │
        │ 76-99 │       │          │ 26-50 │
        └───┬───┘       │          └───┬───┘
            │           │              │
            │           │              │
            │       ┌───▼───┐          │
            │       │ Node C│          │
            └───────│ Token │──────────┘
                    │ Range │
                    │ 51-75 │
                    └───────┘
                        │
                  Token Range: 50

How Data is Placed:

Hash the Partition Key

hash("user123") = 42
# Hash function: Murmur3 (default) or MD5
# Output: Integer in range [0, 2^63]

Find Owning Node

Hash value 42 falls in range [26-50]
→ Data goes to Node B (primary replica)

Replicate to Successors

Replication Factor = 3
→ Primary: Node B
→ Replica 1: Node C (next in ring)
→ Replica 2: Node D (next after C)

Adding a Node (Elasticity):

Before (4 nodes):
A [0-25] → B [26-50] → C [51-75] → D [76-99]

Add Node E at token 63:
A [0-25] → B [26-50] → C [51-62] → E [63-75] → D [76-99]
                                    ↑
                        Only C and E exchange data!
                        Other nodes unaffected

Why This Matters:

Adding nodes doesn’t cause full data reshuffling
Removal/failure similarly localized
Scales to thousands of nodes efficiently

The Write Path (Why Writes Are Fast)

Cassandra is optimized for writes. Here’s why: Write Path Flow:

┌──────────────────────────────────────────────────────┐
│ 1. Client sends write to any node (Coordinator)     │
└──────────────────────┬───────────────────────────────┘
                       │
                       ▼
┌──────────────────────────────────────────────────────┐
│ 2. Coordinator determines replicas (using ring)      │
│    Example: RF=3 → Nodes B, C, D                     │
└──────────────────────┬───────────────────────────────┘
                       │
        ┌──────────────┼──────────────┐
        ▼              ▼              ▼
   ┌────────┐     ┌────────┐     ┌────────┐
   │ Node B │     │ Node C │     │ Node D │
   └────┬───┘     └────┬───┘     └────┬───┘
        │              │              │
        │ 3. Each replica does:       │
        │    a) Append to CommitLog (disk, sequential)
        │    b) Write to MemTable (memory)
        │    c) Return ACK
        │              │              │
        └──────────────┼──────────────┘
                       │
                       ▼
┌──────────────────────────────────────────────────────┐
│ 4. Coordinator waits for consistency level           │
│    - ONE: Wait for 1 ACK (fastest)                   │
│    - QUORUM: Wait for 2/3 ACKs (balanced)            │
│    - ALL: Wait for 3/3 ACKs (slowest, most consistent)│
└──────────────────────┬───────────────────────────────┘
                       │
                       ▼
┌──────────────────────────────────────────────────────┐
│ 5. Return success to client                          │
└──────────────────────────────────────────────────────┘

Later (asynchronously):
   MemTable full → Flush to SSTable (sorted, immutable file on disk)

Why So Fast?

Sequential Disk I/O

CommitLog is append-only. Sequential writes are 100x faster than random writes on spinning disks.

In-Memory Writes

MemTable is in RAM. Write completes after memory write (async flush to disk later).

No Read-Before-Write

Unlike B-trees, Cassandra doesn’t read existing data before writing. Just append and sort later.

Batched Disk Flushes

MemTable accumulates writes in memory, then flushes in one large sequential write (efficient).

Real Numbers:

Write latency: 1-2ms (p99) for local writes
Write throughput: 100,000+ writes/sec per node (on modern hardware)

Trade-off:

Fast writes come at cost of read complexity (data spread across multiple SSTables)
Compaction needed to merge SSTables (covered in Module 3)

The Read Path (More Complex)

Reads are more complex than writes because data may be scattered: Read Path Flow:

┌──────────────────────────────────────────────────────┐
│ 1. Client sends read to coordinator                  │
│    SELECT * FROM messages WHERE user_id = 123        │
└──────────────────────┬───────────────────────────────┘
                       │
                       ▼
┌──────────────────────────────────────────────────────┐
│ 2. Coordinator determines replicas (Nodes B, C, D)   │
│    Consistency = QUORUM → Query 2 replicas           │
└──────────────────────┬───────────────────────────────┘
                       │
                ┌──────┴──────┐
                ▼             ▼
           ┌────────┐    ┌────────┐
           │ Node B │    │ Node C │
           └────┬───┘    └────┬───┘
                │             │
     3. Each node checks:     │
        a) MemTable (memory)  │
        b) Bloom Filters (is data in SSTable?)
        c) SSTables (disk, may be multiple)
                │             │
                └──────┬──────┘
                       │
                       ▼
┌──────────────────────────────────────────────────────┐
│ 4. Coordinator compares timestamps                   │
│    - If data matches → Return to client              │
│    - If mismatch → Trigger read repair               │
└──────────────────────┬───────────────────────────────┘
                       │
                       ▼
┌──────────────────────────────────────────────────────┐
│ 5. Return most recent data to client                 │
└──────────────────────────────────────────────────────┘

Read Optimizations:

Bloom Filters

Problem: Checking every SSTable is expensiveSolution: Bloom filter (probabilistic data structure)

In-memory bitmap for each SSTable
Can say “definitely NOT in this SSTable” (avoid disk read)
Or “maybe in this SSTable” (check disk)

Impact: Reduces disk I/O by 90%+ for negative lookups

Partition Index

Problem: SSTables are large (GBs), scanning is slowSolution: Partition index (in-memory)

Maps partition keys → byte offset in SSTable
Jump directly to data location

Impact: O(1) lookup instead of full table scan

Row Cache (Optional)

Problem: Hot partitions read repeatedlySolution: Cache frequently accessed rows in memory

Configurable per table
Bypass MemTable + SSTable checks

Impact: Sub-millisecond latency for cached readsTrade-off: Uses heap memory (can cause GC pressure)

Why Reads Are Slower Than Writes:

Must check multiple SSTables (writes just append)
May require network requests to replicas (for consistency)
Disk reads are random I/O (slower than sequential writes)

Mitigation:

Compaction merges SSTables (fewer files to check)
Bloom filters skip empty SSTables
Partition caches help for hot data

Data Model: Column Families

Cassandra uses a wide-column data model inspired by Bigtable:

Traditional RDBMS vs Cassandra

Relational (MySQL):

-- Fixed schema
CREATE TABLE users (
    id INT PRIMARY KEY,
    name VARCHAR(50),
    email VARCHAR(100)
);

-- Must pre-define all columns

Cassandra:

-- Flexible schema
CREATE TABLE users (
    id UUID PRIMARY KEY,
    name TEXT,
    email TEXT
    -- Can add more columns per row dynamically
);

-- Different rows can have different columns
INSERT INTO users (id, name, email) VALUES (...);
INSERT INTO users (id, name, email, phone) VALUES (...);

The Power of Clustering Columns

Cassandra’s killer feature: Clustering columns for efficient range queries:

CREATE TABLE messages (
    user_id UUID,
    timestamp TIMESTAMP,
    message_id UUID,
    body TEXT,
    PRIMARY KEY (user_id, timestamp)
    --           ^^^^^^^^  ^^^^^^^^^
    --           Partition  Clustering
    --           Key        Column
) WITH CLUSTERING ORDER BY (timestamp DESC);

How This Works:

Partition Key Determines Node

user_id = "alice" → hash("alice") = 42 → Node B

All of Alice’s messages stored on same node(s)!

Clustering Column Determines Sort Order

Within Node B's partition for Alice:
[
  {timestamp: 2024-01-15 10:30, message: "Hello"},
  {timestamp: 2024-01-15 10:25, message: "Hi"},
  {timestamp: 2024-01-15 10:20, message: "Hey"}
]
↑ Sorted by timestamp DESC on disk

Efficient Range Queries

-- Get Alice's last 10 messages (single partition, sequential read!)
SELECT * FROM messages
WHERE user_id = 'alice'
LIMIT 10;

-- O(1) partition lookup + sequential read (FAST)

Why This Matters:

Time-series data: Messages, logs, events naturally ordered by time
Single partition read: No scatter-gather across nodes
Sequential disk I/O: Much faster than random reads

Constraint:

⚠️ Must always query by partition key (can’t query all messages across all users efficiently)
This is the price of scalability: Trade flexibility for performance

Replication & Consistency

Replication Strategies

SimpleStrategy (single datacenter):

CREATE KEYSPACE my_app WITH REPLICATION = {
    'class': 'SimpleStrategy',
    'replication_factor': 3
};

Replicas placed on consecutive nodes in ring
Suitable for development/single-DC deployments

NetworkTopologyStrategy (production):

CREATE KEYSPACE my_app WITH REPLICATION = {
    'class': 'NetworkTopologyStrategy',
    'dc1': 3,  -- 3 replicas in datacenter 1
    'dc2': 2   -- 2 replicas in datacenter 2
};

Datacenter-aware: Replicas spread across racks and DCs
Multi-region: Survive entire datacenter failures
Performance: Serve reads locally in each region

Consistency Levels (Per-Query Tuning)

Replication Factor = 3 (Nodes A, B, C)

┌────────────┬────────────┬───────────────┬──────────────┐
│ Level      │ Writes     │ Reads         │ Trade-off    │
├────────────┼────────────┼───────────────┼──────────────┤
│ ONE        │ 1 ACK      │ 1 response    │ Fastest,     │
│            │            │               │ least        │
│            │            │               │ consistent   │
├────────────┼────────────┼───────────────┼──────────────┤
│ QUORUM     │ 2/3 ACKs   │ 2/3 responses │ Balanced     │
│            │            │               │ (most common)│
├────────────┼────────────┼───────────────┼──────────────┤
│ ALL        │ 3/3 ACKs   │ 3/3 responses │ Slowest,     │
│            │            │               │ most         │
│            │            │               │ consistent   │
└────────────┴────────────┴───────────────┴──────────────┘

Strong Consistency Formula:

R + W > RF  (Read + Write replicas > Replication Factor)

Examples:
- R=QUORUM (2), W=QUORUM (2), RF=3 → 2+2 > 3 ✓ (Strong)
- R=ONE (1), W=ALL (3), RF=3     → 1+3 > 3 ✓ (Strong)
- R=ONE (1), W=ONE (1), RF=3     → 1+1 > 3 ✗ (Eventual)

Real-World Example (E-commerce):

-- Product catalog (read-heavy, eventual OK)
SELECT * FROM products WHERE id = 123
  USING CONSISTENCY ONE;

-- Shopping cart (needs consistency)
SELECT * FROM carts WHERE user_id = 456
  USING CONSISTENCY QUORUM;

-- Payment processing (critical!)
INSERT INTO orders (id, total, status) VALUES (...)
  USING CONSISTENCY ALL;

Gossip Protocol (Failure Detection)

How does Cassandra detect node failures without a master? Gossip Protocol (inspired by Dynamo):

Periodic Communication

Every second, each node:
1. Picks 1-3 random nodes
2. Exchanges cluster state:
   - Which nodes are alive?
   - What are their token ranges?
   - What's their load?

Failure Detection

Node A hasn't heard from Node C in 3 gossip rounds
→ Mark C as "suspected down"

If confirmed by other nodes → Mark C as "down"
→ Stop sending requests to C
→ Trigger hinted handoff

State Propagation

Node A learns C is down
→ Gossips to B, D, E
→ B gossips to F, G
→ Entire cluster knows within seconds

(Exponential spread, like real gossip!)

Why Gossip?

Decentralized: No single coordinator
Scalable: O(log N) messages to reach all nodes
Fault-tolerant: Works even if many nodes fail
Eventually consistent: Cluster state converges

Tuning Parameters:

# cassandra.yaml
phi_convict_threshold: 8  # Lower = more sensitive to failures
                          # Higher = more tolerant of network delays

CAP Theorem and Cassandra’s Choice

Quick CAP Recap

CAP Theorem: Distributed systems can provide at most 2 of 3:

Consistency: All nodes see the same data
Availability: System responds to requests
Partition Tolerance: Works despite network failures

In practice: Network partitions happen, so choose CP or AP.

Cassandra’s Flexibility

Cassandra is AP by default, but can be configured CP: AP Configuration (High Availability):

-- Writes succeed even if some replicas unreachable
WRITE USING CONSISTENCY ONE;
READ USING CONSISTENCY ONE;

-- System always available, but may be temporarily inconsistent

Use case: Social media feeds, recommendations, caches CP Configuration (Strong Consistency):

-- Writes fail if quorum unreachable
WRITE USING CONSISTENCY QUORUM;
READ USING CONSISTENCY QUORUM;

-- Sacrifice availability for consistency (may fail during partition)

Use case: Financial transactions, inventory management The Innovation: You choose per-query, not per-cluster!

Key Insights from the Paper

1. Always Writable (Hinted Handoff)

Problem: Node B is down, but client tries to write to partition owned by B. Solution:

Client → Coordinator (Node A)
         ↓
Node A checks: "B is down, C and D are alive"
         ↓
Node A writes to C and D (replicas)
         ↓
Node A stores "hint" for B locally:
   "When B comes back, replay this write"
         ↓
Later: B comes online
         ↓
Node A replays hints to B

Trade-off:

✅ Writes never fail due to node failures
⚠️ Temporary inconsistency until hints replayed
⚠️ Hints can build up (configure max_hint_window)

2. Read Repair (Eventual Consistency)

Problem: Replicas may have different data (due to failures, network delays). Solution:

Client reads at QUORUM (2/3 replicas)

Node A: {user: "alice", age: 30, updated: T1}
Node B: {user: "alice", age: 30, updated: T1}
Node C: {user: "alice", age: 25, updated: T0}  ← Stale!

Coordinator detects mismatch:
   → Return newest (age: 30) to client
   → Background: Send newest to Node C (read repair)
   → Eventually: All replicas consistent

When it runs:

Foreground: During reads at QUORUM/ALL (before returning to client)
Background: Probabilistic read repair (10% of ONE reads, configurable)

3. Anti-Entropy Repair (Scheduled Maintenance)

Problem: Read repair only fixes accessed data. What about cold data? Solution: Nodetool repair (Merkle tree comparison):

# Compare data across replicas using hash trees
nodetool repair

# Cassandra:
# 1. Builds Merkle trees for each replica
# 2. Compares hashes (efficient, no full data transfer)
# 3. Identifies differences
# 4. Streams missing/different data

Best Practice: Run repair every gc_grace_seconds (default 10 days) to prevent:

Deleted data from resurrecting (tombstones)
Replicas diverging permanently

Compaction (LSM Tree Maintenance)

Cassandra uses Log-Structured Merge (LSM) trees, which require compaction: Why Compaction?

Without compaction:
MemTable flushes → SSTable1 (10 MB)
MemTable flushes → SSTable2 (10 MB)
MemTable flushes → SSTable3 (10 MB)
...
MemTable flushes → SSTable100 (10 MB)

Problem: Read must check 100+ SSTables (SLOW!)

With compaction:

Background process merges SSTables:
SSTable1 + SSTable2 + SSTable3 → SSTable_merged (30 MB)

Benefits:
- Fewer files to check (faster reads)
- Remove deleted data (tombstones)
- Reclaim disk space

Compaction Strategies (covered deeply in Module 3):

SizeTieredCompactionStrategy (STCS): Default, good for writes
LeveledCompactionStrategy (LCS): Better for reads, more I/O
TimeWindowCompactionStrategy (TWCS): Optimized for time-series

Real-World Design Patterns

Pattern 1: Time-Series Data (IoT Sensors)

CREATE TABLE sensor_data (
    sensor_id UUID,
    timestamp TIMESTAMP,
    temperature DECIMAL,
    humidity DECIMAL,
    PRIMARY KEY (sensor_id, timestamp)
) WITH CLUSTERING ORDER BY (timestamp DESC)
  AND COMPACTION = {'class': 'TimeWindowCompactionStrategy'};

-- Efficient queries:
-- "Get last 24 hours of data for sensor X"
SELECT * FROM sensor_data
WHERE sensor_id = ?
  AND timestamp > now() - 24h;

Why Cassandra?

Millions of writes/sec (new sensor readings)
Time-based queries (last N hours)
Data naturally partitioned by sensor

CREATE TABLE user_timeline (
    user_id UUID,
    post_timestamp TIMESTAMP,
    post_id UUID,
    author_id UUID,
    content TEXT,
    PRIMARY KEY (user_id, post_timestamp, post_id)
) WITH CLUSTERING ORDER BY (post_timestamp DESC);

-- Efficient queries:
-- "Get Alice's feed (last 50 posts)"
SELECT * FROM user_timeline
WHERE user_id = 'alice'
LIMIT 50;

Why Cassandra?

Personalized feeds per user (partition key = user)
Chronological order (clustering by timestamp)
Scales to billions of users

Pattern 3: Distributed Counters

CREATE TABLE page_views (
    page_id TEXT PRIMARY KEY,
    views COUNTER
);

-- Increment counter (no read-before-write!)
UPDATE page_views
SET views = views + 1
WHERE page_id = '/home';

Why Cassandra?

Distributed counting without coordination
High-throughput increments
Eventually consistent totals

When NOT to Use Cassandra

Be honest about trade-offs: ❌ Avoid Cassandra for:

ACID Transactions Across Rows

Why: Cassandra only supports single-partition lightweight transactions (slow, Paxos-based).What it can’t do: Multi-partition transactions (e.g., transfer money between two accounts).Alternative: Use PostgreSQL, CockroachDB, or Spanner for strong ACID needs.

Complex JOINs

Why: No JOIN support. Cassandra requires denormalization (store data multiple ways).What it can’t do: SELECT * FROM users JOIN orders ON users.id = orders.user_idAlternative: Use relational databases or pre-compute joins (materialized views).

Ad-hoc Queries

Why: Must query by partition key (primary key). No full table scans.What it can’t do: SELECT * FROM users WHERE email = 'alice@example.com' (unless email is partition key)Alternative: Use Elasticsearch for search, or design tables for known query patterns.

Small Datasets

Why: Cassandra’s operational complexity only justified at scale.What it can’t do: Efficiently manage < 100 GB datasets (overhead not worth it).Alternative: Use PostgreSQL, MySQL for small-to-medium data.

Key Takeaways

Before moving to hands-on modules, internalize these principles:

Peer-to-Peer = No SPOF

Every node is equal. No master means no bottleneck, no single point of failure. This is Cassandra’s superpower.

Tunable Consistency

Choose per-query: QUORUM for critical data, ONE for speed. This flexibility is unique to Cassandra.

Write-Optimized (LSM)

Sequential writes to CommitLog + MemTable make writes blazingly fast. Compaction is the price.

Query-Driven Modeling

Design tables for your queries, not normalization. Embrace denormalization for performance.

Interview Preparation

Understanding the paper gives you a huge advantage:

Common Interview Questions

Architecture Questions:

“Why is Cassandra called a peer-to-peer system?”
- Answer: No master node, all nodes equal, gossip for coordination (contrast with HDFS NameNode)
“Explain Cassandra’s write path”
- Answer: CommitLog (sequential disk) → MemTable (memory) → SSTable (async flush)
“How does Cassandra handle node failures?”
- Answer: Hinted handoff (temporary), read repair (on reads), anti-entropy repair (scheduled)

Trade-off Questions:

“Why are writes faster than reads in Cassandra?”
- Answer: Writes append to log, reads must check multiple SSTables + bloom filters
“What’s the downside of eventual consistency?”
- Answer: Stale reads possible, application must tolerate (use QUORUM for important data)

Design Questions:

“Design a messaging system like WhatsApp”
- Answer: Partition by user_id, cluster by timestamp, denormalize for sender/receiver views

Practical Exercise

Before moving to implementation:

Read the Paper

Download the Cassandra paper and read sections 1-3 (30-45 min).As you read, note:

Which design choices surprise you?
How do Facebook’s requirements drive architecture?

Compare with Other Systems

Create a comparison table:

┌──────────────┬───────────┬──────────┬───────────┐
│ Feature      │ Cassandra │ MySQL    │ MongoDB   │
├──────────────┼───────────┼──────────┼───────────┤
│ Consistency  │ Tunable   │ Strong   │ Eventual  │
│ Architecture │ P2P       │ Master   │ Primary+  │
│              │           │ -Slave   │ Secondary │
│ Scaling      │ Horizontal│ Vertical │ Horizontal│
│ Writes       │ Fast (LSM)│ Slow(B+) │ Fast      │
│ JOINs        │ No        │ Yes      │ Limited   │
└──────────────┴───────────┴──────────┴───────────┘

Thought Experiments

Consider these scenarios:

Two datacenters lose connectivity (network partition) - what happens to reads/writes?
A node fails during a QUORUM write (RF=3) - does the write succeed?
You need to query users by email (not primary key) - what are your options?

Whiteboard Exercise

Draw the architecture:

Cassandra ring with 4 nodes, show token ranges
Write path: Client → Coordinator → Replicas (CommitLog, MemTable)
Read path: Client → Coordinator → Replicas (MemTable + SSTables)

Explain each step to yourself or a peer.

What’s Next?

Now that you understand the why behind Cassandra’s design, let’s get hands-on with data modeling and CQL.

Module 2: Data Modeling & CQL Mastery

Learn query-driven data modeling, primary keys, and write your first CQL queries

Pro Tip: The concepts from this module (ring topology, consistency levels, write/read paths) will be referenced constantly. Bookmark this page!

Fun Facts

Why the Name 'Cassandra'?

Named after the Greek mythological prophet Cassandra, who could see the future but was cursed to never be believed.The metaphor: Cassandra (the database) can “predict” where data lives (consistent hashing) and handle future failures (replication), but requires trust in its eventual consistency model.Also: The authors wanted a name suggesting “seeing the future” (forecasting scale).

Facebook's Scale (2008)

When Cassandra was built:

150+ billion messages stored
15 TB of data
1 billion writes/day
50 GB new data daily

Today, Facebook (Meta) uses a fork called “RocksDB” for similar workloads, but Cassandra lives on at Netflix, Apple, Discord, and thousands more.

Cassandra vs. Cassandra

There’s a fork: DataStax Enterprise (DSE) vs. Apache Cassandra

Apache Cassandra: Open source, community-driven
DataStax Enterprise: Commercial version with analytics (Spark integration), search, graph features

This course focuses on open-source Apache Cassandra (works for both).

Ready to dive into data modeling? Let’s learn how to think in Cassandra’s query-driven paradigm.

Next: Data Modeling & CQL

Master the art of designing schemas for Cassandra’s distributed architecture

Overview

Testing & Code Quality

Crash Courses

AI Engineering

Math for ML - Understanding Linear Algebra

Probability & Statistics for ML

Math for ML - Understanding Calculus

ML Mastery

Deep Learning Mastery

NestJS Mastery

Microservices Mastery

Low Level Design

OOP Concepts

SOLID Principles

Design Patterns

LLD Case Studies

System Design (HLD)

Senior Level (L5+/Staff)

HLD Case Studies

Engineering Fundamentals

DevOps & Operations

Azure Cloud Engineering

AWS Cloud

AWS Monitoring & Observability

AWS Security Services

AWS Serverless

AWS Operations

AWS Advanced

AWS Case Studies

GCP Cloud Engineering

DevOps Tools

Database Engineering

HIPAA Compliance Mastery

Operating Systems

Linux Internals

Distributed Systems

Networking Mastery

Build Your Own X

Go Lang Mastery

C Programming

Classic Research Papers

Distributed System Tools

​The Cassandra Paper & Core Architecture

​Why Start with the Paper?

​The Facebook Problem (2007-2008)

​The Challenge: Inbox Search at Scale

​Why Existing Solutions Failed

​The Insight

​The Cassandra Paper

​What Makes This Paper Special

​Core Design Principles

​Principle 1: No Single Point of Failure

​Principle 2: Tunable Consistency

​Principle 3: Always Writable

​Principle 4: Scale Linearly

​Cassandra Architecture Deep Dive

​The Ring Topology

​The Write Path (Why Writes Are Fast)

Sequential Disk I/O

In-Memory Writes

No Read-Before-Write

Batched Disk Flushes

​The Read Path (More Complex)

​Data Model: Column Families

​Traditional RDBMS vs Cassandra

​The Power of Clustering Columns

​Replication & Consistency

​Replication Strategies

​Consistency Levels (Per-Query Tuning)

​Gossip Protocol (Failure Detection)

​CAP Theorem and Cassandra’s Choice

​Quick CAP Recap

​Cassandra’s Flexibility

​Key Insights from the Paper

​1. Always Writable (Hinted Handoff)

​2. Read Repair (Eventual Consistency)

​3. Anti-Entropy Repair (Scheduled Maintenance)

​Compaction (LSM Tree Maintenance)

​Real-World Design Patterns

​Pattern 1: Time-Series Data (IoT Sensors)

The Cassandra Paper & Core Architecture

Why Start with the Paper?

The Facebook Problem (2007-2008)

The Challenge: Inbox Search at Scale

Why Existing Solutions Failed

The Insight

The Cassandra Paper

What Makes This Paper Special

Core Design Principles

Principle 1: No Single Point of Failure

Principle 2: Tunable Consistency

Principle 3: Always Writable

Principle 4: Scale Linearly

Cassandra Architecture Deep Dive

The Ring Topology

The Write Path (Why Writes Are Fast)

The Read Path (More Complex)

Data Model: Column Families

Traditional RDBMS vs Cassandra

The Power of Clustering Columns

Replication & Consistency

Replication Strategies

Consistency Levels (Per-Query Tuning)

Gossip Protocol (Failure Detection)

CAP Theorem and Cassandra’s Choice

Quick CAP Recap

Cassandra’s Flexibility

Key Insights from the Paper

1. Always Writable (Hinted Handoff)

2. Read Repair (Eventual Consistency)

3. Anti-Entropy Repair (Scheduled Maintenance)

Compaction (LSM Tree Maintenance)

Real-World Design Patterns

Pattern 1: Time-Series Data (IoT Sensors)

Pattern 2: User Activity Streams (Social Media)

Pattern 3: Distributed Counters

When NOT to Use Cassandra

Key Takeaways

Interview Preparation

Recommended Reading

Primary Source

Foundational Papers (Context)

Cassandra-Specific Resources

Practical Exercise

What’s Next?

Fun Facts