The Cassandra Paper & Core Architecture
Module Duration: 3-4 hours
Learning Style: Theory + Historical Context + Architectural Foundations
Outcome: Deep understanding of WHY Cassandra was designed the way it is
Why Start with the Paper?
Most Cassandra courses jump straight into CQL syntax and cluster setup. We’re taking a different approach. Understanding the original research paper gives you:- Conceptual clarity: Know WHY design decisions were made
- Troubleshooting intuition: Predict behavior based on first principles
- Interview advantage: Explain trade-offs confidently
- Architectural thinking: Apply distributed systems principles broadly
We’ll break down the paper in plain language - no PhD required. Focus on understanding the concepts and the “why” behind them.
The Facebook Problem (2007-2008)
The Challenge: Inbox Search at Scale
In 2007, Facebook was growing explosively:- 500 million users (and growing)
- Billions of messages sent daily
- Users wanted to search their entire message history
- Searches needed to be fast (< 100ms)
- Ingest millions of writes per second (new messages)
- Search across billions of messages instantly
- Scale horizontally as Facebook grows
- Never go down (high availability)
- Work across multiple datacenters (Facebook was global)
Why Existing Solutions Failed
MySQL (Traditional RDBMS)
MySQL (Traditional RDBMS)
The Problem:
- Sharding messages across MySQL servers was complex
- JOIN-heavy queries too slow at scale
- Write bottlenecks (single-master replication)
- Hard to scale horizontally
Amazon Dynamo
Amazon Dynamo
The Good:
- Excellent availability and partition tolerance (AP in CAP theorem)
- Linear scalability
- Multi-datacenter support
- Key-value only - no structured queries
- No column-based data model
- Limited query flexibility
Google Bigtable
Google Bigtable
The Good:
- Column-family data model (flexible schema)
- Fast writes and structured data
- Proven at massive scale
- Single-master architecture (GFS + Chubby)
- Not designed for multi-datacenter
- Centralized metadata server = single point of failure
The Insight
Facebook engineers Avinash Lakshman (formerly of Amazon Dynamo team) and Prashant Malik realized: Why not combine the best of both?- Take Dynamo’s distribution model (no single point of failure, multi-DC replication)
- Take Bigtable’s data model (column families, structured queries)
- Add Facebook-specific optimizations (tunable consistency, efficient range queries)
The Cassandra Paper
Full Citation: Avinash Lakshman and Prashant Malik. “Cassandra - A Decentralized Structured Storage System.” ACM SIGOPS Operating Systems Review, 2010. First Presented: Facebook Engineering blog, 2008 Open Sourced: 2008 Apache Incubator: 2009 Apache Top-Level Project: 2010What Makes This Paper Special
Unlike many research papers, the Cassandra paper:- Describes a production system serving real users (not a prototype)
- Focuses on engineering trade-offs (not just novel algorithms)
- Explains why design choices were made for Facebook’s workload
- Includes real performance numbers from production
Core Design Principles
Principle 1: No Single Point of Failure
Traditional Master-Slave Architecture:- No bottlenecks
- No single point of failure
- Linear scalability (add nodes = add capacity)
Principle 2: Tunable Consistency
Most databases force you to choose:- Strong Consistency (CP in CAP theorem) → Slower, but always correct
- Eventual Consistency (AP in CAP theorem) → Faster, but may be stale
- User profiles: Strong consistency (QUORUM) - must be accurate
- Video thumbnails: Weak consistency (ONE) - stale data acceptable
- Billing data: Strongest consistency (ALL) - money must be exact
Principle 3: Always Writable
Traditional Databases: If you can’t reach a majority of replicas → Write fails Cassandra: If even one node is reachable → Write succeeds How? Hinted Handoff:- ✅ High availability (writes never fail due to network issues)
- ⚠️ Temporary inconsistency (hints must be replayed)
Principle 4: Scale Linearly
Goal: 2x nodes = 2x throughput How Cassandra Achieves This:Consistent Hashing
Data distributed evenly across nodes using a hash ring:Adding a node? Only affects neighbors - no full reshuffle!
No Centralized Metadata
Unlike HDFS (centralized NameNode), every Cassandra node knows the full cluster topology via gossip protocol.No metadata bottleneck = scales to thousands of nodes.
- Apple: 75,000+ nodes
- Netflix: Hundreds of nodes across 3 AWS regions
- Discord: Trillions of messages
Cassandra Architecture Deep Dive
The Ring Topology
Cassandra organizes nodes in a ring using consistent hashing:- Adding nodes doesn’t cause full data reshuffling
- Removal/failure similarly localized
- Scales to thousands of nodes efficiently
The Write Path (Why Writes Are Fast)
Cassandra is optimized for writes. Here’s why: Write Path Flow:Sequential Disk I/O
CommitLog is append-only. Sequential writes are 100x faster than random writes on spinning disks.
In-Memory Writes
MemTable is in RAM. Write completes after memory write (async flush to disk later).
No Read-Before-Write
Unlike B-trees, Cassandra doesn’t read existing data before writing. Just append and sort later.
Batched Disk Flushes
MemTable accumulates writes in memory, then flushes in one large sequential write (efficient).
- Write latency: 1-2ms (p99) for local writes
- Write throughput: 100,000+ writes/sec per node (on modern hardware)
- Fast writes come at cost of read complexity (data spread across multiple SSTables)
- Compaction needed to merge SSTables (covered in Module 3)
The Read Path (More Complex)
Reads are more complex than writes because data may be scattered: Read Path Flow:Bloom Filters
Bloom Filters
Problem: Checking every SSTable is expensiveSolution: Bloom filter (probabilistic data structure)
- In-memory bitmap for each SSTable
- Can say “definitely NOT in this SSTable” (avoid disk read)
- Or “maybe in this SSTable” (check disk)
Partition Index
Partition Index
Problem: SSTables are large (GBs), scanning is slowSolution: Partition index (in-memory)
- Maps partition keys → byte offset in SSTable
- Jump directly to data location
Row Cache (Optional)
Row Cache (Optional)
Problem: Hot partitions read repeatedlySolution: Cache frequently accessed rows in memory
- Configurable per table
- Bypass MemTable + SSTable checks
- Must check multiple SSTables (writes just append)
- May require network requests to replicas (for consistency)
- Disk reads are random I/O (slower than sequential writes)
- Compaction merges SSTables (fewer files to check)
- Bloom filters skip empty SSTables
- Partition caches help for hot data
Data Model: Column Families
Cassandra uses a wide-column data model inspired by Bigtable:Traditional RDBMS vs Cassandra
Relational (MySQL):The Power of Clustering Columns
Cassandra’s killer feature: Clustering columns for efficient range queries:
Why This Matters:
- Time-series data: Messages, logs, events naturally ordered by time
- Single partition read: No scatter-gather across nodes
- Sequential disk I/O: Much faster than random reads
- ⚠️ Must always query by partition key (can’t query all messages across all users efficiently)
- This is the price of scalability: Trade flexibility for performance
Replication & Consistency
Replication Strategies
SimpleStrategy (single datacenter):- Replicas placed on consecutive nodes in ring
- Suitable for development/single-DC deployments
- Datacenter-aware: Replicas spread across racks and DCs
- Multi-region: Survive entire datacenter failures
- Performance: Serve reads locally in each region
Consistency Levels (Per-Query Tuning)
Gossip Protocol (Failure Detection)
How does Cassandra detect node failures without a master? Gossip Protocol (inspired by Dynamo): Why Gossip?- Decentralized: No single coordinator
- Scalable: O(log N) messages to reach all nodes
- Fault-tolerant: Works even if many nodes fail
- Eventually consistent: Cluster state converges
CAP Theorem and Cassandra’s Choice
Quick CAP Recap
CAP Theorem: Distributed systems can provide at most 2 of 3:- Consistency: All nodes see the same data
- Availability: System responds to requests
- Partition Tolerance: Works despite network failures
Cassandra’s Flexibility
Cassandra is AP by default, but can be configured CP: AP Configuration (High Availability):Key Insights from the Paper
1. Always Writable (Hinted Handoff)
Problem: Node B is down, but client tries to write to partition owned by B. Solution:- ✅ Writes never fail due to node failures
- ⚠️ Temporary inconsistency until hints replayed
- ⚠️ Hints can build up (configure
max_hint_window)
2. Read Repair (Eventual Consistency)
Problem: Replicas may have different data (due to failures, network delays). Solution:- Foreground: During reads at QUORUM/ALL (before returning to client)
- Background: Probabilistic read repair (10% of ONE reads, configurable)
3. Anti-Entropy Repair (Scheduled Maintenance)
Problem: Read repair only fixes accessed data. What about cold data? Solution: Nodetool repair (Merkle tree comparison):gc_grace_seconds (default 10 days) to prevent:
- Deleted data from resurrecting (tombstones)
- Replicas diverging permanently
Compaction (LSM Tree Maintenance)
Cassandra uses Log-Structured Merge (LSM) trees, which require compaction: Why Compaction?- SizeTieredCompactionStrategy (STCS): Default, good for writes
- LeveledCompactionStrategy (LCS): Better for reads, more I/O
- TimeWindowCompactionStrategy (TWCS): Optimized for time-series
Real-World Design Patterns
Pattern 1: Time-Series Data (IoT Sensors)
- Millions of writes/sec (new sensor readings)
- Time-based queries (last N hours)
- Data naturally partitioned by sensor
Pattern 2: User Activity Streams (Social Media)
- Personalized feeds per user (partition key = user)
- Chronological order (clustering by timestamp)
- Scales to billions of users
Pattern 3: Distributed Counters
- Distributed counting without coordination
- High-throughput increments
- Eventually consistent totals
When NOT to Use Cassandra
Be honest about trade-offs: ❌ Avoid Cassandra for:ACID Transactions Across Rows
ACID Transactions Across Rows
Why: Cassandra only supports single-partition lightweight transactions (slow, Paxos-based).What it can’t do: Multi-partition transactions (e.g., transfer money between two accounts).Alternative: Use PostgreSQL, CockroachDB, or Spanner for strong ACID needs.
Complex JOINs
Complex JOINs
Why: No JOIN support. Cassandra requires denormalization (store data multiple ways).What it can’t do:
SELECT * FROM users JOIN orders ON users.id = orders.user_idAlternative: Use relational databases or pre-compute joins (materialized views).Ad-hoc Queries
Ad-hoc Queries
Why: Must query by partition key (primary key). No full table scans.What it can’t do:
SELECT * FROM users WHERE email = 'alice@example.com' (unless email is partition key)Alternative: Use Elasticsearch for search, or design tables for known query patterns.Small Datasets
Small Datasets
Why: Cassandra’s operational complexity only justified at scale.What it can’t do: Efficiently manage < 100 GB datasets (overhead not worth it).Alternative: Use PostgreSQL, MySQL for small-to-medium data.
Key Takeaways
Before moving to hands-on modules, internalize these principles:Peer-to-Peer = No SPOF
Every node is equal. No master means no bottleneck, no single point of failure. This is Cassandra’s superpower.
Tunable Consistency
Choose per-query: QUORUM for critical data, ONE for speed. This flexibility is unique to Cassandra.
Write-Optimized (LSM)
Sequential writes to CommitLog + MemTable make writes blazingly fast. Compaction is the price.
Query-Driven Modeling
Design tables for your queries, not normalization. Embrace denormalization for performance.
Interview Preparation
Understanding the paper gives you a huge advantage:Common Interview Questions
Common Interview Questions
Architecture Questions:
-
“Why is Cassandra called a peer-to-peer system?”
- Answer: No master node, all nodes equal, gossip for coordination (contrast with HDFS NameNode)
-
“Explain Cassandra’s write path”
- Answer: CommitLog (sequential disk) → MemTable (memory) → SSTable (async flush)
-
“How does Cassandra handle node failures?”
- Answer: Hinted handoff (temporary), read repair (on reads), anti-entropy repair (scheduled)
-
“Why are writes faster than reads in Cassandra?”
- Answer: Writes append to log, reads must check multiple SSTables + bloom filters
-
“What’s the downside of eventual consistency?”
- Answer: Stale reads possible, application must tolerate (use QUORUM for important data)
- “Design a messaging system like WhatsApp”
- Answer: Partition by user_id, cluster by timestamp, denormalize for sender/receiver views
Recommended Reading
Primary Source
The Cassandra Paper (2010)- Full PDF
- Read: Sections 1-3 (Introduction, Data Model, System Architecture)
- Skim: Section 4 (Implementation)
- Focus: Section 5 (Experiences, real production numbers from Facebook)
Foundational Papers (Context)
-
Dynamo: Amazon’s Highly Available Key-value Store (2007)
- Full PDF
- Understand: Consistent hashing, vector clocks, gossip protocol
-
Bigtable: A Distributed Storage System for Structured Data (2006)
- Full PDF
- Understand: Column-family model, LSM trees, tablets
Cassandra-Specific Resources
- “Cassandra: The Definitive Guide” (O’Reilly) - Chapter 2 (Cassandra’s approach)
- DataStax Academy - Free course: “DS101: Introduction to Apache Cassandra”
Practical Exercise
Before moving to implementation:Read the Paper
Download the Cassandra paper and read sections 1-3 (30-45 min).As you read, note:
- Which design choices surprise you?
- How do Facebook’s requirements drive architecture?
Thought Experiments
Consider these scenarios:
- Two datacenters lose connectivity (network partition) - what happens to reads/writes?
- A node fails during a QUORUM write (RF=3) - does the write succeed?
- You need to query users by email (not primary key) - what are your options?
What’s Next?
Now that you understand the why behind Cassandra’s design, let’s get hands-on with data modeling and CQL.Module 2: Data Modeling & CQL Mastery
Learn query-driven data modeling, primary keys, and write your first CQL queries
Pro Tip: The concepts from this module (ring topology, consistency levels, write/read paths) will be referenced constantly. Bookmark this page!
Fun Facts
Why the Name 'Cassandra'?
Why the Name 'Cassandra'?
Named after the Greek mythological prophet Cassandra, who could see the future but was cursed to never be believed.The metaphor: Cassandra (the database) can “predict” where data lives (consistent hashing) and handle future failures (replication), but requires trust in its eventual consistency model.Also: The authors wanted a name suggesting “seeing the future” (forecasting scale).
Facebook's Scale (2008)
Facebook's Scale (2008)
When Cassandra was built:
- 150+ billion messages stored
- 15 TB of data
- 1 billion writes/day
- 50 GB new data daily
Cassandra vs. Cassandra
Cassandra vs. Cassandra
There’s a fork: DataStax Enterprise (DSE) vs. Apache Cassandra
- Apache Cassandra: Open source, community-driven
- DataStax Enterprise: Commercial version with analytics (Spark integration), search, graph features
Ready to dive into data modeling? Let’s learn how to think in Cassandra’s query-driven paradigm.
Next: Data Modeling & CQL
Master the art of designing schemas for Cassandra’s distributed architecture