The Core Problem: How do you distribute data evenly across N nodes, and minimize data movement when nodes are added/removed?Naive Approach (Hash Modulo):
Copy
# Bad: Rehashes most data on cluster changesnode = hash(key) % num_nodes# Example with 4 nodes:hash("user123") = 4242 % 4 = 2 → Node 2# Add 5th node:42 % 5 = 2 → Still Node 2 (lucky!)# But:hash("user456") = 8888 % 4 = 0 → Node 088 % 5 = 3 → Node 3 (MOVED!)# On average: Adding 1 node rehashes N/(N+1) of data# 4→5 nodes: 80% of data moves! Unacceptable!
Consistent Hashing Solution:
Copy
1. Create a "hash space" (ring): 0 to 2^64-12. Each node gets a position on ring: Node A: hash("A") = 0 Node B: hash("B") = 4611686018427387904 (2^62) Node C: hash("C") = 9223372036854775808 (2^63) Node D: hash("D") = 13835058055282163712 (3*2^62)3. Visualize as ring: 0 (Node A) ↓ ┌──────────────┐ │ │D ←─┤ ├─→ B │ │ └──────────────┘ ↑ C (2^63)4. Data placement: hash("user123") = 42 → Falls between A and B → Goes to B hash("user456") = 2^63 + 100 → Falls between C and D → Goes to D5. Adding Node E at position 2^62 + 1000: - Only keys between B and E move - Rest of data unchanged! - On average: Only 1/(N+1) of data moves
Real Numbers:
Copy
4 nodes → 5 nodes:- Hash modulo: 80% data movement- Consistent hashing: 20% data movement- 4x better!100 nodes → 101 nodes:- Hash modulo: 99% data movement- Consistent hashing: ~1% data movement- 99x better!
Classic Cassandra (pre-1.2):Each node gets 1 position on ringNode A: token = 0Node B: token = 2^62Node C: token = 2^63Node D: token = 3*2^62Issues:1. Uneven data distribution if nodes aren't perfectly spaced2. Adding node requires manual token assignment (error-prone)3. Replacing failed node: Must use exact same token4. Hot spots if data access uneven
Vnodes Solution (Default Since Cassandra 1.2):
Copy
Each physical node owns MULTIPLE positions on ringDefault: 256 vnodes per node (num_tokens = 256)Node A gets 256 random tokens: vnode_1: hash("A_1") = 123456 vnode_2: hash("A_2") = 9876543 ... vnode_256: hash("A_256") = 5555555Node B gets 256 random tokens: vnode_1: hash("B_1") = 234567 ...Ring looks like:[A_1][B_5][C_2][D_1][A_2][B_6][C_3][D_2][A_3]...Benefits:1. Automatic even distribution2. Adding node: Shares load with ALL existing nodes3. Removing node: Load redistributed to ALL nodes4. No manual token assignment
Vnode Distribution Example:
Copy
# 4 nodes, 256 vnodes each = 1024 vnodes total# Each node should own ~25% of ringWithout vnodes (manual tokens):Node A: 0 to 2^62 (25%... if perfectly calculated)With vnodes (automatic):Node A: vnode_1 (0.1%) + vnode_2 (0.1%) + ... + vnode_256 (0.1%) = 25.3% (statistically approaches 25%)Variance:- Manual: High (depends on human token choice)- Vnodes: Low (law of large numbers)
Streaming on Node Addition:
Copy
Adding Node E with 256 vnodes:Before:Ring has 1024 vnodes (256 × 4 nodes)After:Ring has 1280 vnodes (256 × 5 nodes)Each node should own 1280/5 = 256 vnodesNode E receives ~64 vnodes from each existing node:- From A: vnodes that now belong to E's token ranges- From B: vnodes that now belong to E's token ranges- From C: vnodes that now belong to E's token ranges- From D: vnodes that now belong to E's token rangesTotal data streamed: ~20% of cluster (evenly from all nodes)Time: Depends on data size- 100 GB cluster: ~10 minutes (with 10 Gbps network)- 10 TB cluster: ~3 hours
Configuring Vnodes:
Copy
# cassandra.yaml# Default (recommended for most clusters)num_tokens: 256# Fewer vnodes (better for very large clusters, 100+ nodes)num_tokens: 128# Single token (legacy, not recommended)num_tokens: 1# Auto-calculate based on cluster size (Cassandra 4.0+)allocate_tokens_for_local_replication_factor: 3
Replication Factor (RF): Number of copies of each data partition.Common Configurations:
Copy
RF = 1 (Development only, never production)- Single copy- No fault tolerance- Fast writes- Use case: Throwaway dev environmentRF = 2 (Rare, not recommended)- Two copies- Can survive 1 node failure- But: Can't achieve quorum with 1 node down- Use case: Non-critical data, tight budgetRF = 3 (Production standard)- Three copies- Can survive 2 node failures- QUORUM = 2 nodes (majority)- Use case: Most production deploymentsRF = 5+ (Critical data)- Five+ copies- Can survive 4+ node failures- Higher storage cost- Use case: Financial systems, regulatory requirements
How Replication Works:
Copy
Example: RF = 3, 6-node clusterWrite to key "user123":hash("user123") = 42Step 1: Find primary nodeRing position 42 falls on Node B→ Node B is PRIMARY replicaStep 2: Find replica nodes (walk clockwise on ring)Next vnode after 42: Node DNext vnode after D: Node F→ Replicas: B (primary), D, FStep 3: Coordinator sends write to B, D, FAll three nodes write to:1. CommitLog (disk, sequential)2. MemTable (memory)Step 4: Wait for consistency level ACKs- CL=ONE: Wait for 1 ACK (B or D or F)- CL=QUORUM: Wait for 2 ACKs (any 2 of B, D, F)- CL=ALL: Wait for all 3 ACKs
CREATE KEYSPACE my_keyspace WITH REPLICATION = { 'class': 'SimpleStrategy', 'replication_factor': 3};How it works:- Places replicas on consecutive nodes in ring- No datacenter awareness- Simple, but not rack-awareRing with RF=3:[A][B][C][D][E][F] ↑ ↑ ↑ Primary, Replica1, Replica2 for partition P1Good for: Development, single-DC deploymentsBad for: Production (no rack awareness)
2. NetworkTopologyStrategy (Production):
Copy
CREATE KEYSPACE my_keyspace WITH REPLICATION = { 'class': 'NetworkTopologyStrategy', 'dc1': 3, -- 3 replicas in datacenter 1 'dc2': 2 -- 2 replicas in datacenter 2};How it works:- Datacenter-aware- Rack-aware within each DC- Places replicas in different racks for fault toleranceExample topology:DC1: Rack A: [Node1, Node2] Rack B: [Node3, Node4] Rack C: [Node5, Node6]DC2: Rack A: [Node7, Node8] Rack B: [Node9, Node10]Partition P1 with RF=3 in DC1:- Primary: Node1 (Rack A)- Replica1: Node3 (Rack B) ← Different rack!- Replica2: Node5 (Rack C) ← Different rack!Benefits:- Survives entire rack failure- Survives entire datacenter failure (with multi-DC RF)- Localized reads per DC
Defining Datacenter & Rack:
Copy
# cassandra-rackdc.properties# AWS deploymentdc=us-east-1rack=us-east-1a# Another node in same DC, different AZdc=us-east-1rack=us-east-1b# Node in different regiondc=eu-west-1rack=eu-west-1a
What is a Snitch?
Cassandra component that determines network topology (DC and rack of each node).Common Snitches:
SimpleSnitch (Default, single DC):
Copy
endpoint_snitch: SimpleSnitch# All nodes in same DC and rack# No network topology awareness# Use only for development
GossipingPropertyFileSnitch (Most common):
Copy
endpoint_snitch: GossipingPropertyFileSnitch# Reads cassandra-rackdc.properties# Nodes gossip their DC/rack info# Best for most deployments
Ec2Snitch (AWS):
Copy
endpoint_snitch: Ec2Snitch# Auto-detects AWS region as DC# Auto-detects availability zone as rack# No configuration needed# Use for AWS deployments
GoogleCloudSnitch (GCP):
Copy
endpoint_snitch: GoogleCloudSnitch# Auto-detects GCP region as DC# Auto-detects GCP zone as rack
Why Snitches Matter:
Copy
Without proper snitch:- All nodes treated as same rack- Replicas might land on same rack- Rack failure = data lossWith proper snitch:- Replicas spread across racks- Fault-tolerant to rack failures- Optimized network traffic (local reads)
Purpose: Ensure no write is lost, even on node crash.How It Works:
Copy
Write flow:1. Client sends write2. Coordinator receives3. BEFORE acknowledging: a) Append to CommitLog (sequential disk write) b) Write to MemTable (memory)4. Acknowledge to client5. Later: MemTable flushed to SSTable (async)Crash scenario:Node crashes after step 4, before MemTable flush→ On restart: Replay CommitLog→ Reconstruct MemTable→ No data loss!
CommitLog Structure:
Copy
CommitLog file format (simplified):[Header: Version, Compression, Checksum][Segment 1] ├─ [Mutation 1: keyspace, table, key, columns, timestamp] ├─ [Mutation 2: ...] └─ [Mutation N: ...][Segment 2] └─ ...Properties:- Append-only (sequential writes = fast)- Fixed-size segments (default 32 MB)- Rotates to new segment when full- Deleted after MemTable flushed
CommitLog Configuration:
Copy
# cassandra.yaml# Sync mode (durability vs performance trade-off)commitlog_sync: periodic # Default: flush every N mscommitlog_sync_period_in_ms: 10000 # 10 seconds# ORcommitlog_sync: batch # Flush every N bytescommitlog_sync_batch_window_in_ms: 2# Segment sizecommitlog_segment_size_in_mb: 32# Compression (save disk, cost CPU)commitlog_compression: - class_name: LZ4Compressor
Sync Modes Comparison:
Copy
periodic (default):- Batches writes for 10 seconds- Flushes to disk together- Pros: Better throughput (batch I/O)- Cons: Up to 10s data loss on crash- Use: Most applications (acceptable loss)batch:- Flushes immediately- Pros: No data loss (durable)- Cons: Slower (more disk I/O)- Use: Financial systems, critical dataBenchmarks:- periodic: 50,000 writes/sec- batch: 10,000 writes/sec- 5x throughput difference
# Per-table configurationCREATE TABLE users ( id UUID PRIMARY KEY, name TEXT, email TEXT) WITH memtable_flush_period_in_ms = 3600000; -- Flush hourly# Global configuration (cassandra.yaml)memtable_allocation_type: heap_buffers # or offheap_buffersmemtable_heap_space_in_mb: 2048 # Max heap for MemTablesmemtable_offheap_space_in_mb: 2048 # Max off-heap
Flush Triggers:
Copy
MemTable flushes when ANY condition met:1. Size threshold: MemTable size > memtable_heap_space / num_tables2. Time threshold: Time since last flush > memtable_flush_period_in_ms3. CommitLog pressure: CommitLog size > threshold → Flush oldest MemTable4. Manual trigger: nodetool flush keyspace table
Maps partition keys → byte offset in Data.dbuser123 → offset: 0user456 → offset: 1024user789 → offset: 2048Allows O(1) jump to partition in Data.db
Summary.db - Partition Summary (in-memory):
Copy
Sparse index of Index.dbEvery 128th partition key:user000 → Index.db offset: 0user128 → Index.db offset: 16384user256 → Index.db offset: 32768Reduces Index.db lookups (saves disk I/O)
Filter.db - Bloom Filter:
Copy
Probabilistic data structureCan answer: "Is user999 in this SSTable?"- Definitely NOT (100% accurate) → Skip SSTable- Probably YES (false positive possible) → Check SSTableBenefits:- In-memory (few KB per SSTable)- Saves disk reads for missing data- 90%+ of SSTables skipped on average
Default strategy, optimized for writes.How It Works:
Copy
1. Group SSTables by similar size2. When 4 SSTables of similar size exist → Merge3. Result: 1 larger SSTableExample:Time 0: [10MB] [10MB] [10MB] [10MB] ← 4 similar-sizedTime 1: → Compact → [40MB]Time 2: [40MB] [10MB] [10MB] [10MB] [10MB]Time 3: [40MB] → Compact → [40MB]Time 4: [40MB] [40MB] ← 2 similar-sizedWait for 2 more 40MB SSTables...Time 10: [40MB] [40MB] [40MB] [40MB]Time 11: → Compact → [160MB]Eventually: [640MB] (from many 10MB SSTables)
Configuration:
Copy
CREATE TABLE users ( id UUID PRIMARY KEY, name TEXT) WITH compaction = { 'class': 'SizeTieredCompactionStrategy', 'min_threshold': 4, -- Min SSTables to compact 'max_threshold': 32 -- Max SSTables in one compaction};
Pros:
Fast writes (no immediate compaction)
Simple algorithm
Good for write-heavy workloads
Cons:
Reads can be slow (many SSTables)
Space amplification (need 2x space during compaction)
Organize SSTables into levels:Level 0: [10MB] [10MB] [10MB] [10MB] (new SSTables)Level 1: [10MB] [10MB] [10MB] ... [10MB] (max 100MB total)Level 2: [100MB] [100MB] ... [100MB] (max 1GB total)Level 3: [1GB] [1GB] ... (max 10GB total)Rules:1. Each level is 10x larger than previous2. Within a level, SSTables don't overlap3. When level full → compact to next levelExample:Level 0 fills with 4× 10MB SSTables→ Compact to Level 1 (merge with overlapping SSTables)→ Result: 10MB SSTable in Level 1 (no overlap with others)Key insight: Non-overlapping SSTables→ Query only reads 1 SSTable per level (max N levels)→ Predictable read performance
Configuration:
Copy
CREATE TABLE users ( id UUID PRIMARY KEY, name TEXT) WITH compaction = { 'class': 'LeveledCompactionStrategy', 'sstable_size_in_mb': 160 -- Target SSTable size};
1. Group SSTables by time window (e.g., 1 day)2. Compact SSTables within same time window3. NEVER compact across time windows4. Expired windows deleted entirely (TTL)Example with 1-day windows:Day 1: [SSTable_2024-01-01_1] [SSTable_2024-01-01_2] → Compact → [SSTable_2024-01-01_merged]Day 2: [SSTable_2024-01-02_1] [SSTable_2024-01-02_2] → Compact → [SSTable_2024-01-02_merged]Day 30 (with TTL=30 days):Delete [SSTable_2024-01-01_merged] entirely→ Efficient expiration (delete file, no tombstones!)
# On each node, check ring statusnodetool ring# Output (simplified):Address Rack Status Token Owns192.168.1.1 rack1 Up -9223372036854775808 33.3%192.168.1.2 rack1 Up -3074457345618258603 33.3%192.168.1.3 rack1 Up 3074457345618258602 33.4%# Explanation:# - Token range: -2^63 to 2^63-1# - Each node owns ~33% of ring# - Vnodes distribute evenly
Task: Insert data and verify distribution:
Copy
CREATE KEYSPACE test WITH REPLICATION = { 'class': 'SimpleStrategy', 'replication_factor': 2};CREATE TABLE test.users ( id UUID PRIMARY KEY, name TEXT);-- Insert 1000 users-- Then check distribution:
Copy
# On each nodenodetool tablestats test.users# Check "Number of partitions" per node# Should be roughly equal
Pro Tip: Revisit this module when troubleshooting production issues. Understanding the internals helps diagnose: “Why is this query slow?” → Check: Bloom filters, compaction strategy, SSTables per read.