> ## Documentation Index > Fetch the complete documentation index at: https://resources.devweekends.com/llms.txt > Use this file to discover all available pages before exploring further. # Cluster Operations & Multi-DC Replication > Master gossip protocol, failure detection, repair mechanisms, and multi-datacenter replication strategies # Cluster Operations & Multi-DC Replication **Module Duration**: 8-10 hours **Learning Style**: Deep Technical + Hands-On Operations + Production Scenarios **Outcome**: Understand how Cassandra maintains cluster health, detects failures, and operates across datacenters ## Introduction: The Challenge of Distributed Coordination In a distributed database like Cassandra with potentially hundreds of nodes spread across multiple datacenters, how do nodes: * Discover each other and know who's in the cluster? * Detect when a node fails or becomes unreachable? * Ensure data remains consistent across replicas? * Handle network partitions gracefully? * Replicate data across geographic regions? **The Cassandra Answer**: A combination of **gossip protocol** (inspired by epidemiology), **Phi Accrual Failure Detection**, **hinted handoff**, **read repair**, **anti-entropy repair**, and sophisticated multi-DC replication strategies. This module explores each of these mechanisms in depth. *** ## Part 1: The Gossip Protocol ### What is Gossip? Gossip is a **peer-to-peer communication protocol** inspired by how rumors spread in social networks or diseases spread through populations. In Cassandra, nodes exchange information about themselves and other nodes they know about. **Key Characteristics**: * **Decentralized**: No master node orchestrating communication * **Eventually Consistent**: Information propagates gradually * **Fault Tolerant**: Works even when some nodes are down * **Scalable**: Overhead doesn't increase linearly with cluster size ### The Gossip Mechanism Every **1 second**, each node: 1. **Increments its heartbeat counter** (local version number) 2. **Selects 1-3 random nodes** to gossip with 3. **Sends a GossipDigestSyn** message containing: * Endpoint states for all known nodes * Generation numbers (when node was started) * Heartbeat versions (how recently we heard from each node) 4. **Receives GossipDigestAck** with updates 5. **Applies newer information** to its local state ### Gossip Data Structures ```java theme={null} // Core gossip state for a single endpoint class EndpointState { // When this node was started (prevents confusion with restarts) int generation; // Version number incremented on every gossip round int heartbeatVersion; // Application states (node metadata) Map applicationStates; } enum ApplicationState { STATUS, // NORMAL, LEAVING, JOINING, REMOVING DC, // Datacenter name RACK, // Rack name SCHEMA, // Schema version UUID LOAD, // Disk usage SEVERITY, // Node health (0.0-1.0) TOKENS, // Token ranges owned INTERNAL_IP, // Internal network address RPC_ADDRESS, // Client-facing address RELEASE_VERSION // Cassandra version } class VersionedValue { String value; int version; // Incremented when value changes } ``` ### Gossip Round Example Let's trace a gossip round between three nodes: **Initial State**: ``` Node A knows: A: gen=1, heartbeat=100, STATUS=NORMAL B: gen=1, heartbeat=95, STATUS=NORMAL C: gen=1, heartbeat=90, STATUS=NORMAL Node B knows: A: gen=1, heartbeat=98, STATUS=NORMAL B: gen=1, heartbeat=100, STATUS=NORMAL C: gen=1, heartbeat=92, STATUS=NORMAL Node C knows: A: gen=1, heartbeat=97, STATUS=NORMAL B: gen=1, heartbeat=93, STATUS=NORMAL C: gen=1, heartbeat=100, STATUS=NORMAL ``` **Step 1**: Node A increments its heartbeat and picks B to gossip with ``` A → B: GossipDigestSyn { A: gen=1, maxVersion=101, // A's own heartbeat B: gen=1, maxVersion=95, // What A knows about B C: gen=1, maxVersion=90 // What A knows about C } ``` **Step 2**: B compares with its state and responds ``` B → A: GossipDigestAck { // B has newer info about itself B: gen=1, heartbeat=100, STATUS=NORMAL // B has newer info about C C: gen=1, heartbeat=92, STATUS=NORMAL } ``` **Step 3**: A updates its state ``` Node A now knows: A: gen=1, heartbeat=101, STATUS=NORMAL B: gen=1, heartbeat=100, STATUS=NORMAL // UPDATED C: gen=1, heartbeat=92, STATUS=NORMAL // UPDATED ``` **Step 4**: A sends GossipDigestAck2 with its newer info ``` A → B: GossipDigestAck2 { A: gen=1, heartbeat=101, STATUS=NORMAL } ``` **Step 5**: B updates its state ``` Node B now knows: A: gen=1, heartbeat=101, STATUS=NORMAL // UPDATED B: gen=1, heartbeat=100, STATUS=NORMAL C: gen=1, heartbeat=92, STATUS=NORMAL ``` **Result**: After just one round, nodes A and B have converged on the latest information about A and B, and A learned newer info about C. ### Gossip Propagation Speed How quickly does information spread through gossip? **Mathematical Model**: * Cluster size: N nodes * Gossip fanout: f nodes per round (typically 3) * Gossip interval: 1 second The number of nodes that know about a new piece of information grows exponentially: ``` Round 0: 1 node knows Round 1: 1 + f nodes know (≈4) Round 2: (1+f) × f nodes know (≈16) Round 3: ((1+f) × f) × f nodes know (≈64) ``` **General Formula**: After r rounds, approximately min(f^r, N) nodes know the information. **Example**: In a 1000-node cluster with fanout=3: * Round 1: \~4 nodes know * Round 2: \~16 nodes know * Round 3: \~64 nodes know * Round 4: \~256 nodes know * Round 5: \~1000 nodes know (full propagation) So information propagates to all 1000 nodes in about **5 seconds** (log₃(1000) rounds). ### Seed Nodes **Problem**: When a new node joins, how does it know who to gossip with initially? **Solution**: **Seed nodes** - well-known nodes that new nodes contact first. ```yaml theme={null} # cassandra.yaml seed_provider: - class_name: org.apache.cassandra.locator.SimpleSeedProvider parameters: - seeds: "10.0.1.10,10.0.1.11,10.0.1.12" ``` **Important Characteristics**: * Seeds are **not special** after bootstrap - just initial contact points * Every datacenter should have at least one seed * A node should **not** list itself as a seed * Seeds don't form a special "seed cluster" - they're regular nodes * Seeds just need to be **reliable** and **long-lived** **Common Misconception**: Seeds are masters or leaders. **False!** They're just stable contact points for gossip initialization. ### Monitoring Gossip You can observe gossip in action: ```bash theme={null} # View current gossip state nodetool gossipinfo # Example output: /10.0.1.11 generation:1638360000 heartbeat:123456 STATUS:NORMAL,-9223372036854775808 DC:datacenter1 RACK:rack1 SCHEMA:e84b6a60-1f3e-11ec-9621-0242ac130002 LOAD:2.5e+11 TOKENS:,,... /10.0.1.12 generation:1638359000 heartbeat:123400 STATUS:NORMAL,0 DC:datacenter1 RACK:rack1 # ... more states ``` ```bash theme={null} # Check if gossip is running nodetool statusgossip # Disable gossip (dangerous!) nodetool disablegossip # Re-enable gossip nodetool enablegossip ``` *** ## Part 2: Failure Detection ### The Challenge How do you determine if a node is down vs. just slow or experiencing network delays? **Naive Approach**: Use a fixed timeout (e.g., "no response in 5 seconds = dead") **Problems**: * Too short: false positives during network hiccups * Too long: slow to detect real failures * Network latency varies over time * Different nodes have different performance characteristics ### Phi Accrual Failure Detector Cassandra uses the **Phi (Φ) Accrual Failure Detector**, which provides a **suspicion level** rather than a binary up/down decision. **Key Idea**: Instead of "is this node down?", ask "how confident am I that this node is down?" **The Algorithm**: 1. **Track Arrival Times**: Record when gossip heartbeats arrive from each node 2. **Build a Statistical Model**: Use a sliding window (default: last 1000 heartbeats) to model the expected inter-arrival time 3. **Calculate Phi (Φ)**: When a heartbeat is late, calculate how suspicious this is: ``` Φ(t) = -log₁₀(P(T > t)) ``` Where: * `t` = time since last heartbeat * `P(T > t)` = probability that inter-arrival time exceeds `t` * Φ = suspicion level 4. **Make Decision**: If Φ exceeds threshold (default: 8), mark node as down ### Understanding Phi Values | Phi (Φ) | Probability of Mistake | Meaning | | ------- | ---------------------- | ----------------------------------- | | 1 | 10% | Slightly suspicious | | 2 | 1% | Moderately suspicious | | 3 | 0.1% | Very suspicious | | 5 | 0.001% | Almost certainly down | | 8 | 0.00001% | Definitely down (default threshold) | | 10 | 0.0000001% | Absolutely dead | **Phi = 8** means: "There's a 0.00001% chance I'm wrong about this node being down" ### Phi Calculation Example Let's trace a failure detection scenario: **Setup**: * Node A monitoring Node B * Historical heartbeat intervals: 1.0s, 1.1s, 0.9s, 1.2s, 1.0s (mean ≈ 1.04s, σ ≈ 0.11s) **Timeline**: ``` t=0s: Heartbeat received Phi = 0 (just heard from node) t=1s: No heartbeat yet Expected: ~1.04s Actual: 1.0s (slightly early) Phi ≈ 0.5 (not suspicious) t=2s: Still no heartbeat Expected: ~1.04s Actual: 2.0s (very late) Phi ≈ 3.2 (getting suspicious) t=3s: Still no heartbeat Actual: 3.0s (extremely late) Phi ≈ 6.8 (very suspicious) t=4s: Still no heartbeat Actual: 4.0s Phi ≈ 9.2 (exceeds threshold of 8) ACTION: Mark Node B as DOWN ``` ### Implementation Details ```java theme={null} class FailureDetector { // Sliding window of arrival intervals (milliseconds) private final BoundedStatsDeque arrivalIntervals; // Default phi threshold private static final double PHI_THRESHOLD = 8.0; // Calculate current phi value for an endpoint public double phi(InetAddress endpoint) { long now = System.currentTimeMillis(); long lastHeartbeat = getLastHeartbeatTime(endpoint); long timeSinceHeartbeat = now - lastHeartbeat; // Get statistics from arrival interval history double mean = arrivalIntervals.mean(); double stddev = arrivalIntervals.stddev(); // Calculate probability using normal distribution double prob = 1.0 - probabilityOfInterval(timeSinceHeartbeat, mean, stddev); // Phi = -log10(probability) return -Math.log10(Math.max(prob, 1e-10)); } public boolean isAlive(InetAddress endpoint) { return phi(endpoint) < PHI_THRESHOLD; } } ``` ### Configuring Failure Detection ```yaml theme={null} # cassandra.yaml # Phi threshold (default: 8) # Lower = more sensitive (faster detection, more false positives) # Higher = less sensitive (slower detection, fewer false positives) phi_convict_threshold: 8 # How long to wait before removing a dead node from gossip failure_detector_timeout_in_ms: 60000 # 60 seconds ``` **Tuning Advice**: * **High-latency networks** (cross-region): Increase to 10-12 * **Low-latency networks** (single DC): Can decrease to 6-7 * **Flaky networks**: Increase to avoid false positives * **Mission-critical availability**: Decrease for faster failover ### Observing Failure Detection ```bash theme={null} # Check failure detector thresholds nodetool failuredetector # Example output: Endpoint Phi /10.0.1.11 0.23 /10.0.1.12 0.18 /10.0.1.13 12.45 # DOWN! /10.0.1.14 0.31 ``` *** ## Part 3: Hinted Handoff (Detailed) We introduced hinted handoff in the write path module. Let's dive deeper into its implementation and edge cases. ### The Detailed Write Flow with Hints ``` Client writes with CL=QUORUM (RF=3) ↓ Coordinator determines replicas: [A, B, C] ↓ Coordinator sends write to A, B, C ↓ Results: • Node A: Success (50ms) • Node B: Success (45ms) • Node C: Timeout (down) ↓ QUORUM achieved (2/3 responses) ↓ Coordinator stores HINT for Node C ↓ Respond SUCCESS to client ``` ### Hint Storage Format Hints are stored locally on the coordinator node in a special system table: ```cql theme={null} CREATE TABLE system.hints ( target_id uuid, -- UUID of down node hint_id timeuuid, -- When hint was created message_version int, mutation blob, -- The actual write mutation PRIMARY KEY ((target_id), hint_id) ) ``` **Example Hint**: ``` target_id: 550e8400-e29b-41d4-a716-446655440000 (Node C's UUID) hint_id: d2177dd0-eaa2-11de-a572-001b779c76e3 (2023-11-15 10:23:45) mutation: [binary data representing: INSERT INTO users (id, name) VALUES (42, 'Alice')] ``` ### Hint Replay The coordinator continuously tries to replay hints: ```java theme={null} class HintedHandoffManager { // Check for hints every 10 minutes (default) private static final int HINT_SCAN_INTERVAL_MS = 600_000; public void scheduleHintDelivery() { while (true) { Thread.sleep(HINT_SCAN_INTERVAL_MS); for (UUID targetNode : getAllTargetNodes()) { if (isNodeAlive(targetNode)) { deliverHints(targetNode); } } } } private void deliverHints(UUID targetNode) { // Read hints for this target from local storage List hints = readHintsFor(targetNode); // Replay them in order for (Hint hint : hints) { try { sendMutationToNode(targetNode, hint.mutation); deleteHint(hint); // Success! Remove hint } catch (Exception e) { // Node went down again, try later break; } } } } ``` ### Hint Lifecycle Timeline ``` t=0s: Node C goes down - Coordinator starts storing hints for C t=10s: Hint #1 stored (INSERT INTO users...) t=15s: Hint #2 stored (UPDATE users...) t=20s: Hint #3 stored (DELETE FROM users...) t=300s (5 min): Node C comes back online - Gossip detects C is up - C marked as ALIVE in failure detector t=600s (10 min): Next hint delivery scan - Coordinator sees C is alive - Begins replaying hints #1, #2, #3 - Each mutation sent to C - Hints deleted after successful delivery t=610s: All hints delivered, C is fully caught up ``` ### Hint Configuration ```yaml theme={null} # cassandra.yaml # How long to store hints (default: 3 hours) max_hint_window_in_ms: 10800000 # Hint storage directory hints_directory: /var/lib/cassandra/hints # Max size of hints on disk (per node) max_hints_file_size_in_mb: 128 # Max total size of all hints max_hints_size_per_host_in_mb: 1024 # Throttle hint delivery speed (MB/s per delivery thread) hints_flush_period_in_ms: 10000 ``` ### When Hints Are NOT Enough Hints are **best-effort only**. They fail in these scenarios: **Scenario 1: Max Hint Window Exceeded** ``` Node C is down for 4 hours (> 3 hour default max_hint_window) → Hints are dropped after 3 hours → Node C will have missing data → Must run repair to recover ``` **Scenario 2: Coordinator Crashes** ``` Coordinator stores hints locally, then crashes → Hints are lost (not replicated) → Node C will have missing data → Must run repair to recover ``` **Scenario 3: Hint Storage Full** ``` Node C is down for extended period → Hints fill up disk space (hit max_hints_size_per_host_in_mb) → Newer hints are dropped → Node C will have missing data → Must run repair to recover ``` **Key Insight**: Hints are a **temporary bridge**, not a replacement for repair! ### Monitoring Hints ```bash theme={null} # Check hint statistics nodetool statushandoff # View hints stored nodetool viewbuildstatuses # Force hint replay for a specific node nodetool handoff # Disable hinted handoff (dangerous!) nodetool disablehandoff # Enable hinted handoff nodetool enablehandoff ``` *** ## Part 4: Read Repair ### The Problem Even with hinted handoff, replicas can diverge: * Hints were dropped (too old) * Coordinator crashed before replaying hints * Network partitions prevented writes from reaching some replicas **Read repair** fixes inconsistencies by comparing replicas during reads. ### Read Repair Types Cassandra has **two types** of read repair: 1. **Blocking Read Repair** (Foreground) 2. **Background Read Repair** ### 1. Blocking Read Repair (Foreground) Happens automatically when you read with a consistency level that queries multiple replicas. **Example**: Read with `CL=QUORUM` (RF=3) ``` Client: SELECT * FROM users WHERE id = 42 WITH CL=QUORUM ↓ Coordinator queries replicas A, B, C: • A: {id: 42, name: 'Alice', email: 'alice@example.com', updated: t=100} • B: {id: 42, name: 'Alice', email: 'alice@example.com', updated: t=100} • C: {id: 42, name: 'Bob', email: 'bob@example.com', updated: t=50} ↓ Coordinator compares: • A and B agree (both at t=100) • C is outdated (t=50 < t=100) ↓ Coordinator sends repair mutation to C: UPDATE users SET name='Alice', email='alice@example.com' WHERE id=42 ↓ Respond to client with newest data: 'Alice' ``` **Key Points**: * Repair happens **synchronously** (blocks the read) * Client gets the **newest data** * Divergent replicas are **fixed immediately** * Only repairs data actually **requested** in the query ### Read Repair Probability Not every read triggers repair! There's a configurable probability: ```cql theme={null} -- Set read repair chance (0.0 to 1.0) ALTER TABLE users WITH read_repair_chance = 0.1; -- 10% of reads -- Set datacenter-local read repair chance ALTER TABLE users WITH dclocal_read_repair_chance = 0.1; ``` **Why probabilistic?** * **Performance**: Full comparison on every read is expensive * **Traffic**: Increases network and CPU overhead * **Tunable**: Higher for critical data, lower for less important data **Default Values** (Cassandra 3.0+): * `read_repair_chance = 0.0` (disabled) * `dclocal_read_repair_chance = 0.1` (10%) **Modern Best Practice**: Rely on repair, not read repair, for consistency. Use `read_repair_chance = 0.0`. ### Read Repair Algorithm ```java theme={null} class ReadRepairHandler { public Row handleRead(ReadCommand cmd, ConsistencyLevel cl) { // 1. Determine which replicas to query List replicas = getReplicas(cmd.key); // 2. Send read to required replicas List responses = queryReplicas(replicas, cl); // 3. Find the newest version (highest timestamp) Row newestRow = findNewest(responses); // 4. Check if read repair triggered (probabilistic) if (shouldDoReadRepair()) { // 5. Compare all responses to newest for (int i = 0; i < responses.size(); i++) { if (!responses.get(i).equals(newestRow)) { // 6. Send repair mutation to outdated replica Mutation repair = computeDiff(newestRow, responses.get(i)); sendRepair(replicas.get(i), repair); } } } // 7. Return newest data to client return newestRow; } } ``` ### 2. Background Read Repair **Purpose**: Repair data that **wasn't queried** in the read request. **Example**: ```cql theme={null} -- Table has 10 columns CREATE TABLE users ( id int PRIMARY KEY, name text, email text, phone text, address text, -- ... 6 more columns ); -- Client queries only 2 columns SELECT name, email FROM users WHERE id = 42; ``` **Blocking read repair** only compares and repairs `name` and `email` columns. **Background read repair** compares and repairs **all columns**, running asynchronously: ``` Client query returns immediately with (name, email) ↓ Background thread compares full rows across replicas: • A: {id: 42, name: 'Alice', email: 'alice@...', phone: '555-1234', ...} • B: {id: 42, name: 'Alice', email: 'alice@...', phone: '555-1234', ...} • C: {id: 42, name: 'Alice', email: 'alice@...', phone: '555-5678', ...} ← Different! ↓ Repair mutation sent to C to fix phone column ``` **Configuration**: ```yaml theme={null} # cassandra.yaml # Background read repair runs asynchronously # No user-facing config - always enabled for columns not in SELECT ``` ### Read Repair Monitoring ```bash theme={null} # Check read repair statistics nodetool tpstats # Look for "ReadRepairStage" section: # Pool Name Active Pending Completed Blocked # ReadRepairStage 0 0 123456 0 # View table-specific read repair settings nodetool tablestats keyspace.table ``` *** ## Part 5: Anti-Entropy Repair (The Big One) ### Why We Need Repair Neither hinted handoff nor read repair is sufficient: * **Hints**: Expire after max\_hint\_window, lost if coordinator crashes * **Read repair**: Only fixes data that's queried **Problem**: Unread data can drift indefinitely! **Solution**: **Anti-entropy repair** - actively compare all data across replicas. ### The Repair Process Overview ``` 1. Choose a token range to repair 2. Build Merkle trees on each replica for that range 3. Compare Merkle trees to find differences 4. Stream missing/outdated data between replicas 5. Repeat for all token ranges ``` ### Merkle Trees Explained A **Merkle tree** (hash tree) allows efficient comparison of large datasets. **Structure**: ``` Root Hash / \ Hash(AB) Hash(CD) / \ / \ Hash(A) Hash(B) Hash(C) Hash(D) / | | \ [Rows] [Rows] [Rows] [Rows] 1-1000 1001-2000 2001-3000 3001-4000 ``` **How It Works**: 1. **Divide token range** into segments (default: 2^15 = 32,768 segments) 2. **Hash each segment's data** (all rows in that token range) 3. **Build tree** by hashing pairs of hashes upward 4. **Compare trees** from root down **Example Comparison**: ``` Node A Merkle Tree: Root: 0xABCD1234 / \ 0x1234ABCD 0x5678EFGH / \ / \ 0xAAAA 0xBBBB 0xCCCC 0xDDDD Node B Merkle Tree: Root: 0xABCD1234 / \ 0x1234ABCD 0x9999ZZZZ ← DIFFERENT! / \ / \ 0xAAAA 0xBBBB 0xXXXX 0xYYYY ``` **Analysis**: * Root hashes differ → trees are different * Left subtree hashes match → left half is identical (skip!) * Right subtree hashes differ → need to check further * Leaf hashes 0xCCCC vs 0xXXXX differ → stream segment 3 * Leaf hashes 0xDDDD vs 0xYYYY differ → stream segment 4 **Efficiency**: Instead of comparing millions of rows, we compare \~30K hashes! ### Merkle Tree Building ```java theme={null} class MerkleTreeBuilder { private static final int TREE_DEPTH = 15; // 2^15 = 32768 leaves public MerkleTree buildTree(ColumnFamilyStore cfs, Range range) { MerkleTree tree = new MerkleTree(TREE_DEPTH); // Read all data in the token range Iterator rows = cfs.rangeIterator(range); while (rows.hasNext()) { Row row = rows.next(); Token token = getToken(row.key); // Find which leaf this row belongs to Leaf leaf = tree.getLeafForToken(token); // Update leaf hash with this row's hash leaf.updateHash(hash(row)); } // Compute internal node hashes bottom-up tree.seal(); return tree; } private byte[] hash(Row row) { MessageDigest md = MessageDigest.getInstance("MD5"); // Hash all columns and their timestamps for (Cell cell : row.cells()) { md.update(cell.name().bytes); md.update(cell.value().bytes); md.update(longToBytes(cell.timestamp())); } return md.digest(); } } ``` ### Running Repair **Full Repair** (all data): ```bash theme={null} nodetool repair ``` **Repair Specific Keyspace**: ```bash theme={null} nodetool repair my_keyspace ``` **Repair Specific Table**: ```bash theme={null} nodetool repair my_keyspace my_table ``` **Incremental Repair** (only unrepaired data): ```bash theme={null} nodetool repair -inc ``` **Repair Specific Token Range**: ```bash theme={null} nodetool repair -st -et ``` ### Repair Output Example ```bash theme={null} $ nodetool repair my_keyspace [2023-11-15 10:00:00,123] Starting repair command #1 [2023-11-15 10:00:00,456] Repair session 12345678 for range (-3074457345618258603,-3074457345618258602] [2023-11-15 10:00:01,789] Building Merkle trees for users [2023-11-15 10:00:15,234] Merkle tree for users is ready [2023-11-15 10:00:15,567] Differences detected for users [2023-11-15 10:00:15,890] Streaming data to /10.0.1.11 [2023-11-15 10:00:45,123] Streaming completed for users [2023-11-15 10:00:45,456] Repair session 12345678 completed [2023-11-15 10:00:45,789] Repair command #1 finished in 45.67 seconds ``` ### Full vs. Incremental Repair **Full Repair**: * Compares **all data** every time * Expensive for large datasets * Use when: major inconsistency suspected **Incremental Repair** (Cassandra 2.2+): * Marks SSTables as "repaired" after successful repair * Only repairs **unrepaired SSTables** on subsequent runs * Much faster for ongoing maintenance * Requires special compaction strategy **Enable Incremental Repair**: ```cql theme={null} ALTER TABLE users WITH compaction = { 'class': 'LeveledCompactionStrategy', 'sstable_size_in_mb': 160 }; ``` ```bash theme={null} # First incremental repair - repairs all data nodetool repair -inc # Subsequent repairs - only new data nodetool repair -inc # Much faster! ``` ### Repair Scheduling **Best Practice**: Run repair within `gc_grace_seconds` (default: 10 days) **Why?** Cassandra uses tombstones to mark deleted data. If a replica is down longer than `gc_grace_seconds`, tombstones can be garbage collected, and deleted data might "resurrect." **Timeline**: ``` Day 0: DELETE FROM users WHERE id = 42 → Tombstone written to A and B → Node C is down (doesn't get tombstone) Day 5: Garbage collection runs → Tombstone still present (< 10 days old) Day 8: Repair runs → Tombstone streamed to Node C → Data properly deleted everywhere Day 12: Garbage collection runs → Tombstone removed (> 10 days old) ``` **Bad Timeline**: ``` Day 0: DELETE FROM users WHERE id = 42 → Tombstone written to A and B → Node C is down Day 12: Garbage collection runs → Tombstone removed (> 10 days old) Day 14: Node C comes back → Still has old data for id=42 Day 15: Repair runs → Compares A (no data) vs C (old data) → Old data wins! (has timestamp, no tombstone to override) → DELETED DATA RESURRECTED! ⚠️ ``` **Automated Scheduling**: Use `cassandra-reaper` (open source tool): ```bash theme={null} # Install Reaper docker run -p 8080:8080 thelastpickle/cassandra-reaper # Schedule automatic repairs reaper-cli schedule add \ --cluster my_cluster \ --keyspace my_keyspace \ --interval 7 \ --segments 64 ``` Or simple cron: ```bash theme={null} # Run repair weekly 0 2 * * 0 nodetool repair -pr my_keyspace ``` ### Repair Performance Impact Repair is **resource-intensive**: | Resource | Impact | Reason | | ------------ | --------- | ----------------------------------- | | **CPU** | High | Building Merkle trees, hashing data | | **Disk I/O** | Very High | Reading all SSTables | | **Network** | High | Streaming differing data | | **Memory** | Medium | Storing Merkle trees | **Mitigation Strategies**: 1. **Repair During Off-Peak Hours** ```bash theme={null} # Run at 2 AM 0 2 * * * nodetool repair ``` 2. **Limit Repair Parallelism** ```bash theme={null} # Repair one token range at a time nodetool repair -seq ``` 3. **Throttle Streaming** ```bash theme={null} # Limit streaming bandwidth to 16 MB/s nodetool setstreamthroughput 16 ``` 4. **Use Sub-Range Repair** ```bash theme={null} # Repair in smaller chunks for range in $(calculate_ranges); do nodetool repair -st $start -et $end sleep 300 # 5-minute break between ranges done ``` *** ## Part 6: Multi-Datacenter Replication ### Why Multiple Datacenters? 1. **Geographic Distribution**: Serve users with low latency worldwide 2. **Disaster Recovery**: Survive datacenter failures 3. **Compliance**: Keep data in specific regions (GDPR, etc.) 4. **Read Scalability**: Read from local DC, reduce cross-DC traffic ### Network Topology Cassandra models datacenters and racks: ``` Cluster ├── Datacenter: us-east │ ├── Rack: rack1 │ │ ├── Node A (10.0.1.10) │ │ └── Node B (10.0.1.11) │ └── Rack: rack2 │ ├── Node C (10.0.2.10) │ └── Node D (10.0.2.11) └── Datacenter: eu-west ├── Rack: rack1 │ ├── Node E (10.1.1.10) │ └── Node F (10.1.1.11) └── Rack: rack2 ├── Node G (10.1.2.10) └── Node H (10.1.2.11) ``` ### Configuring Datacenters **1. Define Topology** (cassandra-rackdc.properties): ```properties theme={null} # Node A configuration dc=us-east rack=rack1 # Node E configuration dc=eu-west rack=rack1 ``` **2. Use NetworkTopologyStrategy**: ```cql theme={null} CREATE KEYSPACE my_keyspace WITH replication = { 'class': 'NetworkTopologyStrategy', 'us-east': 3, -- 3 replicas in us-east 'eu-west': 2 -- 2 replicas in eu-west }; ``` ### Multi-DC Replica Placement With `NetworkTopologyStrategy`, Cassandra ensures: * Replicas are spread across different racks within a DC * Each DC gets the specified number of replicas **Example**: Token range `[0, 100)` with RF=\{us-east:3, eu-west:2} ``` Token range [0, 100) owned by Node A (us-east, rack1) Replica placement: 1. Node A (us-east, rack1) -- Primary 2. Node C (us-east, rack2) -- Different rack 3. Node B (us-east, rack1) -- Back to rack1 4. Node E (eu-west, rack1) -- EU primary 5. Node G (eu-west, rack2) -- Different rack ``` **Visual**: ``` us-east: rack1: [A*, B] rack2: [C*] eu-west: rack1: [E*] rack2: [G*] * = Replica for token range [0, 100) ``` ### Multi-DC Consistency Levels New consistency levels for multi-DC: | Consistency Level | Meaning | | ----------------- | ------------------------------ | | **LOCAL\_ONE** | At least 1 replica in local DC | | **LOCAL\_QUORUM** | Quorum of replicas in local DC | | **EACH\_QUORUM** | Quorum in **every** DC | **Example**: Keyspace with RF=\{us-east:3, eu-west:3} ```cql theme={null} -- Write from us-east with LOCAL_QUORUM INSERT INTO users (id, name) VALUES (1, 'Alice') WITH CONSISTENCY LOCAL_QUORUM; Required acknowledgments: ✓ 2 of 3 replicas in us-east (quorum) ✗ 0 of 3 replicas in eu-west (not required) Result: Write succeeds if 2 us-east nodes respond ``` ```cql theme={null} -- Write with EACH_QUORUM INSERT INTO users (id, name) VALUES (1, 'Alice') WITH CONSISTENCY EACH_QUORUM; Required acknowledgments: ✓ 2 of 3 replicas in us-east (quorum) ✓ 2 of 3 replicas in eu-west (quorum) Result: Write succeeds only if 2 nodes in BOTH DCs respond ``` ### Multi-DC Write Flow **Scenario**: Client in us-east writes with `CL=LOCAL_QUORUM`, RF=\{us-east:3, eu-west:2} ``` Client (us-east) → Coordinator (us-east Node A) ↓ Coordinator determines replicas: us-east: [A, B, C] eu-west: [E, F] ↓ Coordinator sends writes: • A, B, C: Synchronously (wait for response) • E, F: Asynchronously (don't wait) ↓ Wait for LOCAL_QUORUM (2/3 from us-east): • A: ACK (20ms) • B: ACK (25ms) • C: ... (waiting) ↓ LOCAL_QUORUM satisfied → Respond to client ↓ Background: Write propagates to eu-west • E: ACK (150ms) - cross-DC latency • F: ACK (160ms) ``` **Key Points**: * **Local writes are fast** (no cross-DC latency in critical path) * **Remote writes happen asynchronously** * **Eventual consistency** across DCs ### Multi-DC Read Flow **Scenario**: Client in eu-west reads with `CL=LOCAL_QUORUM`, RF=\{us-east:3, eu-west:2} ``` Client (eu-west) → Coordinator (eu-west Node E) ↓ Coordinator determines LOCAL replicas: eu-west: [E, F] ↓ Coordinator sends reads to E, F ↓ E returns: {id: 1, name: 'Alice', timestamp: 100} F returns: {id: 1, name: 'Alice', timestamp: 100} ↓ Compare timestamps → Both match ↓ Respond to client: 'Alice' ``` **No cross-DC communication** for reads with `LOCAL_*` consistency! ### Multi-DC Repair Repair operates **per-DC** by default: ```bash theme={null} # Repair only local DC nodetool repair -dc us-east # Repair only specific datacenter from any node nodetool repair -dc eu-west # Repair across all DCs (expensive!) nodetool repair -full ``` **Best Practice**: Run per-DC repairs regularly, cross-DC repairs less frequently. ### Monitoring Multi-DC Clusters ```bash theme={null} # View cluster status across DCs nodetool status # Example output: Datacenter: us-east =================== Status=Up/Down |/ State=Normal/Leaving/Joining/Moving -- Address Load Tokens Owns Host ID Rack UN 10.0.1.10 2.5 TB 256 50.1% aaa-bbb rack1 UN 10.0.1.11 2.3 TB 256 49.9% ccc-ddd rack1 UN 10.0.2.10 2.4 TB 256 50.0% eee-fff rack2 Datacenter: eu-west =================== -- Address Load Tokens Owns Host ID Rack UN 10.1.1.10 1.8 TB 256 50.2% ggg-hhh rack1 UN 10.1.1.11 1.7 TB 256 49.8% iii-jjj rack2 ``` *** ## Part 7: Cluster Management Operations ### Adding a Node (Bootstrap) **Process**: 1. **Install Cassandra** on new node 2. **Configure** cassandra.yaml: ```yaml theme={null} cluster_name: 'MyCluster' seeds: "10.0.1.10,10.0.1.11" # Existing seed nodes listen_address: 10.0.3.10 # New node IP rpc_address: 10.0.3.10 ``` 3. **Start Cassandra**: ```bash theme={null} cassandra -f ``` 4. **Bootstrap Process**: ``` New Node Joins ↓ Gossip with seeds ↓ Receive cluster topology ↓ Calculate token ownership ↓ Stream data from existing nodes ↓ Mark status as NORMAL ↓ Start serving traffic ``` **Timeline**: ``` t=0s: Start Cassandra [2023-11-15 10:00:00] Joining cluster t=5s: Gossip establishes [2023-11-15 10:00:05] Cluster state received t=10s: Token calculation [2023-11-15 10:00:10] Calculating tokens (num_tokens=256) t=15s: Streaming begins [2023-11-15 10:00:15] Streaming from /10.0.1.10 [2023-11-15 10:00:15] Streaming from /10.0.1.11 [2023-11-15 10:00:15] Streaming from /10.0.2.10 t=3600s (1 hour): Streaming completes (varies by data size) [2023-11-15 11:00:00] Streaming completed t=3605s: Node ready [2023-11-15 11:00:05] Node state: NORMAL [2023-11-15 11:00:05] Accepting client connections ``` **Monitor Bootstrap**: ```bash theme={null} # Check bootstrap status nodetool netstats # Example output: Bootstrap 45.2% Sending 123 files, 567.8 GB total /10.0.1.10: 45/100 files, 234.5 GB / 500 GB /10.0.1.11: 34/90 files, 189.2 GB / 450 GB ``` ### Decommissioning a Node **Purpose**: Safely remove a node from the cluster **Process**: ```bash theme={null} # On the node to be removed nodetool decommission ``` **What Happens**: ``` 1. Node streams its data to other nodes 2. Token ranges reassigned 3. Node marks itself as DECOMMISSIONED 4. Gossip informs cluster 5. Node shuts down ``` **Timeline**: ``` t=0s: Run decommission [2023-11-15 10:00:00] Starting decommission t=5s: Streaming begins [2023-11-15 10:00:05] Streaming to /10.0.1.11 [2023-11-15 10:00:05] Streaming to /10.0.2.10 t=3600s: Streaming completes [2023-11-15 11:00:00] Streaming completed t=3605s: Decommission complete [2023-11-15 11:00:05] Node decommissioned [2023-11-15 11:00:05] Safe to shutdown ``` **Never Just Kill the Node!** Decommission ensures data isn't lost. ### Replacing a Dead Node **Scenario**: Node crashed and can't be recovered (hardware failure) **Process**: 1. **Prepare new node** with same IP (or use `replace_address`) 2. **Configure cassandra.yaml**: ```yaml theme={null} # Use this to replace dead node with different IP cassandra.replace_address: 10.0.1.10 # IP of dead node ``` 3. **Start new node**: ```bash theme={null} cassandra -f ``` 4. **Streaming**: ``` New node streams data from other replicas Same token ownership as dead node Takes over dead node's responsibilities ``` 5. **Remove replace\_address** and restart: ```yaml theme={null} # cassandra.yaml # cassandra.replace_address: 10.0.1.10 # COMMENT OUT! ``` ### Monitoring Cluster Health **Real-time Status**: ```bash theme={null} # Cluster overview nodetool status # Detailed node info nodetool info # Performance stats nodetool tpstats # Table statistics nodetool tablestats keyspace.table ``` **Key Metrics to Monitor**: | Metric | Command | Healthy Range | | ------------------- | -------------------------- | ------------- | | Heap usage | `nodetool info` | \< 75% | | Pending compactions | `nodetool compactionstats` | \< 20 | | Dropped messages | `nodetool tpstats` | 0 | | Read/write latency | `nodetool tablehistograms` | \< 10ms p95 | | Disk usage | `df -h` | \< 80% | *** ## Part 8: Hands-On Exercises ### Exercise 1: Observe Gossip Propagation **Setup**: 3-node cluster **Task**: 1. Monitor gossip on Node A: `watch -n 1 nodetool gossipinfo` 2. On Node B, change schema: `ALTER TABLE users ADD phone text;` 3. Observe SCHEMA state propagate through gossip 4. Measure propagation time **Expected**: Schema UUID updates across all nodes within 1-3 seconds ### Exercise 2: Test Failure Detection **Task**: 1. Monitor failure detector: `watch -n 1 nodetool failuredetector` 2. Pause Node B: `nodetool pausehandoff` 3. Simulate network delay: `sudo tc qdisc add dev eth0 root netem delay 500ms` 4. Watch Phi values rise 5. Remove delay: `sudo tc qdisc del dev eth0 root` 6. Watch Phi values drop **Question**: At what Phi value did the node get marked as DOWN? ### Exercise 3: Hinted Handoff **Scenario**: Test hint storage and replay **Steps**: ```bash theme={null} # 1. Stop Node C nodetool drain systemctl stop cassandra # 2. Write data with CL=ONE (will succeed via hints) cqlsh -e "INSERT INTO test.users (id, name) VALUES (999, 'Hinted') USING CONSISTENCY ONE;" # 3. Check hints on coordinator ls -lh /var/lib/cassandra/hints/ # 4. Restart Node C systemctl start cassandra # 5. Wait for hint replay (10 min by default) # Or force: nodetool handoff # 6. Verify data on Node C cqlsh -e "SELECT * FROM test.users WHERE id = 999;" ``` ### Exercise 4: Read Repair in Action **Setup**: ```cql theme={null} CREATE TABLE test.users ( id int PRIMARY KEY, name text, email text ) WITH read_repair_chance = 1.0; -- Always repair ``` **Create Inconsistency**: ```bash theme={null} # Write to Node A only cqlsh -e "INSERT INTO test.users (id, name, email) VALUES (500, 'Alice', 'alice@example.com') USING CONSISTENCY ONE;" # Write different value to Node B only cqlsh -e "INSERT INTO test.users (id, name, email) VALUES (500, 'Bob', 'bob@example.com') USING CONSISTENCY ONE;" ``` **Trigger Read Repair**: ```bash theme={null} # Read with QUORUM (triggers repair) cqlsh -e "SELECT * FROM test.users WHERE id = 500 USING CONSISTENCY QUORUM;" # Check both nodes now agree cqlsh -e "SELECT * FROM test.users WHERE id = 500;" cqlsh -e "SELECT * FROM test.users WHERE id = 500;" ``` **Expected**: Both nodes now have the same data (newest timestamp wins) ### Exercise 5: Run Repair and Analyze **Task**: ```bash theme={null} # Start repair with verbose output nodetool repair test users -full 2>&1 | tee repair.log # Analyze repair output grep "Merkle tree" repair.log grep "Differences detected" repair.log grep "Streaming" repair.log # Check repair statistics nodetool compactionstats ``` **Questions**: * How long did Merkle tree building take? * How much data was streamed? * Which token ranges had differences? ### Exercise 6: Multi-DC Consistency **Setup**: 2-DC cluster **Create Keyspace**: ```cql theme={null} CREATE KEYSPACE multidc WITH replication = { 'class': 'NetworkTopologyStrategy', 'dc1': 2, 'dc2': 2 }; ``` **Test Consistency Levels**: ```cql theme={null} -- Write from DC1 INSERT INTO multidc.test (id, val) VALUES (1, 'DC1 write') USING CONSISTENCY LOCAL_QUORUM; -- Immediately read from DC2 SELECT * FROM multidc.test WHERE id = 1 USING CONSISTENCY LOCAL_QUORUM; ``` **Question**: Is the data visible in DC2 immediately? Why or why not? **Test EACH\_QUORUM**: ```cql theme={null} INSERT INTO multidc.test (id, val) VALUES (2, 'Global write') USING CONSISTENCY EACH_QUORUM; ``` **Question**: How does write latency compare to LOCAL\_QUORUM? ### Exercise 7: Simulate Node Failure and Recovery **Scenario**: Test complete failure recovery workflow **Steps**: 1. **Create baseline data**: ```bash theme={null} for i in {1..1000}; do cqlsh -e "INSERT INTO test.users (id, name) VALUES ($i, 'User$i');" done ``` 2. **Stop Node C**: ```bash theme={null} systemctl stop cassandra ``` 3. **Continue writes** (hints accumulate): ```bash theme={null} for i in {1001..2000}; do cqlsh -e "INSERT INTO test.users (id, name) VALUES ($i, 'User$i') USING CONSISTENCY QUORUM;" done ``` 4. **Restart Node C after 5 hours** (exceeds hint window): ```bash theme={null} # Simulate time passing... systemctl start cassandra ``` 5. **Check for missing data**: ```bash theme={null} # Query Node C for recently written data cqlsh -e "SELECT COUNT(*) FROM test.users WHERE id > 1000;" ``` 6. **Run repair**: ```bash theme={null} nodetool repair test users ``` 7. **Verify consistency**: ```bash theme={null} cqlsh -e "SELECT COUNT(*) FROM test.users;" # Should match other nodes now ``` *** ## Part 9: Production Best Practices ### Repair Schedules **Recommendation**: Repair every 7 days (within gc\_grace\_seconds) **Strategy 1: Full Cluster Repair (Small Clusters)** ```bash theme={null} # Weekly full repair 0 2 * * 0 nodetool repair -full ``` **Strategy 2: Per-Node Repair (Large Clusters)** ```bash theme={null} # Each node repairs only its primary ranges # Stagger across nodes throughout the week # Node A: Sunday 0 2 * * 0 nodetool repair -pr # Node B: Monday 0 2 * * 1 nodetool repair -pr # Node C: Tuesday 0 2 * * 2 nodetool repair -pr ``` **Strategy 3: Incremental Repair** ```bash theme={null} # Run daily incremental repair (faster) 0 2 * * * nodetool repair -inc ``` ### Gossip Tuning **High Latency Networks** (cross-region): ```yaml theme={null} # cassandra.yaml # Increase failure detection threshold phi_convict_threshold: 12 # Increase gossip interval gossip_interval_ms: 2000 # Default: 1000 # Increase shadow round interval shadow_round_ms: 120000 # Default: 60000 ``` **Very Large Clusters** (500+ nodes): ```yaml theme={null} # Reduce gossip overhead gossip_interval_ms: 2000 ``` ### Multi-DC Best Practices **1. Use LOCAL Consistency Levels** ```cql theme={null} -- Write locally, async replicate INSERT ... USING CONSISTENCY LOCAL_QUORUM; -- Read locally SELECT ... USING CONSISTENCY LOCAL_QUORUM; ``` **2. Configure Snitches Correctly** ```yaml theme={null} # cassandra.yaml endpoint_snitch: GossipingPropertyFileSnitch # cassandra-rackdc.properties dc=us-east-1 rack=rack1 ``` **3. Set DC-Aware Load Balancing** (driver config): ```python theme={null} # Python driver from cassandra.cluster import Cluster from cassandra.policies import DCAwareRoundRobinPolicy cluster = Cluster( contact_points=['10.0.1.10'], load_balancing_policy=DCAwareRoundRobinPolicy(local_dc='us-east-1') ) ``` **4. Separate Seeds Per DC** ```yaml theme={null} # In us-east nodes seeds: "10.0.1.10,10.0.1.11" # In eu-west nodes seeds: "10.1.1.10,10.0.1.10" # Local seed + one remote seed ``` ### Monitoring Checklist | Metric | Tool | Alert Threshold | | ----------------------- | -------------------------- | ---------------- | | Pending gossip messages | `nodetool tpstats` | > 100 | | Failure detector phi | `nodetool failuredetector` | > 8 = DOWN | | Hints stored | `nodetool statushandoff` | > 10 GB per node | | Pending compactions | `nodetool compactionstats` | > 20 | | Last repair time | Log scraping | > 10 days | | Inter-DC latency | `nodetool proxyhistograms` | p99 > 200ms | *** ## Part 10: Common Issues and Debugging ### Issue 1: Gossip Not Propagating **Symptoms**: * Nodes don't see schema changes * `nodetool describecluster` shows different schema versions **Diagnosis**: ```bash theme={null} # Check gossip status nodetool statusgossip # View schema versions nodetool describecluster # Example bad output: Schema versions: e84b6a60-1f3e-11ec-9621-0242ac130002: [10.0.1.10, 10.0.1.11] f95c7b71-2g4f-22fd-a732-1353bd241113: [10.0.1.12] # DIFFERENT! ``` **Solutions**: ```bash theme={null} # Force schema agreement nodetool resetlocalschema # Restart gossip nodetool disablegossip nodetool enablegossip # Last resort: rolling restart ``` ### Issue 2: False Failure Detection **Symptoms**: * Nodes marked as DOWN when actually healthy * Logs show: `FatClient ... has been silent for 30000ms, removing from gossip` **Diagnosis**: ```bash theme={null} # Check phi values nodetool failuredetector # Check GC pauses (common cause) grep "GC" /var/log/cassandra/system.log | grep "application stopped" ``` **Solutions**: ```yaml theme={null} # Increase phi threshold (cassandra.yaml) phi_convict_threshold: 12 # Default: 8 # Tune JVM GC (jvm.options) -XX:MaxGCPauseMillis=500 ``` ### Issue 3: Hints Accumulating **Symptoms**: * Disk space filling up * `/var/lib/cassandra/hints/` directory growing **Diagnosis**: ```bash theme={null} # Check hint statistics nodetool statushandoff # Check hint directory size du -sh /var/lib/cassandra/hints/ # Find which nodes have hints ls -lh /var/lib/cassandra/hints/ ``` **Solutions**: ```bash theme={null} # 1. Force immediate hint delivery nodetool handoff # 2. If node is permanently gone, remove hints nodetool removenode # 3. Reduce max hint window # cassandra.yaml: max_hint_window_in_ms: 3600000 # 1 hour instead of 3 ``` ### Issue 4: Repair Taking Too Long **Symptoms**: * Repair runs for days * High CPU and disk I/O **Diagnosis**: ```bash theme={null} # Check repair progress nodetool compactionstats # Check active repair sessions nodetool repair_admin list ``` **Solutions**: ```bash theme={null} # 1. Use sub-range repair nodetool repair -st -et # 2. Use incremental repair nodetool repair -inc # 3. Limit parallelism nodetool repair -seq # 4. Throttle streaming nodetool setstreamthroughput 16 # MB/s ``` *** ## Summary & Key Takeaways **Gossip Protocol**: * Peer-to-peer state propagation * Exponential spread: log(N) rounds to full propagation * Seeds are just initial contact points * Runs every 1 second **Failure Detection**: * Phi Accrual: suspicion level, not binary * Phi = 8 (default) = 99.999% confidence node is down * Adapts to network conditions * Configurable via `phi_convict_threshold` **Hinted Handoff**: * Temporary solution for node failures * Stored locally on coordinator * Max window: 3 hours (default) * **Not a replacement for repair!** **Read Repair**: * Fixes data queried in reads * Foreground (blocking) + Background (async) * Probabilistic (10% by default) * Limited scope **Anti-Entropy Repair**: * Compares all data using Merkle trees * Essential for consistency * Must run within `gc_grace_seconds` * Resource-intensive **Multi-DC**: * `NetworkTopologyStrategy` for replication * `LOCAL_*` consistency for low latency * Asynchronous cross-DC replication * Per-DC repair recommended **Cluster Operations**: * Bootstrap: Automated, streams data * Decommission: Safe node removal * Replace: For dead node recovery * Never just kill nodes! *** ## What's Next? JVM tuning, monitoring, troubleshooting, and running Cassandra in production at scale