Performance Tuning & Production Operations
Module Duration : 10-12 hours
Learning Style : Deep Technical + Hands-On Tuning + Production War Stories
Outcome : Operate Cassandra clusters at peak performance in production environments
Introduction: The Production Reality
Running Cassandra in production is vastly different from development:
Development: Single node, small dataset, tolerant of restarts
Production: 50+ nodes, multi-TB per node, 24/7 uptime, millisecond SLAs
This module covers everything you need to run Cassandra successfully in production: JVM tuning, OS configuration, monitoring, capacity planning, and troubleshooting real-world issues.
Part 1: The JVM - Cassandra’s Foundation
Cassandra runs on the Java Virtual Machine (JVM). JVM performance directly impacts Cassandra performance , especially around garbage collection (GC).
Why GC Matters
Problem : Cassandra keeps data in memory (MemTables, caches). When JVM runs GC:
Stop-the-world pauses : Application threads freeze
Pauses > 1 second → timeouts, failed requests
Pauses > 10 seconds → nodes marked as DOWN by failure detector
Goal : Keep GC pauses < 200ms
Heap Size Configuration
Cassandra’s heap is split into two regions:
Heap (Total)
├── Young Generation (Eden + Survivor spaces)
│ └── Short-lived objects (mutations, query results)
└── Old Generation (Tenured)
└── Long-lived objects (MemTables, caches, bloom filters)
Sizing Rules :
# jvm.options or jvm11-server.options
# Rule 1: Max heap = 8GB or 25% of RAM (whichever is smaller)
# Why? Large heaps = long GC pauses
# Example: 64GB RAM machine
-Xms8G # Initial heap
-Xmx8G # Max heap (always match Xms!)
# Rule 2: Young gen = 100MB per CPU core (up to 2GB max)
# Example: 16 cores
-Xmn1600M # 16 * 100MB = 1.6GB
Why These Limits?
Heap Size GC Pause Time Issue < 4GB 50-100ms Too small, frequent GC 4-8GB 100-200ms Optimal 8-16GB 200-500ms Acceptable for some workloads > 16GB 500ms-5s Too long, avoid!
Real Example :
# Bad configuration (128GB RAM)
-Xms32G -Xmx32G # GC pauses will be 1-3 seconds!
# Good configuration (128GB RAM)
-Xms8G -Xmx8G # GC pauses ~200ms, use remaining RAM for OS cache
Garbage Collection Algorithms
Cassandra supports three main GC algorithms:
1. G1GC (G1 Garbage Collector) - Recommended
Best For : Cassandra 3.0+, default choice
Configuration :
# jvm11-server.options
-XX:+UseG1GC
-XX:G1RSetUpdatingPauseTimePercent =5
-XX:MaxGCPauseMillis =200 # Target pause time
-XX:InitiatingHeapOccupancyPercent =70 # When to start GC
-XX:ParallelGCThreads =16 # GC thread count (# of cores)
How G1GC Works :
Heap divided into ~2000 equal-sized regions (1-32MB each)
Region types:
├── Eden (young objects)
├── Survivor (survived 1+ GC)
├── Old (long-lived)
└── Humongous (objects > 50% region size)
GC Process:
1. Young GC: Collect Eden + Survivor (10-50ms)
2. Concurrent marking: Find live objects in Old gen
3. Mixed GC: Collect Young + some Old regions (50-200ms)
Tuning for Read-Heavy :
# Tolerate more garbage before GC (fewer pauses)
-XX:InitiatingHeapOccupancyPercent =75
Tuning for Write-Heavy :
# Start GC earlier (prevent MemTable flush storms)
-XX:InitiatingHeapOccupancyPercent =65
2. CMS (Concurrent Mark Sweep) - Legacy
Best For : Cassandra 2.x (deprecated in Cassandra 3.0+)
# jvm.options (Cassandra 2.x)
-XX:+UseConcMarkSweepGC
-XX:+CMSParallelRemarkEnabled
-XX:SurvivorRatio =8
-XX:MaxTenuringThreshold =1
-XX:CMSInitiatingOccupancyFraction =75
-XX:+UseCMSInitiatingOccupancyOnly
Issues :
Fragmentation in Old Gen → Full GC (5-10 second pauses!)
Deprecated in Java 9+
3. ZGC / Shenandoah - Experimental
Best For : Cassandra 4.0+, Java 11+, cutting-edge deployments
# ZGC (Java 15+)
-XX:+UseZGC
-XX:ZCollectionInterval =120 # GC every 2 minutes
# Shenandoah (Java 11+)
-XX:+UseShenandoahGC
Benefits :
Sub-10ms pauses even with 100GB heaps!
Still experimental for Cassandra
GC Logging and Monitoring
Enable GC Logging :
# jvm11-server.options
-Xlog:gc*,gc+age =trace,safepoint:file=/var/log/cassandra/gc.log:time,uptime:filecount=10,filesize=10m
Analyze GC Logs :
# View GC log
tail -f /var/log/cassandra/gc.log
# Example GC event:
[2023-11-15T10:23:45.678+0000][14.523s] GC( 12 ) Pause Young ( Normal ) ( G1 Evacuation Pause ) 2048M- > 512M( 8192M ) 45.678ms
Breakdown :
Pause Young: Young generation collection
2048M->512M: Heap before → after GC
(8192M): Total heap size
45.678ms: Pause duration (monitor this! )
Tools for GC Analysis :
GCViewer (GUI):
# Download GCViewer
wget https://github.com/chewiebug/GCViewer/releases/download/1.36/gcviewer-1.36.jar
# Visualize GC log
java -jar gcviewer-1.36.jar /var/log/cassandra/gc.log
GCEasy (Web-based):
# Upload gc.log to https://gceasy.io
# Generates detailed analysis and recommendations
Key Metrics to Watch :
Pause time p99 : Should be < 200ms
Pause frequency : Young GC every 5-10 seconds is normal
Full GC events : Should be 0! Any Full GC is a red flag
Common GC Issues
Issue 1: Frequent Full GCs
Symptoms :
[GC] Pause Full (Allocation Failure) 8192M->7890M(8192M) 5234.567ms
Causes :
Heap too small
Memory leak (improper cache configuration)
Large object allocation (huge queries)
Solutions :
# 1. Increase heap (up to 8GB)
-Xms8G -Xmx8G
# 2. Check for memory leaks
nodetool info | grep "Heap Memory"
# 3. Limit query size
# cassandra.yaml:
tombstone_failure_threshold: 10000
range_request_timeout_in_ms: 10000
Issue 2: Long Young GC Pauses
Symptoms :
[GC] Pause Young 3000M->500M(8192M) 850.123ms # > 500ms!
Cause : Young gen too large
Solution :
# Reduce young gen size
-Xmn1G # Instead of -Xmn2G
Issue 3: Memory Pressure
Symptoms :
Constant GC activity
nodetool tpstats shows dropped mutations
Heap constantly near max
Diagnosis :
# Check heap usage
nodetool info | grep "Heap Memory"
# Example bad output:
Heap Memory (MB): 7890 / 8192 # 96% used!
Solutions :
# 1. Reduce MemTable size (cassandra.yaml)
memtable_heap_space_in_mb : 2048 # Default: calculated
# 2. Reduce cache sizes
key_cache_size_in_mb : 100 # Default: auto
row_cache_size_in_mb : 0 # Disable if not needed
# 3. Add more nodes (scale out)
Part 2: Operating System Tuning
Disk I/O Configuration
Cassandra is I/O intensive . OS settings massively impact performance.
File System Choice
FS Performance Recommendation ext4 Good Safe default XFS Better Recommended for large volumesZFS Variable Use with caution (compression conflicts) NTFS Poor Avoid (Windows)
Mount Options (XFS):
# /etc/fstab
/dev/sdb1 /var/lib/cassandra xfs noatime,nodiratime,nobarrier 0 0
Why These Options?
noatime: Don’t update access time (reduces writes)
nodiratime: Don’t update directory access time
nobarrier: Disable write barriers (safe with battery-backed RAID)
I/O Scheduler
For SSDs :
# Check current scheduler
cat /sys/block/sda/queue/scheduler
# Set to noop or none (optimal for SSDs)
echo noop > /sys/block/sda/queue/scheduler
# Make persistent (add to /etc/rc.local)
echo "echo noop > /sys/block/sda/queue/scheduler" >> /etc/rc.local
For HDDs :
# Use deadline scheduler
echo deadline > /sys/block/sda/queue/scheduler
Readahead
Default : Often 8KB (too small for Cassandra)
# Check current readahead
blockdev --getra /dev/sda
# Set to 8-16MB for sequential reads
blockdev --setra 16384 /dev/sda # 16384 * 512 bytes = 8MB
# Make persistent
echo "blockdev --setra 16384 /dev/sda" >> /etc/rc.local
Linux Kernel Settings
Critical sysctl Settings :
# /etc/sysctl.conf
# Network tuning
net.core.rmem_max = 16777216 # Max receive buffer
net.core.wmem_max = 16777216 # Max send buffer
net.ipv4.tcp_rmem = 4096 87380 16777216
net.ipv4.tcp_wmem = 4096 65536 16777216
net.core.netdev_max_backlog = 5000 # Queue size for network packets
# VM settings
vm.max_map_count = 1048575 # Max memory maps (for large heaps)
vm.swappiness = 1 # Avoid swap (but don't disable entirely)
# File descriptor limits
fs.file-max = 1048576 # System-wide max FDs
Apply Settings :
User Limits
Cassandra opens many files simultaneously:
# /etc/security/limits.conf
cassandra soft nofile 65536
cassandra hard nofile 65536
cassandra soft nproc 65536
cassandra hard nproc 65536
cassandra soft memlock unlimited
cassandra hard memlock unlimited
cassandra soft as unlimited
cassandra hard as unlimited
Verify :
# As cassandra user
ulimit -n # File descriptors (should be 65536)
ulimit -u # Processes (should be 65536)
Swap Configuration
Philosophy : Minimize swap, but don’t disable entirely.
Why Not Disable Swap?
Linux kernel may OOM-kill Cassandra if no swap
Small swap (1-2GB) acts as emergency overflow
Configuration :
# Set swappiness to 1 (only use swap in emergencies)
sysctl -w vm.swappiness= 1
# Limit swap size to 1GB
dd if=/dev/zero of=/swapfile bs=1G count= 1
chmod 600 /swapfile
mkswap /swapfile
swapon /swapfile
# Add to /etc/fstab
echo "/swapfile none swap sw 0 0" >> /etc/fstab
Transparent Huge Pages (THP)
Issue : THP causes GC pauses and memory fragmentation.
Disable THP :
# Check THP status
cat /sys/kernel/mm/transparent_hugepage/enabled
# [always] madvise never ← bad (enabled)
# Disable THP
echo never > /sys/kernel/mm/transparent_hugepage/enabled
echo never > /sys/kernel/mm/transparent_hugepage/defrag
# Make persistent (add to /etc/rc.local)
echo "echo never > /sys/kernel/mm/transparent_hugepage/enabled" >> /etc/rc.local
echo "echo never > /sys/kernel/mm/transparent_hugepage/defrag" >> /etc/rc.local
CPU Governor
For Performance :
# Check current governor
cat /sys/devices/system/cpu/cpu * /cpufreq/scaling_governor
# Set to performance (max CPU frequency)
for cpu in /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor ; do
echo performance > $cpu
done
# Install cpufrequtils for persistence
apt-get install cpufrequtils
echo 'GOVERNOR="performance"' > /etc/default/cpufrequtils
systemctl restart cpufrequtils
Part 3: Cassandra Configuration Tuning
Compaction Strategy Selection
Choosing the right compaction strategy is critical for performance.
STCS (Size-Tiered Compaction Strategy)
Best For : Write-heavy, time-series data, small tables
How It Works :
SSTables grouped by similar size:
Tier 1: [1MB, 1MB, 1MB, 1MB] → Compact to 4MB
Tier 2: [4MB, 4MB, 4MB, 4MB] → Compact to 16MB
Tier 3: [16MB, 16MB, 16MB, 16MB] → Compact to 64MB
Configuration :
ALTER TABLE users WITH compaction = {
'class' : 'SizeTieredCompactionStrategy' ,
'min_threshold' : 4 , -- Min SSTables to compact
'max_threshold' : 32 -- Max SSTables to compact
} ;
Pros :
Fast writes (less compaction overhead)
Simple, predictable
Cons :
Read amplification (query may touch many SSTables)
Temporary disk space = 2x data size during compaction
LCS (Leveled Compaction Strategy)
Best For : Read-heavy, frequently updated data
How It Works :
L0: [10MB, 10MB, 10MB, 10MB] (unsorted)
↓ Compact into L1
L1: [10MB, 10MB, 10MB, 10MB, 10MB, 10MB, 10MB, 10MB, 10MB, 10MB] (sorted, max 100MB)
↓ Compact into L2
L2: [10MB × 100 SSTables] (sorted, max 1GB)
Configuration :
ALTER TABLE users WITH compaction = {
'class' : 'LeveledCompactionStrategy' ,
'sstable_size_in_mb' : 160 -- SSTable target size
} ;
Pros :
Low read amplification (90% reads touch 1 SSTable)
Predictable disk space usage
Cons :
More compaction overhead (impacts writes)
More I/O intensive
TWCS (Time Window Compaction Strategy)
Best For : Time-series data with TTL
How It Works :
Bucket by time window:
Window 1 (Day 1): [SSTable, SSTable] → Compact
Window 2 (Day 2): [SSTable, SSTable] → Compact
Window 3 (Day 3): [SSTable, SSTable] → Compact
After TTL expires:
→ Delete entire window (entire SSTable dropped, instant!)
Configuration :
CREATE TABLE metrics (
sensor_id int ,
timestamp timestamp ,
value double ,
PRIMARY KEY (( sensor_id ), timestamp )
) WITH compaction = {
'class' : 'TimeWindowCompactionStrategy' ,
'compaction_window_size' : 1 , -- Window size
'compaction_window_unit' : 'DAYS' -- HOURS , DAYS , WEEKS
} AND default_time_to_live = 2592000 ; -- 30 days
Pros :
Ultra-fast TTL deletion (drop entire SSTable)
Minimal read amplification for time-range queries
Cons :
Only suitable for time-series with TTL
Comparison Table :
Workload Strategy Reason Write-heavy, small dataset STCS Fast writes, simple Read-heavy, updates LCS Low read amplification Time-series with TTL TWCS Fast TTL deletion IoT sensor data TWCS Time-based, expires User profiles (read-heavy) LCS Frequently updated, read often
MemTable Configuration
MemTables are in-memory write buffers. Tuning them balances memory vs. flush frequency.
# cassandra.yaml
# Total heap space for all MemTables (MB)
memtable_heap_space_in_mb : 2048 # 25% of heap is reasonable
# When to flush MemTable to disk (MB)
memtable_flush_writers : 2 # Concurrent flush threads
# Control flush frequency
memtable_cleanup_threshold : 0.5 # Flush when heap 50% full
Trade-offs :
Setting Small MemTable Large MemTable Flush frequency Frequent Infrequent SSTables created Many Few Write performance Lower Higher Compaction overhead Higher Lower Memory usage Lower Higher
Recommendation : Default is usually good. Only adjust if:
Many small writes : Increase MemTable size (reduce flush frequency)
Memory pressure : Decrease MemTable size
Cache Configuration
Cassandra has three caches:
1. Key Cache
Purpose : Cache partition key → SSTable mapping (avoid Bloom filter checks)
Configuration :
# cassandra.yaml
key_cache_size_in_mb : 100 # Or 'auto' (5% of heap)
key_cache_save_period : 14400 # Save to disk every 4 hours
When to Use :
Read-heavy workloads with hot partitions
Queries by primary key
When to Disable :
Write-heavy workloads
Cold data (rarely queried)
2. Row Cache
Purpose : Cache entire rows (most aggressive caching)
Configuration :
# cassandra.yaml
row_cache_size_in_mb : 0 # DISABLED by default (use carefully!)
Warning : Row cache is dangerous :
Consumes heap (increases GC pressure)
Only helps if reading exact same rows repeatedly
Most production clusters disable this
When to Use : Small, frequently-read tables (e.g., configuration)
3. Counter Cache
Purpose : Cache counter column values (counter tables only)
Configuration :
# cassandra.yaml
counter_cache_size_in_mb : 50
Enable Per Table :
ALTER TABLE counters WITH caching = {
'keys' : 'ALL' ,
'rows_per_partition' : 'ALL'
} ;
Commit Log Tuning
CommitLog is Cassandra’s write-ahead log.
Two Modes :
Periodic (default):
# cassandra.yaml
commitlog_sync : periodic
commitlog_sync_period_in_ms : 10000 # Sync every 10 seconds
Pros : High write throughput
Cons : Up to 10 seconds of data loss on crash
Batch :
commitlog_sync : batch
commitlog_sync_batch_window_in_ms : 2 # Sync every 2ms
Pros : Minimal data loss (2ms window)
Cons : 30-50% lower write throughput
Disk Configuration :
# Put CommitLog on separate disk (if using HDD)
commitlog_directory : /mnt/commitlog # Separate disk!
# CommitLog size
commitlog_total_space_in_mb : 8192 # 8GB
Best Practice : Use separate SSD for CommitLog if using HDDs for data.
Concurrent Operations
Control thread pool sizes:
# cassandra.yaml
# Read thread pool (handle read requests)
concurrent_reads : 32 # Default: 32
# Write thread pool (handle write requests)
concurrent_writes : 32 # Default: 32
# Counter write thread pool
concurrent_counter_writes : 32 # Default: 32
# Compaction threads
concurrent_compactors : 2 # 1 per disk spindle (HDD) or 2-4 (SSD)
Tuning :
More CPU cores : Increase concurrent_reads/writes
More disks : Increase concurrent_compactors
Memory constrained : Decrease to reduce overhead
Part 4: Monitoring and Observability
Key Metrics to Monitor
1. System Metrics
CPU :
# Monitor CPU usage
top
htop
# Check per-core usage
mpstat -P ALL 1
Target : < 80% average, < 95% peak
Disk I/O :
# Monitor disk I/O
iostat -x 1
# Key metrics:
# - %util: Disk utilization (should be < 90%)
# - await: Average wait time (should be < 10ms SSD, < 20ms HDD)
# - r/s, w/s: Reads/writes per second
Network :
# Monitor network throughput
iftop
nethogs
# Check dropped packets
netstat -s | grep -i drop
2. JVM Metrics
Heap Usage :
nodetool info | grep "Heap Memory"
# Example output:
Heap Memory (MB): 3456 / 8192 # 42% used (healthy)
GC Metrics :
# GC stats
nodetool gcstats
# Example output:
Interval (ms) Max GC Elapsed ( ms ) Total GC Elapsed ( ms ) Stdev GC Elapsed ( ms )
10234 189 3456 45
Target : Max GC < 200ms
3. Cassandra Metrics
Thread Pool Stats :
nodetool tpstats
# Example output:
Pool Name Active Pending Completed Blocked All time blocked
ReadStage 2 0 12345678 0 0
MutationStage 1 0 98765432 0 0
CompactionExecutor 2 15 123456 0 0
Message type Dropped
READ 0
MUTATION 0 # Any non-zero is BAD!
Critical : Dropped messages should be 0!
Table Statistics :
nodetool tablestats keyspace.table
# Key metrics:
# - SSTable count (should be < 20 for LCS, < 50 for STCS)
# - Space used by snapshots
# - Read/write latency
Compaction Stats :
nodetool compactionstats
# Example output:
pending tasks: 5
- keyspace.table: 5
Active compactions:
compaction type keyspace table completed total progress
Compaction test users 456 MB 1.2 GB 38.00%
Pending compactions should stay low (< 20). High pending = falling behind.
Latency :
nodetool tablehistograms keyspace.table
# Example output:
Percentile Read Latency Write Latency
50% 2.1 ms 0.8 ms
75% 4.5 ms 1.2 ms
95% 12.3 ms 3.4 ms
99% 28.7 ms 8.9 ms
Targets :
p50 < 5ms
p95 < 20ms
p99 < 50ms
Throughput :
nodetool tablestats keyspace.table | grep -i "read\|write"
# Example output:
Local read count: 12345678
Local read latency: 3.456 ms
Local write count: 98765432
Local write latency: 1.234 ms
1. Prometheus + Grafana (Recommended)
Architecture :
Cassandra nodes → JMX → Prometheus JMX Exporter → Prometheus → Grafana
Setup :
Install JMX Exporter :
# Download jmx_prometheus_javaagent
wget https://repo1.maven.org/maven2/io/prometheus/jmx/jmx_prometheus_javaagent/0.17.0/jmx_prometheus_javaagent-0.17.0.jar
# Move to Cassandra directory
mv jmx_prometheus_javaagent-0.17.0.jar /usr/share/cassandra/
Configure JMX Exporter (cassandra_jmx.yml):
lowercaseOutputName : true
lowercaseOutputLabelNames : true
rules :
- pattern : org.apache.cassandra.metrics<type=(\w+), name=(\w+)><>Value
name : cassandra_$1_$2
Add to JVM Options :
# jvm11-server.options
-javaagent:/usr/share/cassandra/jmx_prometheus_javaagent-0.17.0.jar =7070:/etc/cassandra/cassandra_jmx.yml
Configure Prometheus (prometheus.yml):
scrape_configs :
- job_name : 'cassandra'
static_configs :
- targets : [ 'node1:7070' , 'node2:7070' , 'node3:7070' ]
Import Grafana Dashboard :
2. DataStax OpsCenter
Commercial tool with free tier:
# Install OpsCenter
apt-get install opscenter
# Start OpsCenter
service opscenterd start
# Access: http://localhost:8888
Features :
Visual cluster topology
Performance graphs
Repair scheduling
Backup management
Quick Checks :
# Overall cluster health
nodetool status
# Node-specific info
nodetool info
# Performance metrics
nodetool tpstats
nodetool tablehistograms keyspace.table
nodetool cfhistograms keyspace.table
# GC monitoring
nodetool gcstats
# Compaction status
nodetool compactionstats
Alert Thresholds
Critical Alerts :
Metric Threshold Action Dropped mutations > 0 Investigate immediately (data loss!) GC pause time p99 > 1s Tune JVM or add nodes Pending compactions > 50 Increase compaction threads Disk usage > 85% Add capacity or cleanup Node down Any Investigate and repair
Warning Alerts :
Metric Threshold Action Read latency p99 > 50ms Check query patterns Write latency p99 > 20ms Check disk I/O Heap usage > 85% Review cache settings Pending hints > 1GB Check network/nodes SSTable count > 50 Review compaction strategy
Part 5: Capacity Planning
Disk Capacity
Formula :
Required Disk = (Data Size × Replication Factor × (1 + Compaction Overhead)) / Number of Nodes
Compaction Overhead :
STCS: 50% (2x data during compaction)
LCS: 10% (1.1x data)
TWCS: 20% (1.2x data)
Example :
Data size: 10 TB
Replication factor: 3
Compaction strategy: STCS (50% overhead)
Number of nodes: 10
Disk per node = (10 TB × 3 × 1.5) / 10 = 4.5 TB
Recommendation: 6 TB disks (33% headroom)
Memory Capacity
Formula :
Total RAM = Heap + OS Cache + OS Overhead
Heap: 8GB (max recommended)
OS Cache: 50-75% of remaining RAM
OS Overhead: 4-8GB
Example (64GB RAM):
Heap: 8GB
OS Overhead: 8GB
OS Cache: 48GB (for caching SSTables)
Total: 64GB ✓
Why Large OS Cache?
Cassandra relies on OS page cache for SSTable caching
Larger cache = fewer disk reads = better performance
CPU Capacity
Rule of Thumb : 1 CPU core per 1-2 TB of data
Example :
Data per node: 4TB
Recommended cores: 4-8 cores
For read-heavy: 8 cores (more parallelism)
For write-heavy: 4 cores (less GC pressure)
Network Capacity
Formula :
Network throughput = (Write rate × Replication factor) + (Read rate) + (Repair traffic)
Example :
Write rate: 10,000 writes/sec × 5 KB/write = 50 MB/s
Replication factor: 3
Read rate: 5,000 reads/sec × 10 KB/read = 50 MB/s
Network needed = (50 MB/s × 3) + 50 MB/s = 200 MB/s
Recommendation: 1 Gbps NIC (125 MB/s max) is too small!
Use: 10 Gbps NIC (1250 MB/s max)
Scaling Triggers
When to Add Nodes :
Metric Threshold Action Disk usage > 70% Add nodes (or cleanup) CPU usage > 75% average Add nodes Read latency p99 > 50ms (sustained) Add nodes or optimize Write latency p99 > 20ms (sustained) Add nodes or tune Compaction falling behind Pending > 30 Add nodes or tune compaction
Scaling Example :
Initial: 6 nodes, 4TB each = 24TB total capacity (8TB user data × RF=3)
After 6 months:
- Data grows to 12TB
- 6 nodes × 4TB = 24TB total capacity
- Usage: (12TB × 3) / 24TB = 150% (OVER CAPACITY!)
Action: Add 6 more nodes
- 12 nodes × 4TB = 48TB total capacity
- Usage: (12TB × 3) / 48TB = 75% (healthy)
Part 6: Backup and Disaster Recovery
Snapshot-Based Backups
How Snapshots Work :
1. SSTable files are immutable
2. Snapshot creates hardlinks (not copies!)
3. Hardlinks consume minimal disk space
4. Compaction doesn't delete files with active hardlinks
Create Snapshot :
# Snapshot all keyspaces
nodetool snapshot
# Snapshot specific keyspace
nodetool snapshot my_keyspace
# Snapshot with name
nodetool snapshot -t backup_20231115 my_keyspace
Snapshot Location :
/var/lib/cassandra/data/<keyspace>/<table>/snapshots/<snapshot_name>/
List Snapshots :
nodetool listsnapshots
# Example output:
Snapshot name Keyspace Table Size
backup_20231115 my_keyspace users 2.3 GB
Delete Snapshot :
# Delete specific snapshot
nodetool clearsnapshot -t backup_20231115
# Delete all snapshots
nodetool clearsnapshot --all
Incremental Backups
Enable Incremental Backups :
# cassandra.yaml
incremental_backups : true
How It Works :
1. When SSTable is flushed, hardlink is created in backups/ directory
2. Backup contains only new data since last full snapshot
3. Restore = Full snapshot + Incremental backups
Backup Location :
/var/lib/cassandra/data/<keyspace>/<table>/backups/
Backup Strategy :
Day 0: Full snapshot
Day 1-6: Incremental backups accumulate
Day 7: Full snapshot + delete old incrementals
Backup to External Storage
Script Example (S3):
#!/bin/bash
# backup_to_s3.sh
KEYSPACE = "my_keyspace"
SNAPSHOT = "backup_$( date +%Y%m%d_%H%M%S)"
S3_BUCKET = "s3://my-cassandra-backups"
# 1. Create snapshot
nodetool snapshot -t $SNAPSHOT $KEYSPACE
# 2. Find snapshot directory
SNAPSHOT_DIR = "/var/lib/cassandra/data/ $KEYSPACE /*/snapshots/ $SNAPSHOT "
# 3. Upload to S3
aws s3 sync $SNAPSHOT_DIR $S3_BUCKET / $HOSTNAME / $SNAPSHOT /
# 4. Delete local snapshot
nodetool clearsnapshot -t $SNAPSHOT
echo "Backup complete: $S3_BUCKET / $HOSTNAME / $SNAPSHOT /"
Automate with Cron :
# Daily backup at 2 AM
0 2 * * * /usr/local/bin/backup_to_s3.sh
Restore from Backup
Full Restore Process :
Stop Cassandra :
Clear existing data :
rm -rf /var/lib/cassandra/data/my_keyspace/users/ *
Restore snapshot :
# Download from S3
aws s3 sync s3://my-cassandra-backups/node1/backup_20231115/ /tmp/restore/
# Copy to data directory
cp -r /tmp/restore/ * /var/lib/cassandra/data/my_keyspace/users/
Fix ownership :
chown -R cassandra:cassandra /var/lib/cassandra/data
Restart Cassandra :
systemctl start cassandra
Run repair (important!):
nodetool repair my_keyspace users
Point-in-Time Recovery
Requirements :
Full snapshot
Incremental backups
CommitLog archives
CommitLog Archiving :
# cassandra.yaml
commitlog_archiving :
archive_command : "/usr/local/bin/archive_commitlog.sh %path"
restore_command : "/usr/local/bin/restore_commitlog.sh %from %to"
restore_directories : /var/lib/cassandra/commitlog_archive
restore_point_in_time : 2023-11-15T10:30:00
Archive Script :
#!/bin/bash
# archive_commitlog.sh
aws s3 cp $1 s3://my-cassandra-backups/commitlogs/ $( hostname ) / $( basename $1 )
Part 7: Troubleshooting Production Issues
Issue 1: High Read Latency
Symptoms :
nodetool tablehistograms keyspace.table
# p99: 500ms (very high!)
Diagnosis Steps :
Check SSTable Count :
nodetool tablestats keyspace.table | grep "SSTable count"
# SSTable count: 73 (too many!)
Solution : Switch to LCS or run compaction:
nodetool compact keyspace table
Check for Wide Partitions :
nodetool cfstats keyspace.table | grep "Compacted partition maximum bytes"
# Compacted partition maximum bytes: 2147483648 (2GB partition!)
Solution : Redesign data model (split partition)
Check Disk I/O :
iostat -x 1
# %util: 98% (disk saturated!)
Solution : Add nodes or upgrade to SSDs
Check for Tombstones :
# Enable tracing
cqlsh > TRACING ON
cqlsh > SELECT * FROM keyspace.table WHERE id = 123 ;
# Look for:
# Read 10 live rows and 50000 tombstone cells (too many tombstones!)
Solution : Run repair or adjust gc_grace_seconds
Issue 2: Write Timeouts
Symptoms :
WriteTimeout: Timed out waiting for write response
Diagnosis :
Check Dropped Mutations :
nodetool tpstats | grep "Dropped"
# MUTATION: 12345 (non-zero = bad!)
Cause : Nodes can’t keep up with write load
Check Pending Compactions :
nodetool compactionstats
# pending tasks: 87 (falling behind!)
Cause : Compaction can’t keep up
Solutions :
# Increase compaction threads
nodetool setcompactionthroughput 0 # Unlimited (use carefully!)
# Or in cassandra.yaml:
compaction_throughput_mb_per_sec: 64 # Default: 16
# Increase concurrent compactors
concurrent_compactors: 4 # Default: 2
Check GC Pauses :
grep "GC" /var/log/cassandra/system.log | tail -20
# [GC pause (young) 1234ms] (too long!)
Cause : Heap pressure
Solution : Reduce MemTable size or increase heap
Issue 3: Node Marked as DOWN (But It’s Running)
Symptoms :
nodetool status
# DN 10.0.1.13 (Down/Normal) ← but node is actually running!
Diagnosis :
Check Failure Detector :
nodetool failuredetector
# /10.0.1.13: 15.34 (phi > 8 = marked down)
Check for GC Pauses :
# On node 10.0.1.13
grep "GC pause" /var/log/cassandra/system.log
# [GC pause (young) 5678ms] (> phi threshold!)
Cause : GC pause exceeded failure detector threshold
Solutions :
# Increase phi threshold (cassandra.yaml)
phi_convict_threshold : 12 # Default: 8
# Tune JVM GC
-XX:MaxGCPauseMillis=200
Issue 4: Disk Full
Symptoms :
ERROR: Cannot allocate memory for commitlog segment
Diagnosis :
df -h
# /var/lib/cassandra: 98% used (critical!)
# Check snapshot disk usage
nodetool listsnapshots | awk '{sum+=$4} END {print sum}'
# 2.3 TB in snapshots!
Solutions :
Delete Old Snapshots :
nodetool clearsnapshot --all
Clean Incremental Backups :
rm -rf /var/lib/cassandra/data/ * / * /backups/ *
Compact Tables :
Add Nodes (long-term solution)
Issue 5: Schema Mismatch
Symptoms :
nodetool describecluster
# Schema versions:
# e84b6a60-1f3e-11ec-9621-0242ac130002: [10.0.1.10, 10.0.1.11]
# f95c7b71-2g4f-22fd-a732-1353bd241113: [10.0.1.12] ← different!
Diagnosis :
Node 10.0.1.12 has different schema
Gossip not propagating schema updates
Solutions :
Force Schema Reset :
# On the outlier node (10.0.1.12)
nodetool resetlocalschema
Restart Gossip :
nodetool disablegossip
nodetool enablegossip
Rolling Restart (last resort):
# Restart nodes one by one
systemctl restart cassandra
Part 8: Advanced Production Topics
Multi-DC Latency Optimization
Problem : Cross-DC writes add 100-200ms latency
Solution : Use LOCAL_QUORUM for writes:
INSERT INTO users ( id , name ) VALUES ( 1 , 'Alice' )
USING CONSISTENCY LOCAL_QUORUM ;
Why : Write only waits for local DC acknowledgment, remote DC replicates asynchronously.
Trade-off : Remote DC may lag by seconds/minutes during network issues.
Read Consistency Tuning
Scenario : Reads occasionally return stale data
Diagnosis : Repair not running frequently enough
Solutions :
Increase Read Repair Chance :
ALTER TABLE users WITH dclocal_read_repair_chance = 0.2 ; -- 20 % of reads
Use Higher Consistency Level :
SELECT * FROM users WHERE id = 1
USING CONSISTENCY QUORUM ; -- Instead of ONE
Run Repair More Frequently :
# Daily incremental repair
0 2 * * * nodetool repair -inc
Handling Large Partitions
Problem : Partition > 100MB causes:
High read latency
Timeouts
OOM errors
Detection :
# Find large partitions
nodetool cfstats keyspace.table | grep "Compacted partition maximum bytes"
# Or use SSTable tools
sstablemetadata /var/lib/cassandra/data/keyspace/table/mc-123-big-Data.db | grep "Partition size"
Solutions :
Redesign Data Model (best):
-- Bad : All events for a user in one partition
CREATE TABLE user_events (
user_id int ,
event_time timestamp ,
event_data text ,
PRIMARY KEY ( user_id , event_time ) -- user_id partition grows unbounded !
);
-- Good : Bucket by time
CREATE TABLE user_events (
user_id int ,
bucket text , -- e . g ., "2023-11"
event_time timestamp ,
event_data text ,
PRIMARY KEY (( user_id , bucket ), event_time ) -- Bounded partitions
);
Add Compaction Threshold :
ALTER TABLE users WITH
compaction = { 'class' : 'LeveledCompactionStrategy' }
AND min_threshold = 4
AND max_threshold = 32 ;
Connection Pool Tuning (Driver-Side)
Python Driver :
from cassandra.cluster import Cluster
from cassandra.policies import DCAwareRoundRobinPolicy
cluster = Cluster(
contact_points = [ '10.0.1.10' ],
protocol_version = 4 ,
# Connection pool
pool_size = 10 , # Connections per host
max_connections = 20 , # Max connections per host
# Timeouts
connect_timeout = 10 , # Initial connection timeout
control_connection_timeout = 10 ,
# Load balancing
load_balancing_policy = DCAwareRoundRobinPolicy( local_dc = 'us-east-1' ),
# Retries
default_retry_policy = RetryPolicy(),
)
Java Driver :
CqlSession session = CqlSession . builder ()
. addContactPoint ( new InetSocketAddress ( "10.0.1.10" , 9042 ))
. withLocalDatacenter ( "us-east-1" )
// Connection pool
. withPoolSize ( 10 )
// Timeouts
. withRequestTimeout ( Duration . ofSeconds ( 5 ))
. withConnectTimeout ( Duration . ofSeconds ( 10 ))
. build ();
Pre-Production Checklist
Load Testing Tools :
cassandra-stress (built-in):
# Write test
cassandra-stress write n= 1000000 -node 10.0.1.10 -rate threads= 50
# Read test
cassandra-stress read n= 1000000 -node 10.0.1.10 -rate threads= 50
# Mixed workload
cassandra-stress mixed ratio \( write=1,read= 3 \) n= 1000000 -node 10.0.1.10
NoSQLBench :
# Install
wget https://github.com/nosqlbench/nosqlbench/releases/download/4.15.102/nb.jar
# Run workload
java -jar nb.jar cql-keyvalue rampup-cycles= 10000 main-cycles= 100000 host= 10.0.1.10
Key Metrics to Capture :
Throughput (ops/sec)
Latency (p50, p95, p99, p999)
Error rate
Resource utilization (CPU, RAM, disk, network)
Part 10: Hands-On Exercises
Exercise 1: JVM Tuning
Scenario : Node experiencing 2-second GC pauses
Task :
Analyze GC log: cat /var/log/cassandra/gc.log
Identify problem (heap too large? Young gen too large?)
Adjust JVM settings in jvm11-server.options
Restart and monitor improvements
Expected : GC pauses < 200ms after tuning
Exercise 2: Compaction Strategy Comparison
Task :
Create three identical tables with different compaction strategies (STCS, LCS, TWCS)
Load 10GB of data into each
Run mixed read/write workload with cassandra-stress
Compare SSTable count, read latency, write latency
Questions :
Which strategy has lowest SSTable count?
Which strategy has best read latency?
Which strategy has best write throughput?
Exercise 3: Monitoring Setup
Task :
Install Prometheus + Grafana
Configure JMX exporter on Cassandra nodes
Import Cassandra dashboard
Create custom alerts for:
GC pause > 500ms
Dropped mutations > 0
Pending compactions > 20
Exercise 4: Backup and Restore
Task :
Create snapshot of my_keyspace
Upload to S3 (or local directory)
Drop table: DROP TABLE my_keyspace.users;
Restore from snapshot
Verify data integrity
Exercise 5: Troubleshoot Slow Queries
Scenario : Query taking 5 seconds
Task :
TRACING ON ;
SELECT * FROM users WHERE user_id = 12345 ;
Analyze trace output
Identify bottleneck (tombstones? large partition? SSTable count?)
Apply fix (repair? compaction? data model change?)
Verify improvement
Summary & Production Best Practices
JVM :
Heap: 8GB max (25% of RAM)
GC: G1GC with 200ms pause target
Monitor GC logs continuously
OS :
SSDs for all storage
XFS with noatime
Disable THP and swap
I/O scheduler: noop
Cassandra :
Compaction strategy matches workload
Run repair weekly (within gc_grace_seconds)
Use LOCAL_QUORUM for multi-DC
Monitor: GC, tpstats, compaction, latency
Capacity Planning :
1 core per 1-2TB data
64GB+ RAM (8GB heap + 48GB OS cache)
Disk: 50% headroom for compaction
Network: 10 Gbps minimum
Backup :
Daily snapshots to external storage
Incremental backups enabled
Test restore procedures regularly
Troubleshooting :
High reads → Check SSTable count, tombstones
High writes → Check pending compactions, GC
Timeouts → Check dropped messages, disk I/O
Node down → Check failure detector, GC pauses
What’s Next?
Module 7: Capstone Project - Building a Production System Apply everything you’ve learned to design and implement a real-world Cassandra application at scale