Skip to main content

Performance Tuning & Production Operations

Module Duration: 10-12 hours Learning Style: Deep Technical + Hands-On Tuning + Production War Stories Outcome: Operate Cassandra clusters at peak performance in production environments

Introduction: The Production Reality

Running Cassandra in production is vastly different from development:
  • Development: Single node, small dataset, tolerant of restarts
  • Production: 50+ nodes, multi-TB per node, 24/7 uptime, millisecond SLAs
This module covers everything you need to run Cassandra successfully in production: JVM tuning, OS configuration, monitoring, capacity planning, and troubleshooting real-world issues.

Part 1: The JVM - Cassandra’s Foundation

Cassandra runs on the Java Virtual Machine (JVM). JVM performance directly impacts Cassandra performance, especially around garbage collection (GC).

Why GC Matters

Problem: Cassandra keeps data in memory (MemTables, caches). When JVM runs GC:
  • Stop-the-world pauses: Application threads freeze
  • Pauses > 1 second → timeouts, failed requests
  • Pauses > 10 seconds → nodes marked as DOWN by failure detector
Goal: Keep GC pauses < 200ms

Heap Size Configuration

Cassandra’s heap is split into two regions:
Heap (Total)
├── Young Generation (Eden + Survivor spaces)
│   └── Short-lived objects (mutations, query results)
└── Old Generation (Tenured)
    └── Long-lived objects (MemTables, caches, bloom filters)
Sizing Rules:
# jvm.options or jvm11-server.options

# Rule 1: Max heap = 8GB or 25% of RAM (whichever is smaller)
# Why? Large heaps = long GC pauses

# Example: 64GB RAM machine
-Xms8G   # Initial heap
-Xmx8G   # Max heap (always match Xms!)

# Rule 2: Young gen = 100MB per CPU core (up to 2GB max)
# Example: 16 cores
-Xmn1600M   # 16 * 100MB = 1.6GB
Why These Limits?
Heap SizeGC Pause TimeIssue
< 4GB50-100msToo small, frequent GC
4-8GB100-200msOptimal
8-16GB200-500msAcceptable for some workloads
> 16GB500ms-5sToo long, avoid!
Real Example:
# Bad configuration (128GB RAM)
-Xms32G -Xmx32G  # GC pauses will be 1-3 seconds!

# Good configuration (128GB RAM)
-Xms8G -Xmx8G    # GC pauses ~200ms, use remaining RAM for OS cache

Garbage Collection Algorithms

Cassandra supports three main GC algorithms: Best For: Cassandra 3.0+, default choice Configuration:
# jvm11-server.options
-XX:+UseG1GC
-XX:G1RSetUpdatingPauseTimePercent=5
-XX:MaxGCPauseMillis=200           # Target pause time
-XX:InitiatingHeapOccupancyPercent=70  # When to start GC
-XX:ParallelGCThreads=16           # GC thread count (# of cores)
How G1GC Works:
Heap divided into ~2000 equal-sized regions (1-32MB each)

Region types:
├── Eden (young objects)
├── Survivor (survived 1+ GC)
├── Old (long-lived)
└── Humongous (objects > 50% region size)

GC Process:
1. Young GC: Collect Eden + Survivor (10-50ms)
2. Concurrent marking: Find live objects in Old gen
3. Mixed GC: Collect Young + some Old regions (50-200ms)
Tuning for Read-Heavy:
# Tolerate more garbage before GC (fewer pauses)
-XX:InitiatingHeapOccupancyPercent=75
Tuning for Write-Heavy:
# Start GC earlier (prevent MemTable flush storms)
-XX:InitiatingHeapOccupancyPercent=65

2. CMS (Concurrent Mark Sweep) - Legacy

Best For: Cassandra 2.x (deprecated in Cassandra 3.0+)
# jvm.options (Cassandra 2.x)
-XX:+UseConcMarkSweepGC
-XX:+CMSParallelRemarkEnabled
-XX:SurvivorRatio=8
-XX:MaxTenuringThreshold=1
-XX:CMSInitiatingOccupancyFraction=75
-XX:+UseCMSInitiatingOccupancyOnly
Issues:
  • Fragmentation in Old Gen → Full GC (5-10 second pauses!)
  • Deprecated in Java 9+

3. ZGC / Shenandoah - Experimental

Best For: Cassandra 4.0+, Java 11+, cutting-edge deployments
# ZGC (Java 15+)
-XX:+UseZGC
-XX:ZCollectionInterval=120  # GC every 2 minutes

# Shenandoah (Java 11+)
-XX:+UseShenandoahGC
Benefits:
  • Sub-10ms pauses even with 100GB heaps!
  • Still experimental for Cassandra

GC Logging and Monitoring

Enable GC Logging:
# jvm11-server.options
-Xlog:gc*,gc+age=trace,safepoint:file=/var/log/cassandra/gc.log:time,uptime:filecount=10,filesize=10m
Analyze GC Logs:
# View GC log
tail -f /var/log/cassandra/gc.log

# Example GC event:
[2023-11-15T10:23:45.678+0000][14.523s] GC(12) Pause Young (Normal) (G1 Evacuation Pause) 2048M->512M(8192M) 45.678ms
Breakdown:
  • Pause Young: Young generation collection
  • 2048M->512M: Heap before → after GC
  • (8192M): Total heap size
  • 45.678ms: Pause duration (monitor this!)
Tools for GC Analysis:
  1. GCViewer (GUI):
# Download GCViewer
wget https://github.com/chewiebug/GCViewer/releases/download/1.36/gcviewer-1.36.jar

# Visualize GC log
java -jar gcviewer-1.36.jar /var/log/cassandra/gc.log
  1. GCEasy (Web-based):
# Upload gc.log to https://gceasy.io
# Generates detailed analysis and recommendations
Key Metrics to Watch:
  • Pause time p99: Should be < 200ms
  • Pause frequency: Young GC every 5-10 seconds is normal
  • Full GC events: Should be 0! Any Full GC is a red flag

Common GC Issues

Issue 1: Frequent Full GCs

Symptoms:
[GC] Pause Full (Allocation Failure) 8192M->7890M(8192M) 5234.567ms
Causes:
  • Heap too small
  • Memory leak (improper cache configuration)
  • Large object allocation (huge queries)
Solutions:
# 1. Increase heap (up to 8GB)
-Xms8G -Xmx8G

# 2. Check for memory leaks
nodetool info | grep "Heap Memory"

# 3. Limit query size
# cassandra.yaml:
tombstone_failure_threshold: 10000
range_request_timeout_in_ms: 10000

Issue 2: Long Young GC Pauses

Symptoms:
[GC] Pause Young 3000M->500M(8192M) 850.123ms  # > 500ms!
Cause: Young gen too large Solution:
# Reduce young gen size
-Xmn1G  # Instead of -Xmn2G

Issue 3: Memory Pressure

Symptoms:
  • Constant GC activity
  • nodetool tpstats shows dropped mutations
  • Heap constantly near max
Diagnosis:
# Check heap usage
nodetool info | grep "Heap Memory"

# Example bad output:
Heap Memory (MB): 7890 / 8192  # 96% used!
Solutions:
# 1. Reduce MemTable size (cassandra.yaml)
memtable_heap_space_in_mb: 2048  # Default: calculated

# 2. Reduce cache sizes
key_cache_size_in_mb: 100        # Default: auto
row_cache_size_in_mb: 0          # Disable if not needed

# 3. Add more nodes (scale out)

Part 2: Operating System Tuning

Disk I/O Configuration

Cassandra is I/O intensive. OS settings massively impact performance.

File System Choice

FSPerformanceRecommendation
ext4GoodSafe default
XFSBetterRecommended for large volumes
ZFSVariableUse with caution (compression conflicts)
NTFSPoorAvoid (Windows)
Mount Options (XFS):
# /etc/fstab
/dev/sdb1  /var/lib/cassandra  xfs  noatime,nodiratime,nobarrier  0  0
Why These Options?
  • noatime: Don’t update access time (reduces writes)
  • nodiratime: Don’t update directory access time
  • nobarrier: Disable write barriers (safe with battery-backed RAID)

I/O Scheduler

For SSDs:
# Check current scheduler
cat /sys/block/sda/queue/scheduler

# Set to noop or none (optimal for SSDs)
echo noop > /sys/block/sda/queue/scheduler

# Make persistent (add to /etc/rc.local)
echo "echo noop > /sys/block/sda/queue/scheduler" >> /etc/rc.local
For HDDs:
# Use deadline scheduler
echo deadline > /sys/block/sda/queue/scheduler

Readahead

Default: Often 8KB (too small for Cassandra)
# Check current readahead
blockdev --getra /dev/sda

# Set to 8-16MB for sequential reads
blockdev --setra 16384 /dev/sda  # 16384 * 512 bytes = 8MB

# Make persistent
echo "blockdev --setra 16384 /dev/sda" >> /etc/rc.local

Linux Kernel Settings

Critical sysctl Settings:
# /etc/sysctl.conf

# Network tuning
net.core.rmem_max = 16777216           # Max receive buffer
net.core.wmem_max = 16777216           # Max send buffer
net.ipv4.tcp_rmem = 4096 87380 16777216
net.ipv4.tcp_wmem = 4096 65536 16777216
net.core.netdev_max_backlog = 5000     # Queue size for network packets

# VM settings
vm.max_map_count = 1048575             # Max memory maps (for large heaps)
vm.swappiness = 1                      # Avoid swap (but don't disable entirely)

# File descriptor limits
fs.file-max = 1048576                  # System-wide max FDs
Apply Settings:
sysctl -p

User Limits

Cassandra opens many files simultaneously:
# /etc/security/limits.conf

cassandra soft nofile 65536
cassandra hard nofile 65536
cassandra soft nproc 65536
cassandra hard nproc 65536
cassandra soft memlock unlimited
cassandra hard memlock unlimited
cassandra soft as unlimited
cassandra hard as unlimited
Verify:
# As cassandra user
ulimit -n  # File descriptors (should be 65536)
ulimit -u  # Processes (should be 65536)

Swap Configuration

Philosophy: Minimize swap, but don’t disable entirely. Why Not Disable Swap?
  • Linux kernel may OOM-kill Cassandra if no swap
  • Small swap (1-2GB) acts as emergency overflow
Configuration:
# Set swappiness to 1 (only use swap in emergencies)
sysctl -w vm.swappiness=1

# Limit swap size to 1GB
dd if=/dev/zero of=/swapfile bs=1G count=1
chmod 600 /swapfile
mkswap /swapfile
swapon /swapfile

# Add to /etc/fstab
echo "/swapfile none swap sw 0 0" >> /etc/fstab

Transparent Huge Pages (THP)

Issue: THP causes GC pauses and memory fragmentation. Disable THP:
# Check THP status
cat /sys/kernel/mm/transparent_hugepage/enabled
# [always] madvise never  ← bad (enabled)

# Disable THP
echo never > /sys/kernel/mm/transparent_hugepage/enabled
echo never > /sys/kernel/mm/transparent_hugepage/defrag

# Make persistent (add to /etc/rc.local)
echo "echo never > /sys/kernel/mm/transparent_hugepage/enabled" >> /etc/rc.local
echo "echo never > /sys/kernel/mm/transparent_hugepage/defrag" >> /etc/rc.local

CPU Governor

For Performance:
# Check current governor
cat /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor

# Set to performance (max CPU frequency)
for cpu in /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor; do
  echo performance > $cpu
done

# Install cpufrequtils for persistence
apt-get install cpufrequtils
echo 'GOVERNOR="performance"' > /etc/default/cpufrequtils
systemctl restart cpufrequtils

Part 3: Cassandra Configuration Tuning

Compaction Strategy Selection

Choosing the right compaction strategy is critical for performance.

STCS (Size-Tiered Compaction Strategy)

Best For: Write-heavy, time-series data, small tables How It Works:
SSTables grouped by similar size:
Tier 1: [1MB, 1MB, 1MB, 1MB] → Compact to 4MB
Tier 2: [4MB, 4MB, 4MB, 4MB] → Compact to 16MB
Tier 3: [16MB, 16MB, 16MB, 16MB] → Compact to 64MB
Configuration:
ALTER TABLE users WITH compaction = {
  'class': 'SizeTieredCompactionStrategy',
  'min_threshold': 4,          -- Min SSTables to compact
  'max_threshold': 32          -- Max SSTables to compact
};
Pros:
  • Fast writes (less compaction overhead)
  • Simple, predictable
Cons:
  • Read amplification (query may touch many SSTables)
  • Temporary disk space = 2x data size during compaction

LCS (Leveled Compaction Strategy)

Best For: Read-heavy, frequently updated data How It Works:
L0: [10MB, 10MB, 10MB, 10MB]  (unsorted)
↓ Compact into L1
L1: [10MB, 10MB, 10MB, 10MB, 10MB, 10MB, 10MB, 10MB, 10MB, 10MB]  (sorted, max 100MB)
↓ Compact into L2
L2: [10MB × 100 SSTables]  (sorted, max 1GB)
Configuration:
ALTER TABLE users WITH compaction = {
  'class': 'LeveledCompactionStrategy',
  'sstable_size_in_mb': 160    -- SSTable target size
};
Pros:
  • Low read amplification (90% reads touch 1 SSTable)
  • Predictable disk space usage
Cons:
  • More compaction overhead (impacts writes)
  • More I/O intensive

TWCS (Time Window Compaction Strategy)

Best For: Time-series data with TTL How It Works:
Bucket by time window:
Window 1 (Day 1): [SSTable, SSTable] → Compact
Window 2 (Day 2): [SSTable, SSTable] → Compact
Window 3 (Day 3): [SSTable, SSTable] → Compact

After TTL expires:
→ Delete entire window (entire SSTable dropped, instant!)
Configuration:
CREATE TABLE metrics (
  sensor_id int,
  timestamp timestamp,
  value double,
  PRIMARY KEY ((sensor_id), timestamp)
) WITH compaction = {
  'class': 'TimeWindowCompactionStrategy',
  'compaction_window_size': 1,      -- Window size
  'compaction_window_unit': 'DAYS'  -- HOURS, DAYS, WEEKS
} AND default_time_to_live = 2592000;  -- 30 days
Pros:
  • Ultra-fast TTL deletion (drop entire SSTable)
  • Minimal read amplification for time-range queries
Cons:
  • Only suitable for time-series with TTL
Comparison Table:
WorkloadStrategyReason
Write-heavy, small datasetSTCSFast writes, simple
Read-heavy, updatesLCSLow read amplification
Time-series with TTLTWCSFast TTL deletion
IoT sensor dataTWCSTime-based, expires
User profiles (read-heavy)LCSFrequently updated, read often

MemTable Configuration

MemTables are in-memory write buffers. Tuning them balances memory vs. flush frequency.
# cassandra.yaml

# Total heap space for all MemTables (MB)
memtable_heap_space_in_mb: 2048  # 25% of heap is reasonable

# When to flush MemTable to disk (MB)
memtable_flush_writers: 2        # Concurrent flush threads

# Control flush frequency
memtable_cleanup_threshold: 0.5  # Flush when heap 50% full
Trade-offs:
SettingSmall MemTableLarge MemTable
Flush frequencyFrequentInfrequent
SSTables createdManyFew
Write performanceLowerHigher
Compaction overheadHigherLower
Memory usageLowerHigher
Recommendation: Default is usually good. Only adjust if:
  • Many small writes: Increase MemTable size (reduce flush frequency)
  • Memory pressure: Decrease MemTable size

Cache Configuration

Cassandra has three caches:

1. Key Cache

Purpose: Cache partition key → SSTable mapping (avoid Bloom filter checks) Configuration:
# cassandra.yaml
key_cache_size_in_mb: 100  # Or 'auto' (5% of heap)
key_cache_save_period: 14400  # Save to disk every 4 hours
When to Use:
  • Read-heavy workloads with hot partitions
  • Queries by primary key
When to Disable:
  • Write-heavy workloads
  • Cold data (rarely queried)

2. Row Cache

Purpose: Cache entire rows (most aggressive caching) Configuration:
# cassandra.yaml
row_cache_size_in_mb: 0  # DISABLED by default (use carefully!)
Warning: Row cache is dangerous:
  • Consumes heap (increases GC pressure)
  • Only helps if reading exact same rows repeatedly
  • Most production clusters disable this
When to Use: Small, frequently-read tables (e.g., configuration)

3. Counter Cache

Purpose: Cache counter column values (counter tables only) Configuration:
# cassandra.yaml
counter_cache_size_in_mb: 50
Enable Per Table:
ALTER TABLE counters WITH caching = {
  'keys': 'ALL',
  'rows_per_partition': 'ALL'
};

Commit Log Tuning

CommitLog is Cassandra’s write-ahead log. Two Modes:
  1. Periodic (default):
# cassandra.yaml
commitlog_sync: periodic
commitlog_sync_period_in_ms: 10000  # Sync every 10 seconds
  • Pros: High write throughput
  • Cons: Up to 10 seconds of data loss on crash
  1. Batch:
commitlog_sync: batch
commitlog_sync_batch_window_in_ms: 2  # Sync every 2ms
  • Pros: Minimal data loss (2ms window)
  • Cons: 30-50% lower write throughput
Disk Configuration:
# Put CommitLog on separate disk (if using HDD)
commitlog_directory: /mnt/commitlog  # Separate disk!

# CommitLog size
commitlog_total_space_in_mb: 8192  # 8GB
Best Practice: Use separate SSD for CommitLog if using HDDs for data.

Concurrent Operations

Control thread pool sizes:
# cassandra.yaml

# Read thread pool (handle read requests)
concurrent_reads: 32  # Default: 32

# Write thread pool (handle write requests)
concurrent_writes: 32  # Default: 32

# Counter write thread pool
concurrent_counter_writes: 32  # Default: 32

# Compaction threads
concurrent_compactors: 2  # 1 per disk spindle (HDD) or 2-4 (SSD)
Tuning:
  • More CPU cores: Increase concurrent_reads/writes
  • More disks: Increase concurrent_compactors
  • Memory constrained: Decrease to reduce overhead

Part 4: Monitoring and Observability

Key Metrics to Monitor

1. System Metrics

CPU:
# Monitor CPU usage
top
htop

# Check per-core usage
mpstat -P ALL 1
Target: < 80% average, < 95% peak Disk I/O:
# Monitor disk I/O
iostat -x 1

# Key metrics:
# - %util: Disk utilization (should be < 90%)
# - await: Average wait time (should be < 10ms SSD, < 20ms HDD)
# - r/s, w/s: Reads/writes per second
Network:
# Monitor network throughput
iftop
nethogs

# Check dropped packets
netstat -s | grep -i drop

2. JVM Metrics

Heap Usage:
nodetool info | grep "Heap Memory"

# Example output:
Heap Memory (MB): 3456 / 8192  # 42% used (healthy)
GC Metrics:
# GC stats
nodetool gcstats

# Example output:
Interval (ms) Max GC Elapsed (ms) Total GC Elapsed (ms) Stdev GC Elapsed (ms)
10234         189                  3456                  45
Target: Max GC < 200ms

3. Cassandra Metrics

Thread Pool Stats:
nodetool tpstats

# Example output:
Pool Name                    Active   Pending      Completed   Blocked  All time blocked
ReadStage                         2         0       12345678         0                 0
MutationStage                     1         0       98765432         0                 0
CompactionExecutor                2        15         123456         0                 0

Message type           Dropped
READ                         0
MUTATION                     0  # Any non-zero is BAD!
Critical: Dropped messages should be 0! Table Statistics:
nodetool tablestats keyspace.table

# Key metrics:
# - SSTable count (should be < 20 for LCS, < 50 for STCS)
# - Space used by snapshots
# - Read/write latency
Compaction Stats:
nodetool compactionstats

# Example output:
pending tasks: 5
- keyspace.table: 5

Active compactions:
compaction type        keyspace   table      completed    total   progress
Compaction             test       users      456 MB       1.2 GB  38.00%
Pending compactions should stay low (< 20). High pending = falling behind.

4. Performance Metrics

Latency:
nodetool tablehistograms keyspace.table

# Example output:
Percentile      Read Latency     Write Latency
50%                 2.1 ms           0.8 ms
75%                 4.5 ms           1.2 ms
95%                 12.3 ms          3.4 ms
99%                 28.7 ms          8.9 ms
Targets:
  • p50 < 5ms
  • p95 < 20ms
  • p99 < 50ms
Throughput:
nodetool tablestats keyspace.table | grep -i "read\|write"

# Example output:
Local read count: 12345678
Local read latency: 3.456 ms
Local write count: 98765432
Local write latency: 1.234 ms

Monitoring Tools

Architecture:
Cassandra nodes → JMX → Prometheus JMX Exporter → Prometheus → Grafana
Setup:
  1. Install JMX Exporter:
# Download jmx_prometheus_javaagent
wget https://repo1.maven.org/maven2/io/prometheus/jmx/jmx_prometheus_javaagent/0.17.0/jmx_prometheus_javaagent-0.17.0.jar

# Move to Cassandra directory
mv jmx_prometheus_javaagent-0.17.0.jar /usr/share/cassandra/
  1. Configure JMX Exporter (cassandra_jmx.yml):
lowercaseOutputName: true
lowercaseOutputLabelNames: true
rules:
  - pattern: org.apache.cassandra.metrics<type=(\w+), name=(\w+)><>Value
    name: cassandra_$1_$2
  1. Add to JVM Options:
# jvm11-server.options
-javaagent:/usr/share/cassandra/jmx_prometheus_javaagent-0.17.0.jar=7070:/etc/cassandra/cassandra_jmx.yml
  1. Configure Prometheus (prometheus.yml):
scrape_configs:
  - job_name: 'cassandra'
    static_configs:
      - targets: ['node1:7070', 'node2:7070', 'node3:7070']
  1. Import Grafana Dashboard:

2. DataStax OpsCenter

Commercial tool with free tier:
# Install OpsCenter
apt-get install opscenter

# Start OpsCenter
service opscenterd start

# Access: http://localhost:8888
Features:
  • Visual cluster topology
  • Performance graphs
  • Repair scheduling
  • Backup management

3. Nodetool (Built-in)

Quick Checks:
# Overall cluster health
nodetool status

# Node-specific info
nodetool info

# Performance metrics
nodetool tpstats
nodetool tablehistograms keyspace.table
nodetool cfhistograms keyspace.table

# GC monitoring
nodetool gcstats

# Compaction status
nodetool compactionstats

Alert Thresholds

Critical Alerts:
MetricThresholdAction
Dropped mutations> 0Investigate immediately (data loss!)
GC pause time p99> 1sTune JVM or add nodes
Pending compactions> 50Increase compaction threads
Disk usage> 85%Add capacity or cleanup
Node downAnyInvestigate and repair
Warning Alerts:
MetricThresholdAction
Read latency p99> 50msCheck query patterns
Write latency p99> 20msCheck disk I/O
Heap usage> 85%Review cache settings
Pending hints> 1GBCheck network/nodes
SSTable count> 50Review compaction strategy

Part 5: Capacity Planning

Disk Capacity

Formula:
Required Disk = (Data Size × Replication Factor × (1 + Compaction Overhead)) / Number of Nodes
Compaction Overhead:
  • STCS: 50% (2x data during compaction)
  • LCS: 10% (1.1x data)
  • TWCS: 20% (1.2x data)
Example:
Data size: 10 TB
Replication factor: 3
Compaction strategy: STCS (50% overhead)
Number of nodes: 10

Disk per node = (10 TB × 3 × 1.5) / 10 = 4.5 TB

Recommendation: 6 TB disks (33% headroom)

Memory Capacity

Formula:
Total RAM = Heap + OS Cache + OS Overhead

Heap: 8GB (max recommended)
OS Cache: 50-75% of remaining RAM
OS Overhead: 4-8GB
Example (64GB RAM):
Heap: 8GB
OS Overhead: 8GB
OS Cache: 48GB (for caching SSTables)

Total: 64GB ✓
Why Large OS Cache?
  • Cassandra relies on OS page cache for SSTable caching
  • Larger cache = fewer disk reads = better performance

CPU Capacity

Rule of Thumb: 1 CPU core per 1-2 TB of data Example:
Data per node: 4TB
Recommended cores: 4-8 cores

For read-heavy: 8 cores (more parallelism)
For write-heavy: 4 cores (less GC pressure)

Network Capacity

Formula:
Network throughput = (Write rate × Replication factor) + (Read rate) + (Repair traffic)
Example:
Write rate: 10,000 writes/sec × 5 KB/write = 50 MB/s
Replication factor: 3
Read rate: 5,000 reads/sec × 10 KB/read = 50 MB/s

Network needed = (50 MB/s × 3) + 50 MB/s = 200 MB/s

Recommendation: 1 Gbps NIC (125 MB/s max) is too small!
Use: 10 Gbps NIC (1250 MB/s max)

Scaling Triggers

When to Add Nodes:
MetricThresholdAction
Disk usage> 70%Add nodes (or cleanup)
CPU usage> 75% averageAdd nodes
Read latency p99> 50ms (sustained)Add nodes or optimize
Write latency p99> 20ms (sustained)Add nodes or tune
Compaction falling behindPending > 30Add nodes or tune compaction
Scaling Example:
Initial: 6 nodes, 4TB each = 24TB total capacity (8TB user data × RF=3)

After 6 months:
- Data grows to 12TB
- 6 nodes × 4TB = 24TB total capacity
- Usage: (12TB × 3) / 24TB = 150% (OVER CAPACITY!)

Action: Add 6 more nodes
- 12 nodes × 4TB = 48TB total capacity
- Usage: (12TB × 3) / 48TB = 75% (healthy)

Part 6: Backup and Disaster Recovery

Snapshot-Based Backups

How Snapshots Work:
1. SSTable files are immutable
2. Snapshot creates hardlinks (not copies!)
3. Hardlinks consume minimal disk space
4. Compaction doesn't delete files with active hardlinks
Create Snapshot:
# Snapshot all keyspaces
nodetool snapshot

# Snapshot specific keyspace
nodetool snapshot my_keyspace

# Snapshot with name
nodetool snapshot -t backup_20231115 my_keyspace
Snapshot Location:
/var/lib/cassandra/data/<keyspace>/<table>/snapshots/<snapshot_name>/
List Snapshots:
nodetool listsnapshots

# Example output:
Snapshot name    Keyspace    Table    Size
backup_20231115  my_keyspace users    2.3 GB
Delete Snapshot:
# Delete specific snapshot
nodetool clearsnapshot -t backup_20231115

# Delete all snapshots
nodetool clearsnapshot --all

Incremental Backups

Enable Incremental Backups:
# cassandra.yaml
incremental_backups: true
How It Works:
1. When SSTable is flushed, hardlink is created in backups/ directory
2. Backup contains only new data since last full snapshot
3. Restore = Full snapshot + Incremental backups
Backup Location:
/var/lib/cassandra/data/<keyspace>/<table>/backups/
Backup Strategy:
Day 0: Full snapshot
Day 1-6: Incremental backups accumulate
Day 7: Full snapshot + delete old incrementals

Backup to External Storage

Script Example (S3):
#!/bin/bash
# backup_to_s3.sh

KEYSPACE="my_keyspace"
SNAPSHOT="backup_$(date +%Y%m%d_%H%M%S)"
S3_BUCKET="s3://my-cassandra-backups"

# 1. Create snapshot
nodetool snapshot -t $SNAPSHOT $KEYSPACE

# 2. Find snapshot directory
SNAPSHOT_DIR="/var/lib/cassandra/data/$KEYSPACE/*/snapshots/$SNAPSHOT"

# 3. Upload to S3
aws s3 sync $SNAPSHOT_DIR $S3_BUCKET/$HOSTNAME/$SNAPSHOT/

# 4. Delete local snapshot
nodetool clearsnapshot -t $SNAPSHOT

echo "Backup complete: $S3_BUCKET/$HOSTNAME/$SNAPSHOT/"
Automate with Cron:
# Daily backup at 2 AM
0 2 * * * /usr/local/bin/backup_to_s3.sh

Restore from Backup

Full Restore Process:
  1. Stop Cassandra:
systemctl stop cassandra
  1. Clear existing data:
rm -rf /var/lib/cassandra/data/my_keyspace/users/*
  1. Restore snapshot:
# Download from S3
aws s3 sync s3://my-cassandra-backups/node1/backup_20231115/ /tmp/restore/

# Copy to data directory
cp -r /tmp/restore/* /var/lib/cassandra/data/my_keyspace/users/
  1. Fix ownership:
chown -R cassandra:cassandra /var/lib/cassandra/data
  1. Restart Cassandra:
systemctl start cassandra
  1. Run repair (important!):
nodetool repair my_keyspace users

Point-in-Time Recovery

Requirements:
  • Full snapshot
  • Incremental backups
  • CommitLog archives
CommitLog Archiving:
# cassandra.yaml
commitlog_archiving:
  archive_command: "/usr/local/bin/archive_commitlog.sh %path"
  restore_command: "/usr/local/bin/restore_commitlog.sh %from %to"
  restore_directories: /var/lib/cassandra/commitlog_archive
  restore_point_in_time: 2023-11-15T10:30:00
Archive Script:
#!/bin/bash
# archive_commitlog.sh
aws s3 cp $1 s3://my-cassandra-backups/commitlogs/$(hostname)/$(basename $1)

Part 7: Troubleshooting Production Issues

Issue 1: High Read Latency

Symptoms:
nodetool tablehistograms keyspace.table
# p99: 500ms (very high!)
Diagnosis Steps:
  1. Check SSTable Count:
nodetool tablestats keyspace.table | grep "SSTable count"
# SSTable count: 73  (too many!)
Solution: Switch to LCS or run compaction:
nodetool compact keyspace table
  1. Check for Wide Partitions:
nodetool cfstats keyspace.table | grep "Compacted partition maximum bytes"
# Compacted partition maximum bytes: 2147483648  (2GB partition!)
Solution: Redesign data model (split partition)
  1. Check Disk I/O:
iostat -x 1
# %util: 98%  (disk saturated!)
Solution: Add nodes or upgrade to SSDs
  1. Check for Tombstones:
# Enable tracing
cqlsh> TRACING ON
cqlsh> SELECT * FROM keyspace.table WHERE id = 123;

# Look for:
# Read 10 live rows and 50000 tombstone cells  (too many tombstones!)
Solution: Run repair or adjust gc_grace_seconds

Issue 2: Write Timeouts

Symptoms:
WriteTimeout: Timed out waiting for write response
Diagnosis:
  1. Check Dropped Mutations:
nodetool tpstats | grep "Dropped"
# MUTATION: 12345  (non-zero = bad!)
Cause: Nodes can’t keep up with write load
  1. Check Pending Compactions:
nodetool compactionstats
# pending tasks: 87  (falling behind!)
Cause: Compaction can’t keep up Solutions:
# Increase compaction threads
nodetool setcompactionthroughput 0  # Unlimited (use carefully!)

# Or in cassandra.yaml:
compaction_throughput_mb_per_sec: 64  # Default: 16

# Increase concurrent compactors
concurrent_compactors: 4  # Default: 2
  1. Check GC Pauses:
grep "GC" /var/log/cassandra/system.log | tail -20
# [GC pause (young) 1234ms]  (too long!)
Cause: Heap pressure Solution: Reduce MemTable size or increase heap

Issue 3: Node Marked as DOWN (But It’s Running)

Symptoms:
nodetool status
# DN  10.0.1.13  (Down/Normal)  ← but node is actually running!
Diagnosis:
  1. Check Failure Detector:
nodetool failuredetector
# /10.0.1.13: 15.34  (phi > 8 = marked down)
  1. Check for GC Pauses:
# On node 10.0.1.13
grep "GC pause" /var/log/cassandra/system.log
# [GC pause (young) 5678ms]  (> phi threshold!)
Cause: GC pause exceeded failure detector threshold Solutions:
# Increase phi threshold (cassandra.yaml)
phi_convict_threshold: 12  # Default: 8

# Tune JVM GC
-XX:MaxGCPauseMillis=200

Issue 4: Disk Full

Symptoms:
ERROR: Cannot allocate memory for commitlog segment
Diagnosis:
df -h
# /var/lib/cassandra: 98% used (critical!)

# Check snapshot disk usage
nodetool listsnapshots | awk '{sum+=$4} END {print sum}'
# 2.3 TB in snapshots!
Solutions:
  1. Delete Old Snapshots:
nodetool clearsnapshot --all
  1. Clean Incremental Backups:
rm -rf /var/lib/cassandra/data/*/*/backups/*
  1. Compact Tables:
nodetool compact
  1. Add Nodes (long-term solution)

Issue 5: Schema Mismatch

Symptoms:
nodetool describecluster
# Schema versions:
#   e84b6a60-1f3e-11ec-9621-0242ac130002: [10.0.1.10, 10.0.1.11]
#   f95c7b71-2g4f-22fd-a732-1353bd241113: [10.0.1.12]  ← different!
Diagnosis:
  • Node 10.0.1.12 has different schema
  • Gossip not propagating schema updates
Solutions:
  1. Force Schema Reset:
# On the outlier node (10.0.1.12)
nodetool resetlocalschema
  1. Restart Gossip:
nodetool disablegossip
nodetool enablegossip
  1. Rolling Restart (last resort):
# Restart nodes one by one
systemctl restart cassandra

Part 8: Advanced Production Topics

Multi-DC Latency Optimization

Problem: Cross-DC writes add 100-200ms latency Solution: Use LOCAL_QUORUM for writes:
INSERT INTO users (id, name) VALUES (1, 'Alice')
USING CONSISTENCY LOCAL_QUORUM;
Why: Write only waits for local DC acknowledgment, remote DC replicates asynchronously. Trade-off: Remote DC may lag by seconds/minutes during network issues.

Read Consistency Tuning

Scenario: Reads occasionally return stale data Diagnosis: Repair not running frequently enough Solutions:
  1. Increase Read Repair Chance:
ALTER TABLE users WITH dclocal_read_repair_chance = 0.2;  -- 20% of reads
  1. Use Higher Consistency Level:
SELECT * FROM users WHERE id = 1
USING CONSISTENCY QUORUM;  -- Instead of ONE
  1. Run Repair More Frequently:
# Daily incremental repair
0 2 * * * nodetool repair -inc

Handling Large Partitions

Problem: Partition > 100MB causes:
  • High read latency
  • Timeouts
  • OOM errors
Detection:
# Find large partitions
nodetool cfstats keyspace.table | grep "Compacted partition maximum bytes"

# Or use SSTable tools
sstablemetadata /var/lib/cassandra/data/keyspace/table/mc-123-big-Data.db | grep "Partition size"
Solutions:
  1. Redesign Data Model (best):
-- Bad: All events for a user in one partition
CREATE TABLE user_events (
    user_id int,
    event_time timestamp,
    event_data text,
    PRIMARY KEY (user_id, event_time)  -- user_id partition grows unbounded!
);

-- Good: Bucket by time
CREATE TABLE user_events (
    user_id int,
    bucket text,  -- e.g., "2023-11"
    event_time timestamp,
    event_data text,
    PRIMARY KEY ((user_id, bucket), event_time)  -- Bounded partitions
);
  1. Add Compaction Threshold:
ALTER TABLE users WITH
  compaction = {'class': 'LeveledCompactionStrategy'}
  AND min_threshold = 4
  AND max_threshold = 32;

Connection Pool Tuning (Driver-Side)

Python Driver:
from cassandra.cluster import Cluster
from cassandra.policies import DCAwareRoundRobinPolicy

cluster = Cluster(
    contact_points=['10.0.1.10'],
    protocol_version=4,

    # Connection pool
    pool_size=10,              # Connections per host
    max_connections=20,        # Max connections per host

    # Timeouts
    connect_timeout=10,        # Initial connection timeout
    control_connection_timeout=10,

    # Load balancing
    load_balancing_policy=DCAwareRoundRobinPolicy(local_dc='us-east-1'),

    # Retries
    default_retry_policy=RetryPolicy(),
)
Java Driver:
CqlSession session = CqlSession.builder()
    .addContactPoint(new InetSocketAddress("10.0.1.10", 9042))
    .withLocalDatacenter("us-east-1")

    // Connection pool
    .withPoolSize(10)

    // Timeouts
    .withRequestTimeout(Duration.ofSeconds(5))
    .withConnectTimeout(Duration.ofSeconds(10))

    .build();

Part 9: Performance Checklist

Pre-Production Checklist

  • Hardware
    • SSDs for data and commitlog
    • 10 Gbps network
    • 64GB+ RAM per node
    • 8+ CPU cores per node
  • OS Configuration
    • XFS filesystem with noatime,nodiratime
    • Swap disabled or swappiness=1
    • THP disabled
    • I/O scheduler: noop (SSD) or deadline (HDD)
    • Readahead: 8-16MB
    • File descriptor limits: 65536
    • CPU governor: performance
  • JVM Configuration
    • Heap: 8GB max
    • G1GC enabled
    • GC logging enabled
    • GC pause target: 200ms
  • Cassandra Configuration
    • Compaction strategy matches workload
    • Concurrent operations tuned
    • Caches configured
    • CommitLog on separate disk (if HDD)
    • NetworkTopologyStrategy for multi-DC
  • Monitoring
    • Prometheus + Grafana or OpsCenter
    • Alerts configured
    • Log aggregation (e.g., ELK stack)
  • Backup
    • Snapshot schedule configured
    • Incremental backups enabled
    • Restore procedure tested
  • Repair
    • Automated repair schedule (weekly)
    • Repair monitoring

Performance Testing

Load Testing Tools:
  1. cassandra-stress (built-in):
# Write test
cassandra-stress write n=1000000 -node 10.0.1.10 -rate threads=50

# Read test
cassandra-stress read n=1000000 -node 10.0.1.10 -rate threads=50

# Mixed workload
cassandra-stress mixed ratio\(write=1,read=3\) n=1000000 -node 10.0.1.10
  1. NoSQLBench:
# Install
wget https://github.com/nosqlbench/nosqlbench/releases/download/4.15.102/nb.jar

# Run workload
java -jar nb.jar cql-keyvalue rampup-cycles=10000 main-cycles=100000 host=10.0.1.10
Key Metrics to Capture:
  • Throughput (ops/sec)
  • Latency (p50, p95, p99, p999)
  • Error rate
  • Resource utilization (CPU, RAM, disk, network)

Part 10: Hands-On Exercises

Exercise 1: JVM Tuning

Scenario: Node experiencing 2-second GC pauses Task:
  1. Analyze GC log: cat /var/log/cassandra/gc.log
  2. Identify problem (heap too large? Young gen too large?)
  3. Adjust JVM settings in jvm11-server.options
  4. Restart and monitor improvements
Expected: GC pauses < 200ms after tuning

Exercise 2: Compaction Strategy Comparison

Task:
  1. Create three identical tables with different compaction strategies (STCS, LCS, TWCS)
  2. Load 10GB of data into each
  3. Run mixed read/write workload with cassandra-stress
  4. Compare SSTable count, read latency, write latency
Questions:
  • Which strategy has lowest SSTable count?
  • Which strategy has best read latency?
  • Which strategy has best write throughput?

Exercise 3: Monitoring Setup

Task:
  1. Install Prometheus + Grafana
  2. Configure JMX exporter on Cassandra nodes
  3. Import Cassandra dashboard
  4. Create custom alerts for:
    • GC pause > 500ms
    • Dropped mutations > 0
    • Pending compactions > 20

Exercise 4: Backup and Restore

Task:
  1. Create snapshot of my_keyspace
  2. Upload to S3 (or local directory)
  3. Drop table: DROP TABLE my_keyspace.users;
  4. Restore from snapshot
  5. Verify data integrity

Exercise 5: Troubleshoot Slow Queries

Scenario: Query taking 5 seconds Task:
TRACING ON;
SELECT * FROM users WHERE user_id = 12345;
  1. Analyze trace output
  2. Identify bottleneck (tombstones? large partition? SSTable count?)
  3. Apply fix (repair? compaction? data model change?)
  4. Verify improvement

Summary & Production Best Practices

JVM:
  • Heap: 8GB max (25% of RAM)
  • GC: G1GC with 200ms pause target
  • Monitor GC logs continuously
OS:
  • SSDs for all storage
  • XFS with noatime
  • Disable THP and swap
  • I/O scheduler: noop
Cassandra:
  • Compaction strategy matches workload
  • Run repair weekly (within gc_grace_seconds)
  • Use LOCAL_QUORUM for multi-DC
  • Monitor: GC, tpstats, compaction, latency
Capacity Planning:
  • 1 core per 1-2TB data
  • 64GB+ RAM (8GB heap + 48GB OS cache)
  • Disk: 50% headroom for compaction
  • Network: 10 Gbps minimum
Backup:
  • Daily snapshots to external storage
  • Incremental backups enabled
  • Test restore procedures regularly
Troubleshooting:
  • High reads → Check SSTable count, tombstones
  • High writes → Check pending compactions, GC
  • Timeouts → Check dropped messages, disk I/O
  • Node down → Check failure detector, GC pauses

What’s Next?

Module 7: Capstone Project - Building a Production System

Apply everything you’ve learned to design and implement a real-world Cassandra application at scale