Performance Tuning & Production Operations

Module Duration: 10-12 hours Learning Style: Deep Technical + Hands-On Tuning + Production War Stories Outcome: Operate Cassandra clusters at peak performance in production environments

Introduction: The Production Reality

Running Cassandra in production is vastly different from development:

Development: Single node, small dataset, tolerant of restarts
Production: 50+ nodes, multi-TB per node, 24/7 uptime, millisecond SLAs

This module covers everything you need to run Cassandra successfully in production: JVM tuning, OS configuration, monitoring, capacity planning, and troubleshooting real-world issues.

Part 1: The JVM - Cassandra’s Foundation

Cassandra runs on the Java Virtual Machine (JVM). JVM performance directly impacts Cassandra performance, especially around garbage collection (GC).

Why GC Matters

Problem: Cassandra keeps data in memory (MemTables, caches). When JVM runs GC:

Stop-the-world pauses: Application threads freeze
Pauses > 1 second → timeouts, failed requests
Pauses > 10 seconds → nodes marked as DOWN by failure detector

Goal: Keep GC pauses < 200ms

Heap Size Configuration

Cassandra’s heap is split into two regions:

Heap (Total)
├── Young Generation (Eden + Survivor spaces)
│   └── Short-lived objects (mutations, query results)
└── Old Generation (Tenured)
    └── Long-lived objects (MemTables, caches, bloom filters)

Sizing Rules:

# jvm.options or jvm11-server.options

# Rule 1: Max heap = 8GB or 25% of RAM (whichever is smaller)
# Why? Large heaps = long GC pauses

# Example: 64GB RAM machine
-Xms8G   # Initial heap
-Xmx8G   # Max heap (always match Xms!)

# Rule 2: Young gen = 100MB per CPU core (up to 2GB max)
# Example: 16 cores
-Xmn1600M   # 16 * 100MB = 1.6GB

Why These Limits?

Heap Size	GC Pause Time	Issue
< 4GB	50-100ms	Too small, frequent GC
4-8GB	100-200ms	Optimal
8-16GB	200-500ms	Acceptable for some workloads
> 16GB	500ms-5s	Too long, avoid!

Real Example:

# Bad configuration (128GB RAM)
-Xms32G -Xmx32G  # GC pauses will be 1-3 seconds!

# Good configuration (128GB RAM)
-Xms8G -Xmx8G    # GC pauses ~200ms, use remaining RAM for OS cache

Garbage Collection Algorithms

Cassandra supports three main GC algorithms:

1. G1GC (G1 Garbage Collector) - Recommended

Best For: Cassandra 3.0+, default choice Configuration:

# jvm11-server.options
-XX:+UseG1GC
-XX:G1RSetUpdatingPauseTimePercent=5
-XX:MaxGCPauseMillis=200           # Target pause time
-XX:InitiatingHeapOccupancyPercent=70  # When to start GC
-XX:ParallelGCThreads=16           # GC thread count (# of cores)

How G1GC Works:

Heap divided into ~2000 equal-sized regions (1-32MB each)

Region types:
├── Eden (young objects)
├── Survivor (survived 1+ GC)
├── Old (long-lived)
└── Humongous (objects > 50% region size)

GC Process:
1. Young GC: Collect Eden + Survivor (10-50ms)
2. Concurrent marking: Find live objects in Old gen
3. Mixed GC: Collect Young + some Old regions (50-200ms)

Tuning for Read-Heavy:

# Tolerate more garbage before GC (fewer pauses)
-XX:InitiatingHeapOccupancyPercent=75

Tuning for Write-Heavy:

# Start GC earlier (prevent MemTable flush storms)
-XX:InitiatingHeapOccupancyPercent=65

2. CMS (Concurrent Mark Sweep) - Legacy

Best For: Cassandra 2.x (deprecated in Cassandra 3.0+)

# jvm.options (Cassandra 2.x)
-XX:+UseConcMarkSweepGC
-XX:+CMSParallelRemarkEnabled
-XX:SurvivorRatio=8
-XX:MaxTenuringThreshold=1
-XX:CMSInitiatingOccupancyFraction=75
-XX:+UseCMSInitiatingOccupancyOnly

Issues:

Fragmentation in Old Gen → Full GC (5-10 second pauses!)
Deprecated in Java 9+

3. ZGC / Shenandoah - Experimental

Best For: Cassandra 4.0+, Java 11+, cutting-edge deployments

# ZGC (Java 15+)
-XX:+UseZGC
-XX:ZCollectionInterval=120  # GC every 2 minutes

# Shenandoah (Java 11+)
-XX:+UseShenandoahGC

Benefits:

Sub-10ms pauses even with 100GB heaps!
Still experimental for Cassandra

GC Logging and Monitoring

Enable GC Logging:

# jvm11-server.options
-Xlog:gc*,gc+age=trace,safepoint:file=/var/log/cassandra/gc.log:time,uptime:filecount=10,filesize=10m

Analyze GC Logs:

# View GC log
tail -f /var/log/cassandra/gc.log

# Example GC event:
[2023-11-15T10:23:45.678+0000][14.523s] GC(12) Pause Young (Normal) (G1 Evacuation Pause) 2048M->512M(8192M) 45.678ms

Breakdown:

Pause Young: Young generation collection
2048M->512M: Heap before → after GC
(8192M): Total heap size
45.678ms: Pause duration (monitor this!)

Tools for GC Analysis:

GCViewer (GUI):

# Download GCViewer
wget https://github.com/chewiebug/GCViewer/releases/download/1.36/gcviewer-1.36.jar

# Visualize GC log
java -jar gcviewer-1.36.jar /var/log/cassandra/gc.log

GCEasy (Web-based):

# Upload gc.log to https://gceasy.io
# Generates detailed analysis and recommendations

Key Metrics to Watch:

Pause time p99: Should be < 200ms
Pause frequency: Young GC every 5-10 seconds is normal
Full GC events: Should be 0! Any Full GC is a red flag

Common GC Issues

Issue 1: Frequent Full GCs

Symptoms:

[GC] Pause Full (Allocation Failure) 8192M->7890M(8192M) 5234.567ms

Causes:

Heap too small
Memory leak (improper cache configuration)
Large object allocation (huge queries)

Solutions:

# 1. Increase heap (up to 8GB)
-Xms8G -Xmx8G

# 2. Check for memory leaks
nodetool info | grep "Heap Memory"

# 3. Limit query size
# cassandra.yaml:
tombstone_failure_threshold: 10000
range_request_timeout_in_ms: 10000

Issue 2: Long Young GC Pauses

Symptoms:

[GC] Pause Young 3000M->500M(8192M) 850.123ms  # > 500ms!

Cause: Young gen too large Solution:

# Reduce young gen size
-Xmn1G  # Instead of -Xmn2G

Issue 3: Memory Pressure

Symptoms:

Constant GC activity
nodetool tpstats shows dropped mutations
Heap constantly near max

Diagnosis:

# Check heap usage
nodetool info | grep "Heap Memory"

# Example bad output:
Heap Memory (MB): 7890 / 8192  # 96% used!

Solutions:

# 1. Reduce MemTable size (cassandra.yaml)
memtable_heap_space_in_mb: 2048  # Default: calculated

# 2. Reduce cache sizes
key_cache_size_in_mb: 100        # Default: auto
row_cache_size_in_mb: 0          # Disable if not needed

# 3. Add more nodes (scale out)

Part 2: Operating System Tuning

Disk I/O Configuration

Cassandra is I/O intensive. OS settings massively impact performance.

File System Choice

FS	Performance	Recommendation
ext4	Good	Safe default
XFS	Better	Recommended for large volumes
ZFS	Variable	Use with caution (compression conflicts)
NTFS	Poor	Avoid (Windows)

Mount Options (XFS):

# /etc/fstab
/dev/sdb1  /var/lib/cassandra  xfs  noatime,nodiratime,nobarrier  0  0

Why These Options?

noatime: Don’t update access time (reduces writes)
nodiratime: Don’t update directory access time
nobarrier: Disable write barriers (safe with battery-backed RAID)

I/O Scheduler

For SSDs:

# Check current scheduler
cat /sys/block/sda/queue/scheduler

# Set to noop or none (optimal for SSDs)
echo noop > /sys/block/sda/queue/scheduler

# Make persistent (add to /etc/rc.local)
echo "echo noop > /sys/block/sda/queue/scheduler" >> /etc/rc.local

For HDDs:

# Use deadline scheduler
echo deadline > /sys/block/sda/queue/scheduler

Readahead

Default: Often 8KB (too small for Cassandra)

# Check current readahead
blockdev --getra /dev/sda

# Set to 8-16MB for sequential reads
blockdev --setra 16384 /dev/sda  # 16384 * 512 bytes = 8MB

# Make persistent
echo "blockdev --setra 16384 /dev/sda" >> /etc/rc.local

Linux Kernel Settings

Critical sysctl Settings:

# /etc/sysctl.conf

# Network tuning
net.core.rmem_max = 16777216           # Max receive buffer
net.core.wmem_max = 16777216           # Max send buffer
net.ipv4.tcp_rmem = 4096 87380 16777216
net.ipv4.tcp_wmem = 4096 65536 16777216
net.core.netdev_max_backlog = 5000     # Queue size for network packets

# VM settings
vm.max_map_count = 1048575             # Max memory maps (for large heaps)
vm.swappiness = 1                      # Avoid swap (but don't disable entirely)

# File descriptor limits
fs.file-max = 1048576                  # System-wide max FDs

Apply Settings:

sysctl -p

User Limits

Cassandra opens many files simultaneously:

# /etc/security/limits.conf

cassandra soft nofile 65536
cassandra hard nofile 65536
cassandra soft nproc 65536
cassandra hard nproc 65536
cassandra soft memlock unlimited
cassandra hard memlock unlimited
cassandra soft as unlimited
cassandra hard as unlimited

Verify:

# As cassandra user
ulimit -n  # File descriptors (should be 65536)
ulimit -u  # Processes (should be 65536)

Swap Configuration

Philosophy: Minimize swap, but don’t disable entirely. Why Not Disable Swap?

Linux kernel may OOM-kill Cassandra if no swap
Small swap (1-2GB) acts as emergency overflow

Configuration:

# Set swappiness to 1 (only use swap in emergencies)
sysctl -w vm.swappiness=1

# Limit swap size to 1GB
dd if=/dev/zero of=/swapfile bs=1G count=1
chmod 600 /swapfile
mkswap /swapfile
swapon /swapfile

# Add to /etc/fstab
echo "/swapfile none swap sw 0 0" >> /etc/fstab

Transparent Huge Pages (THP)

Issue: THP causes GC pauses and memory fragmentation. Disable THP:

# Check THP status
cat /sys/kernel/mm/transparent_hugepage/enabled
# [always] madvise never  ← bad (enabled)

# Disable THP
echo never > /sys/kernel/mm/transparent_hugepage/enabled
echo never > /sys/kernel/mm/transparent_hugepage/defrag

# Make persistent (add to /etc/rc.local)
echo "echo never > /sys/kernel/mm/transparent_hugepage/enabled" >> /etc/rc.local
echo "echo never > /sys/kernel/mm/transparent_hugepage/defrag" >> /etc/rc.local

CPU Governor

For Performance:

# Check current governor
cat /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor

# Set to performance (max CPU frequency)
for cpu in /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor; do
  echo performance > $cpu
done

# Install cpufrequtils for persistence
apt-get install cpufrequtils
echo 'GOVERNOR="performance"' > /etc/default/cpufrequtils
systemctl restart cpufrequtils

Part 3: Cassandra Configuration Tuning

Compaction Strategy Selection

Choosing the right compaction strategy is critical for performance.

STCS (Size-Tiered Compaction Strategy)

Best For: Write-heavy, time-series data, small tables How It Works:

SSTables grouped by similar size:
Tier 1: [1MB, 1MB, 1MB, 1MB] → Compact to 4MB
Tier 2: [4MB, 4MB, 4MB, 4MB] → Compact to 16MB
Tier 3: [16MB, 16MB, 16MB, 16MB] → Compact to 64MB

Configuration:

ALTER TABLE users WITH compaction = {
  'class': 'SizeTieredCompactionStrategy',
  'min_threshold': 4,          -- Min SSTables to compact
  'max_threshold': 32          -- Max SSTables to compact
};

Pros:

Fast writes (less compaction overhead)
Simple, predictable

Cons:

Read amplification (query may touch many SSTables)
Temporary disk space = 2x data size during compaction

LCS (Leveled Compaction Strategy)

Best For: Read-heavy, frequently updated data How It Works:

L0: [10MB, 10MB, 10MB, 10MB]  (unsorted)
↓ Compact into L1
L1: [10MB, 10MB, 10MB, 10MB, 10MB, 10MB, 10MB, 10MB, 10MB, 10MB]  (sorted, max 100MB)
↓ Compact into L2
L2: [10MB × 100 SSTables]  (sorted, max 1GB)

Configuration:

ALTER TABLE users WITH compaction = {
  'class': 'LeveledCompactionStrategy',
  'sstable_size_in_mb': 160    -- SSTable target size
};

Pros:

Low read amplification (90% reads touch 1 SSTable)
Predictable disk space usage

Cons:

More compaction overhead (impacts writes)
More I/O intensive

TWCS (Time Window Compaction Strategy)

Best For: Time-series data with TTL How It Works:

Bucket by time window:
Window 1 (Day 1): [SSTable, SSTable] → Compact
Window 2 (Day 2): [SSTable, SSTable] → Compact
Window 3 (Day 3): [SSTable, SSTable] → Compact

After TTL expires:
→ Delete entire window (entire SSTable dropped, instant!)

Configuration:

CREATE TABLE metrics (
  sensor_id int,
  timestamp timestamp,
  value double,
  PRIMARY KEY ((sensor_id), timestamp)
) WITH compaction = {
  'class': 'TimeWindowCompactionStrategy',
  'compaction_window_size': 1,      -- Window size
  'compaction_window_unit': 'DAYS'  -- HOURS, DAYS, WEEKS
} AND default_time_to_live = 2592000;  -- 30 days

Pros:

Ultra-fast TTL deletion (drop entire SSTable)
Minimal read amplification for time-range queries

Cons:

Only suitable for time-series with TTL

Comparison Table:

Workload	Strategy	Reason
Write-heavy, small dataset	STCS	Fast writes, simple
Read-heavy, updates	LCS	Low read amplification
Time-series with TTL	TWCS	Fast TTL deletion
IoT sensor data	TWCS	Time-based, expires
User profiles (read-heavy)	LCS	Frequently updated, read often

MemTable Configuration

MemTables are in-memory write buffers. Tuning them balances memory vs. flush frequency.

# cassandra.yaml

# Total heap space for all MemTables (MB)
memtable_heap_space_in_mb: 2048  # 25% of heap is reasonable

# When to flush MemTable to disk (MB)
memtable_flush_writers: 2        # Concurrent flush threads

# Control flush frequency
memtable_cleanup_threshold: 0.5  # Flush when heap 50% full

Trade-offs:

Setting	Small MemTable	Large MemTable
Flush frequency	Frequent	Infrequent
SSTables created	Many	Few
Write performance	Lower	Higher
Compaction overhead	Higher	Lower
Memory usage	Lower	Higher

Recommendation: Default is usually good. Only adjust if:

Many small writes: Increase MemTable size (reduce flush frequency)
Memory pressure: Decrease MemTable size

Cache Configuration

Cassandra has three caches:

1. Key Cache

Purpose: Cache partition key → SSTable mapping (avoid Bloom filter checks) Configuration:

# cassandra.yaml
key_cache_size_in_mb: 100  # Or 'auto' (5% of heap)
key_cache_save_period: 14400  # Save to disk every 4 hours

When to Use:

Read-heavy workloads with hot partitions
Queries by primary key

When to Disable:

Write-heavy workloads
Cold data (rarely queried)

2. Row Cache

Purpose: Cache entire rows (most aggressive caching) Configuration:

# cassandra.yaml
row_cache_size_in_mb: 0  # DISABLED by default (use carefully!)

Warning: Row cache is dangerous:

Consumes heap (increases GC pressure)
Only helps if reading exact same rows repeatedly
Most production clusters disable this

When to Use: Small, frequently-read tables (e.g., configuration)

3. Counter Cache

Purpose: Cache counter column values (counter tables only) Configuration:

# cassandra.yaml
counter_cache_size_in_mb: 50

Enable Per Table:

ALTER TABLE counters WITH caching = {
  'keys': 'ALL',
  'rows_per_partition': 'ALL'
};

Commit Log Tuning

CommitLog is Cassandra’s write-ahead log. Two Modes:

Periodic (default):

# cassandra.yaml
commitlog_sync: periodic
commitlog_sync_period_in_ms: 10000  # Sync every 10 seconds

Pros: High write throughput
Cons: Up to 10 seconds of data loss on crash

Batch:

commitlog_sync: batch
commitlog_sync_batch_window_in_ms: 2  # Sync every 2ms

Pros: Minimal data loss (2ms window)
Cons: 30-50% lower write throughput

Disk Configuration:

# Put CommitLog on separate disk (if using HDD)
commitlog_directory: /mnt/commitlog  # Separate disk!

# CommitLog size
commitlog_total_space_in_mb: 8192  # 8GB

Best Practice: Use separate SSD for CommitLog if using HDDs for data.

Concurrent Operations

Control thread pool sizes:

# cassandra.yaml

# Read thread pool (handle read requests)
concurrent_reads: 32  # Default: 32

# Write thread pool (handle write requests)
concurrent_writes: 32  # Default: 32

# Counter write thread pool
concurrent_counter_writes: 32  # Default: 32

# Compaction threads
concurrent_compactors: 2  # 1 per disk spindle (HDD) or 2-4 (SSD)

Tuning:

More CPU cores: Increase concurrent_reads/writes
More disks: Increase concurrent_compactors
Memory constrained: Decrease to reduce overhead

Part 4: Monitoring and Observability

Key Metrics to Monitor

1. System Metrics

CPU:

# Monitor CPU usage
top
htop

# Check per-core usage
mpstat -P ALL 1

Target: < 80% average, < 95% peak Disk I/O:

# Monitor disk I/O
iostat -x 1

# Key metrics:
# - %util: Disk utilization (should be < 90%)
# - await: Average wait time (should be < 10ms SSD, < 20ms HDD)
# - r/s, w/s: Reads/writes per second

Network:

# Monitor network throughput
iftop
nethogs

# Check dropped packets
netstat -s | grep -i drop

2. JVM Metrics

Heap Usage:

nodetool info | grep "Heap Memory"

# Example output:
Heap Memory (MB): 3456 / 8192  # 42% used (healthy)

GC Metrics:

# GC stats
nodetool gcstats

# Example output:
Interval (ms) Max GC Elapsed (ms) Total GC Elapsed (ms) Stdev GC Elapsed (ms)
10234         189                  3456                  45

Target: Max GC < 200ms

3. Cassandra Metrics

Thread Pool Stats:

nodetool tpstats

# Example output:
Pool Name                    Active   Pending      Completed   Blocked  All time blocked
ReadStage                         2         0       12345678         0                 0
MutationStage                     1         0       98765432         0                 0
CompactionExecutor                2        15         123456         0                 0

Message type           Dropped
READ                         0
MUTATION                     0  # Any non-zero is BAD!

Critical: Dropped messages should be 0! Table Statistics:

nodetool tablestats keyspace.table

# Key metrics:
# - SSTable count (should be < 20 for LCS, < 50 for STCS)
# - Space used by snapshots
# - Read/write latency

Compaction Stats:

nodetool compactionstats

# Example output:
pending tasks: 5
- keyspace.table: 5

Active compactions:
compaction type        keyspace   table      completed    total   progress
Compaction             test       users      456 MB       1.2 GB  38.00%

Pending compactions should stay low (< 20). High pending = falling behind.

4. Performance Metrics

Latency:

nodetool tablehistograms keyspace.table

# Example output:
Percentile      Read Latency     Write Latency
50%                 2.1 ms           0.8 ms
75%                 4.5 ms           1.2 ms
95%                 12.3 ms          3.4 ms
99%                 28.7 ms          8.9 ms

Targets:

p50 < 5ms
p95 < 20ms
p99 < 50ms

Throughput:

nodetool tablestats keyspace.table | grep -i "read\|write"

# Example output:
Local read count: 12345678
Local read latency: 3.456 ms
Local write count: 98765432
Local write latency: 1.234 ms

Monitoring Tools

1. Prometheus + Grafana (Recommended)

Architecture:

Cassandra nodes → JMX → Prometheus JMX Exporter → Prometheus → Grafana

Setup:

Install JMX Exporter:

# Download jmx_prometheus_javaagent
wget https://repo1.maven.org/maven2/io/prometheus/jmx/jmx_prometheus_javaagent/0.17.0/jmx_prometheus_javaagent-0.17.0.jar

# Move to Cassandra directory
mv jmx_prometheus_javaagent-0.17.0.jar /usr/share/cassandra/

Configure JMX Exporter (cassandra_jmx.yml):

lowercaseOutputName: true
lowercaseOutputLabelNames: true
rules:
  - pattern: org.apache.cassandra.metrics<type=(\w+), name=(\w+)><>Value
    name: cassandra_$1_$2

Add to JVM Options:

# jvm11-server.options
-javaagent:/usr/share/cassandra/jmx_prometheus_javaagent-0.17.0.jar=7070:/etc/cassandra/cassandra_jmx.yml

Configure Prometheus (prometheus.yml):

scrape_configs:
  - job_name: 'cassandra'
    static_configs:
      - targets: ['node1:7070', 'node2:7070', 'node3:7070']

Import Grafana Dashboard:

Dashboard ID: 13183 (Cassandra Overview)
https://grafana.com/grafana/dashboards/13183

2. DataStax OpsCenter

Commercial tool with free tier:

# Install OpsCenter
apt-get install opscenter

# Start OpsCenter
service opscenterd start

# Access: http://localhost:8888

Features:

Visual cluster topology
Performance graphs
Repair scheduling
Backup management

3. Nodetool (Built-in)

Quick Checks:

# Overall cluster health
nodetool status

# Node-specific info
nodetool info

# Performance metrics
nodetool tpstats
nodetool tablehistograms keyspace.table
nodetool cfhistograms keyspace.table

# GC monitoring
nodetool gcstats

# Compaction status
nodetool compactionstats

Alert Thresholds

Critical Alerts:

Metric	Threshold	Action
Dropped mutations	> 0	Investigate immediately (data loss!)
GC pause time p99	> 1s	Tune JVM or add nodes
Pending compactions	> 50	Increase compaction threads
Disk usage	> 85%	Add capacity or cleanup
Node down	Any	Investigate and repair

Warning Alerts:

Metric	Threshold	Action
Read latency p99	> 50ms	Check query patterns
Write latency p99	> 20ms	Check disk I/O
Heap usage	> 85%	Review cache settings
Pending hints	> 1GB	Check network/nodes
SSTable count	> 50	Review compaction strategy

Part 5: Capacity Planning

Disk Capacity

Formula:

Required Disk = (Data Size × Replication Factor × (1 + Compaction Overhead)) / Number of Nodes

Compaction Overhead:

STCS: 50% (2x data during compaction)
LCS: 10% (1.1x data)
TWCS: 20% (1.2x data)

Example:

Data size: 10 TB
Replication factor: 3
Compaction strategy: STCS (50% overhead)
Number of nodes: 10

Disk per node = (10 TB × 3 × 1.5) / 10 = 4.5 TB

Recommendation: 6 TB disks (33% headroom)

Memory Capacity

Formula:

Total RAM = Heap + OS Cache + OS Overhead

Heap: 8GB (max recommended)
OS Cache: 50-75% of remaining RAM
OS Overhead: 4-8GB

Example (64GB RAM):

Heap: 8GB
OS Overhead: 8GB
OS Cache: 48GB (for caching SSTables)

Total: 64GB ✓

Why Large OS Cache?

Cassandra relies on OS page cache for SSTable caching
Larger cache = fewer disk reads = better performance

CPU Capacity

Rule of Thumb: 1 CPU core per 1-2 TB of data Example:

Data per node: 4TB
Recommended cores: 4-8 cores

For read-heavy: 8 cores (more parallelism)
For write-heavy: 4 cores (less GC pressure)

Network Capacity

Formula:

Network throughput = (Write rate × Replication factor) + (Read rate) + (Repair traffic)

Example:

Write rate: 10,000 writes/sec × 5 KB/write = 50 MB/s
Replication factor: 3
Read rate: 5,000 reads/sec × 10 KB/read = 50 MB/s

Network needed = (50 MB/s × 3) + 50 MB/s = 200 MB/s

Recommendation: 1 Gbps NIC (125 MB/s max) is too small!
Use: 10 Gbps NIC (1250 MB/s max)

Scaling Triggers

When to Add Nodes:

Metric	Threshold	Action
Disk usage	> 70%	Add nodes (or cleanup)
CPU usage	> 75% average	Add nodes
Read latency p99	> 50ms (sustained)	Add nodes or optimize
Write latency p99	> 20ms (sustained)	Add nodes or tune
Compaction falling behind	Pending > 30	Add nodes or tune compaction

Scaling Example:

Initial: 6 nodes, 4TB each = 24TB total capacity (8TB user data × RF=3)

After 6 months:
- Data grows to 12TB
- 6 nodes × 4TB = 24TB total capacity
- Usage: (12TB × 3) / 24TB = 150% (OVER CAPACITY!)

Action: Add 6 more nodes
- 12 nodes × 4TB = 48TB total capacity
- Usage: (12TB × 3) / 48TB = 75% (healthy)

Part 6: Backup and Disaster Recovery

Snapshot-Based Backups

How Snapshots Work:

SSTable files are immutable
Snapshot creates hardlinks (not copies!)
Hardlinks consume minimal disk space
Compaction doesn't delete files with active hardlinks

Create Snapshot:

# Snapshot all keyspaces
nodetool snapshot

# Snapshot specific keyspace
nodetool snapshot my_keyspace

# Snapshot with name
nodetool snapshot -t backup_20231115 my_keyspace

Snapshot Location:

/var/lib/cassandra/data/<keyspace>/<table>/snapshots/<snapshot_name>/

List Snapshots:

nodetool listsnapshots

# Example output:
Snapshot name    Keyspace    Table    Size
backup_20231115  my_keyspace users    2.3 GB

Delete Snapshot:

# Delete specific snapshot
nodetool clearsnapshot -t backup_20231115

# Delete all snapshots
nodetool clearsnapshot --all

Incremental Backups

Enable Incremental Backups:

# cassandra.yaml
incremental_backups: true

How It Works:

When SSTable is flushed, hardlink is created in backups/ directory
Backup contains only new data since last full snapshot
Restore = Full snapshot + Incremental backups

Backup Location:

/var/lib/cassandra/data/<keyspace>/<table>/backups/

Backup Strategy:

Day 0: Full snapshot
Day 1-6: Incremental backups accumulate
Day 7: Full snapshot + delete old incrementals

Backup to External Storage

Script Example (S3):

#!/bin/bash
# backup_to_s3.sh

KEYSPACE="my_keyspace"
SNAPSHOT="backup_$(date +%Y%m%d_%H%M%S)"
S3_BUCKET="s3://my-cassandra-backups"

# 1. Create snapshot
nodetool snapshot -t $SNAPSHOT $KEYSPACE

# 2. Find snapshot directory
SNAPSHOT_DIR="/var/lib/cassandra/data/$KEYSPACE/*/snapshots/$SNAPSHOT"

# 3. Upload to S3
aws s3 sync $SNAPSHOT_DIR $S3_BUCKET/$HOSTNAME/$SNAPSHOT/

# 4. Delete local snapshot
nodetool clearsnapshot -t $SNAPSHOT

echo "Backup complete: $S3_BUCKET/$HOSTNAME/$SNAPSHOT/"

Automate with Cron:

# Daily backup at 2 AM
0 2 * * * /usr/local/bin/backup_to_s3.sh

Restore from Backup

Full Restore Process:

Stop Cassandra:

systemctl stop cassandra

Clear existing data:

rm -rf /var/lib/cassandra/data/my_keyspace/users/*

Restore snapshot:

# Download from S3
aws s3 sync s3://my-cassandra-backups/node1/backup_20231115/ /tmp/restore/

# Copy to data directory
cp -r /tmp/restore/* /var/lib/cassandra/data/my_keyspace/users/

Fix ownership:

chown -R cassandra:cassandra /var/lib/cassandra/data

Restart Cassandra:

systemctl start cassandra

Run repair (important!):

nodetool repair my_keyspace users

Point-in-Time Recovery

Requirements:

Full snapshot
Incremental backups
CommitLog archives

CommitLog Archiving:

# cassandra.yaml
commitlog_archiving:
  archive_command: "/usr/local/bin/archive_commitlog.sh %path"
  restore_command: "/usr/local/bin/restore_commitlog.sh %from %to"
  restore_directories: /var/lib/cassandra/commitlog_archive
  restore_point_in_time: 2023-11-15T10:30:00

Archive Script:

#!/bin/bash
# archive_commitlog.sh
aws s3 cp $1 s3://my-cassandra-backups/commitlogs/$(hostname)/$(basename $1)

Part 7: Troubleshooting Production Issues

Issue 1: High Read Latency

Symptoms:

nodetool tablehistograms keyspace.table
# p99: 500ms (very high!)

Diagnosis Steps:

Check SSTable Count:

nodetool tablestats keyspace.table | grep "SSTable count"
# SSTable count: 73  (too many!)

Solution: Switch to LCS or run compaction:

nodetool compact keyspace table

Check for Wide Partitions:

nodetool cfstats keyspace.table | grep "Compacted partition maximum bytes"
# Compacted partition maximum bytes: 2147483648  (2GB partition!)

Solution: Redesign data model (split partition)

Check Disk I/O:

iostat -x 1
# %util: 98%  (disk saturated!)

Solution: Add nodes or upgrade to SSDs

Check for Tombstones:

# Enable tracing
cqlsh> TRACING ON
cqlsh> SELECT * FROM keyspace.table WHERE id = 123;

# Look for:
# Read 10 live rows and 50000 tombstone cells  (too many tombstones!)

Solution: Run repair or adjust gc_grace_seconds

Issue 2: Write Timeouts

Symptoms:

WriteTimeout: Timed out waiting for write response

Diagnosis:

Check Dropped Mutations:

nodetool tpstats | grep "Dropped"
# MUTATION: 12345  (non-zero = bad!)

Cause: Nodes can’t keep up with write load

Check Pending Compactions:

nodetool compactionstats
# pending tasks: 87  (falling behind!)

Cause: Compaction can’t keep up Solutions:

# Increase compaction threads
nodetool setcompactionthroughput 0  # Unlimited (use carefully!)

# Or in cassandra.yaml:
compaction_throughput_mb_per_sec: 64  # Default: 16

# Increase concurrent compactors
concurrent_compactors: 4  # Default: 2

Check GC Pauses:

grep "GC" /var/log/cassandra/system.log | tail -20
# [GC pause (young) 1234ms]  (too long!)

Cause: Heap pressure Solution: Reduce MemTable size or increase heap

Issue 3: Node Marked as DOWN (But It’s Running)

Symptoms:

nodetool status
# DN  10.0.1.13  (Down/Normal)  ← but node is actually running!

Diagnosis:

Check Failure Detector:

nodetool failuredetector
# /10.0.1.13: 15.34  (phi > 8 = marked down)

Check for GC Pauses:

# On node 10.0.1.13
grep "GC pause" /var/log/cassandra/system.log
# [GC pause (young) 5678ms]  (> phi threshold!)

Cause: GC pause exceeded failure detector threshold Solutions:

# Increase phi threshold (cassandra.yaml)
phi_convict_threshold: 12  # Default: 8

# Tune JVM GC
-XX:MaxGCPauseMillis=200

Issue 4: Disk Full

Symptoms:

ERROR: Cannot allocate memory for commitlog segment

Diagnosis:

df -h
# /var/lib/cassandra: 98% used (critical!)

# Check snapshot disk usage
nodetool listsnapshots | awk '{sum+=$4} END {print sum}'
# 2.3 TB in snapshots!

Solutions:

Delete Old Snapshots:

nodetool clearsnapshot --all

Clean Incremental Backups:

rm -rf /var/lib/cassandra/data/*/*/backups/*

Compact Tables:

nodetool compact

Add Nodes (long-term solution)

Issue 5: Schema Mismatch

Symptoms:

nodetool describecluster
# Schema versions:
#   e84b6a60-1f3e-11ec-9621-0242ac130002: [10.0.1.10, 10.0.1.11]
#   f95c7b71-2g4f-22fd-a732-1353bd241113: [10.0.1.12]  ← different!

Diagnosis:

Node 10.0.1.12 has different schema
Gossip not propagating schema updates

Solutions:

Force Schema Reset:

# On the outlier node (10.0.1.12)
nodetool resetlocalschema

Restart Gossip:

nodetool disablegossip
nodetool enablegossip

Rolling Restart (last resort):

# Restart nodes one by one
systemctl restart cassandra

Part 8: Advanced Production Topics

Multi-DC Latency Optimization

Problem: Cross-DC writes add 100-200ms latency Solution: Use LOCAL_QUORUM for writes:

INSERT INTO users (id, name) VALUES (1, 'Alice')
USING CONSISTENCY LOCAL_QUORUM;

Why: Write only waits for local DC acknowledgment, remote DC replicates asynchronously. Trade-off: Remote DC may lag by seconds/minutes during network issues.

Read Consistency Tuning

Scenario: Reads occasionally return stale data Diagnosis: Repair not running frequently enough Solutions:

Increase Read Repair Chance:

ALTER TABLE users WITH dclocal_read_repair_chance = 0.2;  -- 20% of reads

Use Higher Consistency Level:

SELECT * FROM users WHERE id = 1
USING CONSISTENCY QUORUM;  -- Instead of ONE

Run Repair More Frequently:

# Daily incremental repair
0 2 * * * nodetool repair -inc

Handling Large Partitions

Problem: Partition > 100MB causes:

High read latency
Timeouts
OOM errors

Detection:

# Find large partitions
nodetool cfstats keyspace.table | grep "Compacted partition maximum bytes"

# Or use SSTable tools
sstablemetadata /var/lib/cassandra/data/keyspace/table/mc-123-big-Data.db | grep "Partition size"

Solutions:

Redesign Data Model (best):

-- Bad: All events for a user in one partition
CREATE TABLE user_events (
    user_id int,
    event_time timestamp,
    event_data text,
    PRIMARY KEY (user_id, event_time)  -- user_id partition grows unbounded!
);

-- Good: Bucket by time
CREATE TABLE user_events (
    user_id int,
    bucket text,  -- e.g., "2023-11"
    event_time timestamp,
    event_data text,
    PRIMARY KEY ((user_id, bucket), event_time)  -- Bounded partitions
);

Add Compaction Threshold:

ALTER TABLE users WITH
  compaction = {'class': 'LeveledCompactionStrategy'}
  AND min_threshold = 4
  AND max_threshold = 32;

Connection Pool Tuning (Driver-Side)

Python Driver:

from cassandra.cluster import Cluster
from cassandra.policies import DCAwareRoundRobinPolicy

cluster = Cluster(
    contact_points=['10.0.1.10'],
    protocol_version=4,

    # Connection pool
    pool_size=10,              # Connections per host
    max_connections=20,        # Max connections per host

    # Timeouts
    connect_timeout=10,        # Initial connection timeout
    control_connection_timeout=10,

    # Load balancing
    load_balancing_policy=DCAwareRoundRobinPolicy(local_dc='us-east-1'),

    # Retries
    default_retry_policy=RetryPolicy(),
)

Java Driver:

CqlSession session = CqlSession.builder()
    .addContactPoint(new InetSocketAddress("10.0.1.10", 9042))
    .withLocalDatacenter("us-east-1")

    // Connection pool
    .withPoolSize(10)

    // Timeouts
    .withRequestTimeout(Duration.ofSeconds(5))
    .withConnectTimeout(Duration.ofSeconds(10))

    .build();

Part 9: Performance Checklist

Pre-Production Checklist

Performance Testing

Load Testing Tools:

cassandra-stress (built-in):

# Write test
cassandra-stress write n=1000000 -node 10.0.1.10 -rate threads=50

# Read test
cassandra-stress read n=1000000 -node 10.0.1.10 -rate threads=50

# Mixed workload
cassandra-stress mixed ratio\(write=1,read=3\) n=1000000 -node 10.0.1.10

NoSQLBench:

# Install
wget https://github.com/nosqlbench/nosqlbench/releases/download/4.15.102/nb.jar

# Run workload
java -jar nb.jar cql-keyvalue rampup-cycles=10000 main-cycles=100000 host=10.0.1.10

Key Metrics to Capture:

Throughput (ops/sec)
Latency (p50, p95, p99, p999)
Error rate
Resource utilization (CPU, RAM, disk, network)

Part 10: Hands-On Exercises

Exercise 1: JVM Tuning

Scenario: Node experiencing 2-second GC pauses Task:

Analyze GC log: cat /var/log/cassandra/gc.log
Identify problem (heap too large? Young gen too large?)
Adjust JVM settings in jvm11-server.options
Restart and monitor improvements

Expected: GC pauses < 200ms after tuning

Exercise 2: Compaction Strategy Comparison

Task:

Create three identical tables with different compaction strategies (STCS, LCS, TWCS)
Load 10GB of data into each
Run mixed read/write workload with cassandra-stress
Compare SSTable count, read latency, write latency

Questions:

Which strategy has lowest SSTable count?
Which strategy has best read latency?
Which strategy has best write throughput?

Exercise 3: Monitoring Setup

Task:

Install Prometheus + Grafana
Configure JMX exporter on Cassandra nodes
Import Cassandra dashboard
Create custom alerts for:
- GC pause > 500ms
- Dropped mutations > 0
- Pending compactions > 20

Exercise 4: Backup and Restore

Task:

Create snapshot of my_keyspace
Upload to S3 (or local directory)
Drop table: DROP TABLE my_keyspace.users;
Restore from snapshot
Verify data integrity

Exercise 5: Troubleshoot Slow Queries

Scenario: Query taking 5 seconds Task:

TRACING ON;
SELECT * FROM users WHERE user_id = 12345;

Analyze trace output
Identify bottleneck (tombstones? large partition? SSTable count?)
Apply fix (repair? compaction? data model change?)
Verify improvement

Summary & Production Best Practices

JVM:

Heap: 8GB max (25% of RAM)
GC: G1GC with 200ms pause target
Monitor GC logs continuously

OS:

SSDs for all storage
XFS with noatime
Disable THP and swap
I/O scheduler: noop

Cassandra:

Compaction strategy matches workload
Run repair weekly (within gc_grace_seconds)
Use LOCAL_QUORUM for multi-DC
Monitor: GC, tpstats, compaction, latency

Capacity Planning:

1 core per 1-2TB data
64GB+ RAM (8GB heap + 48GB OS cache)
Disk: 50% headroom for compaction
Network: 10 Gbps minimum

Backup:

Daily snapshots to external storage
Incremental backups enabled
Test restore procedures regularly

Troubleshooting:

High reads → Check SSTable count, tombstones
High writes → Check pending compactions, GC
Timeouts → Check dropped messages, disk I/O
Node down → Check failure detector, GC pauses

What’s Next?

Module 7: Capstone Project - Building a Production System

Apply everything you’ve learned to design and implement a real-world Cassandra application at scale

Overview

Testing & Code Quality

Crash Courses

AI Engineering

Math for ML - Understanding Linear Algebra

Probability & Statistics for ML

Math for ML - Understanding Calculus

ML Mastery

Deep Learning Mastery

NestJS Mastery

Microservices Mastery

Low Level Design

OOP Concepts

SOLID Principles

Design Patterns

LLD Case Studies

System Design (HLD)

Senior Level (L5+/Staff)

HLD Case Studies

Engineering Fundamentals

DevOps & Operations

Azure Cloud Engineering

AWS Cloud

AWS Monitoring & Observability

AWS Security Services

AWS Serverless

AWS Operations

AWS Advanced

AWS Case Studies

GCP Cloud Engineering

DevOps Tools

Database Engineering

HIPAA Compliance Mastery

Operating Systems

Linux Internals

Distributed Systems

Networking Mastery

Build Your Own X

Go Lang Mastery

C Programming

Classic Research Papers

Distributed System Tools

​Performance Tuning & Production Operations

​Introduction: The Production Reality

​Part 1: The JVM - Cassandra’s Foundation

​Why GC Matters

​Heap Size Configuration

​Garbage Collection Algorithms

​1. G1GC (G1 Garbage Collector) - Recommended

​2. CMS (Concurrent Mark Sweep) - Legacy

​3. ZGC / Shenandoah - Experimental

​GC Logging and Monitoring

​Common GC Issues

​Issue 1: Frequent Full GCs

​Issue 2: Long Young GC Pauses

​Issue 3: Memory Pressure

​Part 2: Operating System Tuning

​Disk I/O Configuration

​File System Choice

​I/O Scheduler

​Readahead

​Linux Kernel Settings

​User Limits

​Swap Configuration

​Transparent Huge Pages (THP)

​CPU Governor

​Part 3: Cassandra Configuration Tuning

​Compaction Strategy Selection

​STCS (Size-Tiered Compaction Strategy)

​LCS (Leveled Compaction Strategy)

​TWCS (Time Window Compaction Strategy)

​MemTable Configuration

​Cache Configuration

​1. Key Cache

​2. Row Cache

​3. Counter Cache

​Commit Log Tuning

​Concurrent Operations

​Part 4: Monitoring and Observability

​Key Metrics to Monitor

Performance Tuning & Production Operations

Introduction: The Production Reality

Part 1: The JVM - Cassandra’s Foundation

Why GC Matters

Heap Size Configuration

Garbage Collection Algorithms

1. G1GC (G1 Garbage Collector) - Recommended

2. CMS (Concurrent Mark Sweep) - Legacy

3. ZGC / Shenandoah - Experimental

GC Logging and Monitoring

Common GC Issues

Issue 1: Frequent Full GCs

Issue 2: Long Young GC Pauses

Issue 3: Memory Pressure

Part 2: Operating System Tuning

Disk I/O Configuration

File System Choice

I/O Scheduler

Readahead

Linux Kernel Settings

User Limits

Swap Configuration

Transparent Huge Pages (THP)

CPU Governor

Part 3: Cassandra Configuration Tuning

Compaction Strategy Selection

STCS (Size-Tiered Compaction Strategy)

LCS (Leveled Compaction Strategy)

TWCS (Time Window Compaction Strategy)

MemTable Configuration

Cache Configuration

1. Key Cache

2. Row Cache

3. Counter Cache

Commit Log Tuning

Concurrent Operations

Part 4: Monitoring and Observability

Key Metrics to Monitor

1. System Metrics

2. JVM Metrics

3. Cassandra Metrics

4. Performance Metrics

Monitoring Tools

1. Prometheus + Grafana (Recommended)

2. DataStax OpsCenter

3. Nodetool (Built-in)

Alert Thresholds

Part 5: Capacity Planning

Disk Capacity

Memory Capacity

CPU Capacity

Network Capacity

Scaling Triggers

Part 6: Backup and Disaster Recovery

Snapshot-Based Backups

Incremental Backups

Backup to External Storage

Restore from Backup

Point-in-Time Recovery

Part 7: Troubleshooting Production Issues

Issue 1: High Read Latency

Issue 2: Write Timeouts

Issue 3: Node Marked as DOWN (But It’s Running)

Issue 4: Disk Full

Issue 5: Schema Mismatch

Part 8: Advanced Production Topics

Multi-DC Latency Optimization

Read Consistency Tuning

Handling Large Partitions

Connection Pool Tuning (Driver-Side)

Part 9: Performance Checklist

Pre-Production Checklist

Performance Testing

Part 10: Hands-On Exercises

Exercise 1: JVM Tuning

Exercise 2: Compaction Strategy Comparison

Exercise 3: Monitoring Setup

Exercise 4: Backup and Restore

Exercise 5: Troubleshoot Slow Queries

Summary & Production Best Practices