Kafka Operations
Cluster Architecture
Key Configurations (server.properties)
High Availability & Replication
Replication Factor
In-Sync Replicas (ISR)
Monitoring & Observability
Key Metrics (JMX)
Consumer Lag
Security
1. Encryption (SSL/TLS)
2. Authentication (SASL)
3. Authorization (ACLs)
Troubleshooting
Broker Failure
Consumer Issues
Data Loss Scenarios
Maintenance Tasks
Rebalancing Partitions
Log Compaction
Log Compaction Deep Dive
How It Works
Configuration
Tombstones (Deletes)
Retention Strategies
Time-Based Retention
Size-Based Retention
Combined
Disaster Recovery
Multi-Datacenter Replication
MirrorMaker 2 Architecture
Backup Strategies
Performance Tuning
Broker Tuning
Producer Tuning
Consumer Tuning
Interview Questions & Answers
Common Pitfalls

Kafka Operations

Learn to deploy, manage, and secure Kafka clusters in production environments.

Cluster Architecture

A production Kafka cluster typically consists of:

3+ Brokers: For availability and replication.
3+ Zookeeper Nodes (or KRaft controllers): For cluster coordination.

Key Configurations (`server.properties`)

Property	Description	Recommended
`broker.id`	Unique integer ID	Unique per broker
`log.dirs`	Where data is stored	Separate disk/mount
`num.partitions`	Default partitions	3
`default.replication.factor`	Default replication	3
`min.insync.replicas`	Min replicas for ack	2
`auto.create.topics.enable`	Auto-create topics	`false` (Production)

High Availability & Replication

Replication Factor

RF=3: Standard for production. Allows 1 broker failure with no data loss.
Min ISR=2: Ensures data is written to at least 2 brokers before ack.

In-Sync Replicas (ISR)

Replicas that are caught up with the leader.

If a follower falls too far behind, it’s removed from ISR.
If leader fails, only a member of ISR can become new leader.

Monitoring & Observability

Key Metrics (JMX)

Under Replicated Partitions

Critical: Should be 0. >0 means data isn’t fully replicated.

Active Controller Count

Critical: Should be 1 per cluster.

Offline Partitions

Critical: >0 means data is unavailable.

Request Latency

Time to process produce/fetch requests.

Consumer Lag

The most important metric for consumers.

Lag: Difference between latest offset and consumer offset.
Monitoring: Use kafka-consumer-groups.sh or tools like Prometheus/Grafana.

# Check lag
bin/kafka-consumer-groups.sh --bootstrap-server localhost:9092 --describe --group my-group

Security

1. Encryption (SSL/TLS)

Encrypts data in transit between clients and brokers, and between brokers.

# server.properties
listeners=SSL://:9093
ssl.keystore.location=/var/private/ssl/server.keystore.jks
ssl.keystore.password=test1234
ssl.key.password=test1234
ssl.truststore.location=/var/private/ssl/server.truststore.jks
ssl.truststore.password=test1234

2. Authentication (SASL)

Verifies identity of clients.

SASL/PLAIN: Username/Password.
SASL/SCRAM: Salted Challenge Response (More secure).
SASL/GSSAPI (Kerberos): Enterprise integration.

3. Authorization (ACLs)

Controls what authenticated users can do.

# Allow 'alice' to write to 'finance-topic'
bin/kafka-acls.sh --bootstrap-server localhost:9092 --add \
    --allow-principal User:alice \
    --operation Write \
    --topic finance-topic

Troubleshooting

Broker Failure

Check Logs: /var/log/kafka/server.log.
Check Disk: Is disk full? Kafka stops accepting writes if disk is full.
Check Zookeeper: Is broker connected to ZK?

Consumer Issues

Stuck Consumer: Check if rebalancing is happening constantly (“stop-the-world”).
High Lag: Consumer is too slow.
- Solution: Add more partitions and more consumer instances.
- Solution: Optimize processing logic.

Data Loss Scenarios

Unclean Leader Election: If all ISRs fail, non-ISR replica becomes leader (data loss).
- Config: unclean.leader.election.enable=false (Default).
Producer Acks=1: Leader accepts write but crashes before replicating.
- Config: acks=all.

Maintenance Tasks

Rebalancing Partitions

When adding new brokers, partitions don’t automatically move. Use kafka-reassign-partitions.sh.

Log Compaction

For topics where only the latest value for a key matters (e.g., user profile updates).

Config: cleanup.policy=compact.

Log Compaction Deep Dive

Log compaction ensures Kafka retains the latest value for each key, rather than retaining by time.

How It Works

Before Compaction:
offset: 0  1  2  3  4  5  6  7
key:    A  B  A  C  B  A  C  B
value:  v1 v1 v2 v1 v2 v3 v2 v3

After Compaction:
offset: 5  6  7
key:    A  C  B
value:  v3 v2 v3

Configuration

# Topic-level
cleanup.policy=compact
min.cleanable.dirty.ratio=0.5   # Compact when 50% is "dirty"
delete.retention.ms=86400000    # Keep tombstones for 24h

Tombstones (Deletes)

To delete a key, produce a message with null value:

producer.send(new ProducerRecord<>("topic", "key-to-delete", null));

The tombstone is kept for delete.retention.ms, then removed during compaction.

Interview Tip: Compacted topics are ideal for:

Changelogs (database CDC)
Configuration/feature flags
User profile caches
KTable backing in Kafka Streams

Retention Strategies

Time-Based Retention

retention.ms=604800000          # 7 days
retention.bytes=-1               # Unlimited size

Size-Based Retention

retention.ms=-1                  # Unlimited time
retention.bytes=1073741824       # 1GB per partition

Combined

retention.ms=604800000           # 7 days OR
retention.bytes=1073741824       # 1GB, whichever comes first

Disaster Recovery

Multi-Datacenter Replication

Tool	Description
MirrorMaker 2	Kafka-native, replicates topics between clusters
Confluent Replicator	Commercial, more features
Cluster Linking	Confluent Cloud, byte-for-byte replication

MirrorMaker 2 Architecture

Backup Strategies

Regular Backups: Use tools like Kafka Connect to S3
Topic Mirroring: MirrorMaker 2 to standby cluster
Retention Policy: Set high retention for critical topics

Performance Tuning

Broker Tuning

# Increase socket buffers
socket.send.buffer.bytes=102400
socket.receive.buffer.bytes=102400

# Number of I/O threads
num.io.threads=8
num.network.threads=3

# Log flush settings
log.flush.interval.messages=10000
log.flush.interval.ms=1000

Producer Tuning

# Batching
batch.size=16384           # 16KB batches
linger.ms=5                # Wait 5ms to fill batch

# Compression
compression.type=lz4       # lz4 or zstd for best perf

# Memory
buffer.memory=33554432     # 32MB buffer

Consumer Tuning

# Fetch settings
fetch.min.bytes=1024       # Wait for 1KB before returning
fetch.max.wait.ms=500      # Or 500ms, whichever first
max.partition.fetch.bytes=1048576  # 1MB per partition

Interview Questions & Answers

How do you monitor a Kafka cluster?

Key Metrics to Monitor:

Under-replicated partitions: Should be 0
Active Controller Count: Should be 1
Offline Partitions: Should be 0
Request Latency: Produce/Fetch p99
Consumer Lag: Per consumer group

Tools:

JMX + Prometheus + Grafana
Confluent Control Center
Burrow (consumer lag)
Cruise Control (auto-rebalancing)

How do you add a new broker to a cluster?

Configure new broker with unique broker.id
Point to same ZooKeeper/KRaft controllers
Start the broker
Partitions don’t auto-move! Use kafka-reassign-partitions.sh

# Generate reassignment plan
kafka-reassign-partitions.sh --generate \
  --topics-to-move-json-file topics.json \
  --broker-list "1,2,3,4"

# Execute reassignment
kafka-reassign-partitions.sh --execute \
  --reassignment-json-file plan.json

What is log compaction and when would you use it?

Log compaction keeps only the latest value per key, not by time.Use Cases:

Database changelog (CDC)
User profile cache
Configuration store
KTable backing store

Config: cleanup.policy=compact

How do you handle a broker failure?

Automatic: Controller elects new leaders from ISR
Check Under-Replicated Partitions: If > 0, replication is catching up
Bring broker back: It will rejoin and catch up
If disk failed: May need to reassign partitions

Prevent data loss:

RF=3, min.insync.replicas=2
acks=all on producers
unclean.leader.election.enable=false

How do you upgrade Kafka without downtime?

Rolling Upgrade Process:

Set inter.broker.protocol.version to current version
Upgrade brokers one at a time (preferred replicas first)
After all brokers upgraded, bump protocol version
Upgrade clients (consumers first, then producers)

Key: Never skip versions. 2.x → 3.0 → 3.5, not 2.x → 3.5

What is Cruise Control?

An open-source tool for Kafka cluster management:

Auto-rebalancing: Moves partitions for even load
Self-healing: Handles broker failures
Goal-based optimization: CPU, disk, network balance

Reduces operational burden of managing large clusters.

Common Pitfalls

1. Disk Full: Kafka stops accepting writes. Monitor disk usage and set retention appropriately.2. Under-Replicated Partitions: Indicates slow broker or network issues. Don’t ignore!3. Skipping Versions: Can cause data corruption. Always follow upgrade path.4. No Monitoring: Without metrics, you’re flying blind. Set up alerting on key metrics.5. Manual Partition Management: Use Cruise Control for large clusters to avoid imbalanced load.

🎉 Congratulations! You’ve completed the Kafka Crash Course. Next: Kubernetes Crash Course →

Streams Kubernetes Overview

Overview

Testing & Code Quality

Crash Courses

AI Engineering

Math for ML - Understanding Linear Algebra

Probability & Statistics for ML

Math for ML - Understanding Calculus

ML Mastery

Deep Learning Mastery

NestJS Mastery

Microservices Mastery

Low Level Design

OOP Concepts

SOLID Principles

Design Patterns

LLD Case Studies

System Design (HLD)

Senior Level (L5+/Staff)

HLD Case Studies

Engineering Fundamentals

DevOps & Operations

Azure Cloud Engineering

AWS Cloud

AWS Monitoring & Observability

AWS Security Services

AWS Serverless

AWS Operations

AWS Advanced

AWS Case Studies

GCP Cloud Engineering

DevOps Tools

Database Engineering

HIPAA Compliance Mastery

Operating Systems

Linux Internals

Distributed Systems

Networking Mastery

Build Your Own X

Go Lang Mastery

C Programming

Classic Research Papers

Distributed System Tools

​Kafka Operations

​Cluster Architecture

​Key Configurations (server.properties)

​High Availability & Replication

​Replication Factor

​In-Sync Replicas (ISR)

​Monitoring & Observability

​Key Metrics (JMX)

Under Replicated Partitions

Active Controller Count

Offline Partitions

Request Latency

​Consumer Lag

​Security

​1. Encryption (SSL/TLS)

​2. Authentication (SASL)

​3. Authorization (ACLs)

​Troubleshooting

​Broker Failure

​Consumer Issues

​Data Loss Scenarios

​Maintenance Tasks

​Rebalancing Partitions

​Log Compaction

​Log Compaction Deep Dive

​How It Works

​Configuration

​Tombstones (Deletes)

​Retention Strategies

​Time-Based Retention

​Size-Based Retention

​Combined

​Disaster Recovery

​Multi-Datacenter Replication

​MirrorMaker 2 Architecture

​Backup Strategies

​Performance Tuning

​Broker Tuning

Kafka Operations

Cluster Architecture

Key Configurations (`server.properties`)

High Availability & Replication

Replication Factor

In-Sync Replicas (ISR)

Monitoring & Observability

Key Metrics (JMX)

Consumer Lag

Security

1. Encryption (SSL/TLS)

2. Authentication (SASL)

3. Authorization (ACLs)

Troubleshooting

Broker Failure

Consumer Issues

Data Loss Scenarios

Maintenance Tasks

Rebalancing Partitions

Log Compaction

Log Compaction Deep Dive

How It Works

Configuration

Tombstones (Deletes)

Retention Strategies

Time-Based Retention

Size-Based Retention

Combined

Disaster Recovery

Multi-Datacenter Replication

MirrorMaker 2 Architecture

Backup Strategies

Performance Tuning

Broker Tuning

Producer Tuning

Consumer Tuning

Interview Questions & Answers

Common Pitfalls