Kafka Operations
Learn to deploy, manage, and secure Kafka clusters in production environments.Cluster Architecture
A production Kafka cluster typically consists of:- 3+ Brokers: For availability and replication.
- 3+ Zookeeper Nodes (or KRaft controllers): For cluster coordination.
Key Configurations (server.properties)
| Property | Description | Recommended |
|---|---|---|
broker.id | Unique integer ID | Unique per broker |
log.dirs | Where data is stored | Separate disk/mount |
num.partitions | Default partitions | 3 |
default.replication.factor | Default replication | 3 |
min.insync.replicas | Min replicas for ack | 2 |
auto.create.topics.enable | Auto-create topics | false (Production) |
High Availability & Replication
Replication Factor
- RF=3: Standard for production. Allows 1 broker failure with no data loss.
- Min ISR=2: Ensures data is written to at least 2 brokers before ack.
In-Sync Replicas (ISR)
Replicas that are caught up with the leader.- If a follower falls too far behind, it’s removed from ISR.
- If leader fails, only a member of ISR can become new leader.
Monitoring & Observability
Key Metrics (JMX)
Under Replicated Partitions
Critical: Should be 0. >0 means data isn’t fully replicated.
Active Controller Count
Critical: Should be 1 per cluster.
Offline Partitions
Critical: >0 means data is unavailable.
Request Latency
Time to process produce/fetch requests.
Consumer Lag
The most important metric for consumers.- Lag: Difference between latest offset and consumer offset.
- Monitoring: Use
kafka-consumer-groups.shor tools like Prometheus/Grafana.
Security
1. Encryption (SSL/TLS)
Encrypts data in transit between clients and brokers, and between brokers.2. Authentication (SASL)
Verifies identity of clients.- SASL/PLAIN: Username/Password.
- SASL/SCRAM: Salted Challenge Response (More secure).
- SASL/GSSAPI (Kerberos): Enterprise integration.
3. Authorization (ACLs)
Controls what authenticated users can do.Troubleshooting
Broker Failure
- Check Logs:
/var/log/kafka/server.log. - Check Disk: Is disk full? Kafka stops accepting writes if disk is full.
- Check Zookeeper: Is broker connected to ZK?
Consumer Issues
- Stuck Consumer: Check if rebalancing is happening constantly (“stop-the-world”).
- High Lag: Consumer is too slow.
- Solution: Add more partitions and more consumer instances.
- Solution: Optimize processing logic.
Data Loss Scenarios
- Unclean Leader Election: If all ISRs fail, non-ISR replica becomes leader (data loss).
- Config:
unclean.leader.election.enable=false(Default).
- Config:
- Producer Acks=1: Leader accepts write but crashes before replicating.
- Config:
acks=all.
- Config:
Maintenance Tasks
Rebalancing Partitions
When adding new brokers, partitions don’t automatically move. Usekafka-reassign-partitions.sh.
Log Compaction
For topics where only the latest value for a key matters (e.g., user profile updates).- Config:
cleanup.policy=compact.
Log Compaction Deep Dive
Log compaction ensures Kafka retains the latest value for each key, rather than retaining by time.How It Works
Configuration
Tombstones (Deletes)
To delete a key, produce a message withnull value:
delete.retention.ms, then removed during compaction.
Retention Strategies
Time-Based Retention
Size-Based Retention
Combined
Disaster Recovery
Multi-Datacenter Replication
| Tool | Description |
|---|---|
| MirrorMaker 2 | Kafka-native, replicates topics between clusters |
| Confluent Replicator | Commercial, more features |
| Cluster Linking | Confluent Cloud, byte-for-byte replication |
MirrorMaker 2 Architecture
Backup Strategies
- Regular Backups: Use tools like Kafka Connect to S3
- Topic Mirroring: MirrorMaker 2 to standby cluster
- Retention Policy: Set high retention for critical topics
Performance Tuning
Broker Tuning
Producer Tuning
Consumer Tuning
Interview Questions & Answers
How do you monitor a Kafka cluster?
How do you monitor a Kafka cluster?
Key Metrics to Monitor:
- Under-replicated partitions: Should be 0
- Active Controller Count: Should be 1
- Offline Partitions: Should be 0
- Request Latency: Produce/Fetch p99
- Consumer Lag: Per consumer group
- JMX + Prometheus + Grafana
- Confluent Control Center
- Burrow (consumer lag)
- Cruise Control (auto-rebalancing)
How do you add a new broker to a cluster?
How do you add a new broker to a cluster?
- Configure new broker with unique
broker.id - Point to same ZooKeeper/KRaft controllers
- Start the broker
- Partitions don’t auto-move! Use
kafka-reassign-partitions.sh
What is log compaction and when would you use it?
What is log compaction and when would you use it?
Log compaction keeps only the latest value per key, not by time.Use Cases:
- Database changelog (CDC)
- User profile cache
- Configuration store
- KTable backing store
cleanup.policy=compactHow do you handle a broker failure?
How do you handle a broker failure?
- Automatic: Controller elects new leaders from ISR
- Check Under-Replicated Partitions: If > 0, replication is catching up
- Bring broker back: It will rejoin and catch up
- If disk failed: May need to reassign partitions
- RF=3, min.insync.replicas=2
- acks=all on producers
- unclean.leader.election.enable=false
How do you upgrade Kafka without downtime?
How do you upgrade Kafka without downtime?
Rolling Upgrade Process:
- Set
inter.broker.protocol.versionto current version - Upgrade brokers one at a time (preferred replicas first)
- After all brokers upgraded, bump protocol version
- Upgrade clients (consumers first, then producers)
What is Cruise Control?
What is Cruise Control?
An open-source tool for Kafka cluster management:
- Auto-rebalancing: Moves partitions for even load
- Self-healing: Handles broker failures
- Goal-based optimization: CPU, disk, network balance
Common Pitfalls
🎉 Congratulations! You’ve completed the Kafka Crash Course. Next: Kubernetes Crash Course →