Documentation Index
Fetch the complete documentation index at: https://resources.devweekends.com/llms.txt
Use this file to discover all available pages before exploring further.
Kafka Operations
Learn to deploy, manage, and secure Kafka clusters in production environments. Running Kafka in production is like running a commercial airline: the mechanics of flight are well understood, but operational discipline — monitoring, maintenance, capacity planning, and incident response — is what separates a smooth operation from a disaster.Cluster Architecture
A production Kafka cluster typically consists of:- 3+ Brokers: For availability and replication.
- 3+ Zookeeper Nodes (or KRaft controllers): For cluster coordination.
Key Configurations (server.properties)
| Property | Description | Recommended | Why |
|---|---|---|---|
broker.id | Unique integer ID | Unique per broker | Identifies the broker in metadata. Cannot be changed after data is written. |
log.dirs | Where data is stored | Separate disk/mount | Dedicated disks avoid I/O contention with OS or application logs. Use multiple dirs for JBOD. |
num.partitions | Default partitions for auto-created topics | 3 | Only matters if auto.create.topics.enable=true (which it should not be in production). |
default.replication.factor | Default replication for auto-created topics | 3 | RF=3 survives one broker failure with no data loss. |
min.insync.replicas | Min replicas that must ack a write | 2 | With RF=3, this ensures writes survive one broker failure. Set to 1 and you risk silent data loss. |
auto.create.topics.enable | Create topics on first produce/consume | false (Production) | A typo in a topic name silently creates a new topic. Debugging “where did my messages go?” at 3 AM is not fun. |
High Availability & Replication
Replication Factor
- RF=3: Standard for production. Allows 1 broker failure with no data loss.
- Min ISR=2: Ensures data is written to at least 2 brokers before ack.
In-Sync Replicas (ISR)
Replicas that are caught up with the leader.- If a follower falls too far behind, it’s removed from ISR.
- If leader fails, only a member of ISR can become new leader.
Monitoring & Observability
Key Metrics (JMX)
Under Replicated Partitions
Active Controller Count
Offline Partitions
Request Latency
Consumer Lag
The most important metric for consumers.- Lag: Difference between latest offset and consumer offset.
- Monitoring: Use
kafka-consumer-groups.shor tools like Prometheus/Grafana.
Security
1. Encryption (SSL/TLS)
Encrypts data in transit between clients and brokers, and between brokers. Without TLS, every message is sent in plain text over the network — including any sensitive data your applications produce. In regulated environments (PCI, HIPAA, SOC2), encryption in transit is mandatory.2. Authentication (SASL)
Verifies identity of clients.- SASL/PLAIN: Username/Password.
- SASL/SCRAM: Salted Challenge Response (More secure).
- SASL/GSSAPI (Kerberos): Enterprise integration.
3. Authorization (ACLs)
Controls what authenticated users can do. ACLs follow the principle of least privilege: deny everything by default, then grant specific permissions to specific principals.Troubleshooting
Kafka troubleshooting follows a systematic pattern: check the obvious first (disk, network, logs), then move to cluster-level health (ISR, controller), then application-level issues (consumer lag, producer errors).Broker Failure
- Check Logs:
/var/log/kafka/server.log. Look forFATALorERRORentries. The most common causes are disk failures, OOM kills, and GC pauses. - Check Disk: Is disk full? Kafka stops accepting writes when disk is full. This is the single most common cause of production Kafka outages. Run
df -hon broker hosts and set alerts at 80% capacity. - Check ZooKeeper/KRaft Connectivity: Is the broker connected to the coordination layer? Run
echo ruok | nc zk-host 2181for ZooKeeper, or check KRaft controller logs for lost quorum. - Check Under-Replicated Partitions:
kafka-topics.sh --describe --under-replicated-partitions. If a broker is slow (not dead), you will see its partitions falling out of ISR before the broker actually fails.
Consumer Issues
- Stuck Consumer (Rebalance Storm): Check if rebalancing is happening constantly (“stop-the-world”). Common causes: GC pauses exceeding
session.timeout.ms, slow processing exceedingmax.poll.interval.ms, or a consumer that keeps crashing and rejoining. Checkkafka-consumer-groups.sh --describeand look for rapidly changing partition assignments. - High Lag: Consumer is too slow. Diagnose: is the bottleneck CPU (processing), I/O (database writes), or network (large messages)? Solutions ranked by effort: (1) Increase
max.poll.recordsand batch database writes, (2) Optimize processing logic, (3) Add more partitions and consumer instances, (4) Consider async processing with a local buffer. - Duplicate Processing: Usually caused by offsets being committed too late (consumer crashes after processing but before committing). Make your processing idempotent (e.g., upsert instead of insert, deduplicate by message ID).
Data Loss Scenarios
- Unclean Leader Election: If all ISRs fail, a non-ISR replica becomes leader (data loss). Config:
unclean.leader.election.enable=false(Default). This means the partition becomes unavailable rather than losing data — availability is sacrificed for durability. - Producer acks=1: Leader accepts write but crashes before replicating. Config:
acks=all. This is the most common misconfiguration that leads to data loss in production. - min.insync.replicas=1 with RF=2: Only one copy exists after a single broker failure. Config: RF=3,
min.insync.replicas=2. This is the minimum safe configuration for production data.
Maintenance Tasks
Rebalancing Partitions
When adding new brokers, partitions don’t automatically move. Usekafka-reassign-partitions.sh.
Log Compaction
For topics where only the latest value for a key matters (e.g., user profile updates).- Config:
cleanup.policy=compact.
Log Compaction Deep Dive
Log compaction ensures Kafka retains the latest value for each key, rather than retaining by time.How It Works
Configuration
Tombstones (Deletes)
To delete a key, produce a message withnull value:
delete.retention.ms, then removed during compaction.
Retention Strategies
Time-Based Retention
Size-Based Retention
Combined
Disaster Recovery
Multi-Datacenter Replication
| Tool | Description |
|---|---|
| MirrorMaker 2 | Kafka-native, replicates topics between clusters |
| Confluent Replicator | Commercial, more features |
| Cluster Linking | Confluent Cloud, byte-for-byte replication |
MirrorMaker 2 Architecture
Backup Strategies
- Regular Backups: Use tools like Kafka Connect to S3
- Topic Mirroring: MirrorMaker 2 to standby cluster
- Retention Policy: Set high retention for critical topics
Performance Tuning
Broker Tuning
Producer Tuning
Consumer Tuning
Interview Questions & Answers
How do you monitor a Kafka cluster?
How do you monitor a Kafka cluster?
- Under-replicated partitions: Should be 0
- Active Controller Count: Should be 1
- Offline Partitions: Should be 0
- Request Latency: Produce/Fetch p99
- Consumer Lag: Per consumer group
- JMX + Prometheus + Grafana
- Confluent Control Center
- Burrow (consumer lag)
- Cruise Control (auto-rebalancing)
How do you add a new broker to a cluster?
How do you add a new broker to a cluster?
- Configure new broker with unique
broker.id - Point to same ZooKeeper/KRaft controllers
- Start the broker
- Partitions don’t auto-move! Use
kafka-reassign-partitions.sh
What is log compaction and when would you use it?
What is log compaction and when would you use it?
- Database changelog (CDC)
- User profile cache
- Configuration store
- KTable backing store
cleanup.policy=compactHow do you handle a broker failure?
How do you handle a broker failure?
- Automatic: Controller elects new leaders from ISR
- Check Under-Replicated Partitions: If > 0, replication is catching up
- Bring broker back: It will rejoin and catch up
- If disk failed: May need to reassign partitions
- RF=3, min.insync.replicas=2
- acks=all on producers
- unclean.leader.election.enable=false
How do you upgrade Kafka without downtime?
How do you upgrade Kafka without downtime?
- Set
inter.broker.protocol.versionto current version - Upgrade brokers one at a time (preferred replicas first)
- After all brokers upgraded, bump protocol version
- Upgrade clients (consumers first, then producers)
What is Cruise Control?
What is Cruise Control?
- Auto-rebalancing: Moves partitions for even load
- Self-healing: Handles broker failures
- Goal-based optimization: CPU, disk, network balance
Common Pitfalls
Interview Deep-Dive
Your Kafka cluster's disk usage is growing faster than expected. You have 7 days of retention configured but disk is filling up in 3 days. Walk me through how you diagnose and resolve this.
Your Kafka cluster's disk usage is growing faster than expected. You have 7 days of retention configured but disk is filling up in 3 days. Walk me through how you diagnose and resolve this.
- First, I identify which topics are consuming the most disk. I check per-topic sizes with
kafka-log-dirs.sh --describe --bootstrap-server localhost:9092. This shows the size of each topic-partition across brokers. Often, one or two topics dominate disk usage (e.g., a high-volume logging topic or a topic with large message payloads). - Second, I check if retention is working correctly. I verify the topic-level configuration:
kafka-configs.sh --describe --topic big-topic --bootstrap-server localhost:9092. Ifretention.msandretention.bytesare set at the topic level, they override broker defaults. I also checkcleanup.policy— if it is set tocompactinstead ofdelete, old segments are not deleted by time, they are compacted by key. - Third, I check segment configuration. The active segment is never deleted by retention policy. If
segment.bytesis set to 1GB (default) and a partition receives only 100MB/day, a segment takes 10 days to fill and roll. Until it rolls, it is immune to retention cleanup. The fix is to reducesegment.ms(e.g., to 1 hour) so segments roll by time even if they have not reached the size limit. - Fourth, I check for partition imbalance. If some brokers have more partitions than others, disk usage is uneven. Kafka does not auto-rebalance partitions when new brokers are added. I use
kafka-reassign-partitions.shor Cruise Control to rebalance. - Immediate fixes: reduce retention for the largest topics (if the business allows it), enable compression at the producer level (
compression.type=lz4orzstdcan reduce disk usage 50-80%), and add disk capacity if needed.
zstd can reduce storage by 60-80% for JSON/text payloads. Third, move cold data to a separate Kafka cluster with cheaper storage (larger, slower disks) using MirrorMaker 2, and set a shorter retention on the primary cluster. The compliance team reads from the archive cluster for audits.You need to perform a zero-downtime upgrade of your Kafka cluster from version 3.3 to 3.6. Describe the exact steps and the risks at each stage.
You need to perform a zero-downtime upgrade of your Kafka cluster from version 3.3 to 3.6. Describe the exact steps and the risks at each stage.
- Kafka supports rolling upgrades: you upgrade brokers one at a time while the cluster continues serving traffic. The key constraint is the inter-broker protocol version — brokers must be able to communicate with each other during the transition.
- Step 1: Before touching any broker, set
inter.broker.protocol.versionandlog.message.format.versionto the current version (3.3) in the configuration files. This ensures that even after the binary is upgraded, the broker speaks the old protocol to communicate with not-yet-upgraded brokers. - Step 2: Upgrade brokers one at a time. For each broker: stop it cleanly (
kill -TERM, notkill -9), replace the binary with 3.6, start it with the updated binary but the same old protocol version. Wait for the broker to rejoin the cluster and all its replicas to be in-sync (check under-replicated partitions = 0) before moving to the next broker. - Step 3: Once all brokers are running the 3.6 binary, update
inter.broker.protocol.versionto 3.6 and do another rolling restart. This enables the new protocol features. - Step 4: Upgrade clients. Consumers first (they are more tolerant of broker version changes), then producers. Kafka maintains backward compatibility, so old clients work with new brokers, but new client features require new broker protocol.
- Risks: the biggest risk is upgrading too fast — if you start the next broker before the previous one has fully rejoined and caught up, you temporarily reduce the number of in-sync replicas, which can cause write failures if
min.insync.replicasis not met. Never skip versions (do not go from 3.0 to 3.6 directly; go 3.0 to 3.3 to 3.6) because protocol changes accumulate.
Your team has a 50-broker Kafka cluster with 10,000 partitions. Partition leadership is unevenly distributed -- some brokers lead 500 partitions while others lead 50. How do you rebalance, and what are the risks?
Your team has a 50-broker Kafka cluster with 10,000 partitions. Partition leadership is unevenly distributed -- some brokers lead 500 partitions while others lead 50. How do you rebalance, and what are the risks?
- Uneven partition leadership is common and happens organically: when brokers restart, preferred replica election may not run, and new topics may be created when some brokers are temporarily down. This causes hot spots where some brokers are overloaded with leader responsibilities (handling all reads and writes for those partitions) while others are idle.
- First, I check if running a preferred leader election fixes the issue:
kafka-leader-election.sh --election-type preferred --all-topic-partitions --bootstrap-server localhost:9092. Kafka assigns a “preferred replica” for each partition (typically the first broker in the replica list). If a failover occurred and the preferred replica is healthy again, this command restores leadership to it. This is lightweight and usually resolves 80% of imbalance. - If the replica assignment itself is skewed (not just the leadership), I use Cruise Control or
kafka-reassign-partitions.shto physically move partition replicas between brokers. Cruise Control is preferred for large clusters because it optimizes for disk, CPU, and network balance simultaneously, and it throttles data movement to avoid impacting production traffic. - Risks of partition reassignment: the broker physically copies data from the current replicas to the new replicas. For a partition with 50GB of data, this is 50GB of network and disk I/O. Moving many partitions simultaneously can saturate the broker’s network and disk, causing increased latency for production traffic. I always set a replication throttle (
kafka-reassign-partitions.sh --throttle 50000000for 50MB/s) and run reassignments during low-traffic periods. - Monitoring during rebalance: watch
UnderReplicatedPartitions(should be non-zero during data movement but trend toward zero), broker CPU and disk I/O, and consumer lag (should not spike significantly).
Congratulations! You have completed the Kafka Crash Course. Next: Kubernetes Crash Course →