Documentation Index
Fetch the complete documentation index at: https://resources.devweekends.com/llms.txt
Use this file to discover all available pages before exploring further.
Kafka Fundamentals
Master the core concepts of Apache Kafka event streaming and understand its distributed architecture.What is Kafka?
Apache Kafka is a distributed event streaming platform. Unlike traditional message queues (like RabbitMQ), Kafka is designed to handle massive streams of events, store them durably, and process them in real-time. Think of the difference this way: a traditional message queue is like a postal service — once the letter is delivered, it is gone. Kafka is like a newspaper archive — events are published, anyone can read them (even people who subscribe later), and they stay around for as long as you configure.Event Streaming
Distributed
Durable
Scalable
Core Architecture
1. Events (Messages)
An event records the fact that “something happened”.- Key: Optional, used for partitioning (e.g.,
user_123) - Value: The data payload (e.g., JSON
{"action": "login"}) - Timestamp: When it happened
- Headers: Optional metadata
2. Topics
A Topic is a logical category or feed name to which records are published. Think of it as a named channel on a TV network — producers broadcast to a channel, and any number of viewers (consumers) can tune in independently.- Analogous to a table in a database or a folder in a filesystem.
- Multi-subscriber: Can have zero, one, or many consumers. Each consumer reads independently at its own pace.
- Append-only: New events are always added to the end. You cannot insert, update, or delete individual events (log compaction is the closest thing to an “update”).
3. Partitions
Topics are split into Partitions.- Scalability: Partitions allow a topic to be spread across multiple servers.
- Ordering: Order is guaranteed only within a partition, not across the entire topic.
- Offset: Each message in a partition has a unique ID called an offset.
4. Brokers
A Kafka server is called a Broker.- Receives messages from producers.
- Assigns offsets.
- Commits messages to disk storage.
- Serves fetch requests from consumers.
- A Cluster consists of multiple brokers working together.
5. Replication
Kafka replicates partitions across multiple brokers for fault tolerance. This is Kafka’s insurance policy — if a hard drive or server dies, your data survives because copies exist on other machines.- Replication Factor: Number of copies (usually 3). RF=3 means the data exists on 3 different brokers. You can lose any one broker and still have all your data.
- Leader: One broker is the leader for a partition; handles all reads/writes. This simplifies consistency — there is one source of truth.
- Followers: Replicate data from the leader. If the leader fails, a follower is promoted to the new leader within seconds, and producers/consumers reconnect automatically.
Producers & Consumers
Producers
Applications that publish (write) events to Kafka topics.- Partitioning Strategy: Decides which partition a message goes to.
- Round-robin: If no key is provided (load balancing).
- Hash-based: If key is provided (same key always goes to same partition).
Consumers
Applications that subscribe to (read) events from Kafka topics.- Consumer Groups: A set of consumers working together to consume a topic.
- Each partition is consumed by only one consumer in the group.
- Allows parallel processing of a topic.
- Offsets: Consumers track their progress by committing offsets.
Kafka vs RabbitMQ (Deep Dive)
| Feature | Apache Kafka | RabbitMQ |
|---|---|---|
| Design | Distributed Commit Log | Traditional Message Broker |
| Message Retention | Policy-based (e.g., 7 days), durable | Deleted after consumption (usually) |
| Throughput | Extremely High (Millions/sec) | High (Thousands/sec) |
| Ordering | Guaranteed per partition | Guaranteed per queue |
| Consumption | Pull-based (Consumer polls) | Push-based (Broker pushes) |
| Use Case | Event streaming, Log aggregation, Analytics | Complex routing, Task queues |
The Power of Pull-Based Consumption
Kafka’s pull-based model is a key architectural decision that distinguishes it from traditional push-based messaging systems.Consumer Control
Consumer Control
Batching Efficiency
Batching Efficiency
Rewind & Replay
Rewind & Replay
- Recovering from errors
- Testing new processing logic on old data
- Training ML models
ZooKeeper vs KRaft (Critical for Interviews!)
Kafka has traditionally relied on ZooKeeper for cluster coordination. KRaft (Kafka Raft) is the new consensus protocol that removes this dependency.ZooKeeper Mode (Legacy)
ZooKeeper responsibilities:- Broker registration and discovery
- Controller election
- Topic configuration storage
- ACLs and quotas
KRaft Mode (New Standard - Kafka 3.3+)
| Aspect | ZooKeeper | KRaft |
|---|---|---|
| Architecture | Separate cluster | Integrated with Kafka |
| Latency | Higher (two systems) | Lower (single system) |
| Scalability | Limited by ZK | Millions of partitions |
| Operational Complexity | Two systems to manage | Single system |
| Production Ready | Yes | Yes (Kafka 3.6+) |
Partition Internals Deep Dive
Understanding partition internals is crucial for performance tuning and interviews.Log Segments
Each partition is stored as a series of log segments:How Writes Work
- Producer sends message to partition leader
- Leader appends to active segment file (sequential I/O - very fast!)
- Leader replicates to followers (if acks=all)
- Leader responds to producer with offset
How Reads Work
- Consumer requests offset range
- Broker uses
.indexfile to find segment - Broker does sequential read from segment
- Returns batch of messages
In-Sync Replicas (ISR) Deep Dive
ISR is one of the most important concepts for durability.What is ISR?
The set of replicas that are “caught up” with the leader. A replica is removed from ISR if:- It falls behind by more than
replica.lag.time.max.ms(default 30s) - It’s disconnected from ZooKeeper/controllers
ISR Configuration
| Config | Description | Default |
|---|---|---|
min.insync.replicas | Minimum ISRs for write to succeed | 1 |
replica.lag.time.max.ms | Max lag before removal from ISR | 30000 |
Data Loss Scenarios
acks=1, leader fails
acks=1, leader fails
acks=allAll ISRs fail simultaneously
All ISRs fail simultaneously
unclean.leader.election.enable=false (default)min.insync.replicas=1 with RF=2
min.insync.replicas=1 with RF=2
The Golden Rule
Interview Questions & Answers
How does Kafka achieve high throughput?
How does Kafka achieve high throughput?
- Sequential I/O: Append-only writes, no random disk seeks
- OS Page Cache: Uses filesystem cache for zero-copy reads
- Batching: Groups messages to reduce network/disk overhead
- Compression: Reduces network bandwidth
- Partitioning: Parallelism across brokers
What happens when a broker fails?
What happens when a broker fails?
- Controller detects broker failure (via heartbeats)
- For each partition where failed broker was leader:
- Controller elects new leader from ISR
- Updates metadata in all brokers
- Producers/consumers get metadata update and reconnect
- Data is safe if replica was in ISR
How do you choose the number of partitions?
How do you choose the number of partitions?
partitions = max(throughput/producer_throughput, throughput/consumer_throughput)Considerations:- More partitions = more parallelism
- More partitions = more memory/file handles
- Can’t decrease partitions (only increase)
- Each partition = at most one consumer in a group
What is the difference between a Topic and a Partition?
What is the difference between a Topic and a Partition?
- Topic: Logical category/feed (like a table)
- Partition: Physical subdivision of a topic (like a shard)
- Scalability (spread across brokers)
- Parallelism (multiple consumers)
- Ordering (guaranteed within partition)
How does Kafka handle consumer failures?
How does Kafka handle consumer failures?
- Consumer stops sending heartbeats
- After
session.timeout.ms, consumer is considered dead - Rebalance is triggered in the consumer group
- Partitions are reassigned to remaining consumers
- New consumer starts from last committed offset
What is the Controller in Kafka?
What is the Controller in Kafka?
- Monitors broker liveness
- Elects partition leaders when brokers fail
- Updates cluster metadata
- Manages partition reassignments
Common Pitfalls
Installation & Quick Start
- Docker (Recommended)
- Local Install
docker-compose up -dCLI Power User Commands
Topic Management
Producing & Consuming
Consumer Groups
Key Takeaways
- Topics are logs of events, divided into Partitions.
- Partitions allow Kafka to scale and guarantee ordering.
- Brokers form a cluster to provide durability and availability.
- Consumer Groups allow parallel processing of topics.
- Kafka is pull-based and stores data for a set retention period.
Interview Deep-Dive
You are designing a Kafka topic for an e-commerce system that processes orders. How do you decide the number of partitions and the partition key? Walk me through your reasoning.
You are designing a Kafka topic for an e-commerce system that processes orders. How do you decide the number of partitions and the partition key? Walk me through your reasoning.
- The partition key determines which messages land in the same partition, and the number of partitions determines the maximum parallelism for consumers.
- For an order processing system, I would use
customer_idas the partition key. This guarantees that all orders for a given customer are processed in order (ordering is guaranteed within a partition). If I usedorder_id, orders for the same customer could land in different partitions and be processed out of order — a customer might see “order shipped” before “order confirmed.” - For the number of partitions, I work backwards from throughput requirements. If the system needs to process 10,000 orders/second and each consumer instance can handle 1,000 orders/second, I need at least 10 partitions (one consumer per partition in a consumer group). I would start with 12-16 partitions to leave room for growth, because you can increase partitions later but never decrease them, and adding partitions changes the key-to-partition mapping (existing keys may land in different partitions).
- Gotchas to consider: if the customer base has a heavy skew (one enterprise customer generates 50% of orders), that customer’s partition becomes a hot partition. The fix is either a compound key (
customer_id + order_date) to spread the load, or a custom partitioner that sub-partitions high-volume customers. - I would also set the replication factor to 3 and
min.insync.replicasto 2, ensuring that a single broker failure does not cause data loss or availability issues.
Explain what happens when a Kafka broker fails, from the moment of failure to full recovery. Be specific about the role of the controller, ISR, and leader election.
Explain what happens when a Kafka broker fails, from the moment of failure to full recovery. Be specific about the role of the controller, ISR, and leader election.
- Step 1: The controller detects the failure. In ZooKeeper mode, the broker’s ZK session expires (default 18 seconds). In KRaft mode, the controller quorum detects missing heartbeats. This detection latency is the first component of recovery time.
- Step 2: For every partition where the failed broker was the leader, the controller selects a new leader from the ISR (In-Sync Replicas) set. The selection is deterministic — typically the first broker in the ISR list. If the ISR is empty (all replicas failed), the controller either waits for an ISR member to recover or, if
unclean.leader.election.enable=true, promotes a non-ISR replica (risking data loss). - Step 3: The controller updates the cluster metadata with the new leader assignments and broadcasts this to all remaining brokers.
- Step 4: Producers and consumers receive metadata updates (either through polling or through the
NotLeaderForPartitionExceptionerror that triggers a metadata refresh). They reconnect to the new leaders. This typically takes 1-5 seconds. - Step 5: For partitions where the failed broker was a follower, the remaining ISR continues to serve reads and writes from the leader. There is no immediate impact — just reduced redundancy.
- Step 6: When the failed broker restarts, it rejoins the cluster, begins fetching data from current leaders to catch up, and is re-added to ISR once it is caught up (within
replica.lag.time.max.ms). - Total downtime for affected partitions: typically 5-30 seconds, dominated by the failure detection latency. With KRaft, this is faster because there is no ZooKeeper session timeout.
acks=all will receive a NotLeaderForPartitionException (or LeaderNotAvailableException during the election window). With retries enabled (and retries should be set to Integer.MAX_VALUE in production), the producer will retry the request. The retry triggers a metadata refresh, which discovers the new leader, and the message is sent to the new leader. No data is lost — the message is either committed to the old leader’s ISR before failure (in which case the new leader has it) or retried to the new leader. This is why acks=all + retries + idempotent producer is the gold standard for durability.Explain why Kafka moved from ZooKeeper to KRaft. What were the specific pain points with ZooKeeper, and what does KRaft solve?
Explain why Kafka moved from ZooKeeper to KRaft. What were the specific pain points with ZooKeeper, and what does KRaft solve?
- ZooKeeper was Kafka’s external dependency for cluster coordination: broker registration, controller election, topic metadata storage, and ACLs. It worked, but it introduced three categories of pain.
- First, operational complexity. Running Kafka in production meant managing two distributed systems — the Kafka cluster and the ZooKeeper ensemble. Each has its own deployment, monitoring, backup, and upgrade procedures. ZooKeeper has its own configuration, its own failure modes, and its own capacity limits. For small teams, this doubled the operational burden.
- Second, scalability limits. ZooKeeper stores all metadata in memory and uses a consensus protocol (ZAB) that becomes a bottleneck at scale. LinkedIn hit limits around 200,000 partitions because ZooKeeper could not handle the metadata volume. Partition creation and deletion became slow because they required synchronous writes to ZooKeeper.
- Third, latency. Controller election and metadata updates required round-trips to ZooKeeper. When a broker failed, the detection and leader election process involved ZooKeeper session timeouts (default 18 seconds) plus the metadata update propagation. This meant partitions could be unavailable for 20-30 seconds during a broker failure.
- KRaft solves all three. It embeds the consensus protocol (Raft) directly into Kafka. Metadata is stored in a Kafka internal topic (
__cluster_metadata), managed by a quorum of controller nodes. No external ZooKeeper. This simplifies operations (one system), enables millions of partitions (metadata scales with Kafka itself), and reduces failover latency (Raft leader election is sub-second).
kafka-metadata.sh) that performs the transition online. The process: first, deploy KRaft controller nodes alongside the existing ZooKeeper-based cluster. Second, use the migration tool to replicate metadata from ZooKeeper to the KRaft controllers. Third, reconfigure brokers to use KRaft controllers instead of ZooKeeper (rolling restart). Fourth, once all brokers are on KRaft, decommission the ZooKeeper ensemble. The migration is designed to be zero-downtime, but I would schedule it during a low-traffic window and have a rollback plan. The key risk is the “point of no return” — once brokers are on KRaft and you decommission ZooKeeper, you cannot easily go back.Next: Kafka Producers & Consumers →