Kafka Internals Deep Dive

If you love understanding how things actually work, this chapter is for you. If you just want to produce and consume messages, feel free to skip ahead. No judgment.

This chapter reveals the distributed systems magic behind Kafka. We will explore the commit log abstraction, understand how replication guarantees durability, and demystify consumer group coordination. This knowledge is what allows you to tune Kafka for millions of messages per second.

Why Internals Matter

Understanding Kafka internals helps you:

Tune for performance when milliseconds matter
Debug production issues when messages go missing
Design better systems that leverage Kafka’s strengths
Ace interviews where Kafka internals are common
Make informed decisions about partitioning and replication

The Fundamental Abstraction: The Commit Log

At its core, Kafka is an append-only, immutable commit log. This simple abstraction is the source of Kafka’s power.

Topic: user-events
Partition 0:
┌────┬────┬────┬────┬────┬────┬────┬────┬────┬────┐
│ 0  │ 1  │ 2  │ 3  │ 4  │ 5  │ 6  │ 7  │ 8  │ 9  │ ...
└────┴────┴────┴────┴────┴────┴────┴────┴────┴────┘
  ▲                                            ▲
  │                                            │
Oldest                                      Newest
(may be deleted)                         (append here)

Key properties:

Append-only: New messages always go to the end
Immutable: Messages never change after writing
Ordered: Messages have sequential offsets within a partition
Durable: Messages persist to disk (not just memory)

This is fundamentally different from traditional message queues where messages are deleted after consumption.

Log Segments: How Data is Stored

A partition is not one giant file. It is split into segments for manageability.

Partition directory: /kafka-logs/user-events-0/

├── 00000000000000000000.log      # Segment 1: offsets 0-999
├── 00000000000000000000.index    # Offset index
├── 00000000000000000000.timeindex # Timestamp index
├── 00000000000000001000.log      # Segment 2: offsets 1000-1999
├── 00000000000000001000.index
├── 00000000000000001000.timeindex
├── 00000000000000002000.log      # Active segment
└── ...

Segment Structure

Each segment consists of:

File	Purpose
`.log`	Actual message data
`.index`	Maps offset to file position
`.timeindex`	Maps timestamp to offset
`.snapshot`	Producer state for idempotency

Segment Lifecycle

┌─────────────────────────────────────────────────────────────────┐
│                      Segment Lifecycle                           │
│                                                                  │
│  Create ──▶ Active ──▶ Rolled ──▶ Eligible for deletion         │
│                │                                                 │
│                │ When:                                           │
│                │ - Size > segment.bytes (1GB default)           │
│                │ - Age > segment.ms (7 days default)            │
│                │ - Index full                                    │
└─────────────────────────────────────────────────────────────────┘

Log Compaction

For topics where only the latest value per key matters:

Before compaction:
┌─────────────────────────────────────────────────────────────────┐
│ Key:A,V:1 │ Key:B,V:1 │ Key:A,V:2 │ Key:A,V:3 │ Key:B,V:2 │
└─────────────────────────────────────────────────────────────────┘

After compaction:
┌─────────────────────────────────────────────────────────────────┐
│ Key:A,V:3 │ Key:B,V:2 │
└─────────────────────────────────────────────────────────────────┘

Use cases:

CDC (Change Data Capture) - keep latest row state
User profiles - keep latest version
Configuration - keep current settings

The Index Files: Fast Lookups

Reading from offset 1,000,000 should not require scanning 1M messages. Indexes solve this.

Offset Index

Offset Index (.index):
┌─────────────────┬──────────────────┐
│ Relative Offset │ Physical Position│
├─────────────────┼──────────────────┤
│       0         │        0         │
│      100        │     102400       │
│      200        │     204800       │
│      ...        │       ...        │
└─────────────────┴──────────────────┘

Finding offset 150:

Binary search index: 100 < 150 < 200
Start at position 102400
Scan forward until offset 150

Indexes are sparse (not every offset) for space efficiency.

Timestamp Index

Timestamp Index (.timeindex):
┌───────────────┬─────────────────┐
│  Timestamp    │  Relative Offset│
├───────────────┼─────────────────┤
│ 1701590400000 │       0         │
│ 1701590410000 │      100        │
│ 1701590420000 │      200        │
└───────────────┴─────────────────┘

This powers offsetsForTimes() - seek to a specific time.

Replication: The Safety Net

Kafka replicates partitions across brokers for fault tolerance.

Replication Topology

Topic: orders (replication-factor=3)

         ┌──────────────────────────────────────────────────┐
         │                 Partition 0                       │
         ├────────────────────┬──────────────┬──────────────┤
         │    Broker 1        │   Broker 2   │   Broker 3   │
         │    (Leader)        │  (Follower)  │  (Follower)  │
         │    [0,1,2,3,4]     │  [0,1,2,3,4] │  [0,1,2,3]   │
         └────────────────────┴──────────────┴──────────────┘
                                                    ▲
                                                    │
                                              Slightly behind

Leader and Followers

Role	Responsibilities
Leader	Handles all reads and writes for partition
Followers	Replicate from leader, ready to take over

Producers and consumers only talk to the leader. This simplifies consistency.

In-Sync Replicas (ISR)

The ISR is the set of replicas that are “caught up” with the leader:

ISR = [Broker1, Broker2, Broker3]  # All caught up
ISR = [Broker1, Broker2]           # Broker3 fell behind
ISR = [Broker1]                    # Only leader in sync (dangerous!)

A replica falls out of ISR when:

More than replica.lag.time.max.ms behind (default: 30s)
Not sending fetch requests

Durability Guarantees: acks

Producers control durability with acks:

acks	Meaning	Durability	Performance
`0`	Fire and forget	None	Fastest
`1`	Leader acknowledged	Moderate	Fast
`all` (-1)	All ISR acknowledged	Highest	Slowest

acks=all with min.insync.replicas=2:

Producer ──▶ Leader (Broker1)
              │
              ├──▶ Follower (Broker2) ──▶ ACK
              │
              └──▶ Follower (Broker3) ──▶ ACK
                                            │
            ◀──────────────────────────────┘
                    Producer receives ACK

Leader Election

When a leader fails:

Controller detects leader failure (via ZooKeeper/KRaft)
Controller selects new leader from ISR
Controller updates metadata
Brokers and clients fetch new metadata
New leader starts serving requests

Before failure:
  ISR = [Broker1 (Leader), Broker2, Broker3]

Broker1 dies:
  ISR = [Broker2, Broker3]
  Controller elects Broker2 as new leader

After election:
  ISR = [Broker2 (Leader), Broker3]

Unclean leader election: If ISR is empty, Kafka can elect a non-ISR replica (data loss risk). Controlled by unclean.leader.election.enable (default: false).

Producer Internals

Understanding producer internals helps you tune for throughput.

Producer Architecture

┌─────────────────────────────────────────────────────────────────┐
│                        Kafka Producer                            │
│                                                                  │
│  ┌──────────┐   ┌──────────────┐   ┌────────────────────────┐  │
│  │  Record  │──▶│  Serializer  │──▶│     Partitioner        │  │
│  │  (K,V)   │   │              │   │ (Which partition?)     │  │
│  └──────────┘   └──────────────┘   └───────────┬────────────┘  │
│                                                  │               │
│                                    ┌─────────────▼─────────────┐│
│                                    │    Record Accumulator     ││
│                                    │  ┌─────┐ ┌─────┐ ┌─────┐ ││
│                                    │  │Batch│ │Batch│ │Batch│ ││
│                                    │  │ P0  │ │ P1  │ │ P2  │ ││
│                                    │  └─────┘ └─────┘ └─────┘ ││
│                                    └───────────┬───────────────┘│
│                                                │                │
│                                    ┌───────────▼───────────────┐│
│                                    │      Sender Thread        ││
│                                    │   (Network I/O)           ││
│                                    └───────────────────────────┘│
└─────────────────────────────────────────────────────────────────┘

Batching: The Performance Trick

Producers accumulate records into batches:

Config	Purpose	Default
`batch.size`	Max batch size in bytes	16KB
`linger.ms`	Wait time for more records	0ms
`buffer.memory`	Total memory for buffering	32MB

linger.ms=0:   Record arrives ──▶ Send immediately
linger.ms=10:  Record arrives ──▶ Wait 10ms ──▶ Send batch

Larger batches = better throughput, higher latency.

Partitioning Strategy

Default partitioner:

If key is null: Round-robin across partitions
If key is present: hash(key) % numPartitions

// Same key always goes to same partition
producer.send(new ProducerRecord<>("orders", "user-123", orderData));
// user-123 orders always in same partition = ordered processing

Idempotent Producer

Enable with enable.idempotence=true:

Without idempotence:
  Producer sends ──▶ Network error ──▶ Retry ──▶ Duplicate message!

With idempotence:
  Producer sends (PID=1, Seq=5) ──▶ Network error ──▶ Retry (PID=1, Seq=5)
                                                        │
                                                        ▼
                                                   Broker: "Already have Seq=5"
                                                   Deduplicate!

Each producer gets a Producer ID (PID) and sequence numbers per partition.

Consumer Internals

Consumer groups are Kafka’s killer feature for parallel processing.

Consumer Group Coordination

Consumer Group: order-processors

┌─────────────────────────────────────────────────────────────────┐
│                    Group Coordinator                             │
│                    (Broker 2)                                    │
│                                                                  │
│   Manages:                                                       │
│   - Group membership                                             │
│   - Partition assignment                                         │
│   - Offset commits                                               │
└─────────────────────────────────────────────────────────────────┘
                           │
           ┌───────────────┼───────────────┐
           ▼               ▼               ▼
      Consumer 1      Consumer 2      Consumer 3
      [P0, P1]        [P2, P3]        [P4, P5]

The Rebalance Protocol

When group membership changes (consumer joins/leaves), partitions are reassigned:

Consumer sends JoinGroup request
Coordinator waits for all consumers (session.timeout.ms)
Coordinator selects a leader consumer
Leader runs partition assignment algorithm
Coordinator sends assignments to all consumers
Consumers start fetching from assigned partitions

Partition Assignment Strategies

Strategy	Behavior
RangeAssignor	Assign contiguous partitions per topic
RoundRobinAssignor	Distribute partitions evenly across consumers
StickyAssignor	Minimize reassignments during rebalance
CooperativeStickyAssignor	Incremental rebalance (no stop-the-world)

Offset Management

Consumers track progress via offsets:

__consumer_offsets topic (internal):
┌─────────────────┬──────────────────┬─────────────────┐
│  Group          │  Topic-Partition │  Committed      │
├─────────────────┼──────────────────┼─────────────────┤
│  order-procs    │  orders-0        │  1523           │
│  order-procs    │  orders-1        │  1891           │
│  order-procs    │  orders-2        │  1456           │
└─────────────────┴──────────────────┴─────────────────┘

Commit strategies:

Strategy	Code	Risk
Auto commit	`enable.auto.commit=true`	At-least-once, possible duplicates
Sync commit	`consumer.commitSync()`	At-least-once, blocks thread
Async commit	`consumer.commitAsync()`	At-least-once, no blocking
Manual offset	Commit after processing	Exactly-once possible

ZooKeeper vs KRaft

Kafka is transitioning from ZooKeeper to KRaft (Kafka Raft).

ZooKeeper Mode (Legacy)

┌─────────────────────────────────────────────────────────────────┐
│                    ZooKeeper Ensemble                            │
│     (Stores: broker list, topic config, controller election)    │
└────────────────────────────────────┬────────────────────────────┘
                                     │
     ┌───────────────────────────────┼───────────────────────────┐
     ▼                               ▼                           ▼
 Broker 1                        Broker 2                    Broker 3
 (Controller)

ZooKeeper stored:

Broker registration and liveness
Topic and partition metadata
Controller election
ACLs and quotas

KRaft Mode (New Standard)

┌─────────────────────────────────────────────────────────────────┐
│                    Kafka Controllers                             │
│           (Raft consensus, metadata stored in Kafka)            │
│                                                                  │
│     Controller 1 ◀──▶ Controller 2 ◀──▶ Controller 3            │
└────────────────────────────────────┬────────────────────────────┘
                                     │
     ┌───────────────────────────────┼───────────────────────────┐
     ▼                               ▼                           ▼
 Broker 1                        Broker 2                    Broker 3

KRaft advantages:

Simpler operations: One system instead of two
Lower latency: No ZooKeeper round-trips
Better scalability: Millions of partitions possible
Faster recovery: Metadata in Kafka itself

Interview Deep Dive Questions

How does Kafka achieve high throughput?

Answer: 1) Sequential disk I/O (append-only log), 2) Batching (produce/consume in batches), 3) Zero-copy transfer (sendfile syscall), 4) Page cache utilization (OS caches data), 5) Compression (reduce network/disk I/O), 6) Partitioning (parallel processing across brokers).

What is ISR and why does it matter?

Answer: ISR (In-Sync Replicas) is the set of replicas caught up with the leader. When acks=all, producer waits for all ISR replicas to acknowledge. If replica falls behind (lag > replica.lag.time.max.ms), it is removed from ISR. min.insync.replicas sets minimum ISR size for writes to succeed. This balances durability and availability.

How does consumer rebalancing work?

Answer: 1) Consumer sends heartbeats to group coordinator, 2) On membership change, coordinator triggers rebalance, 3) All consumers stop consuming (stop-the-world), 4) Consumers send JoinGroup request, 5) Coordinator elects leader who runs assignment strategy, 6) Coordinator sends assignments, 7) Consumers start consuming assigned partitions. CooperativeStickyAssignor enables incremental rebalance.

Explain exactly-once semantics in Kafka

Answer: Requires: 1) Idempotent producer (enable.idempotence=true) - prevents duplicates via PID+sequence, 2) Transactional producer (transactional.id) - atomic writes across partitions, 3) Consumer with isolation.level=read_committed - only sees committed transactions. Used in Kafka Streams for exactly-once stream processing.

What happens when a broker fails?

Answer: 1) Controller detects failure (missed heartbeats or ZK session loss), 2) For each partition led by failed broker, controller elects new leader from ISR, 3) Controller updates metadata and broadcasts to all brokers, 4) Clients refresh metadata and connect to new leaders, 5) Followers catch up from new leader. If min.insync.replicas not met, partition becomes unavailable for writes.

How does Kafka differ from traditional message queues?

Answer: 1) Log-based vs queue-based (messages not deleted on consumption), 2) Pull model vs push model (consumers control pace), 3) Replayability (seek to any offset), 4) Consumer groups (partitions enable parallel consumption), 5) Retention-based (time/size) vs delivery-based deletion, 6) Higher throughput (millions vs thousands per second), 7) Ordering per partition only.

Tuning Kafka: Key Configurations

Producer Performance

# Batching
batch.size=65536          # 64KB batches
linger.ms=10              # Wait up to 10ms for batch

# Compression
compression.type=lz4       # Fast compression

# Throughput
buffer.memory=67108864     # 64MB buffer
max.block.ms=60000         # Wait if buffer full

Consumer Performance

# Batching
fetch.min.bytes=1048576    # Wait for 1MB before responding
fetch.max.wait.ms=500      # Or 500ms, whichever first
max.poll.records=500       # Records per poll()

# Parallelism
max.partition.fetch.bytes=1048576

Broker Performance

# Networking
num.network.threads=8
num.io.threads=16

# Log
log.segment.bytes=1073741824   # 1GB segments
log.retention.hours=168        # 7 days retention

# Replication
num.replica.fetchers=4
replica.fetch.max.bytes=1048576

Key Takeaways

Kafka is an append-only commit log - simple abstraction, powerful results
Partitions split into segments - enables efficient retention and compaction
Indexes enable fast seeks - sparse offset and timestamp indexes
ISR determines durability - acks=all waits for ISR acknowledgment
Producers batch for throughput - linger.ms and batch.size control this
Consumer groups enable parallelism - partitions assigned to consumers
Rebalancing is expensive - use CooperativeStickyAssignor to minimize impact
KRaft is the future - simpler, faster, scales to millions of partitions

Ready to build streaming applications? Next up: Kafka Streams where we will transform and aggregate event streams in real-time.

Overview

Testing & Code Quality

Crash Courses

AI Engineering

Math for ML - Understanding Linear Algebra

Probability & Statistics for ML

Math for ML - Understanding Calculus

ML Mastery

Deep Learning Mastery

NestJS Mastery

Microservices Mastery

Low Level Design

OOP Concepts

SOLID Principles

Design Patterns

LLD Case Studies

System Design (HLD)

Senior Level (L5+/Staff)

HLD Case Studies

Engineering Fundamentals

DevOps & Operations

Azure Cloud Engineering

AWS Cloud

AWS Monitoring & Observability

AWS Security Services

AWS Serverless

AWS Operations

AWS Advanced

AWS Case Studies

GCP Cloud Engineering

DevOps Tools

Database Engineering

HIPAA Compliance Mastery

Operating Systems

Linux Internals

Distributed Systems

Networking Mastery

Build Your Own X

Go Lang Mastery

C Programming

Classic Research Papers

Distributed System Tools

​Kafka Internals Deep Dive

​Why Internals Matter

​The Fundamental Abstraction: The Commit Log

​Log Segments: How Data is Stored

​Segment Structure

​Segment Lifecycle

​Log Compaction

​The Index Files: Fast Lookups

​Offset Index

​Timestamp Index

​Replication: The Safety Net

​Replication Topology

​Leader and Followers

​In-Sync Replicas (ISR)

​Durability Guarantees: acks

​Leader Election

​Producer Internals

​Producer Architecture

​Batching: The Performance Trick

​Partitioning Strategy

​Idempotent Producer

​Consumer Internals

​Consumer Group Coordination

​The Rebalance Protocol

​Partition Assignment Strategies

​Offset Management

​ZooKeeper vs KRaft

​ZooKeeper Mode (Legacy)

​KRaft Mode (New Standard)

​Interview Deep Dive Questions

​Tuning Kafka: Key Configurations

​Producer Performance

​Consumer Performance

​Broker Performance

​Key Takeaways