Skip to main content

Kafka Internals Deep Dive

If you love understanding how things actually work, this chapter is for you. If you just want to produce and consume messages, feel free to skip ahead. No judgment.
This chapter reveals the distributed systems magic behind Kafka. We will explore the commit log abstraction, understand how replication guarantees durability, and demystify consumer group coordination. This knowledge is what allows you to tune Kafka for millions of messages per second.

Why Internals Matter

Understanding Kafka internals helps you:
  • Tune for performance when milliseconds matter
  • Debug production issues when messages go missing
  • Design better systems that leverage Kafka’s strengths
  • Ace interviews where Kafka internals are common
  • Make informed decisions about partitioning and replication

The Fundamental Abstraction: The Commit Log

At its core, Kafka is an append-only, immutable commit log. This simple abstraction is the source of Kafka’s power.
Topic: user-events
Partition 0:
┌────┬────┬────┬────┬────┬────┬────┬────┬────┬────┐
│ 0  │ 1  │ 2  │ 3  │ 4  │ 5  │ 6  │ 7  │ 8  │ 9  │ ...
└────┴────┴────┴────┴────┴────┴────┴────┴────┴────┘
  ▲                                            ▲
  │                                            │
Oldest                                      Newest
(may be deleted)                         (append here)
Key properties:
  • Append-only: New messages always go to the end
  • Immutable: Messages never change after writing
  • Ordered: Messages have sequential offsets within a partition
  • Durable: Messages persist to disk (not just memory)
This is fundamentally different from traditional message queues where messages are deleted after consumption.

Log Segments: How Data is Stored

A partition is not one giant file. It is split into segments for manageability.
Partition directory: /kafka-logs/user-events-0/

├── 00000000000000000000.log      # Segment 1: offsets 0-999
├── 00000000000000000000.index    # Offset index
├── 00000000000000000000.timeindex # Timestamp index
├── 00000000000000001000.log      # Segment 2: offsets 1000-1999
├── 00000000000000001000.index
├── 00000000000000001000.timeindex
├── 00000000000000002000.log      # Active segment
└── ...

Segment Structure

Each segment consists of:
FilePurpose
.logActual message data
.indexMaps offset to file position
.timeindexMaps timestamp to offset
.snapshotProducer state for idempotency

Segment Lifecycle

┌─────────────────────────────────────────────────────────────────┐
│                      Segment Lifecycle                           │
│                                                                  │
│  Create ──▶ Active ──▶ Rolled ──▶ Eligible for deletion         │
│                │                                                 │
│                │ When:                                           │
│                │ - Size > segment.bytes (1GB default)           │
│                │ - Age > segment.ms (7 days default)            │
│                │ - Index full                                    │
└─────────────────────────────────────────────────────────────────┘

Log Compaction

For topics where only the latest value per key matters:
Before compaction:
┌─────────────────────────────────────────────────────────────────┐
│ Key:A,V:1 │ Key:B,V:1 │ Key:A,V:2 │ Key:A,V:3 │ Key:B,V:2 │
└─────────────────────────────────────────────────────────────────┘

After compaction:
┌─────────────────────────────────────────────────────────────────┐
│ Key:A,V:3 │ Key:B,V:2 │
└─────────────────────────────────────────────────────────────────┘
Use cases:
  • CDC (Change Data Capture) - keep latest row state
  • User profiles - keep latest version
  • Configuration - keep current settings

The Index Files: Fast Lookups

Reading from offset 1,000,000 should not require scanning 1M messages. Indexes solve this.

Offset Index

Offset Index (.index):
┌─────────────────┬──────────────────┐
│ Relative Offset │ Physical Position│
├─────────────────┼──────────────────┤
│       0         │        0         │
│      100        │     102400       │
│      200        │     204800       │
│      ...        │       ...        │
└─────────────────┴──────────────────┘
Finding offset 150:
  1. Binary search index: 100 < 150 < 200
  2. Start at position 102400
  3. Scan forward until offset 150
Indexes are sparse (not every offset) for space efficiency.

Timestamp Index

Timestamp Index (.timeindex):
┌───────────────┬─────────────────┐
│  Timestamp    │  Relative Offset│
├───────────────┼─────────────────┤
│ 1701590400000 │       0         │
│ 1701590410000 │      100        │
│ 1701590420000 │      200        │
└───────────────┴─────────────────┘
This powers offsetsForTimes() - seek to a specific time.

Replication: The Safety Net

Kafka replicates partitions across brokers for fault tolerance.

Replication Topology

Topic: orders (replication-factor=3)

         ┌──────────────────────────────────────────────────┐
         │                 Partition 0                       │
         ├────────────────────┬──────────────┬──────────────┤
         │    Broker 1        │   Broker 2   │   Broker 3   │
         │    (Leader)        │  (Follower)  │  (Follower)  │
         │    [0,1,2,3,4]     │  [0,1,2,3,4] │  [0,1,2,3]   │
         └────────────────────┴──────────────┴──────────────┘


                                              Slightly behind

Leader and Followers

RoleResponsibilities
LeaderHandles all reads and writes for partition
FollowersReplicate from leader, ready to take over
Producers and consumers only talk to the leader. This simplifies consistency.

In-Sync Replicas (ISR)

The ISR is the set of replicas that are “caught up” with the leader:
ISR = [Broker1, Broker2, Broker3]  # All caught up
ISR = [Broker1, Broker2]           # Broker3 fell behind
ISR = [Broker1]                    # Only leader in sync (dangerous!)
A replica falls out of ISR when:
  • More than replica.lag.time.max.ms behind (default: 30s)
  • Not sending fetch requests

Durability Guarantees: acks

Producers control durability with acks:
acksMeaningDurabilityPerformance
0Fire and forgetNoneFastest
1Leader acknowledgedModerateFast
all (-1)All ISR acknowledgedHighestSlowest
acks=all with min.insync.replicas=2:

Producer ──▶ Leader (Broker1)

              ├──▶ Follower (Broker2) ──▶ ACK

              └──▶ Follower (Broker3) ──▶ ACK

            ◀──────────────────────────────┘
                    Producer receives ACK

Leader Election

When a leader fails:
  1. Controller detects leader failure (via ZooKeeper/KRaft)
  2. Controller selects new leader from ISR
  3. Controller updates metadata
  4. Brokers and clients fetch new metadata
  5. New leader starts serving requests
Before failure:
  ISR = [Broker1 (Leader), Broker2, Broker3]

Broker1 dies:
  ISR = [Broker2, Broker3]
  Controller elects Broker2 as new leader

After election:
  ISR = [Broker2 (Leader), Broker3]
Unclean leader election: If ISR is empty, Kafka can elect a non-ISR replica (data loss risk). Controlled by unclean.leader.election.enable (default: false).

Producer Internals

Understanding producer internals helps you tune for throughput.

Producer Architecture

┌─────────────────────────────────────────────────────────────────┐
│                        Kafka Producer                            │
│                                                                  │
│  ┌──────────┐   ┌──────────────┐   ┌────────────────────────┐  │
│  │  Record  │──▶│  Serializer  │──▶│     Partitioner        │  │
│  │  (K,V)   │   │              │   │ (Which partition?)     │  │
│  └──────────┘   └──────────────┘   └───────────┬────────────┘  │
│                                                  │               │
│                                    ┌─────────────▼─────────────┐│
│                                    │    Record Accumulator     ││
│                                    │  ┌─────┐ ┌─────┐ ┌─────┐ ││
│                                    │  │Batch│ │Batch│ │Batch│ ││
│                                    │  │ P0  │ │ P1  │ │ P2  │ ││
│                                    │  └─────┘ └─────┘ └─────┘ ││
│                                    └───────────┬───────────────┘│
│                                                │                │
│                                    ┌───────────▼───────────────┐│
│                                    │      Sender Thread        ││
│                                    │   (Network I/O)           ││
│                                    └───────────────────────────┘│
└─────────────────────────────────────────────────────────────────┘

Batching: The Performance Trick

Producers accumulate records into batches:
ConfigPurposeDefault
batch.sizeMax batch size in bytes16KB
linger.msWait time for more records0ms
buffer.memoryTotal memory for buffering32MB
linger.ms=0:   Record arrives ──▶ Send immediately
linger.ms=10:  Record arrives ──▶ Wait 10ms ──▶ Send batch
Larger batches = better throughput, higher latency.

Partitioning Strategy

Default partitioner:
  • If key is null: Round-robin across partitions
  • If key is present: hash(key) % numPartitions
// Same key always goes to same partition
producer.send(new ProducerRecord<>("orders", "user-123", orderData));
// user-123 orders always in same partition = ordered processing

Idempotent Producer

Enable with enable.idempotence=true:
Without idempotence:
  Producer sends ──▶ Network error ──▶ Retry ──▶ Duplicate message!

With idempotence:
  Producer sends (PID=1, Seq=5) ──▶ Network error ──▶ Retry (PID=1, Seq=5)


                                                   Broker: "Already have Seq=5"
                                                   Deduplicate!
Each producer gets a Producer ID (PID) and sequence numbers per partition.

Consumer Internals

Consumer groups are Kafka’s killer feature for parallel processing.

Consumer Group Coordination

Consumer Group: order-processors

┌─────────────────────────────────────────────────────────────────┐
│                    Group Coordinator                             │
│                    (Broker 2)                                    │
│                                                                  │
│   Manages:                                                       │
│   - Group membership                                             │
│   - Partition assignment                                         │
│   - Offset commits                                               │
└─────────────────────────────────────────────────────────────────┘

           ┌───────────────┼───────────────┐
           ▼               ▼               ▼
      Consumer 1      Consumer 2      Consumer 3
      [P0, P1]        [P2, P3]        [P4, P5]

The Rebalance Protocol

When group membership changes (consumer joins/leaves), partitions are reassigned:
1. Consumer sends JoinGroup request
2. Coordinator waits for all consumers (session.timeout.ms)
3. Coordinator selects a leader consumer
4. Leader runs partition assignment algorithm
5. Coordinator sends assignments to all consumers
6. Consumers start fetching from assigned partitions

Partition Assignment Strategies

StrategyBehavior
RangeAssignorAssign contiguous partitions per topic
RoundRobinAssignorDistribute partitions evenly across consumers
StickyAssignorMinimize reassignments during rebalance
CooperativeStickyAssignorIncremental rebalance (no stop-the-world)

Offset Management

Consumers track progress via offsets:
__consumer_offsets topic (internal):
┌─────────────────┬──────────────────┬─────────────────┐
│  Group          │  Topic-Partition │  Committed      │
├─────────────────┼──────────────────┼─────────────────┤
│  order-procs    │  orders-0        │  1523           │
│  order-procs    │  orders-1        │  1891           │
│  order-procs    │  orders-2        │  1456           │
└─────────────────┴──────────────────┴─────────────────┘
Commit strategies:
StrategyCodeRisk
Auto commitenable.auto.commit=trueAt-least-once, possible duplicates
Sync commitconsumer.commitSync()At-least-once, blocks thread
Async commitconsumer.commitAsync()At-least-once, no blocking
Manual offsetCommit after processingExactly-once possible

ZooKeeper vs KRaft

Kafka is transitioning from ZooKeeper to KRaft (Kafka Raft).

ZooKeeper Mode (Legacy)

┌─────────────────────────────────────────────────────────────────┐
│                    ZooKeeper Ensemble                            │
│     (Stores: broker list, topic config, controller election)    │
└────────────────────────────────────┬────────────────────────────┘

     ┌───────────────────────────────┼───────────────────────────┐
     ▼                               ▼                           ▼
 Broker 1                        Broker 2                    Broker 3
 (Controller)
ZooKeeper stored:
  • Broker registration and liveness
  • Topic and partition metadata
  • Controller election
  • ACLs and quotas

KRaft Mode (New Standard)

┌─────────────────────────────────────────────────────────────────┐
│                    Kafka Controllers                             │
│           (Raft consensus, metadata stored in Kafka)            │
│                                                                  │
│     Controller 1 ◀──▶ Controller 2 ◀──▶ Controller 3            │
└────────────────────────────────────┬────────────────────────────┘

     ┌───────────────────────────────┼───────────────────────────┐
     ▼                               ▼                           ▼
 Broker 1                        Broker 2                    Broker 3
KRaft advantages:
  • Simpler operations: One system instead of two
  • Lower latency: No ZooKeeper round-trips
  • Better scalability: Millions of partitions possible
  • Faster recovery: Metadata in Kafka itself

Interview Deep Dive Questions

Answer: 1) Sequential disk I/O (append-only log), 2) Batching (produce/consume in batches), 3) Zero-copy transfer (sendfile syscall), 4) Page cache utilization (OS caches data), 5) Compression (reduce network/disk I/O), 6) Partitioning (parallel processing across brokers).
Answer: ISR (In-Sync Replicas) is the set of replicas caught up with the leader. When acks=all, producer waits for all ISR replicas to acknowledge. If replica falls behind (lag > replica.lag.time.max.ms), it is removed from ISR. min.insync.replicas sets minimum ISR size for writes to succeed. This balances durability and availability.
Answer: 1) Consumer sends heartbeats to group coordinator, 2) On membership change, coordinator triggers rebalance, 3) All consumers stop consuming (stop-the-world), 4) Consumers send JoinGroup request, 5) Coordinator elects leader who runs assignment strategy, 6) Coordinator sends assignments, 7) Consumers start consuming assigned partitions. CooperativeStickyAssignor enables incremental rebalance.
Answer: Requires: 1) Idempotent producer (enable.idempotence=true) - prevents duplicates via PID+sequence, 2) Transactional producer (transactional.id) - atomic writes across partitions, 3) Consumer with isolation.level=read_committed - only sees committed transactions. Used in Kafka Streams for exactly-once stream processing.
Answer: 1) Controller detects failure (missed heartbeats or ZK session loss), 2) For each partition led by failed broker, controller elects new leader from ISR, 3) Controller updates metadata and broadcasts to all brokers, 4) Clients refresh metadata and connect to new leaders, 5) Followers catch up from new leader. If min.insync.replicas not met, partition becomes unavailable for writes.
Answer: 1) Log-based vs queue-based (messages not deleted on consumption), 2) Pull model vs push model (consumers control pace), 3) Replayability (seek to any offset), 4) Consumer groups (partitions enable parallel consumption), 5) Retention-based (time/size) vs delivery-based deletion, 6) Higher throughput (millions vs thousands per second), 7) Ordering per partition only.

Tuning Kafka: Key Configurations

Producer Performance

# Batching
batch.size=65536          # 64KB batches
linger.ms=10              # Wait up to 10ms for batch

# Compression
compression.type=lz4       # Fast compression

# Throughput
buffer.memory=67108864     # 64MB buffer
max.block.ms=60000         # Wait if buffer full

Consumer Performance

# Batching
fetch.min.bytes=1048576    # Wait for 1MB before responding
fetch.max.wait.ms=500      # Or 500ms, whichever first
max.poll.records=500       # Records per poll()

# Parallelism
max.partition.fetch.bytes=1048576

Broker Performance

# Networking
num.network.threads=8
num.io.threads=16

# Log
log.segment.bytes=1073741824   # 1GB segments
log.retention.hours=168        # 7 days retention

# Replication
num.replica.fetchers=4
replica.fetch.max.bytes=1048576

Key Takeaways

  1. Kafka is an append-only commit log - simple abstraction, powerful results
  2. Partitions split into segments - enables efficient retention and compaction
  3. Indexes enable fast seeks - sparse offset and timestamp indexes
  4. ISR determines durability - acks=all waits for ISR acknowledgment
  5. Producers batch for throughput - linger.ms and batch.size control this
  6. Consumer groups enable parallelism - partitions assigned to consumers
  7. Rebalancing is expensive - use CooperativeStickyAssignor to minimize impact
  8. KRaft is the future - simpler, faster, scales to millions of partitions

Ready to build streaming applications? Next up: Kafka Streams where we will transform and aggregate event streams in real-time.