Skip to main content

Documentation Index

Fetch the complete documentation index at: https://resources.devweekends.com/llms.txt

Use this file to discover all available pages before exploring further.

Kafka Operations

Learn to deploy, manage, and secure Kafka clusters in production environments. Running Kafka in production is like running a commercial airline: the mechanics of flight are well understood, but operational discipline — monitoring, maintenance, capacity planning, and incident response — is what separates a smooth operation from a disaster.

Cluster Architecture

A production Kafka cluster typically consists of:
  • 3+ Brokers: For availability and replication.
  • 3+ Zookeeper Nodes (or KRaft controllers): For cluster coordination.

Key Configurations (server.properties)

PropertyDescriptionRecommendedWhy
broker.idUnique integer IDUnique per brokerIdentifies the broker in metadata. Cannot be changed after data is written.
log.dirsWhere data is storedSeparate disk/mountDedicated disks avoid I/O contention with OS or application logs. Use multiple dirs for JBOD.
num.partitionsDefault partitions for auto-created topics3Only matters if auto.create.topics.enable=true (which it should not be in production).
default.replication.factorDefault replication for auto-created topics3RF=3 survives one broker failure with no data loss.
min.insync.replicasMin replicas that must ack a write2With RF=3, this ensures writes survive one broker failure. Set to 1 and you risk silent data loss.
auto.create.topics.enableCreate topics on first produce/consumefalse (Production)A typo in a topic name silently creates a new topic. Debugging “where did my messages go?” at 3 AM is not fun.

High Availability & Replication

Replication Factor

  • RF=3: Standard for production. Allows 1 broker failure with no data loss.
  • Min ISR=2: Ensures data is written to at least 2 brokers before ack.

In-Sync Replicas (ISR)

Replicas that are caught up with the leader.
  • If a follower falls too far behind, it’s removed from ISR.
  • If leader fails, only a member of ISR can become new leader.

Monitoring & Observability

Key Metrics (JMX)

Under Replicated Partitions

Critical: Should be 0. >0 means data isn’t fully replicated.

Active Controller Count

Critical: Should be 1 per cluster.

Offline Partitions

Critical: >0 means data is unavailable.

Request Latency

Time to process produce/fetch requests.

Consumer Lag

The most important metric for consumers.
  • Lag: Difference between latest offset and consumer offset.
  • Monitoring: Use kafka-consumer-groups.sh or tools like Prometheus/Grafana.
# Check lag
bin/kafka-consumer-groups.sh --bootstrap-server localhost:9092 --describe --group my-group

Security

1. Encryption (SSL/TLS)

Encrypts data in transit between clients and brokers, and between brokers. Without TLS, every message is sent in plain text over the network — including any sensitive data your applications produce. In regulated environments (PCI, HIPAA, SOC2), encryption in transit is mandatory.
# server.properties -- TLS configuration for broker-to-client encryption.
# In production, also enable inter-broker TLS with a separate listener.
listeners=SSL://:9093
# JKS keystore containing the broker's private key and certificate.
# Generate with: keytool -genkey -keyalg RSA -keystore server.keystore.jks
ssl.keystore.location=/var/private/ssl/server.keystore.jks
ssl.keystore.password=test1234      # Use a secret manager in production, not plaintext
ssl.key.password=test1234           # Password for the private key within the keystore
# Truststore contains the CA certificate(s) used to verify client certificates.
# Both brokers and clients need a truststore to verify each other's identity.
ssl.truststore.location=/var/private/ssl/server.truststore.jks
ssl.truststore.password=test1234
Production gotcha: TLS adds CPU overhead for encryption/decryption. On high-throughput clusters (100K+ messages/sec), this can increase produce latency by 20-40%. Benchmark with and without TLS to size your brokers appropriately. Modern CPUs with AES-NI hardware acceleration minimize this impact.

2. Authentication (SASL)

Verifies identity of clients.
  • SASL/PLAIN: Username/Password.
  • SASL/SCRAM: Salted Challenge Response (More secure).
  • SASL/GSSAPI (Kerberos): Enterprise integration.

3. Authorization (ACLs)

Controls what authenticated users can do. ACLs follow the principle of least privilege: deny everything by default, then grant specific permissions to specific principals.
# Allow 'alice' to write to 'finance-topic'.
# Without this ACL, alice gets "TopicAuthorizationException" on produce.
bin/kafka-acls.sh --bootstrap-server localhost:9092 --add \
    --allow-principal User:alice \
    --operation Write \
    --topic finance-topic

# Allow the 'analytics' consumer group to read from all topics prefixed with 'events-'
# The --resource-pattern-type prefixed flag is powerful for topic naming conventions.
bin/kafka-acls.sh --bootstrap-server localhost:9092 --add \
    --allow-principal User:analytics-service \
    --operation Read \
    --topic events- \
    --resource-pattern-type prefixed \
    --group analytics-group

Troubleshooting

Kafka troubleshooting follows a systematic pattern: check the obvious first (disk, network, logs), then move to cluster-level health (ISR, controller), then application-level issues (consumer lag, producer errors).

Broker Failure

  1. Check Logs: /var/log/kafka/server.log. Look for FATAL or ERROR entries. The most common causes are disk failures, OOM kills, and GC pauses.
  2. Check Disk: Is disk full? Kafka stops accepting writes when disk is full. This is the single most common cause of production Kafka outages. Run df -h on broker hosts and set alerts at 80% capacity.
  3. Check ZooKeeper/KRaft Connectivity: Is the broker connected to the coordination layer? Run echo ruok | nc zk-host 2181 for ZooKeeper, or check KRaft controller logs for lost quorum.
  4. Check Under-Replicated Partitions: kafka-topics.sh --describe --under-replicated-partitions. If a broker is slow (not dead), you will see its partitions falling out of ISR before the broker actually fails.

Consumer Issues

  • Stuck Consumer (Rebalance Storm): Check if rebalancing is happening constantly (“stop-the-world”). Common causes: GC pauses exceeding session.timeout.ms, slow processing exceeding max.poll.interval.ms, or a consumer that keeps crashing and rejoining. Check kafka-consumer-groups.sh --describe and look for rapidly changing partition assignments.
  • High Lag: Consumer is too slow. Diagnose: is the bottleneck CPU (processing), I/O (database writes), or network (large messages)? Solutions ranked by effort: (1) Increase max.poll.records and batch database writes, (2) Optimize processing logic, (3) Add more partitions and consumer instances, (4) Consider async processing with a local buffer.
  • Duplicate Processing: Usually caused by offsets being committed too late (consumer crashes after processing but before committing). Make your processing idempotent (e.g., upsert instead of insert, deduplicate by message ID).

Data Loss Scenarios

  • Unclean Leader Election: If all ISRs fail, a non-ISR replica becomes leader (data loss). Config: unclean.leader.election.enable=false (Default). This means the partition becomes unavailable rather than losing data — availability is sacrificed for durability.
  • Producer acks=1: Leader accepts write but crashes before replicating. Config: acks=all. This is the most common misconfiguration that leads to data loss in production.
  • min.insync.replicas=1 with RF=2: Only one copy exists after a single broker failure. Config: RF=3, min.insync.replicas=2. This is the minimum safe configuration for production data.

Maintenance Tasks

Rebalancing Partitions

When adding new brokers, partitions don’t automatically move. Use kafka-reassign-partitions.sh.

Log Compaction

For topics where only the latest value for a key matters (e.g., user profile updates).
  • Config: cleanup.policy=compact.

Log Compaction Deep Dive

Log compaction ensures Kafka retains the latest value for each key, rather than retaining by time.

How It Works

Before Compaction:
offset: 0  1  2  3  4  5  6  7
key:    A  B  A  C  B  A  C  B
value:  v1 v1 v2 v1 v2 v3 v2 v3

After Compaction:
offset: 5  6  7
key:    A  C  B
value:  v3 v2 v3

Configuration

# Topic-level
cleanup.policy=compact
min.cleanable.dirty.ratio=0.5   # Compact when 50% is "dirty"
delete.retention.ms=86400000    # Keep tombstones for 24h

Tombstones (Deletes)

To delete a key, produce a message with null value:
producer.send(new ProducerRecord<>("topic", "key-to-delete", null));
The tombstone is kept for delete.retention.ms, then removed during compaction.
Interview Tip: Compacted topics are ideal for:
  • Changelogs (database CDC)
  • Configuration/feature flags
  • User profile caches
  • KTable backing in Kafka Streams

Retention Strategies

Time-Based Retention

retention.ms=604800000          # 7 days
retention.bytes=-1               # Unlimited size

Size-Based Retention

retention.ms=-1                  # Unlimited time
retention.bytes=1073741824       # 1GB per partition

Combined

retention.ms=604800000           # 7 days OR
retention.bytes=1073741824       # 1GB, whichever comes first

Disaster Recovery

Multi-Datacenter Replication

ToolDescription
MirrorMaker 2Kafka-native, replicates topics between clusters
Confluent ReplicatorCommercial, more features
Cluster LinkingConfluent Cloud, byte-for-byte replication

MirrorMaker 2 Architecture

Backup Strategies

  1. Regular Backups: Use tools like Kafka Connect to S3
  2. Topic Mirroring: MirrorMaker 2 to standby cluster
  3. Retention Policy: Set high retention for critical topics

Performance Tuning

Broker Tuning

# Socket buffers: increase for high-throughput, cross-datacenter replication.
# Default is OS-dependent (~128KB). Increase to match your network bandwidth.
socket.send.buffer.bytes=102400
socket.receive.buffer.bytes=102400

# Network threads handle reading requests from the socket.
# I/O threads handle writing to disk and replication.
# Rule of thumb: num.network.threads = number of NICs or cores (whichever is smaller).
# num.io.threads = number of disks * 2.
num.io.threads=8
num.network.threads=3

# Log flush settings: how often Kafka flushes data from OS page cache to disk.
# WARNING: Setting these too aggressively hurts throughput. Kafka relies on
# replication (not fsync) for durability. In most production setups, leave
# these at defaults and let the OS manage flushing.
log.flush.interval.messages=10000   # Flush after N messages (default: Long.MAX_VALUE)
log.flush.interval.ms=1000          # Flush every N ms (default: Long.MAX_VALUE)

Producer Tuning

# Batching: the #1 lever for throughput. Larger batches = fewer network round trips.
batch.size=16384           # 16KB max batch size per partition
linger.ms=5                # Wait up to 5ms to fill the batch before sending.
                           # 0 = send immediately (lowest latency, worst throughput).
                           # 5-20ms is a good tradeoff for most workloads.

# Compression: reduces network and disk I/O at the cost of CPU.
# lz4 = best compression speed. zstd = best compression ratio. snappy = balanced.
compression.type=lz4

# Buffer memory: total memory available for batching across all partitions.
# If this fills up, producer.send() blocks for max.block.ms then throws.
buffer.memory=33554432     # 32MB. Increase for high-throughput producers.

Consumer Tuning

# Fetch settings: control how aggressively the consumer pulls data.
fetch.min.bytes=1024       # Wait for 1KB of data before returning a fetch response.
                           # Higher = better throughput, higher latency.
fetch.max.wait.ms=500      # Maximum wait time for fetch.min.bytes to be satisfied.
                           # fetch completes when EITHER condition is met.
max.partition.fetch.bytes=1048576  # 1MB per partition per fetch. Increase for
                                   # large messages or high-throughput topics.

Interview Questions & Answers

Key Metrics to Monitor:
  1. Under-replicated partitions: Should be 0
  2. Active Controller Count: Should be 1
  3. Offline Partitions: Should be 0
  4. Request Latency: Produce/Fetch p99
  5. Consumer Lag: Per consumer group
Tools:
  • JMX + Prometheus + Grafana
  • Confluent Control Center
  • Burrow (consumer lag)
  • Cruise Control (auto-rebalancing)
  1. Configure new broker with unique broker.id
  2. Point to same ZooKeeper/KRaft controllers
  3. Start the broker
  4. Partitions don’t auto-move! Use kafka-reassign-partitions.sh
# Generate reassignment plan
kafka-reassign-partitions.sh --generate \
  --topics-to-move-json-file topics.json \
  --broker-list "1,2,3,4"

# Execute reassignment
kafka-reassign-partitions.sh --execute \
  --reassignment-json-file plan.json
Log compaction keeps only the latest value per key, not by time.Use Cases:
  • Database changelog (CDC)
  • User profile cache
  • Configuration store
  • KTable backing store
Config: cleanup.policy=compact
  1. Automatic: Controller elects new leaders from ISR
  2. Check Under-Replicated Partitions: If > 0, replication is catching up
  3. Bring broker back: It will rejoin and catch up
  4. If disk failed: May need to reassign partitions
Prevent data loss:
  • RF=3, min.insync.replicas=2
  • acks=all on producers
  • unclean.leader.election.enable=false
Rolling Upgrade Process:
  1. Set inter.broker.protocol.version to current version
  2. Upgrade brokers one at a time (preferred replicas first)
  3. After all brokers upgraded, bump protocol version
  4. Upgrade clients (consumers first, then producers)
Key: Never skip versions. 2.x → 3.0 → 3.5, not 2.x → 3.5
An open-source tool for Kafka cluster management:
  • Auto-rebalancing: Moves partitions for even load
  • Self-healing: Handles broker failures
  • Goal-based optimization: CPU, disk, network balance
Reduces operational burden of managing large clusters.

Common Pitfalls

1. Disk Full: Kafka stops accepting writes. Monitor disk usage and set retention appropriately.2. Under-Replicated Partitions: Indicates slow broker or network issues. Don’t ignore!3. Skipping Versions: Can cause data corruption. Always follow upgrade path.4. No Monitoring: Without metrics, you’re flying blind. Set up alerting on key metrics.5. Manual Partition Management: Use Cruise Control for large clusters to avoid imbalanced load.

Interview Deep-Dive

Strong Answer:
  • First, I identify which topics are consuming the most disk. I check per-topic sizes with kafka-log-dirs.sh --describe --bootstrap-server localhost:9092. This shows the size of each topic-partition across brokers. Often, one or two topics dominate disk usage (e.g., a high-volume logging topic or a topic with large message payloads).
  • Second, I check if retention is working correctly. I verify the topic-level configuration: kafka-configs.sh --describe --topic big-topic --bootstrap-server localhost:9092. If retention.ms and retention.bytes are set at the topic level, they override broker defaults. I also check cleanup.policy — if it is set to compact instead of delete, old segments are not deleted by time, they are compacted by key.
  • Third, I check segment configuration. The active segment is never deleted by retention policy. If segment.bytes is set to 1GB (default) and a partition receives only 100MB/day, a segment takes 10 days to fill and roll. Until it rolls, it is immune to retention cleanup. The fix is to reduce segment.ms (e.g., to 1 hour) so segments roll by time even if they have not reached the size limit.
  • Fourth, I check for partition imbalance. If some brokers have more partitions than others, disk usage is uneven. Kafka does not auto-rebalance partitions when new brokers are added. I use kafka-reassign-partitions.sh or Cruise Control to rebalance.
  • Immediate fixes: reduce retention for the largest topics (if the business allows it), enable compression at the producer level (compression.type=lz4 or zstd can reduce disk usage 50-80%), and add disk capacity if needed.
Follow-up: The team wants to keep 30 days of retention for compliance, but disk costs are too high. What options do you have?Three options. First, tiered storage (Kafka 3.6+): automatically moves older segments to cheap object storage (S3, GCS) while keeping recent data on local disk. Consumers transparently read from either tier. This is the modern answer for long retention periods. Second, compression: if producers are not already compressing, switching to zstd can reduce storage by 60-80% for JSON/text payloads. Third, move cold data to a separate Kafka cluster with cheaper storage (larger, slower disks) using MirrorMaker 2, and set a shorter retention on the primary cluster. The compliance team reads from the archive cluster for audits.
Strong Answer:
  • Kafka supports rolling upgrades: you upgrade brokers one at a time while the cluster continues serving traffic. The key constraint is the inter-broker protocol version — brokers must be able to communicate with each other during the transition.
  • Step 1: Before touching any broker, set inter.broker.protocol.version and log.message.format.version to the current version (3.3) in the configuration files. This ensures that even after the binary is upgraded, the broker speaks the old protocol to communicate with not-yet-upgraded brokers.
  • Step 2: Upgrade brokers one at a time. For each broker: stop it cleanly (kill -TERM, not kill -9), replace the binary with 3.6, start it with the updated binary but the same old protocol version. Wait for the broker to rejoin the cluster and all its replicas to be in-sync (check under-replicated partitions = 0) before moving to the next broker.
  • Step 3: Once all brokers are running the 3.6 binary, update inter.broker.protocol.version to 3.6 and do another rolling restart. This enables the new protocol features.
  • Step 4: Upgrade clients. Consumers first (they are more tolerant of broker version changes), then producers. Kafka maintains backward compatibility, so old clients work with new brokers, but new client features require new broker protocol.
  • Risks: the biggest risk is upgrading too fast — if you start the next broker before the previous one has fully rejoined and caught up, you temporarily reduce the number of in-sync replicas, which can cause write failures if min.insync.replicas is not met. Never skip versions (do not go from 3.0 to 3.6 directly; go 3.0 to 3.3 to 3.6) because protocol changes accumulate.
Follow-up: During the rolling upgrade, one broker fails to start on the new version. How do you handle the situation?I check the broker logs for the startup failure (common causes: incompatible configuration, Java version mismatch, corrupt log segments). If the fix is quick (configuration issue), I fix it and restart. If the fix requires investigation, I roll back that single broker to the old version. Since the old protocol version is still configured, the rolled-back broker communicates fine with the already-upgraded brokers. The cluster operates with a mix of versions, which is the supported state during an upgrade. I fix the issue on a test environment, then retry the upgrade for that broker. The key rule: never leave a broker down during an upgrade if you can restart it on the old version. A running broker on the old version is better than a dead broker on the new version.
Strong Answer:
  • Uneven partition leadership is common and happens organically: when brokers restart, preferred replica election may not run, and new topics may be created when some brokers are temporarily down. This causes hot spots where some brokers are overloaded with leader responsibilities (handling all reads and writes for those partitions) while others are idle.
  • First, I check if running a preferred leader election fixes the issue: kafka-leader-election.sh --election-type preferred --all-topic-partitions --bootstrap-server localhost:9092. Kafka assigns a “preferred replica” for each partition (typically the first broker in the replica list). If a failover occurred and the preferred replica is healthy again, this command restores leadership to it. This is lightweight and usually resolves 80% of imbalance.
  • If the replica assignment itself is skewed (not just the leadership), I use Cruise Control or kafka-reassign-partitions.sh to physically move partition replicas between brokers. Cruise Control is preferred for large clusters because it optimizes for disk, CPU, and network balance simultaneously, and it throttles data movement to avoid impacting production traffic.
  • Risks of partition reassignment: the broker physically copies data from the current replicas to the new replicas. For a partition with 50GB of data, this is 50GB of network and disk I/O. Moving many partitions simultaneously can saturate the broker’s network and disk, causing increased latency for production traffic. I always set a replication throttle (kafka-reassign-partitions.sh --throttle 50000000 for 50MB/s) and run reassignments during low-traffic periods.
  • Monitoring during rebalance: watch UnderReplicatedPartitions (should be non-zero during data movement but trend toward zero), broker CPU and disk I/O, and consumer lag (should not spike significantly).
Follow-up: How does Cruise Control determine the “ideal” partition distribution?Cruise Control builds a model of the cluster based on real-time metrics (CPU usage, disk usage, network throughput, leadership count per broker). It defines “goals” in priority order (e.g., rack awareness first, then disk capacity balance, then CPU balance, then network balance). The optimizer generates a partition reassignment plan that satisfies as many goals as possible, starting from the highest priority. You can customize goals and their priorities. For example, in a multi-AZ deployment, rack awareness (ensuring replicas span AZs) is non-negotiable, while CPU balance is best-effort. Cruise Control can also self-heal: when a broker fails, it can automatically move partitions to restore balance without human intervention.

Congratulations! You have completed the Kafka Crash Course. Next: Kubernetes Crash Course →