Synchronous communication works for simple request-response scenarios, but microservices often need asynchronous patterns to achieve loose coupling, resilience, and scalability.Think of async communication like the postal system versus a phone call. With a phone call (sync), both parties must be available simultaneously and the caller waits for the response. With mail (async), you drop a letter in the mailbox and go about your day — the recipient processes it whenever they are ready, and if the post office is overwhelmed, letters queue up rather than callers getting a busy signal. The trade-off is latency for resilience: you cannot get an immediate reply, but neither party needs to be available at the same moment, and a spike in mail volume does not crash the post office.
Learning Objectives:
Understand when to use async over sync communication
Implement message queues with RabbitMQ
Build event-driven systems with Apache Kafka
Handle message ordering, deduplication, and dead letters
Before we dive into the mechanics of message brokers, let’s make sure we deeply understand why async exists at all. In a synchronous architecture, service A calls service B and waits for a response before continuing. Every service in the chain becomes a liability: if any one of them is slow, the whole chain is slow; if any one of them is down, the whole chain is down. This coupling is the silent killer of microservices — teams start splitting the monolith to get independence, then accidentally rebuild the monolith with HTTP cables instead of function calls.Async flips the dependency. Instead of “I call you and wait,” it’s “I drop a message and trust someone will process it.” The producer doesn’t need to know who the consumers are, how many there are, or whether they’re currently up. That decoupling is the single most important architectural property async gives you — everything else (better throughput, natural retries, elastic scaling) is downstream of it.The tradeoff is complexity. You now have to reason about eventual consistency (“when will this actually happen?”), message ordering, duplicate delivery, and broker operations. Many teams underestimate this and reach for async too early, before they actually need it. A good rule: stay synchronous until the coupling pain becomes concrete (cascading failures, head-of-line blocking, inability to scale one service independently). Then pay the async tax deliberately.
SYNCHRONOUS COMMUNICATION───────────────────────────────────────────────────────────────────────────── Service A Service B Service C │ │ │ │──────── Request ──────────▶│ │ │ │──────── Request ──────────▶│ │ │ │ │ │◀─────── Response ──────────│ │◀─────── Response ──────────│ │ │ │ │ ⚠️ Problems: • A must wait for B and C (high latency) • If B fails, A fails • Tight coupling between services • Scaling issues (chain bottleneck)─────────────────────────────────────────────────────────────────────────────ASYNCHRONOUS COMMUNICATION───────────────────────────────────────────────────────────────────────────── Service A Message Broker Service B Service C │ │ │ │ │─── Publish Event ─────▶│ │ │ │◀── Acknowledgment ─────│ │ │ │ │ │ │ │ │─────── Event ───────────▶│ │ │ │─────── Event ─────────────────────────▶│ │ │ │ │ │ │◀────── Ack ──────────────│ │ │ │◀────── Ack ────────────────────────────│ ✅ Benefits: • A doesn't wait (immediate response) • B and C process independently • B and C can fail and retry • Loose coupling, better scaling
There are two foundational patterns you need to internalize before writing any broker code: point-to-point (queue) and publish-subscribe (topic). The difference is not about which broker technology you use — RabbitMQ, Kafka, SQS, and Azure Service Bus all support both — it’s about the delivery semantics you want. Choosing the wrong pattern is one of the most common and expensive mistakes in event-driven systems: if you use a queue where you needed a topic, new consumers will silently miss events; if you use a topic where you needed a queue, the same work gets done multiple times.The mental shortcut: ask “is this work or is this news?” If it’s work (send this email, charge this card, resize this image), you want point-to-point — exactly one worker should do the job. If it’s news (an order was placed, a user signed up), you want pub/sub — every interested party should hear about it.
Caveats & Common Pitfalls: Queue vs topic selection
Using a topic where you needed a queue. Three instances of the same service all subscribe to order.placed and each sends the confirmation email. Customer gets three emails, you get three charges on your email provider, and the bug looks like “random duplicates” because scaling the consumer made it worse.
Using a queue where you needed a topic. Analytics and notifications both read from one queue; whichever pulls the message first wins and the other never sees it. Events go missing depending on race conditions.
Forgetting that “consumer” and “service instance” are not the same. In Kafka, a consumer group with N instances is one logical consumer. In RabbitMQ, N processes reading the same queue is one logical consumer. In RabbitMQ, N processes bound to the same exchange with their own queues is N consumers. Getting this distinction wrong costs weeks.
Ordering assumptions that do not hold. FIFO is per-queue in RabbitMQ and per-partition in Kafka. The moment you scale consumers, ordering inside a partition is preserved but ordering across partitions is not.
Solutions & Patterns: “Work vs news” plus partition keysThe decision rule is the “work or news” test. Work (should happen exactly once across the organization) goes to a queue or a Kafka consumer group. News (everyone interested should hear about it) goes to a topic with one consumer group per consumer domain. If you want both — one team’s work AND broadcast to many teams — use Kafka with one consumer group per team (each team behaves like a queue consumer within their group, and the topic broadcasts across groups).Decision rule for ordering: ordering requirements define partitioning. If all events for order 123 must be processed in order, the partition key is order_id = 123. All events for that order land in one partition, so one consumer processes them serially. Events for different orders can process in parallel across partitions.Before: notifications team subscribes to order.placed queue. analytics team subscribes to the same queue. Both teams see roughly half the events because they are racing on the same queue.
After: order.placed is a Kafka topic. notifications-svc consumer group reads independently; analytics-svc consumer group reads independently. Each team sees every event and commits its own offsets.
Your team just migrated from RabbitMQ to Kafka. A week later, downstream teams complain that some events appear to arrive in the wrong order. Walk through the diagnosis and fix.
Strong Answer Framework:
Clarify the ordering guarantee. Kafka guarantees order per partition, not per topic. If the producer is not specifying a partition key, messages hash to arbitrary partitions and cross-partition ordering is lost.
Identify the “same entity” dimension. For order events, that is order_id. For user events, that is user_id. All events about the same entity must share a partition key.
Audit the producer: is ProducerRecord being constructed with a key? If the key is null, Kafka round-robins across partitions and order is lost by design.
Fix the producer to set the partition key. For in-flight events that are already mis-ordered, you typically accept the one-time disruption or replay from a snapshot.
Document the partition-key strategy in the event catalog so new events get it right from day one.
Real-World Example: LinkedIn, where Kafka was born, publishes a case study on their “Brooklin” cross-cluster replication framework that spent significant design effort on per-entity ordering via partition keys. Pinterest’s 2017 outage writeup “Helix-based master-slave handoff” explicitly cited partition-key misuse as a source of duplicate and out-of-order processing.Senior Follow-up Questions:
Follow-up 1: “What if two events for the same entity must be ordered, but the entity changes over time (e.g., a user changes their phone number)?”The partition key should be stable for the life of the ordering requirement. Usually that is a primary identifier like user_id, not a mutable field like phone. If ordering must hold across identity changes, use the earliest stable identifier.Follow-up 2: “How do you repartition a topic without losing ordering?”You cannot, cleanly. Repartitioning reshuffles events across partitions and breaks in-flight ordering. The standard playbook is: freeze producers, drain consumers to zero lag, rebuild a new topic with new partitioning, replay history if needed, cut over, resume producers.Follow-up 3: “What is the cost of using a high-cardinality partition key?”Each partition has bookkeeping cost (a file handle, a controller entry, broker memory). Thousands of partitions per broker is fine; tens of thousands starts to hurt. Use user_id not session_id when possible, because sessions are higher cardinality.
Common Wrong Answers:
“Kafka guarantees ordering by default, so the producer must be broken.” Fails because Kafka only guarantees per-partition ordering; a null key intentionally spreads across partitions.
“Add a single-threaded consumer to preserve order.” Fails because even a single consumer reads from multiple partitions concurrently; the fix is at produce time, not consume time.
Further Reading:
Kafka documentation on partition keys and ordering semantics.
Jay Kreps’s “The Log: What every software engineer should know about real-time data’s unifying abstraction.”
Gwen Shapira’s Kafka: The Definitive Guide, chapter on producers.
RabbitMQ is a “smart broker, dumb consumer” system built on the AMQP protocol. The broker is responsible for routing, retry policies, dead-lettering, and priorities; consumers just receive whatever the broker hands them. This model is excellent when your messaging needs are rich and varied — different queues with different priorities, complex routing rules, per-message TTLs. It’s less ideal when you need to replay past events or process millions of messages per second; that’s Kafka’s turf.Before we look at the code, understand the four core AMQP concepts: producers publish to exchanges, not directly to queues. An exchange applies bindings (routing rules) to decide which queue(s) a message lands in. Consumers then read from queues. This indirection is what gives RabbitMQ its flexibility — you can swap routing logic without changing producer or consumer code.
A common beginner mistake with RabbitMQ is treating connections like HTTP connections: open one per request, close it when done. Don’t. Opening a TCP connection plus an AMQP handshake is expensive (tens to hundreds of milliseconds), and RabbitMQ has hard limits on concurrent connections per node. The correct pattern is one long-lived connection per service process, with many lightweight channels multiplexed over it. Channels are cheap, thread-local units of work. If you ignore this and open a connection per publish, you’ll exhaust file descriptors, blow through RabbitMQ’s connection limit in production, and wonder why the broker falls over at moderate load.The code below also enables a heartbeat. Without heartbeats, a network partition can leave your service in a zombie state — the TCP connection still appears open at the OS level, but the broker has already moved on. Heartbeats force both sides to periodically confirm the connection is alive.
Node.js
Python
// config/rabbitmq.js// This module manages the single RabbitMQ connection for the service.// One connection per service, multiple channels per connection -- this is the// recommended pattern from the RabbitMQ team. Creating a new connection per// request is an anti-pattern that exhausts file descriptors.const amqp = require('amqplib');class RabbitMQConnection { constructor() { this.connection = null; this.channel = null; } async connect() { try { this.connection = await amqp.connect({ hostname: process.env.RABBITMQ_HOST || 'localhost', port: process.env.RABBITMQ_PORT || 5672, username: process.env.RABBITMQ_USER || 'guest', password: process.env.RABBITMQ_PASS || 'guest', vhost: process.env.RABBITMQ_VHOST || '/', heartbeat: 60 // Detects dead connections; without this, a network partition // can leave your service thinking it is connected when it is not }); this.channel = await this.connection.createChannel(); // Handle connection events this.connection.on('error', (err) => { console.error('RabbitMQ connection error:', err); this.reconnect(); }); this.connection.on('close', () => { console.log('RabbitMQ connection closed'); this.reconnect(); }); console.log('Connected to RabbitMQ'); return this.channel; } catch (error) { console.error('Failed to connect to RabbitMQ:', error); throw error; } } async reconnect() { console.log('Attempting to reconnect to RabbitMQ...'); await new Promise(resolve => setTimeout(resolve, 5000)); await this.connect(); } async setupExchanges() { // Direct exchange for point-to-point await this.channel.assertExchange('direct_exchange', 'direct', { durable: true }); // Topic exchange for routing patterns await this.channel.assertExchange('topic_exchange', 'topic', { durable: true }); // Fanout exchange for broadcasting await this.channel.assertExchange('fanout_exchange', 'fanout', { durable: true }); // Dead letter exchange -- messages that fail after max retries land here. // This is your safety net: monitor this queue and alert on it. Messages in the // DLQ represent real business operations that failed and need human attention. await this.channel.assertExchange('dlx_exchange', 'direct', { durable: true }); } async close() { await this.channel?.close(); await this.connection?.close(); }}module.exports = new RabbitMQConnection();
# config/rabbitmq.py# One long-lived connection per process, multiple channels on top.# We use aio-pika because it's the idiomatic async AMQP client for Python --# pika (sync) works too, but blocks your event loop if you're running FastAPI# or any asyncio-based service.import asyncioimport loggingimport osfrom typing import Optionalimport aio_pikafrom aio_pika import ExchangeTypefrom aio_pika.abc import AbstractRobustConnection, AbstractRobustChannellogger = logging.getLogger(__name__)class RabbitMQConnection: def __init__(self) -> None: self.connection: Optional[AbstractRobustConnection] = None self.channel: Optional[AbstractRobustChannel] = None async def connect(self) -> AbstractRobustChannel: try: # connect_robust auto-reconnects on network errors -- # this replaces the manual reconnect loop you'd write with pika. self.connection = await aio_pika.connect_robust( host=os.getenv("RABBITMQ_HOST", "localhost"), port=int(os.getenv("RABBITMQ_PORT", "5672")), login=os.getenv("RABBITMQ_USER", "guest"), password=os.getenv("RABBITMQ_PASS", "guest"), virtualhost=os.getenv("RABBITMQ_VHOST", "/"), heartbeat=60, # detect half-open connections ) self.channel = await self.connection.channel() # publisher confirms: the broker ACKs every publish. # Without this, a publish can silently fail on broker crash. await self.channel.set_qos(prefetch_count=10) self.connection.close_callbacks.add(self._on_close) logger.info("Connected to RabbitMQ") return self.channel except Exception: logger.exception("Failed to connect to RabbitMQ") raise def _on_close(self, connection, exc) -> None: logger.warning("RabbitMQ connection closed: %s", exc) async def setup_exchanges(self) -> None: assert self.channel is not None # Durable=True means the exchange survives broker restarts. await self.channel.declare_exchange( "direct_exchange", ExchangeType.DIRECT, durable=True ) await self.channel.declare_exchange( "topic_exchange", ExchangeType.TOPIC, durable=True ) await self.channel.declare_exchange( "fanout_exchange", ExchangeType.FANOUT, durable=True ) # Dead-letter exchange: monitor this; messages here = humans needed. await self.channel.declare_exchange( "dlx_exchange", ExchangeType.DIRECT, durable=True ) async def close(self) -> None: if self.channel: await self.channel.close() if self.connection: await self.connection.close()rabbitmq = RabbitMQConnection()
The producer is the easy half of messaging — but it hides a few critical decisions. First, persistent: true (or delivery_mode=2 in Python) writes messages to disk on the broker. Without it, every message sits only in RAM, and a broker restart throws them all away. For any business event you care about, you want persistence. Second, messages are opaque bytes to the broker; you need to set contentType so consumers know how to parse them. Third, always include a messageId and a correlationId — the first lets consumers deduplicate, the second lets you trace a single user action across a dozen services in your logs.A subtle tradeoff: persistent messages are ~10x slower than transient ones because they involve an fsync. If you’re publishing telemetry where losing a few messages on broker crash is acceptable, turn persistence off and gain throughput. If you’re publishing “payment charged,” pay the cost.
Consumers are where most of the hard problems live. Three things to internalize before reading the code:Prefetch (QoS) is load balancing’s secret weapon. By default, RabbitMQ will push as many messages as possible to each consumer. If one consumer is fast and another is slow, they end up with equal queue depths, so the slow one becomes a bottleneck. Setting prefetch to a small number (often just 1-10) forces fair dispatch: the broker only sends the next message after the consumer has ack’d the previous batch. This is the single most impactful tuning knob in RabbitMQ.Ack timing matters enormously. If you ack before processing and then crash, the message is lost — the broker thinks you succeeded. If you ack after processing and then crash, the message is redelivered to another consumer — which means your handler must be idempotent. The standard pattern is “ack after success, nack on failure” which gives at-least-once semantics. Do not take shortcuts here; this is how companies lose orders.Retries need backoff, not tight loops. A naive retry just re-queues the failing message immediately, which hammers the broker and the downstream service. Always use exponential backoff with a maximum retry count. After the max is exceeded, send the message to a DLQ where a human can investigate.
Now let’s see what it looks like to actually use these primitives in a real domain. The shape of the code below is the critical pattern: the Order service publishes events describing what happened in its world (an order was created, paid, shipped), and it does not know who listens. The Inventory service subscribes to the events it cares about and reacts. Neither service calls the other directly. If tomorrow you add a Recommendations service that also wants to hear about orders, you don’t touch Order or Inventory — you just subscribe.This is the payoff of async: you can add new consumers (analytics, fraud detection, email notifications, loyalty points) without modifying any existing service. Contrast that with a synchronous world, where adding a new consumer means the Order service has to learn about it, handle its failures, and potentially slow down to wait for it.
Connection-per-publish. Tens of milliseconds of AMQP handshake per message plus a hard node-level connection limit means moderate load exhausts the broker. Use one long-lived connection per process with many channels.
Unbounded queue growth. When consumers fall behind and you have no x-max-length or TTL, queues grow until memory or disk runs out and the broker refuses new publishes. The correct answer is queue length bounds with overflow to a DLQ, not infinite growth.
Auto-ack equals data loss.autoAck: true acknowledges on delivery, not on successful processing. A consumer crash mid-processing drops the message. Always use manual acks in production.
Publisher confirms off by default. Without publisher confirms, a broker restart between the basic.publish and the broker’s disk fsync silently loses messages. Turn on confirms for any publish that must not be lost.
Solutions & Patterns: Bounded queues, manual acks, publisher confirmsThe production-ready pattern: one connection per process, channels pooled per thread. Queues declared with x-max-length (or bytes) and x-overflow = reject-publish-dlx so overflow routes to a DLQ instead of silently dropping. Consumers use manual acks and only ack after the business logic succeeds. Publishers use publisher confirms with a handler that retries on nack. Monitor queue depth, unacked message count, and connection count on every node.Decision rule for prefetch: set prefetch to roughly the number of concurrent tasks the consumer can handle, not infinity. A prefetch of 1000 on a consumer that processes one at a time means one slow consumer hogs 1000 messages while other consumers starve.Before: One consumer with default prefetch pulls 10,000 messages from a shared queue. All other consumers sit idle. One crash loses all 10,000.
After: Prefetch of 50 per consumer; manual ack on each message; publisher confirms on the producer side. Work is spread evenly, a crash loses at most 50 messages, and those messages are redelivered to other consumers.
Your RabbitMQ cluster is healthy but one queue has 4 million unacked messages and publishers are being throttled. What is happening and what do you do in the next hour?
Strong Answer Framework:
Identify the pattern: unacked messages are delivered but not yet acknowledged. A huge unacked count means either consumers are stuck processing, consumers are crashed and messages will redeliver on timeout, or prefetch is set too high and messages are buffered but not being worked on.
Check consumer health: are the consumer processes running, are they CPU-bound, are they deadlocked on a downstream? rabbitmqctl list_consumers plus service logs.
If consumers are stuck: kill them so messages redeliver, and fix the underlying cause (often a downstream call without a timeout).
If prefetch is too high: reduce it in code and restart. A prefetch of 1 is safest during triage; raise later.
Publishers being throttled means the broker is applying flow control because memory is above the high-watermark. Once consumers drain, flow control releases automatically.
Real-World Example: Instagram’s 2019 feed service degradation, described by their engineering team in internal talks, involved exactly this pattern — prefetch of 10,000 on a consumer that deadlocked on a downstream database query. The fix was to cut prefetch and add a strict timeout on the downstream call.Senior Follow-up Questions:
Follow-up 1: “What is flow control in RabbitMQ?”When broker memory crosses a high-watermark (default 40% of system memory), RabbitMQ pauses publishers by not sending TCP ACKs. Publishers block on the socket. This is intentional backpressure; the broker is protecting itself from OOM. The fix is to drain consumers, not to raise the watermark.Follow-up 2: “How do you diagnose a stuck consumer?”Thread dump or CPU flame graph first. If the consumer is in a downstream call, the timeout is too long (or missing). If the consumer is in GC, heap is too small. If the consumer is idle but not pulling messages, check the channel’s prefetch and unacked counts.Follow-up 3: “Why is prefetch of 1 not always right?”Because it serializes work when the consumer could handle concurrency. Prefetch should roughly equal the consumer’s internal concurrency limit. A prefetch of 1 with async I/O wastes throughput; a prefetch of 1000 with sync I/O creates head-of-line blocking.
Common Wrong Answers:
“Restart the broker to clear the queue.” Fails because persistent messages survive restart and the root cause is not addressed.
“Purge the queue.” Fails because it loses real user work and does not diagnose the consumer problem.
Further Reading:
RabbitMQ docs on consumer prefetch and flow control.
“Reliable Messaging with RabbitMQ” by Alvaro Videla and Jason Williams.
CloudAMQP’s blog on diagnosing broker-side throttling.
Kafka excels at high-throughput, ordered event streaming. Where RabbitMQ is a smart broker with simple consumers (the broker routes messages), Kafka is a dumb broker with smart consumers (the broker is just an append-only log, and consumers track their own position). This architectural difference drives all of Kafka’s trade-offs: higher throughput and replay capability, but more operational complexity and consumer-side bookkeeping.RabbitMQ vs. Kafka — the honest trade-off:
Before any code, the mental model: a Kafka topic is an ordered, append-only log split into partitions. Each partition is a completely independent, ordered sequence of messages. You cannot guarantee ordering across partitions — only within one. This is not a bug; it’s what lets Kafka scale. If all messages had to be globally ordered, a single node’s write speed would cap the system. By sharding into partitions, you can parallelize writes and reads across the cluster.The partition key is your routing decision. If you publish with key="order-123", Kafka hashes the key and picks a partition — and every future message with that same key lands on the same partition. This is how you guarantee ordering for a single entity (all events for order-123 are in order) while still scaling horizontally.Consumer groups are the other magical piece. A group is a set of consumer processes that share the work of reading a topic. Each partition is assigned to exactly one consumer in the group. Add another group, and it independently reads the same topic with its own offsets — that’s pub/sub. Add more consumers to the same group, and you parallelize within that subscription — that’s work distribution. Kafka gives you both patterns from the same primitive.
The Kafka producer has more knobs than the RabbitMQ producer because Kafka exposes more of its internals. The one you absolutely must get right is the partition key. Choose a key that identifies the entity whose events must stay ordered (order ID, user ID, account ID). Choose badly — say, always using null as key — and Kafka will round-robin your events across partitions, shattering any ordering guarantee. Choose too narrowly — a single hot key — and all your traffic ends up on one partition, defeating the whole point of sharding.Other critical settings: acks=all (wait for all replicas to confirm the write, trading latency for durability), enable.idempotence=true (prevents duplicate writes on producer retries, which happen more often than you think on flaky networks), and compression.type=snappy or lz4 (often 3-5x throughput improvement at a tiny CPU cost). Default settings are tuned for “development” not “production” — read the docs.
Kafka consumers are where the “smart consumer” philosophy bites hardest. Unlike RabbitMQ, the broker doesn’t track what you’ve processed — you do. Your consumer periodically commits an offset saying “I’ve processed through offset 12345 in partition 2.” If you crash before committing, you’ll re-read from your last committed offset. If you commit before actually processing, you’ll skip messages on crash.This leads to two common offset-management strategies. Auto-commit periodically commits the current position in the background. It’s easy to set up but gives you at-most-once semantics if you’re not careful — a commit can happen between receiving and processing, and a crash loses the message. Manual commit after processing gives at-least-once semantics: you only commit after you’ve successfully processed. This is what you almost always want, and it means your handlers must be idempotent.Watch the heartbeat. If your handler takes longer than session.timeout.ms, Kafka thinks you died and rebalances your partitions to another consumer. Now two consumers are processing the same batch — the original one that isn’t actually dead and the replacement. This is a classic source of duplicate processing in Kafka.
This example illustrates two things worth pausing on. First, note how the producer uses order.id as the Kafka key. That’s deliberate: all events for a given order (created, paid, shipped, cancelled) will land on the same partition and be consumed in order. If we used random keys or none at all, an “order.paid” event could be processed before “order.created” by different consumers — a nightmare for downstream services.Second, note the in-memory processedEvents set in the consumer. This is a cheap, fragile idempotency mechanism — it only works if the service never restarts and never runs multiple replicas. Real production idempotency uses Redis, a database table, or a Kafka-native pattern like transactional processing. We’re showing the simple version here for clarity; swap it out for a Redis-backed version (shown later) before going to production.
Node.js
Python
// order-service/kafkaEvents.jsconst KafkaProducer = require('../kafka/producer');class OrderKafkaPublisher { constructor() { this.producer = new KafkaProducer(); } async initialize() { await this.producer.connect(); } async orderCreated(order) { // Use orderId as key for ordering await this.producer.send('orders', { eventId: `evt-${Date.now()}`, eventType: 'order.created', occurredAt: new Date().toISOString(), data: { orderId: order.id, customerId: order.customerId, items: order.items, totalAmount: order.totalAmount } }, order.id); // Key ensures same order events go to same partition } async orderStatusChanged(orderId, oldStatus, newStatus) { await this.producer.send('orders', { eventId: `evt-${Date.now()}`, eventType: 'order.status_changed', data: { orderId, oldStatus, newStatus } }, orderId); }}// inventory-service/kafkaHandler.jsconst KafkaConsumer = require('../kafka/consumer');class InventoryKafkaHandler { constructor(inventoryService) { this.consumer = new KafkaConsumer('inventory-service'); this.inventoryService = inventoryService; this.processedEvents = new Set(); // For idempotency } async start() { await this.consumer.connect(); await this.consumer.subscribe(['orders'], async (event, metadata) => { // Idempotency check if (this.processedEvents.has(event.eventId)) { console.log(`Skipping duplicate event: ${event.eventId}`); return; } switch (event.eventType) { case 'order.created': await this.handleOrderCreated(event.data); break; case 'order.cancelled': await this.handleOrderCancelled(event.data); break; } this.processedEvents.add(event.eventId); // Cleanup old event IDs periodically if (this.processedEvents.size > 10000) { this.processedEvents.clear(); } }); } async handleOrderCreated(data) { for (const item of data.items) { await this.inventoryService.reserve( item.productId, item.quantity, data.orderId ); } } async handleOrderCancelled(data) { await this.inventoryService.releaseByOrder(data.orderId); }}
# order_service/kafka_events.pyimport uuidfrom datetime import datetime, timezonefrom typing import Anyfrom kafka.producer import KafkaProducerclass OrderKafkaPublisher: def __init__(self) -> None: self.producer = KafkaProducer() async def initialize(self) -> None: await self.producer.connect() async def order_created(self, order: dict[str, Any]) -> None: # key=order["id"] guarantees all events for this order go to the same # partition -> consumers process them in the order they were published. await self.producer.send( topic="orders", message={ "event_id": f"evt-{uuid.uuid4()}", "event_type": "order.created", "occurred_at": datetime.now(timezone.utc).isoformat(), "data": { "order_id": order["id"], "customer_id": order["customer_id"], "items": order["items"], "total_amount": str(order["total_amount"]), }, }, key=order["id"], ) async def order_status_changed( self, order_id: str, old_status: str, new_status: str ) -> None: await self.producer.send( topic="orders", message={ "event_id": f"evt-{uuid.uuid4()}", "event_type": "order.status_changed", "data": { "order_id": order_id, "old_status": old_status, "new_status": new_status, }, }, key=order_id, )# inventory_service/kafka_handler.pyfrom collections import dequefrom typing import Anyfrom kafka.consumer import KafkaConsumerclass InventoryKafkaHandler: def __init__(self, inventory_service) -> None: self.consumer = KafkaConsumer(group_id="inventory-service") self.inventory = inventory_service # Sliding window of recently seen event IDs. Cheap and good enough # for single-instance deployments. Swap for Redis in production. self._processed: set[str] = set() self._processed_order: deque[str] = deque(maxlen=10_000) async def start(self) -> None: await self.consumer.connect(topics=["orders"], from_beginning=False) await self.consumer.consume(self._on_event) async def _on_event(self, event: dict[str, Any], metadata: dict[str, Any]) -> None: event_id = event["event_id"] if event_id in self._processed: return # duplicate event_type = event["event_type"] if event_type == "order.created": await self._handle_created(event["data"]) elif event_type == "order.cancelled": await self._handle_cancelled(event["data"]) self._processed.add(event_id) self._processed_order.append(event_id) if len(self._processed) > 10_000: # Evict oldest IDs so the set doesn't grow unbounded. while len(self._processed) > 10_000: old = self._processed_order.popleft() self._processed.discard(old) async def _handle_created(self, data: dict[str, Any]) -> None: for item in data["items"]: await self.inventory.reserve( item["product_id"], item["quantity"], data["order_id"] ) async def _handle_cancelled(self, data: dict[str, Any]) -> None: await self.inventory.release_by_order(data["order_id"])
Caveats & Common Pitfalls: Kafka in production
Consumer group lag as silent failure. If nobody is alerting on lag, consumers can fall behind indefinitely while the topic looks healthy. By the time someone notices at 4 million messages, catching up takes days and the data is stale.
Auto-commit at the wrong moment. Default auto-commit ticks every 5 seconds whether you finished processing or not. A consumer crash mid-batch commits offsets for un-processed messages and you lose them silently.
Partition rebalances as stop-the-world events. A consumer join/leave triggers a rebalance. During a rebalance, every consumer in the group pauses. With naive settings, a single consumer restart can freeze the whole group for tens of seconds. Use cooperative rebalancing (CooperativeStickyAssignor).
Retention assumptions that do not hold. You assume Kafka keeps events for a week; retention is actually 24 hours in your config. Replay from three days ago and the events are gone. Always verify log.retention.hours and log.retention.bytes in production.
Solutions & Patterns: Manual commits, lag alerting, cooperative rebalanceProduction-grade Kafka consumers disable auto-commit, batch-process a reasonable chunk (50-500 messages), commit offsets only after the business logic has durably succeeded, and handle the “processed but not committed” case with an idempotent consumer. Alert on lag per (topic, partition, consumer_group) with thresholds based on ingestion rate: if the topic gets 10K msgs/sec and your SLO is 30 seconds, alert at 300K messages of lag per partition.Decision rule for partition count: rough rule is target_throughput / per_partition_throughput with a floor of max_consumers_you_expect. You can add partitions later but it breaks ordering for in-flight events. Start generous — 24 partitions is a reasonable default for a new topic that may need to scale.Before: Consumer group with auto-commit every 5 seconds, no lag alert. Consumer deadlocks on a slow downstream. Lag climbs to 12 million over a weekend. Monday morning, business reports missing data from Friday afternoon.
After: Manual commit after batch success; alert on lag over 100K msgs/partition with 5 minute evaluation; cooperative rebalance keeps partition reassignment smooth during deploys. Outages are caught within minutes.
Your Kafka consumer group is 2 million messages behind during Black Friday peak. Walk through the decision tree of what to do.
Strong Answer Framework:
Classify the lag. Is it growing, steady, or shrinking? A growing lag means consumers cannot keep up with producers; steady means you are matching production rate; shrinking means you will eventually catch up on your own.
Triage consumer health. Is the consumer CPU-bound, I/O-bound on a downstream, or blocked on a lock? Profile before scaling, because adding consumers to a consumer group limited by a downstream just moves the bottleneck.
If consumers are under-provisioned and the topic has spare partitions: scale horizontally by adding consumer instances up to the partition count. Kafka rebalances automatically.
If all partitions are saturated: the topic is the bottleneck. You cannot add partitions mid-incident without breaking ordering. Options are (a) accept the delay, (b) spin up a separate consumer group that reads in parallel and commits to a different offset store for catch-up processing, or (c) drop non-critical processing temporarily.
Decide on the business trade-off. Is stale data acceptable for now and you backfill later? Or must you catch up in real time even at the cost of spending engineering effort?
Post-incident: raise partition count, alert on lag earlier, and profile the consumer for systemic slow paths.
Real-World Example: Shopify Black Friday 2019 writeup described exactly this pattern — their order ingest Kafka consumers fell behind peak traffic; they had preemptively added partitions and autoscaling but still had to manually intervene on one stream that was bottlenecked on a downstream warehouse write. DoorDash’s 2021 “Peak performance” engineering post covers similar scaling decisions around Kafka consumer capacity.Senior Follow-up Questions:
Follow-up 1: “Why cannot you add consumers beyond the partition count?”Each partition is assigned to exactly one consumer in a group. Consumers beyond the partition count sit idle. The number of partitions is the upper bound on horizontal scaling for a consumer group.Follow-up 2: “What is cooperative rebalancing and why does it matter during an incident?”The default eager rebalance revokes every partition from every consumer, reassigns all, then resumes. During the rebalance, no consumer makes progress. Cooperative (CooperativeStickyAssignor) incrementally reassigns only the partitions that moved, so most consumers keep working. During a Black Friday incident, the difference is five seconds of downtime versus thirty.Follow-up 3: “How do you do ‘parallel catchup’ without breaking ordering guarantees?”Spin up a new consumer group that reads from the current lag position with relaxed ordering (or with a reshuffled partition key). Use it to process the backlog while the primary group continues to handle new events. Merge results carefully downstream. This is viable when you can tolerate temporary ordering relaxation for the backlog; it does not work for strict per-entity ordering.
Common Wrong Answers:
“Add more partitions right now.” Fails because adding partitions changes the hash distribution and breaks ordering for in-flight events.
“Reset the consumer offset to latest and skip the backlog.” Fails because it silently drops real user data; only acceptable if the business confirms the data is not valuable.
Further Reading:
Confluent’s blog “Things You Should Know About Kafka Consumer Rebalancing.”
Shopify Engineering “Pipelines Meet Pipes: Shopify Black Friday” (2019).
Kafka documentation on CooperativeStickyAssignor.
Your team wants 'exactly-once' semantics for a Kafka-to-database pipeline. The staff engineer pushes back. What is the honest answer and what pattern do you propose?
Strong Answer Framework:
Clarify what exactly-once actually means. Kafka supports exactly-once within Kafka via transactions (produce + consume + commit offset atomically). It does not, and cannot, provide exactly-once across arbitrary side effects like “charge a credit card” or “write to Postgres.”
Identify the real requirement. “Exactly-once” usually means “no duplicates visible to the consumer.” This is achievable via at-least-once delivery plus idempotent processing — a much simpler architecture with the same end-user semantics.
Propose the pattern: at-least-once Kafka consumption, plus a deduplication layer. Deduplication can be (a) an upsert in the target database keyed by event ID, (b) a Redis-based idempotency cache, or (c) a unique constraint that makes duplicate inserts fail harmlessly.
If strict exactly-once within Kafka is required (e.g., for financial transformations that do not leave Kafka), enable transactional producers with isolation.level=read_committed consumers and enable.idempotence=true.
Document the boundary clearly: “within Kafka, exactly-once. From Kafka to Postgres, at-least-once with idempotent writes.”
Real-World Example: Confluent’s own Kafka Streams documentation is careful to specify that exactly-once applies “for Kafka-to-Kafka workflows.” Uber’s Kafka-based financial pipelines explicitly use idempotent consumers with event-id dedup rather than relying on end-to-end exactly-once, because writing to non-Kafka systems breaks the guarantee.Senior Follow-up Questions:
Follow-up 1: “How does Kafka transactions actually work?”A transaction groups produces to multiple partitions and topics, plus offset commits, into one atomic unit. The broker uses a transaction coordinator and a two-phase commit protocol. Consumers with isolation.level=read_committed skip aborted transactions’ messages entirely.Follow-up 2: “What is the performance cost of enabling transactions?”Roughly 3-20% throughput reduction depending on transaction size and commit frequency. Small, frequent commits are worse than fewer, larger commits. For most non-financial workloads, the cost is not worth it because at-least-once plus idempotent consumer achieves the same user-visible semantics.Follow-up 3: “What makes a consumer idempotent in practice?”Every event has a unique ID, and the consumer either upserts by that ID (database), checks and sets an idempotency cache (Redis), or writes to a table with a unique constraint on event ID (fail silently on duplicate). The pattern is: “processing the same event twice produces the same state as processing it once.”
Common Wrong Answers:
“Kafka has exactly-once, just enable it.” Fails because it only applies within Kafka’s own boundaries; any side effect outside Kafka (DB write, HTTP call, email send) breaks the guarantee.
“Just deduplicate on the consumer side using timestamps.” Fails because timestamps are not unique and can be rewritten; the dedup key must be a server-assigned event ID.
Further Reading:
Jay Kreps’s blog “Exactly-Once Support in Apache Kafka.”
Confluent’s “Exactly-Once Semantics Are Possible: Here’s How Kafka Does It.”
Tyler Akidau’s “The world beyond batch: Streaming 101” (on consistency models).
A well-designed event schema is the most underrated investment in an event-driven system. Events are your public API to the rest of the organization — once a service publishes order.created with a certain shape, every downstream consumer depends on it, and changing that shape becomes a coordination nightmare across teams. Unlike HTTP APIs where you can version by URL (/v1/orders, /v2/orders), events tend to sprawl unversioned because nobody thought to version them.Spend the 30 minutes at design time to nail down the envelope. Include metadata fields that future-you will need: an event ID for deduplication, an event type for routing, an occurred_at timestamp for temporal queries, a version for schema evolution, a correlation ID for distributed tracing, and the event data itself as a nested object. Separate envelope from payload — it lets you evolve each independently.A common anti-pattern: stuffing every possible field into the data object “just in case.” Resist. Events should describe what happened, not carry a full entity snapshot. If a consumer needs data that isn’t in the event, it should either be in the event (if the producer owns it) or fetched from the producer (if the producer is the authority). Fat events that duplicate every service’s data become impossible to evolve.
Schemas must evolve; the goal is to evolve them without breaking every consumer. There are two kinds of changes: backward-compatible (old consumers can still parse new events — e.g., adding an optional field) and breaking (old consumers will error or misinterpret — e.g., removing a field, renaming a field, changing a field’s type). Backward-compatible changes are free; breaking changes require careful orchestration.The standard strategy: never make a breaking change in place. Instead, publish both the old and new schema in parallel for a migration period. Roll consumers onto the new schema one by one, monitor that they’re healthy, then retire the old schema once all consumers have moved. This is tedious but it’s the only way to coordinate schema changes across teams without forcing a synchronous upgrade.Use a schema registry (Confluent Schema Registry, AWS Glue Schema Registry, or Azure Schema Registry) for anything beyond toy scale. It enforces compatibility rules at publish time — a producer trying to publish a breaking change gets rejected before the broker ever sees it. This is much better than discovering the break in production when three consumer services crash.
from decimal import Decimalfrom typing import Any# Version 1.0event_v1 = { "event_type": "order.created", "event_version": "1.0", "data": { "order_id": "ord-123", "customer_id": "cust-456", "total": 59.98, # just a number },}# Version 1.1 - additive (backward compatible)event_v1_1 = { "event_type": "order.created", "event_version": "1.1", "data": { "order_id": "ord-123", "customer_id": "cust-456", "total": 59.98, "currency": "USD", # new optional field },}# Version 2.0 - breaking (shape change)event_v2 = { "event_type": "order.created", "event_version": "2.0", "data": { "order_id": "ord-123", "customer_id": "cust-456", "amount": {"value": 59.98, "currency": "USD"}, },}class EventHandler: """Handle multiple schema versions by dispatching on event_version. Keep one code path per major version; retire old ones only after all producers have upgraded.""" async def handle_order_created(self, event: dict[str, Any]) -> None: version: str = event["event_version"] data = event["data"] if version.startswith("1."): await self._handle_v1(data) elif version.startswith("2."): await self._handle_v2(data) else: raise ValueError(f"Unsupported event version: {version}") async def _handle_v1(self, data: dict[str, Any]) -> None: amount = { "value": Decimal(str(data["total"])), "currency": data.get("currency", "USD"), } # ... process with normalized amount async def _handle_v2(self, data: dict[str, Any]) -> None: amount = data["amount"] # already structured # ... process
Caveats & Common Pitfalls: Event schema evolution
No version field on the envelope. The first breaking change forces a migration across every consumer, because nobody can distinguish old events from new. Add event_version on day one even if it is always “1”.
Shipping a breaking change as a minor revision. Renaming a required field or changing its type silently breaks every consumer that has not redeployed. Every change should be evaluated as backward-compatible (add a field, make optional) or breaking (rename, remove, change type).
Fat events with full entity snapshots. Event payloads grow to include every field in the entity “just in case.” Storage and network costs climb, and evolving any one field breaks all consumers because they all depend on it.
Events that describe state instead of change. An order.updated event with a full object requires consumers to diff against previous state to know what actually changed. Prefer domain-specific events: order.address_changed, order.shipped, order.refunded.
Solutions & Patterns: Versioned envelopes, additive evolution, domain eventsEvery event has an envelope with event_id, event_type, event_version, occurred_at, correlation_id, and data. The envelope fields evolve rarely; the data payload is the place where schemas grow. Follow additive evolution: new fields are optional with defaults; deprecated fields are marked but left on the wire; breaking changes get a new event_version or a new topic.Decision rule: if a change removes or renames a field, it is breaking. Bump event_version and publish both shapes during migration; consumers switch on version. If a change adds an optional field or an enum value with a known default, it is backward-compatible. Ship it without a version bump.Before: order.updated events carry the full Order entity. When the team adds a priority field, every consumer must redeploy to handle the new key in their JSON schema, or their parsers break.
After: order.priority_set is a new event with just {order_id, priority, set_at}. Consumers who care about priority subscribe; others do not. No one’s parser breaks.
Your team wants to 'just add a required field' to an existing event. The platform team says no. Explain why, and propose a migration plan.
Strong Answer Framework:
Explain the blast radius. Every consumer of this event has a deserialization path that currently works without the field. Adding a required field will break them at deserialization or at validation, depending on the consumer’s implementation.
Identify who consumes the event via the schema registry or the event catalog. If the registry is missing, that is a platform gap to address first.
Propose the migration: introduce the new field as optional with a documented default; instrument producer-side population so you can see all producers actually set it; once adoption is confirmed over a grace period (one or two release cycles), mark it required in the consumer-side contract test. No one was ever forced to deploy in lockstep.
Document the rule in the event governance guide: fields are optional at introduction; they become required only after telemetry confirms all producers populate them.
If the field is truly load-bearing and cannot be optional (e.g., a tenant ID for data segregation), treat it as a new event version, not a change to the existing one.
Real-World Example: Confluent’s schema registry with BACKWARD compatibility mode enforces exactly this pattern: adding required fields is rejected by the registry itself at publish time. Shopify’s “data platform playbook” talks describe the same policy — events are additive-only, and breaking changes become new event types.Senior Follow-up Questions:
Follow-up 1: “What does Avro’s ‘BACKWARD’ compatibility mean exactly?”New schema can read data written by old schema. Achieved by making new fields optional with defaults, and never removing required fields. The alternatives are FORWARD (old schema can read new data) and FULL (both). Most teams default to BACKWARD because producers typically upgrade before consumers.Follow-up 2: “What if the schema registry is not in place and we have no idea who consumes the event?”Step zero is to establish visibility: log every consumer that parses the event (user-agent, service name, timestamp) into a central catalog. Wait one to two weeks of peak traffic to enumerate consumers. Only then change the schema.Follow-up 3: “How do you handle polyglot consumers where different languages enforce schema differently?”Pick a serialization format with cross-language contracts: Avro or Protobuf with a shared registry. JSON with Schema can work but enforcement varies widely by client library. Document the compatibility rule (BACKWARD or FULL) and gate changes on registry checks in CI.
Common Wrong Answers:
“Just ship it; consumers will upgrade when they break.” Fails because production failures become your outage, not the consumer’s.
“Version the topic instead of the schema.” Fails because now every consumer must subscribe to two topics forever; the versioning happens at the event-type level, not the topic level.
Further Reading:
Confluent Schema Registry documentation on compatibility modes.
Martin Kleppmann’s Designing Data-Intensive Applications, chapter on encoding and evolution.
Ben Stopford’s Designing Event-Driven Systems (O’Reilly).
There are three possible delivery semantics, and understanding which one you have (and which one you need) is fundamental. At-most-once: each message is delivered zero or one times — no duplicates, but you can lose messages on failure. Use only when lost data is acceptable (e.g., click telemetry). At-least-once: each message is delivered one or more times — no data loss, but duplicates are possible. This is the default sane choice and what you’ll want 90% of the time. Exactly-once: each message is delivered exactly once. Sounds ideal, but it’s expensive, only works within specific technology boundaries (Kafka transactions, for example), and does not cross from messaging into side effects like sending emails or calling external APIs.The practical answer is almost always “at-least-once with idempotent consumers.” Accept that duplicates will happen and make your handlers safe to run twice. This is radically simpler than chasing exactly-once across service boundaries, and it works with any broker.
Idempotency is the single most important property for at-least-once messaging. An idempotent handler produces the same result no matter how many times it runs with the same input. “Charge customer 50"isnotidempotent−−runittwice,chargetwice."Setordertotalto50” is idempotent. “Insert order record” is not idempotent without a unique constraint; “Insert order record if not exists” is.The standard implementation: before processing, atomically check “have I seen this event ID before?” in a shared store (Redis, a DB table). If yes, ack and skip. If no, process, record that you’ve seen it, then ack. The tricky part is atomicity — you need the “record” step to be in the same transaction as the business operation, otherwise you can crash between them and still double-process. Many production systems use a database table as the idempotency store alongside the business DB specifically to get a single transaction.
Kafka’s “exactly-once semantics” (EOS) is the closest any broker comes to the real thing, but it has a strict scope: it only guarantees exactly-once within Kafka — meaning a consume-transform-produce loop where input, output, and consumer offsets are all committed in one transaction. It does not extend to calling an external service or writing to an arbitrary database. The moment your side effect leaves the Kafka-aware world, you’re back to at-least-once territory and need idempotency.Practically: use Kafka transactions for stream-processing jobs that read from one topic, transform, and write to another topic. Use idempotent consumers for anything involving external side effects. Don’t let the “exactly-once” marketing fool you into thinking you’ve escaped duplicate handling entirely.
# Kafka exactly-once in Python via aiokafka's transactional API.# Scope: only covers "consume -> produce -> commit offset" within Kafka.# Side effects outside Kafka (DB writes, API calls) still need idempotency.from aiokafka import AIOKafkaProducer, TopicPartitionproducer = AIOKafkaProducer( bootstrap_servers="localhost:9092", enable_idempotence=True, transactional_id="order-processor",)await producer.start()try: async with producer.transaction(): await producer.send( topic="orders-processed", value=b'{"status": "done"}', ) # Commit the consumer offsets inside the same transaction. # This ties the output message and the input offset together -- # either both commit or neither does. await producer.send_offsets_to_transaction( {TopicPartition("orders", 0): 100}, group_id="order-processor", )except Exception: # aborting happens automatically if the context manager exits with an exception, # but re-raise so the caller sees the failure. raisefinally: await producer.stop()
Caveats & Common Pitfalls: Delivery semantics
Believing in “exactly-once” across systems. Kafka’s exactly-once applies within Kafka. Once you write to Postgres or send an email, you are back to at-least-once. Treating the messaging layer’s guarantee as end-to-end is a common source of phantom bugs.
Idempotency that is not actually idempotent. A handler that does count += 1 is not idempotent. A handler that does SET last_count = n is. A handler that does INSERT ... ON CONFLICT DO NOTHING is. Review the handler with this lens before trusting at-least-once.
Choosing at-most-once for convenience. “We enabled auto-commit and forgot about it” is how at-most-once sneaks in. It silently loses data during crashes and is only acceptable when the data is genuinely ephemeral.
Storing idempotency keys without a TTL. After a year of operation, the idempotency table has hundreds of millions of keys and queries slow down. Keep TTLs reasonable (24 hours to 7 days for most use cases).
Solutions & Patterns: At-least-once plus idempotent consumersThe practical default: at-least-once delivery with idempotent consumers. Every event has a unique event_id. The consumer’s first action is an upsert keyed by event_id into a processed_events table (or Redis with TTL). If the key already exists, skip processing. If not, process and commit.Decision rule for when to pay the exactly-once cost: only when the pipeline stays entirely within Kafka (Kafka Streams, Flink-to-Kafka), and only when the business semantics demand zero duplicates even in the face of consumer retries. Everywhere else, at-least-once plus idempotency is simpler, cheaper, and end-to-end equivalent.Before: Consumer increments a counter on every event. A single retry causes double-counting. Business metrics drift over time, and by the time anyone notices, the history is irreparable.
After: Consumer checks processed_events table for the event ID before incrementing. Duplicates are safely no-ops. Counters match producer-side expected totals.
A product manager asks you to guarantee 'zero duplicate emails' in your notification service. Walk through how you deliver on that, and the honest caveats.
Strong Answer Framework:
Clarify the requirement. “Zero duplicates” at the user level is achievable; “zero duplicates” at the messaging layer is not, because at-least-once is the only robust delivery mode for a networked pipeline ending in an external email provider.
Design the pipeline for at-least-once delivery plus idempotent sending. Every email event has a unique ID; before calling the email provider, the consumer checks an idempotency store keyed by (user_id, event_id). If present, skip. If absent, send, then record.
Handle the failure window. If the consumer crashes between “send” and “record,” the retry will try to send again. The fix is to use the email provider’s idempotency key feature (SendGrid, Postmark, and Amazon SES all support it) so the provider deduplicates on their end.
Document the edge cases: if the provider does not support idempotency keys, you cannot guarantee zero duplicates; you can only reduce probability. The honest answer is “we guarantee near-zero under all plausible failure modes, but not mathematical zero.”
Add observability: duplicate-suppression metric, email-send latency, idempotency-store hit rate. If duplicates start appearing, you see it within minutes.
Real-World Example: Stripe’s webhook delivery system explicitly uses at-least-once with idempotency keys on the receiver side; they state in their docs that webhooks can and do arrive multiple times. SendGrid’s idempotency feature (introduced around 2019) was added precisely because every customer wanted “zero duplicates” and the honest architectural answer required provider-side dedup.Senior Follow-up Questions:
Follow-up 1: “What is the retention policy for the idempotency store?”Long enough to cover the longest plausible retry window. For most email pipelines, 24-72 hours is sufficient because anything older is unlikely to retry. Keep TTLs tight to keep the store small and fast.Follow-up 2: “What if two different events should produce one email? (e.g., order confirmation and shipping confirmation combined)”That is a business logic problem, not an infrastructure problem. The application layer should decide when to send; the infrastructure layer should just ensure that once a decision is made, it is delivered at-least-once and idempotently. Mixing the two leads to confused architecture.Follow-up 3: “How do you handle the case where the email provider’s idempotency store expires before our retry?”That is a real edge case: if we retry 72 hours later and the provider’s key has expired, we may send twice. Pragmatic fix: track our own send state with a longer TTL, and only retry within the provider’s idempotency window. After that, treat it as non-retryable and escalate to a human.
Common Wrong Answers:
“Use exactly-once Kafka transactions.” Fails because exactly-once does not extend to the email provider, which is the actual source of duplicates.
“Dedupe on our side using email subject and recipient.” Fails because a user might legitimately receive two emails with the same subject (e.g., two separate orders), and subject-based dedup breaks that.
Further Reading:
Stripe’s docs on webhook idempotency.
Amazon SES developer guide, “Duplicate suppression” section.
Gregor Hohpe’s Enterprise Integration Patterns, “Idempotent Receiver” chapter.
Dead Letter Queues (DLQs) are the exception-handling mechanism of async systems. When a message fails all retry attempts, it lands in the DLQ rather than being lost. Think of it as the “undeliverable mail” bin at the post office — someone needs to periodically open it, figure out why each letter failed, and decide what to do.Production pitfall: The number one operational mistake with DLQs is not monitoring them. Teams set up DLQs, messages silently accumulate, and nobody notices until a customer reports that their order from two weeks ago never shipped. Set up alerts on DLQ depth, and build a simple admin tool to inspect and replay DLQ messages.
# DLQ handler: picks up messages that exhausted retries, lets a human or# automated policy decide to replay or park for investigation.import jsonimport loggingfrom datetime import datetime, timezonefrom typing import Any, Awaitable, Callable, Optionalimport aio_pikafrom aio_pika.abc import AbstractIncomingMessage, AbstractRobustChannellogger = logging.getLogger(__name__)RetryDecision = Callable[[dict[str, Any], Optional[dict[str, Any]]], Awaitable[bool]]class DeadLetterHandler: def __init__(self, channel: AbstractRobustChannel, db) -> None: self.channel = channel self.db = db # e.g., motor client for MongoDB async def setup_dlq(self, original_queue: str) -> None: dlq_name = f"{original_queue}.dlq" await self.channel.declare_queue(dlq_name, durable=True) await self.channel.declare_queue( original_queue, durable=True, arguments={ "x-dead-letter-exchange": "", "x-dead-letter-routing-key": dlq_name, }, ) async def process_dlq( self, dlq_name: str, retry_decision: RetryDecision, ) -> None: queue = await self.channel.declare_queue(dlq_name, durable=True) async def _on_message(message: AbstractIncomingMessage) -> None: try: content = json.loads(message.body) headers = dict(message.headers or {}) death = (headers.get("x-death") or [{}])[0] logger.warning( "DLQ message: original_queue=%s reason=%s count=%s first_death=%s", death.get("queue"), death.get("reason"), death.get("count"), death.get("time"), ) should_retry = await retry_decision(content, death) if should_retry and death.get("queue"): # Replay to the original queue with a manual-retry marker. new_headers = { **headers, "x-manual-retry": True, "x-retry-timestamp": datetime.now(timezone.utc).isoformat(), } await self.channel.default_exchange.publish( aio_pika.Message( body=message.body, headers=new_headers, content_type=message.content_type, delivery_mode=aio_pika.DeliveryMode.PERSISTENT, ), routing_key=death["queue"], ) else: await self._store_for_investigation(content, death) await message.ack() except Exception: logger.exception("DLQ processing failed") # Keep in DLQ for manual intervention -- don't requeue. await message.nack(requeue=False) await queue.consume(_on_message) async def _store_for_investigation( self, message: dict[str, Any], death_info: dict[str, Any], ) -> None: await self.db.failed_messages.insert_one( { "message": message, "death_info": death_info, "stored_at": datetime.now(timezone.utc), "status": "pending_investigation", } )
Caveats & Common Pitfalls: Dead Letter Queues
DLQ exists, nobody looks at it. Messages accumulate for weeks. When the business asks “why did order 42 never ship?”, you find three months of unprocessed events in the DLQ. Always alert on DLQ depth and set up a scheduled review cadence.
No context on why the message failed. The DLQ message has just the payload, no stack trace, no retry count, no timing. Reprocessing blindly repeats the same error. Store failure context in message headers or a sidecar record.
Poison-message storms. A single malformed message fails forever, triggering retries that overwhelm the normal pipeline. The DLQ is supposed to quarantine poison; make sure max-retries is low (3-5) and DLQ routing is mandatory, not best-effort.
Replaying DLQ in bulk without triage. Someone kicks off “reprocess all 10,000 DLQ messages” and they fail again for the same reasons. Always triage and fix the underlying cause before bulk replay, and replay in small batches with monitoring.
Solutions & Patterns: DLQ as a first-class operational surfaceTreat the DLQ as a production-critical subsystem. Every message that lands in the DLQ carries metadata: original topic/queue, retry count, last error (class and message), first-failed-at, last-failed-at. Build a simple admin UI (or at minimum, a CLI) that can list, inspect, reason about, and selectively replay messages. Alert on (a) DLQ depth over N, (b) new arrivals per hour, (c) oldest message age.Decision rule for reprocessing: triage first, fix the cause, then replay in small batches (100-1000 at a time). Never replay more than you can watch in real time. If the replay triggers new DLQ arrivals, stop and re-triage.Before: DLQ grows silently. One day a VP asks about missing revenue data, and you find 800K unprocessed payment-confirmation events from the past quarter.
After: DLQ depth is on the oncall dashboard. Alert fires at over 100 messages. On-call reviews within the hour, identifies root cause, fixes, and replays. Mean time-to-drain is under two hours.
You discover 1.2 million messages in your DLQ from the past 6 weeks. Walk through how you triage and safely drain it.
Strong Answer Framework:
Stop the bleeding first. Identify if new messages are still flowing in. If yes, that means the underlying bug is still active; fix or disable the failing path before touching the backlog.
Classify the DLQ. Group messages by (failure reason, topic, time range). You will usually find a handful of distinct error classes accounting for most of the volume.
For each class, decide: is the root cause fixed? If not, fix it and deploy. Is the message still relevant to process, or has the underlying state moved on? Some events may be stale beyond usefulness (e.g., a “send shipping notification” event where the order was cancelled two weeks ago).
Replay in batches with monitoring. Start with a small batch (100 messages). Watch success rate. If healthy, ramp up. If failures resume, stop and investigate.
Document findings. If this DLQ buildup was a symptom of a monitoring gap, close the gap: add depth alerts, add age alerts, add a scheduled weekly review.
Real-World Example: The Pinterest engineering blog described a 2018 incident where their ads analytics DLQ accumulated tens of millions of messages during a multi-week downstream outage; the remediation involved building the “DLQ triage tool” that became a reusable platform capability. AWS’s SQS team publicly advises treating DLQs as first-class operational surfaces, with specific CloudWatch metrics for DLQ depth.Senior Follow-up Questions:
Follow-up 1: “How do you decide between replay to original topic vs side-processing?”Replay to the original topic preserves ordering and is simplest, but floods the consumer with old events mixed with new ones. Side-processing via a separate consumer group isolates the backlog from real-time traffic, at the cost of decoupling from the original ordering. For time-sensitive data, side-process. For ordering-sensitive data, replay carefully in small batches.Follow-up 2: “What if a message in the DLQ has a schema version that no consumer supports anymore?”Translate or drop. Build a translator that reads old-schema events and writes new-schema equivalents. Drop only with explicit business sign-off and a written record of what was dropped.Follow-up 3: “How do you prevent this from happening again?”Alerts on depth, age, and rate of new arrivals. Weekly review cadence during oncall rotation. A “DLQ budget” SLO like “less than 10 new DLQ messages per day” and if exceeded, a paging event.
Common Wrong Answers:
“Click ‘replay all’ and see what happens.” Fails because the same errors will recur, doubling the DLQ and likely disrupting real traffic.
“Delete the DLQ to clean up.” Fails because it destroys evidence of real business events that may still need processing.
Further Reading:
AWS SQS documentation on DLQ best practices.
Pinterest Engineering’s “Building a Real-Time User Action Counting System.”
Gregor Hohpe’s Enterprise Integration Patterns, “Dead Letter Channel” chapter.
'Your team is debating between RabbitMQ and Kafka for an order processing pipeline that handles 50,000 orders per day. Which do you choose and why?'
Strong Answer:At 50,000 orders per day (roughly 0.6 messages per second average, maybe 5-10 per second at peak), either technology can handle the throughput comfortably. The decision comes down to consumption patterns, not volume.I would choose RabbitMQ if the primary need is task distribution — each order message should be processed by exactly one consumer, with complex routing rules (e.g., high-value orders go to a priority queue, international orders go to a different handler). RabbitMQ’s exchange-binding model gives you flexible routing out of the box. It also supports message TTL, dead letter queues, and per-message acknowledgment natively, which is exactly what order processing needs.I would choose Kafka if we need event broadcasting (multiple consumers each reading the same order events independently), event replay (a new analytics service joins and needs to process historical orders), or if we expect growth to millions of events per day. Kafka’s log-based architecture means consumers maintain their own offsets and can replay from any point. This is essential if you are building an event-driven architecture where the same OrderPlaced event triggers payment, inventory, notification, and analytics independently.For an order processing pipeline specifically, I usually recommend Kafka because orders naturally fan out to multiple downstream services, and the ability to replay events when you add a new consumer or need to reprocess after a bug fix is invaluable. The operational overhead of Kafka is higher (ZooKeeper/KRaft, partition management, consumer group rebalancing), but at the scale where you are processing real orders, you should have the infrastructure maturity to manage it.Follow-up: “What happens if a Kafka consumer crashes halfway through processing an order? How do you prevent duplicate processing?”This is the classic at-least-once delivery problem. When the consumer crashes, Kafka does not know the message was partially processed, so the next consumer in the group picks it up and processes it again. You get duplicate processing unless your consumer is idempotent.I enforce idempotency by storing a processed message ID (the Kafka offset or a unique order ID) in the same database transaction as the business operation. Before processing, the consumer checks: “Have I already processed order-12345?” If yes, skip. If no, process and record. This pattern is called the “idempotent consumer” and it is non-negotiable for any message-driven system handling financial transactions.
'Explain the difference between choreography and orchestration in a saga, and describe a scenario where choreography falls apart.'
Strong Answer:In choreography, each service listens for events and decides independently what to do next. OrderService publishes OrderPlaced, PaymentService hears it and publishes PaymentCharged, InventoryService hears that and publishes InventoryReserved. No central coordinator — each service knows its own role and the events it cares about.In orchestration, a central Saga Orchestrator tells each service what to do in sequence. The orchestrator sends “charge payment” to PaymentService, waits for the response, then sends “reserve inventory” to InventoryService, waits again, and so on. The workflow logic lives in one place.Choreography falls apart in two specific scenarios. First: when you have more than 4-5 steps in the saga. At that point, the workflow becomes an implicit state machine spread across multiple services, and nobody can see the full picture. Debugging “why did this order get stuck?” requires tracing events across 6 services to reconstruct the sequence. I have seen teams spend hours debugging a choreographed saga where one service silently dropped an event due to a schema mismatch, and the order just sat in a “processing” state forever with no error logged.Second: when compensation logic is complex. In choreography, if step 4 fails, you need to publish a compensating event that triggers step 3’s rollback, which triggers step 2’s rollback, and so on. Each service must know what events to listen for to trigger its compensation. This creates a reverse event chain that is difficult to reason about and easy to get wrong. I have seen cases where a payment refund was triggered twice because two services independently detected the failure and both published compensation events.My rule of thumb: choreography for simple, 2-3 step flows where each service is genuinely independent. Orchestration for anything with complex business logic, conditional branching, or more than 4 steps. Most real-world order processing pipelines end up with orchestration because the business rules are inherently sequential and conditional.Follow-up: “How do you handle the orchestrator itself becoming a single point of failure?”The orchestrator must be stateful and persistent — it stores saga state in a database (PostgreSQL or a dedicated saga store). If the orchestrator crashes, it recovers by reading incomplete sagas from the database and resuming from their last known state. I run multiple orchestrator instances behind a load balancer, using database-level locking or leader election to ensure only one instance processes a given saga at a time. The key design principle: the orchestrator should be idempotent and recoverable, not stateless. Stateless orchestration is a contradiction — someone has to remember where in the saga you are.
'A consumer is falling behind on a Kafka topic. The lag is growing by 10,000 messages per hour. Walk me through your investigation and remediation.'
Strong Answer:Growing consumer lag means the consumption rate is lower than the production rate. I would investigate in this order:First, check consumer throughput metrics. Is the consumer processing slowly (high per-message latency) or is it processing at normal speed but the production rate spiked? These are fundamentally different problems. A Grafana dashboard showing consumer processing rate versus producer rate over the last 24 hours tells me which scenario I am in.If production spiked, the answer is scaling consumers. I would add more consumer instances (up to the number of partitions — Kafka caps parallelism at partition count). If I have 10 partitions and 3 consumers, I can scale to 10 consumers for linear throughput improvement. If I already have as many consumers as partitions, I need to repartition the topic with more partitions — but that is a heavier operation that requires a maintenance window.If the consumer is slow, I would profile the per-message processing time. Common culprits: synchronous database writes that block on I/O, calling external APIs inline during message processing, or inefficient deserialization. The fix depends on the root cause — batch database inserts instead of one-at-a-time, move external API calls to a separate async step, or switch to a faster serialization format.A subtle issue I have encountered: consumer group rebalancing storms. When you add or remove consumers, Kafka triggers a rebalance that pauses all consumers in the group for seconds to minutes. If your consumer has a long processing time and you have max.poll.interval.ms set too low, Kafka thinks the consumer is dead, triggers a rebalance, which pauses the other consumers, which causes them to miss their poll interval, which triggers more rebalances. The fix is tuning max.poll.interval.ms to be longer than your longest expected processing time, and reducing max.poll.records so each poll returns fewer messages.Follow-up: “What if the lag is caused by poison pill messages — messages that cause the consumer to crash every time?”This is exactly why dead letter queues (DLQs) exist. I configure the consumer to retry a failed message 3 times with backoff, and after the third failure, publish it to a DLQ topic and move on. The DLQ gets monitored with alerts so someone investigates the bad messages, but the main consumer is no longer stuck. Without a DLQ, one malformed message blocks the entire partition permanently.