Skip to main content

Documentation Index

Fetch the complete documentation index at: https://resources.devweekends.com/llms.txt

Use this file to discover all available pages before exploring further.

Real-Time Systems — Protocols, Architecture, and Production Reality

Real-time is not a feature. It is a contract between your system and your users: “When something changes, you will know immediately.” Breaking that contract — even for 500 milliseconds — is the difference between a collaborative tool and a frustrating one. This chapter covers how that contract is actually implemented, from the wire protocol to the global deployment.

Real-World Stories: Why This Matters

In 2015, Figma set out to build a collaborative design tool that ran entirely in the browser. The founding team — led by Dylan Field and Evan Wallace — understood that the hardest problem was not rendering vector graphics in WebGL. The hardest problem was making two designers editing the same file feel like they were sitting next to each other.Figma’s initial prototype used WebSockets to synchronize state between clients. Every cursor movement, every shape drag, every color change had to arrive at every other client’s screen within a window of about 50-100 milliseconds — the threshold below which humans perceive actions as “simultaneous.” Above that threshold, the experience feels laggy, like a bad video call. Below it, the experience feels magical, like a shared whiteboard.The engineering challenge was not just latency. It was conflict resolution at speed. When two designers simultaneously drag the same rectangle, whose drag wins? When one designer deletes a frame while another is adding elements to it, what happens? Figma built a custom operational transformation system (not off-the-shelf OT or CRDTs) optimized for their specific data model — a tree of design nodes. Every operation is transformed against concurrent operations before being applied, ensuring all clients converge to the same state regardless of the order operations arrive.By 2022, Figma supported hundreds of simultaneous editors in a single file. Adobe acquired (and later walked away from acquiring) the company for $20 billion — the largest private software acquisition ever attempted. The lesson: real-time collaboration done well is not an incremental improvement. It is a category-defining competitive moat. And it starts with choosing the right real-time protocol and building the right conflict resolution on top of it.
Discord launched in 2015 as a voice chat tool for gamers, built on a premise that existing voice solutions (Skype, TeamSpeak, Mumble) were either too heavy, too ugly, or required self-hosting. The core technology choice was WebRTC for voice and video, with WebSockets for text chat and control signaling.The decision to use WebRTC was not obvious at the time. WebRTC was young, poorly documented, and had rough edges across browsers. But it offered something no other approach could: peer-to-peer audio with sub-200ms latency without requiring users to install anything beyond a browser (or later, the Discord client, which embeds the WebRTC stack). For a gaming voice chat, this was non-negotiable — gamers need to hear “behind you!” before they are dead, not after.As Discord scaled, pure peer-to-peer broke down. In a group call with 10 people, each participant would need to send their audio stream to 9 others — that is 90 audio streams in a 10-person call. On a home internet connection, this saturates upload bandwidth quickly. Discord solved this with Selective Forwarding Units (SFUs) — servers that receive each participant’s single audio stream and forward it to all other participants. The client sends one stream up; the SFU sends N-1 streams down. This reduced upload bandwidth from O(N) to O(1) per client while keeping latency low because the SFU does not decode or re-encode the audio, just forwards packets.By 2023, Discord regularly handled over 10 million concurrent voice connections. Their infrastructure runs custom SFUs written in Rust (for performance-critical packet handling) with WebSocket-based signaling servers in Elixir (for high-concurrency connection management). The lesson: WebRTC gives you the building blocks, but at scale, you are building significant infrastructure around those building blocks. The protocol is the starting point, not the finish line.
Slack’s first version, launched in 2013, used long polling for real-time message delivery. Every Slack client maintained an open HTTP request to the server. When a new message arrived, the server responded to the pending request, the client displayed the message, and immediately opened a new long-polling request. Simple, reliable, and compatible with every corporate proxy and firewall.But long polling has a cost. Each pending request holds a server-side connection open, consuming memory and file descriptors. With hundreds of thousands of concurrent users, each with a long-polling connection, Slack’s server fleet was burning resources just to hold connections. Worse, every reconnect (after a message delivery or timeout) required a full HTTP round-trip, adding latency spikes.Slack migrated to WebSockets for their real-time messaging layer. This reduced per-connection overhead (WebSocket frames are much smaller than HTTP headers), eliminated reconnect latency for message delivery, and enabled bidirectional communication (typing indicators, presence updates). The WebSocket connections terminated at edge servers that maintained subscription state — which channels and DMs each connection cared about — and routed incoming messages only to relevant connections.The scaling challenge then became the fan-out problem. When a message is posted in a channel with 10,000 members, the system needs to identify which of those members are currently connected, find which edge server each connection lives on, and deliver the message to each one — all within 200 milliseconds. Slack built a pub/sub layer backed by a custom message broker that tracked channel memberships and connection locations. The lesson: the protocol upgrade from long polling to WebSockets was the easy part. The hard part was building the routing and fan-out infrastructure that makes real-time delivery work at scale.

1. Real-Time Communication Protocols

1.1 WebSocket — Full-Duplex Persistent Connection

Analogy: Think of a phone call. Once you dial and the other person picks up (the handshake), you have a two-way channel. Either side can talk at any time without redialing. A WebSocket connection works the same way — after the initial HTTP handshake upgrades the connection, both client and server can send messages at any time over the same persistent TCP connection. Compare this to traditional HTTP, which is more like sending letters: one request, one response, connection closed.
How it works: The client sends an HTTP request with an Upgrade: websocket header. The server responds with 101 Switching Protocols. From that point on, the TCP connection is a WebSocket connection — binary or text frames flow in both directions.
Client → Server: GET /chat HTTP/1.1
                 Upgrade: websocket
                 Connection: Upgrade
                 Sec-WebSocket-Key: dGhlIHNhbXBsZSBub25jZQ==

Server → Client: HTTP/1.1 101 Switching Protocols
                 Upgrade: websocket
                 Connection: Upgrade
                 Sec-WebSocket-Accept: s3pPLMBiTxaQ9kYGzzhZRbK+xOo=

[TCP connection is now a WebSocket — bidirectional frames flow freely]
Key characteristics:
  • Full-duplex: Both sides send and receive simultaneously
  • Persistent: Connection stays open until explicitly closed
  • Low overhead: After handshake, frame headers are 2-14 bytes (vs HTTP headers of 200-800+ bytes per request)
  • Protocol: ws:// (unencrypted) and wss:// (TLS-encrypted, always use this in production)
  • Binary and text: Supports both UTF-8 text frames and binary frames

1.2 Server-Sent Events (SSE) — One-Way Server Push

SSE is a simpler protocol for cases where you only need the server to push data to the client. It uses a plain HTTP response with Content-Type: text/event-stream that the server keeps open, writing events as they occur.
Client → Server: GET /events HTTP/1.1
                 Accept: text/event-stream

Server → Client: HTTP/1.1 200 OK
                 Content-Type: text/event-stream

                 data: {"score": "3-1", "minute": 67}

                 event: goal
                 data: {"team": "home", "scorer": "Martinez"}

                 id: 1042
                 data: {"score": "4-1", "minute": 72}
Key characteristics:
  • One-way: Server to client only (client sends data via regular HTTP requests)
  • Auto-reconnect: The browser’s EventSource API automatically reconnects if the connection drops
  • Event IDs: The id: field enables resumption — on reconnect, the browser sends Last-Event-ID so the server can replay missed events
  • Text only: UTF-8 text (no binary), but you can Base64-encode or use JSON
  • HTTP/2 multiplexing: SSE connections can share an HTTP/2 connection, eliminating the browser’s 6-connection-per-domain limit
When SSE beats WebSocket: If your use case is server-to-client only (live scores, stock tickers, CI/CD build logs, notification feeds), SSE is simpler to implement, easier to debug (it is just HTTP), works with existing HTTP infrastructure (proxies, CDNs, load balancers), and gets auto-reconnect and event replay for free. Do not reach for WebSockets when SSE solves your problem.

1.3 Long Polling — The Reliable Fallback

Long polling is the simplest real-time pattern. The client sends an HTTP request, and the server holds it open until new data is available (or a timeout occurs). The server responds, the client immediately sends a new request, and the cycle repeats.
Client → Server: GET /messages?since=1042    [held open...]
                                              [30 seconds pass, no new data]
Server → Client: 204 No Content

Client → Server: GET /messages?since=1042    [held open...]
                                              [new message arrives after 3s]
Server → Client: 200 OK
                 [{"id": 1043, "text": "hello"}]

Client → Server: GET /messages?since=1043    [cycle continues]
When long polling is still appropriate:
  • Corporate environments where WebSocket connections are blocked by proxies or firewalls
  • Serverless architectures (AWS Lambda) where persistent connections are not possible
  • Low-frequency updates where the overhead of a persistent connection is not justified
  • As a fallback when WebSocket connections fail (Socket.IO does this automatically)
Long polling’s hidden cost: Each pending request consumes a server thread or connection slot. At 50,000 concurrent users, you have 50,000 open HTTP connections. This is the same connection cost as WebSockets but with additional overhead: every response triggers a new HTTP request with full headers, increasing bandwidth and latency. Long polling is not “cheaper” than WebSockets — it is just more compatible with legacy infrastructure.

1.4 WebRTC — Peer-to-Peer Real-Time Communication

WebRTC (Web Real-Time Communication) enables direct peer-to-peer audio, video, and data transfer between browsers without plugins. It is the protocol behind Google Meet, Discord voice, and Zoom’s browser client. The connection process (simplified):
  1. Signaling: Peers exchange session descriptions (SDP — Session Description Protocol) through a signaling server (usually WebSocket). SDP describes codecs, network candidates, and encryption keys.
  2. ICE Gathering: Each peer discovers its network addresses using STUN servers (Simple Traversal of UDP through NAT). STUN tells you your public IP and port.
  3. Connectivity Check: Peers try to connect directly using discovered addresses. If direct connection fails (symmetric NAT, corporate firewall), traffic is relayed through a TURN server (Traversal Using Relays around NAT).
  4. DTLS Handshake: Once connected, peers establish encrypted channels using DTLS (Datagram TLS).
  5. Media/Data Flow: Audio and video flow over SRTP (Secure RTP). Arbitrary data flows over SCTP data channels.
Peer A                    Signaling Server                    Peer B
  |--- SDP Offer ----------->|                                  |
  |                           |--- SDP Offer ------------------>|
  |                           |<-- SDP Answer ------------------|
  |<-- SDP Answer ------------|                                  |
  |                                                              |
  |--- ICE Candidates ------->|--- ICE Candidates ------------->|
  |<-- ICE Candidates --------|<-- ICE Candidates --------------|
  |                                                              |
  |<=============== Direct P2P Connection (DTLS/SRTP) =========>|
STUN vs TURN in practice: STUN is lightweight and usually free — it just tells a peer its public IP. About 80-85% of WebRTC connections succeed with STUN alone. TURN is a relay server that forwards all media traffic — it is expensive (bandwidth costs) and adds latency, but it is necessary for the 15-20% of connections that cannot traverse NAT directly. You must budget for TURN servers. Skipping TURN means 1 in 5 users cannot connect.

TURN Server Cost: Self-Hosted vs Managed Services

TURN servers are the most underestimated cost in WebRTC deployments. Because they relay all media traffic for connections that cannot go peer-to-peer, bandwidth costs scale linearly with usage. A senior engineer needs to understand this cost structure before committing to a WebRTC architecture. The math that matters: A typical 1:1 video call at 720p consumes approximately 1.5-2 Mbps per direction. If 15% of your calls require TURN relay, and you have 10,000 concurrent calls, that is 1,500 relayed calls consuming ~6 Tbps of bandwidth per hour through your TURN infrastructure. At cloud bandwidth prices ($0.05-0.09/GB), that adds up fast.
ApproachCost ModelLatencyOperational BurdenBest For
Self-hosted (coturn)Server cost + bandwidth. A single c5.xlarge on AWS (125/mo)handles 500concurrentrelayedstreams.Bandwidthistherealcost:125/mo) handles ~500 concurrent relayed streams. Bandwidth is the real cost: 0.05-0.09/GB on AWS, cheaper on providers like Hetzner ($0.01/GB) or with committed use discounts.Lowest (you control placement)High: you manage deployment, scaling, monitoring, geographic distribution, TLS certificatesTeams with infra expertise, cost-sensitive at scale, strict data residency requirements
Twilio TURN (Network Traversal Service)~$0.40/GB of relayed data. No server management.Good (global PoPs)Low: fully managedStartups, low-to-medium volume, fast time-to-market
LiveKit (open-source SFU + TURN)Self-hosted: server cost only. LiveKit Cloud: usage-based pricing starting ~$0.006/participant-minute for video. Includes TURN, SFU, recording, and room management.Excellent (built-in SFU reduces TURN need)Medium (self-hosted) to Low (cloud)Teams building conferencing features, need SFU anyway
Cloudflare TURNIncluded in Workers/Calls pricing. Relatively new but leverages Cloudflare’s global network.Excellent (edge network)LowTeams already on Cloudflare’s platform
XirsysPay-per-use, ~$0.16/GB. Global infrastructure with dedicated TURN servers.GoodLowMid-scale deployments wanting a TURN specialist
Cost optimization strategies:
  • ICE candidate ordering: Configure ICE to always try direct (host) and STUN (server-reflexive) candidates before TURN (relay). This is the default, but verify your configuration is not inadvertently preferring relay candidates.
  • TURN-only-when-needed: Use TURN as a fallback, never as default. Some misconfigured clients always use TURN, turning a 15% cost into a 100% cost.
  • Bandwidth-adaptive streams: When on TURN, automatically downgrade video quality (720p to 480p or audio-only) to reduce relay bandwidth.
  • Regional TURN placement: Place TURN servers in regions where your users are. A TURN server in US-East relaying between two users in Tokyo adds unnecessary latency and cross-region bandwidth cost.
  • Connection quality monitoring: Track what percentage of your connections use TURN. If it is significantly above 20%, investigate whether your STUN configuration is correct or if a specific network environment (corporate VPN, specific ISP) is forcing unnecessary relay.
A senior engineer would say: “At our current scale of 500 concurrent video calls, Twilio TURN costs us about 800/month.Thatisabargainforzerooperationaloverhead.Butourgrowthprojectionsshow10,000concurrentcallsin18monthsatthatpoint,Twiliowouldcostus800/month. That is a bargain for zero operational overhead. But our growth projections show 10,000 concurrent calls in 18 months -- at that point, Twilio would cost us 16,000/month and self-hosted coturn on Hetzner would cost about $2,000/month including bandwidth. We should plan the migration to self-hosted when we cross the 3,000 concurrent call mark, which gives us enough volume to justify the ops investment.”

1.5 Protocol Comparison

AspectWebSocketSSELong PollingWebRTC
DirectionBidirectionalServer → ClientServer → Client (with request per update)Bidirectional (peer-to-peer)
Underlying ProtocolTCP (upgraded from HTTP)HTTPHTTPUDP (SRTP/SCTP)
LatencyVery low (~1-5ms over LAN)Low (~5-50ms)Medium (~50-500ms per cycle)Lowest (~10-50ms P2P)
Auto-ReconnectManual (you implement it)Built-in (EventSource)ManualManual (ICE restart)
Binary DataYesNo (text only)Yes (via HTTP body)Yes (data channels)
Browser SupportUniversal (IE10+)Universal (no IE)UniversalUniversal (modern browsers)
Scaling DifficultyMedium-HardEasy (stateless HTTP)Easy (stateless HTTP)Hard (TURN servers, SFUs)
Proxy/Firewall FriendlySometimes blockedYes (plain HTTP)Yes (plain HTTP)Often blocked (UDP)
Max Connections per Browser~255 per domain6 per domain (HTTP/1.1), unlimited on HTTP/26 per domain (HTTP/1.1)Limited by CPU/bandwidth
Best ForChat, collaboration, gamingDashboards, feeds, notificationsLegacy fallback, serverlessVoice, video, screen sharing

1.6 Decision Guide: Which Protocol for Which Use Case

Use WebSocket when:
  • You need bidirectional communication (chat, collaborative editing, multiplayer games)
  • Low latency matters and you have control over your infrastructure
  • You need to push binary data (game state, sensor data)
Use SSE when:
  • Data flows in one direction: server to client
  • You want built-in reconnection and event replay
  • You need to work through corporate proxies and CDNs without special configuration
  • Examples: live dashboards, notification feeds, CI/CD build logs, stock tickers
Use Long Polling when:
  • WebSocket and SSE are blocked by infrastructure
  • You are in a serverless environment (Lambda, Cloud Functions)
  • Update frequency is low (every few seconds or minutes)
  • You need maximum compatibility with zero infrastructure changes
Use WebRTC when:
  • You need peer-to-peer audio, video, or screen sharing
  • Ultra-low latency is critical (sub-100ms)
  • You want to minimize server bandwidth costs (peers exchange data directly)
  • You need data channels for high-performance P2P data transfer (gaming, file sharing)

2. WebSocket Architecture at Scale

2.1 Connection Lifecycle

A WebSocket connection goes through distinct phases, and understanding each one is critical for building reliable systems.
1. HTTP Handshake (Upgrade request)

2. Connection Open (onopen fires)

3. Message Exchange (frames: text, binary, ping, pong)

4. Heartbeat (ping/pong every 30-60s to detect dead connections)

5. Close (clean: close frame exchange; unclean: TCP reset/timeout)
Ping/pong heartbeats are essential. Without them, a connection can appear open on the server side even after the client has disconnected (mobile user enters a tunnel, laptop closes its lid). The server sends a ping frame every 30-60 seconds. If the client does not respond with a pong within a timeout (e.g., 10 seconds), the server closes the connection and cleans up resources. Without this, you accumulate “ghost connections” that consume memory and file descriptors.

2.2 Connection Limits and Capacity

A single server can handle roughly 50K-100K concurrent WebSocket connections. This is not a WebSocket limitation — it is a function of operating system resources. Each connection consumes a file descriptor and approximately 20-50KB of memory (TCP buffers, application-level state, SSL state for wss://). At 50K connections with 40KB each, that is 2GB of RAM just for connection state — before any message processing.The practical limits:
  • File descriptors: Linux defaults to 1024 per process; you must raise ulimit -n to 100K+ for WebSocket servers
  • Memory: Budget 20-50KB per idle connection, more for connections with large buffers or application state
  • CPU: Message routing and serialization; a single core can typically handle 10K-50K messages/second depending on message size and processing complexity
  • Ephemeral ports: For outbound connections (proxies, load balancers), you are limited by the 16-bit port range (~64K ports)
Cross-chapter connection: OS Fundamentals. The connection limits above are fundamentally OS-level constraints. File descriptors, epoll for efficient I/O multiplexing, socket buffer tuning (SO_RCVBUF/SO_SNDBUF), and ulimit configuration are all covered in depth in OS Fundamentals. Understanding why Linux’s epoll can handle 100K concurrent sockets while select/poll choke at a few thousand is essential for anyone tuning WebSocket servers. Each WebSocket connection is, at the kernel level, just a file descriptor backed by a TCP socket — the same abstraction that applies to files, pipes, and network connections.

2.3 Scaling Horizontally

A single server caps at roughly 100K connections. For 1 million concurrent users, you need at least 10-20 WebSocket servers. The problem: when User A connects to Server 1 and sends a message to User B on Server 3, how does the message get there? Approach 1: Sticky Sessions Route each user to a specific server (by user ID hash, IP hash, or cookie). All messages for a conversation go through the same server. Simple, but fragile — if a server dies, all its connections are lost and must reconnect to a different server. And sticky sessions break down for group chats where participants span multiple servers. Approach 2: Pub/Sub Backbone (the standard approach) Every WebSocket server subscribes to a message broker. When Server 1 receives a message for a channel, it publishes to the broker. All servers subscribed to that channel receive the message and forward it to their local connections.
                    ┌──────────────┐
                    │  Redis Pub/Sub│
                    │  or Kafka     │
                    └──┬───┬───┬───┘
                       │   │   │
              ┌────────┘   │   └────────┐
              ▼            ▼            ▼
        ┌─────────┐  ┌─────────┐  ┌─────────┐
        │  WS      │  │  WS      │  │  WS      │
        │  Server 1│  │  Server 2│  │  Server 3│
        └─────────┘  └─────────┘  └─────────┘
          │  │  │      │  │  │      │  │  │
         Users A-D   Users E-H   Users I-L
Redis Pub/Sub is the most common choice for moderate scale (up to hundreds of thousands of connections). It is simple, fast, and well-supported by libraries like Socket.IO’s Redis adapter. Limitation: Redis Pub/Sub is fire-and-forget — if a WebSocket server is temporarily disconnected from Redis, it misses messages. No durability, no replay. Kafka or NATS for higher scale or when you need message durability. Kafka provides ordered, durable message streams that consumers can replay from any offset. This is valuable for systems that need to recover connection state after a server restart.
Cross-chapter connection: Messaging. The pub/sub backbone for WebSocket fan-out is the same messaging pattern discussed in the Messaging, Concurrency & State chapter. Redis Pub/Sub gives you at-most-once delivery (fine for ephemeral typing indicators). Kafka gives you at-least-once with replay (necessary for chat message delivery guarantees). The choice depends on whether missing a message is annoying or unacceptable.

2.4 Connection State Management

You need to track which users are connected and which server they are on. This mapping — user_id → {server_id, connection_id, subscribed_channels} — is called the connection registry. Options:
  • In-memory on each server: Fast, but lost on restart. Other servers cannot look up who is connected where.
  • Redis hash: HSET connections:user123 server ws-server-7 channels ["general","random"]. Fast lookups, shared across servers, but adds a Redis dependency to every connect/disconnect event.
  • Distributed hash table (e.g., Hazelcast, etcd): For very large deployments where a single Redis instance becomes a bottleneck.
In practice, Redis is the standard choice for connection registries up to several million connections.

2.5 Load Balancing WebSockets

WebSocket connections start as HTTP and then upgrade. This means your load balancer must support the HTTP Upgrade mechanism. Not all do.
Load BalancerWebSocket SupportNotes
NginxYes (since 1.3)Requires proxy_set_header Upgrade and proxy_set_header Connection "upgrade"
AWS ALBYesNative WebSocket support, sticky sessions via cookies
AWS NLBYesL4 load balancing, no HTTP-level inspection, lower latency
HAProxyYesExcellent WebSocket support, fine-grained connection management
CloudflareYesWebSocket proxying included, but adds ~10-20ms latency
AWS API GatewayYes (WebSocket APIs)Managed, but limited to 500 concurrent connections per route by default
Common gotcha: Connection timeouts. Many load balancers and proxies have idle connection timeouts (default 60 seconds on ALB). A WebSocket with no traffic for 60 seconds gets killed. Solution: implement ping/pong heartbeats at an interval shorter than the idle timeout (e.g., every 30 seconds).
Cross-chapter connection: API Gateways. WebSocket traffic at the gateway level introduces unique challenges compared to HTTP. API gateways like Kong, AWS API Gateway (WebSocket APIs), and Envoy must handle the HTTP Upgrade handshake, maintain long-lived connections (unlike stateless HTTP routing), and route based on message content rather than URL paths. The API Gateways & Service Mesh chapter covers how gateways handle protocol-level concerns like rate limiting per-connection, authentication on handshake, and routing WebSocket connections through a service mesh. If you are deploying WebSocket servers behind an API gateway, understand the gateway’s connection timeout, max-connection limits, and whether it supports WebSocket-aware health checks.
Cross-chapter connection: Cloud Service Patterns. If you are building on AWS, two managed services can simplify WebSocket infrastructure significantly. AWS API Gateway WebSocket APIs provide a fully managed WebSocket endpoint with route-based message dispatch — the gateway invokes Lambda functions for $connect, $disconnect, and custom routes, eliminating the need to manage WebSocket servers entirely. The trade-off is a 500 concurrent connection default limit per route (raisable to 10K) and higher per-message cost at scale. AWS AppSync offers managed real-time subscriptions over WebSocket for GraphQL APIs, handling connection management, fan-out, and authorization automatically. Both are covered in Cloud Service Patterns. The senior decision: managed WebSocket services make sense below ~50K concurrent connections where operational simplicity outweighs the per-message cost premium. Above that, self-managed WebSocket servers on EC2/ECS are typically more cost-effective.

2.6 Reconnection Strategies

Connections will drop. Mobile users enter elevators. Laptops sleep. Servers deploy. Your reconnection strategy determines whether users notice. Exponential backoff with jitter:
function reconnect(attempt) {
  // Base delay: 1s, 2s, 4s, 8s... capped at 30s
  const baseDelay = Math.min(1000 * Math.pow(2, attempt), 30000);

  // Add random jitter (0-100% of base delay) to prevent thundering herd
  const jitter = Math.random() * baseDelay;
  const delay = baseDelay + jitter;

  setTimeout(() => {
    const ws = new WebSocket('wss://example.com/ws');
    ws.onopen = () => {
      attempt = 0; // Reset on successful connection
      // Send last-received message ID to recover missed messages
      ws.send(JSON.stringify({ type: 'resume', lastMessageId: lastId }));
    };
    ws.onclose = () => reconnect(attempt + 1);
  }, delay);
}
Why jitter matters: When a WebSocket server restarts, all 50,000 connections drop simultaneously. Without jitter, all 50,000 clients reconnect at the same time, overwhelming the new server. With jitter, reconnections spread over a window, allowing the server to absorb load gradually. Connection state recovery: On reconnect, the client sends its last-received message ID. The server replays any messages the client missed during the disconnection. This requires the server (or a backing store like Redis Streams or Kafka) to retain recent messages for replay.

2.7 Connection Recovery Patterns

Reconnection is only half the problem. The harder half is recovering state so the user does not notice the interruption. A robust connection recovery system has three layers: reconnection mechanics (covered above), message gap detection, and client-side buffering during disconnects. Message gap detection and sync: When a client reconnects, it needs to answer: “What did I miss?” There are three common patterns: Pattern 1: Sequence-based resumption. The server assigns each message a monotonically increasing sequence number per channel or conversation. On reconnect, the client sends {type: "resume", channel: "general", lastSeq: 4271}. The server replays all messages with sequence > 4271 from a message buffer (Redis Streams, Kafka consumer offset, or an in-memory ring buffer with a configurable retention window of 5-30 minutes). Pattern 2: Timestamp-based delta sync. The client sends its last-received timestamp. The server returns all messages after that timestamp. This is simpler to implement but vulnerable to clock skew between server instances. Mitigate by using server-assigned timestamps only, never client timestamps. Pattern 3: Snapshot + delta. For long disconnections (> 30 minutes), replaying individual messages is too expensive. Instead, the server sends a full state snapshot (current channel list, unread counts, last N messages per active conversation) plus a delta of events since the snapshot was generated. This is the approach Slack uses for reconnection after extended offline periods.
// Client-side connection recovery with gap detection
class RealtimeClient {
  constructor(url) {
    this.url = url;
    this.lastSeqByChannel = {};  // Track last received sequence per channel
    this.pendingOutbound = [];    // Buffer outbound messages during disconnect
    this.isConnected = false;
  }

  onReconnect(ws) {
    this.isConnected = true;

    // Step 1: Request missed messages for all active channels
    for (const [channel, lastSeq] of Object.entries(this.lastSeqByChannel)) {
      ws.send(JSON.stringify({
        type: 'resume',
        channel,
        lastSeq
      }));
    }

    // Step 2: Flush buffered outbound messages
    while (this.pendingOutbound.length > 0) {
      const msg = this.pendingOutbound.shift();
      ws.send(JSON.stringify(msg));
    }
  }

  send(message) {
    if (this.isConnected) {
      this.ws.send(JSON.stringify(message));
    } else {
      // Buffer during disconnect -- cap at 100 messages to prevent memory issues
      if (this.pendingOutbound.length < 100) {
        this.pendingOutbound.push(message);
      }
    }
  }

  onMessage(msg) {
    // Track sequence for gap detection
    if (msg.seq && msg.channel) {
      const expectedSeq = (this.lastSeqByChannel[msg.channel] || 0) + 1;
      if (msg.seq > expectedSeq) {
        // Gap detected -- request missing messages
        this.requestGapFill(msg.channel, expectedSeq, msg.seq - 1);
      }
      this.lastSeqByChannel[msg.channel] = msg.seq;
    }
  }
}
Client-side buffering during disconnects: While the connection is down, the client should not silently drop user actions. Key strategies:
  • Outbound message queue: Buffer messages the user sends while disconnected (chat messages, edits, reactions). On reconnect, flush the queue with idempotency keys so the server can deduplicate. Cap the buffer size (100-500 messages) to prevent memory exhaustion on mobile devices.
  • Optimistic UI with reconciliation: Show sent messages immediately in the UI with a “pending” indicator (gray checkmark, spinner). On reconnect, confirm delivery and update the indicator, or show a “failed to send” state if the server rejects after replay.
  • Local state snapshot: Before transitioning to “disconnected” state, snapshot the current application state (open channels, scroll positions, draft messages). On reconnect, restore this state so the user returns to exactly where they were.
The gap-fill thundering herd. When a server restarts and 50,000 clients reconnect simultaneously, each requesting gap-fill, you can overwhelm your message store. Mitigate with: (1) jittered reconnection (as discussed above), (2) a dedicated gap-fill service that can handle burst reads, separate from the main message write path, and (3) rate limiting gap-fill requests per client (max one gap-fill request per channel per reconnect).

2.8 Graceful Shutdown During Deploys

When deploying a new version of your WebSocket server, you cannot just kill the process — that drops all connections instantly. Instead:
  1. Stop accepting new connections on the old server instance (remove from load balancer)
  2. Send a “reconnect” control message to all connected clients: {"type": "reconnect", "reason": "server_maintenance"}
  3. Wait for clients to disconnect (with a timeout, e.g., 30 seconds)
  4. Force-close remaining connections after the timeout
  5. New server instance is already accepting connections — clients reconnect to it
This “drain” pattern ensures zero message loss if clients implement reconnection with state recovery.

3. Building Chat Systems

3.1 Architecture Overview

A production chat system has three layers:
┌─────────────────────────────────────────────────┐
│           Client (Browser / Mobile App)          │
│  - WebSocket connection to edge server           │
│  - Local message cache for offline access        │
│  - Optimistic UI (show message before server ack)│
└──────────────────────┬──────────────────────────┘
                       │ WebSocket (wss://)
┌──────────────────────▼──────────────────────────┐
│           Real-Time Layer                        │
│  - WebSocket servers (connection termination)    │
│  - Pub/sub backbone (Redis, Kafka, NATS)         │
│  - Presence service (who is online)              │
│  - Typing indicator service (fire-and-forget)    │
└──────────────────────┬──────────────────────────┘

┌──────────────────────▼──────────────────────────┐
│           Persistence Layer                      │
│  - Message storage (Cassandra, ScyllaDB, DynamoDB│
│  - Message search (Elasticsearch)                │
│  - User/channel metadata (PostgreSQL)            │
│  - Media storage (S3 + CDN)                      │
└─────────────────────────────────────────────────┘

3.2 Message Ordering

Message ordering is the single hardest problem in chat systems. It seems trivial (“just use timestamps”) until you have servers in multiple regions with clock skew, users on connections with different latencies, and messages that arrive at the server out of the order they were sent.
The standard solution: Snowflake IDs Inspired by Twitter’s Snowflake, assign each message a 64-bit ID that encodes:
| 41 bits: timestamp (ms since epoch) | 10 bits: worker ID | 13 bits: sequence |
|  ~69 years of milliseconds          | 1024 workers       | 8192 per ms/worker|
  • Timestamp bits provide rough ordering without centralized coordination
  • Worker ID ensures uniqueness across servers
  • Sequence number handles multiple messages within the same millisecond on the same worker
Messages are ordered by Snowflake ID, not by client-perceived timestamp. If two messages have near-identical timestamps, the deterministic tiebreaker (lower worker ID, then lower sequence) ensures every client converges to the same order. Per-conversation sequence numbers provide an additional ordering guarantee within a single conversation. Each conversation maintains a monotonically increasing counter. When a message is persisted, it receives the next sequence number for that conversation. Clients can detect gaps (sequence 41, 42, 44 — where is 43?) and request missing messages.

3.3 Delivery Guarantees

Chat messages use at-least-once delivery with client-side deduplication:
  1. Client sends message with a client-generated UUID (client_msg_id)
  2. Server persists message, assigns Snowflake ID, returns ACK with the server ID
  3. If client does not receive ACK within timeout, it resends with the same client_msg_id
  4. Server checks client_msg_id for deduplication before persisting
  5. Other clients receiving the message use the Snowflake ID for dedup on the display side
Why not exactly-once? As discussed in the Messaging chapter, exactly-once delivery is provably impossible (the Two Generals Problem). Chat systems implement exactly-once processing — the network may deliver a message twice, but the deduplication layer ensures it appears once in the conversation.

3.4 Read Receipts and Typing Indicators

These two features have radically different reliability requirements: Typing indicators — fire-and-forget over WebSocket. No persistence, no delivery guarantee. If a typing indicator is lost, the worst case is a missing ”…” animation for 2 seconds. Send a typing_start event when the user begins typing, and let it auto-expire after 3-5 seconds client-side (or send typing_stop on submit/pause). Read receipts — persistent. When User A reads a message from User B, the read receipt must be durably stored and eventually delivered to User B, even if User B is offline. Store the latest read position per user per conversation (user_id, conversation_id, last_read_message_id). Batch read receipt updates — do not send a receipt for every single message scrolled past. Send one update when the user pauses scrolling or switches away.

3.5 Group Chat Fan-Out

The fan-out strategy depends on group size: Small groups (< 500 members): Broadcast. When a message arrives, look up all members, find their WebSocket connections, and push the message to each one. This is O(N) per message but acceptable for small N. Discord uses this for most servers. Large channels (500-100,000+ members): Pull on demand. Do not push every message to every connection. Instead, push a lightweight notification (“new message in #announcements”) and let the client pull the full message when the user navigates to the channel. This dramatically reduces fan-out for channels where most members are not actively viewing. Hybrid (what Slack does): Push to all members who have the channel in their active view. For members who have the channel in their sidebar but are not viewing it, push only an unread count badge update. For members who have the channel muted, do nothing until they open it.

3.6 Offline Messages and Push Notifications

When a user is offline (no active WebSocket connection), messages must be queued and delivered later:
  1. Message arrives at the server for a conversation
  2. Check connection registry — is the recipient connected?
  3. If yes: deliver via WebSocket
  4. If no: store in the “undelivered messages” queue for that user, and optionally trigger a push notification (APNs for iOS, FCM for Android)
  5. On reconnect: client sends lastMessageId, server replays all messages after that ID
Push notification dedup and batching. If a user receives 47 messages while offline, do not send 47 push notifications. Batch them: “You have 47 new messages.” Better yet, group by conversation: “3 messages from Alice, 12 in #general, 32 in #random.” Apple and Google rate-limit push notifications per device, so batching is not optional at scale — it is required.

3.7 End-to-End Encryption (E2EE) Basics

End-to-end encryption means the server cannot read message content. Only the sender and recipient(s) have the decryption keys. The server transports ciphertext it cannot decrypt. The Signal Protocol (used by Signal, WhatsApp, and Facebook Messenger):
  • Each user generates a long-term identity key pair and ephemeral prekeys
  • A “double ratchet” algorithm derives new encryption keys for every message, providing forward secrecy (compromising a key does not decrypt past messages) and future secrecy (compromising a key does not decrypt future messages)
  • For group chats, the sender encrypts the message key once per group member using each member’s public key (Sender Keys optimization reduces this cost)
The trade-off with E2EE: The server cannot index, search, moderate, or analyze message content. Features like server-side search, link previews generated server-side, and content moderation require different approaches (client-side search indexes, client-side link preview generation).

4. Live Feeds and Notifications

4.1 Presence Systems: Who Is Online?

Presence (online/offline/away/DND) seems simple until you consider that a single user might have 5 connected devices, presence changes thousands of times per second across your user base, and every change needs to be broadcast to that user’s contacts. Architecture:
  1. Each WebSocket connection sends heartbeats to the presence service
  2. If no heartbeat is received within the timeout (e.g., 30 seconds), the user transitions to “offline”
  3. The presence service maintains a per-user state machine: online → idle (5 min no activity) → offline (connection lost)
  4. Presence changes are published via pub/sub to all users who have subscribed to that user’s presence (typically friends, channel co-members)
Scaling presence:
  • Do not broadcast globally. A user’s presence change only matters to their contacts, not all users.
  • Batch presence updates. Instead of sending a separate message for each online/offline transition, batch them into periodic snapshots (e.g., every 5 seconds: “here are all the presence changes in your contact list since the last batch”).
  • Stale presence is acceptable. Showing a user as “online” for 30 seconds after they disconnect is fine. Showing them as “offline” when they are actually typing is not. Bias toward showing “online” longer.
How Discord handles presence at scale: Discord tracks presence per-session (a user on desktop and mobile has two sessions). Presence updates are sharded by user ID across multiple presence service instances. Each instance handles a subset of users. When you open a channel, your client subscribes to presence for the members of that channel, not all users globally. This subscription-based model limits fan-out to what the client actually needs to display.

4.2 Notification Delivery: In-App, Push, Email

A robust notification system delivers through the appropriate channel based on user state and urgency:
User StateIn-App (WebSocket)Push (Mobile)Email
Active in appYes (immediate)No (they’re already here)No
App in backgroundNoYes (within seconds)No
Offline for < 1 hourNoYesNo
Offline for > 1 hourQueue for next sessionBatched summaryYes (digest)
Deduplication across channels: If you deliver a message via WebSocket and the user reads it, cancel the pending push notification. This requires the notification service to check read state before dispatching push/email. A 5-10 second delay before sending push allows the in-app delivery to succeed first.

4.3 SSE for Dashboards and Monitoring

For read-only live data — dashboards, monitoring panels, log tails — SSE is the right choice over WebSocket.
// Client
const events = new EventSource('/api/metrics/stream');

events.addEventListener('cpu', (e) => {
  updateCPUChart(JSON.parse(e.data));
});

events.addEventListener('memory', (e) => {
  updateMemoryChart(JSON.parse(e.data));
});

events.onerror = () => {
  // EventSource auto-reconnects with Last-Event-ID
  console.log('Connection lost, auto-reconnecting...');
};
# Server (Python/FastAPI)
async def metrics_stream(request: Request):
    async def event_generator():
        last_id = request.headers.get('Last-Event-ID', '0')
        while True:
            metrics = await get_latest_metrics(since=last_id)
            for metric in metrics:
                yield f"id: {metric.id}\nevent: {metric.type}\ndata: {json.dumps(metric.data)}\n\n"
                last_id = metric.id
            await asyncio.sleep(1)
    return StreamingResponse(event_generator(), media_type="text/event-stream")
Why SSE here: Auto-reconnect with Last-Event-ID means the dashboard recovers gracefully from network blips. No custom reconnection logic needed. HTTP infrastructure (CDNs, proxies, load balancers) handles SSE connections without special configuration.

5. Collaborative Editing

5.1 Operational Transformation (OT) — How Google Docs Works

OT is the algorithm family that powers Google Docs, Google Sheets, and many collaborative editors. The core idea: when two users make concurrent edits, transform one operation against the other so that both can be applied and the documents converge to the same state. Example:
Document state: "ABCD"

User 1: Insert 'X' at position 1 → "AXBCD"
User 2: Delete character at position 2 → "ABD"

These operations were made concurrently (neither user saw the other's edit).

If we apply User 1's insert first:
  "ABCD" → Insert X at 1 → "AXBCD" → Delete at position 2... but position 2 is now 'B' not 'C'!

OT transforms User 2's operation against User 1's:
  User 1 inserted before position 2, so User 2's position shifts: delete at position 3 instead.
  "AXBCD" → Delete at 3 → "AXBD" ✓

Conversely, if we apply User 2's delete first:
  "ABCD" → Delete at 2 → "ABD" → Insert X at 1 → "AXBD" ✓

Both orderings produce the same result. That is the OT guarantee.
OT’s challenge: The transformation functions become exponentially complex as you add more operation types. Google Docs’ OT implementation handles text insertion, deletion, formatting, cursor movement, table operations, image insertion, and more. Each pair of operation types needs a transformation function. With N operation types, you need O(N^2) transformation functions, and each must be provably correct.

5.2 CRDTs — Conflict-Free Replicated Data Types

Cross-chapter connection: Distributed Systems Theory. CRDTs are a specific solution to the broader problem of replicated state convergence covered in depth in Distributed Systems Theory. That chapter covers the mathematical foundations — commutativity, associativity, and idempotence — that make CRDTs work. If you want to understand why CRDTs guarantee convergence without coordination, read the CRDT section there. This chapter focuses on how they apply to collaborative editing specifically.
CRDTs take a different approach: design the data structure itself so that concurrent operations always merge without conflicts. There is no transformation step — operations are applied directly, and the data structure’s mathematical properties guarantee convergence. Popular CRDT libraries:
  • Yjs — The most widely used CRDT library for collaborative editing. Used by Notion, JupyterLab, and others.
  • Automerge — Designed by Martin Kleppmann (author of “Designing Data-Intensive Applications”). More academically rigorous, with a focus on local-first software.
How a text CRDT works (simplified): Instead of position-based operations (“insert at position 5”), CRDTs assign each character a unique, immutable ID and maintain a partial order:
Character 'H' (id: A1) → 'e' (id: A2) → 'l' (id: A3) → 'l' (id: A4) → 'o' (id: A5)

User A inserts 'X' after A2: 'H' → 'e' → 'X' (id: A6) → 'l' → 'l' → 'o'
User B inserts 'Y' after A2: 'H' → 'e' → 'Y' (id: B1) → 'l' → 'l' → 'o'

Merge: Both insertions are after A2. A deterministic tiebreaker (e.g., compare IDs)
  decides the order: 'H' → 'e' → 'X' → 'Y' → 'l' → 'l' → 'o' (or Y before X)
  Both users converge to the same result without transformation.

5.3 OT vs CRDT: Trade-Offs

AspectOperational Transformation (OT)CRDTs
Central server requiredYes (server orders operations)No (can work peer-to-peer)
Latency modelClient → server → other clientsClient → client (or via relay)
ComplexityO(N^2) transformation functionsComplex data structures, metadata overhead
Memory overheadLow (only current document state)Higher (tombstones for deleted characters, unique IDs per character)
Offline supportPoor (needs server to resolve conflicts)Excellent (merge on reconnect)
Proven at scaleGoogle Docs (billions of users)Newer but growing (Notion, Figma)
Undo/redoStraightforward (inverse operations)Hard (requires causal history tracking)
My take: If you are building a server-mediated collaborative tool (like Google Docs) and you control the infrastructure, OT is battle-tested and well-understood. If you are building a local-first application where users may be offline and need to sync later, CRDTs are the right choice. Figma uses a custom approach that borrows from both — server-ordered operations with CRDT-like convergence properties. The “OT vs CRDT” debate is less important than understanding your specific latency, offline, and consistency requirements.

5.4 Cursor and Selection Synchronization

Showing other users’ cursors and selections is what makes collaborative editing feel collaborative. Implementation:
  • Each client broadcasts cursor position and selection range via WebSocket (fire-and-forget, like typing indicators)
  • Other clients render colored cursors and selection highlights with the user’s name
  • Challenge: Cursor positions are expressed as document offsets, which change as other users insert or delete text. You must transform cursor positions against incoming operations — a simpler version of the OT problem

6. Real-Time at Scale — Production Concerns

6.1 Capacity Planning

Memory budget per connection:
ComponentMemoryNotes
TCP socket buffer8-16 KBSO_RCVBUF + SO_SNDBUF
TLS state (wss://)10-20 KBSession keys, cipher state
Application state2-10 KBUser ID, subscriptions, last heartbeat
Total per connection~20-50 KBVaries by implementation
Rough capacity planning:
  • 1M concurrent connections at 40KB each = 40 GB RAM just for connection state
  • That is 10-20 servers with 4-8 GB allocated for connections (leave headroom for message processing)
  • Each server handles 50K-100K connections
  • Message throughput: budget 1-5 ms CPU per message for routing, serialization, and pub/sub publish

6.2 Backpressure: When Clients Cannot Keep Up

A user on a slow 3G connection in a busy chat channel might receive messages faster than their connection can deliver them. Without backpressure, the server’s per-connection send buffer grows until it exhausts memory. Strategies:
  • Per-connection send buffer limit: If the buffer exceeds a threshold (e.g., 1 MB), drop low-priority messages (presence updates, typing indicators) first, then older messages
  • Client-side flow control: The client sends an acknowledgment for batches of messages. If acks slow down, the server throttles sending
  • Channel downgrade: For users on slow connections, switch from pushing every message to pushing a “N new messages” notification and let the client pull on demand
The worst backpressure scenario: A server with 50K connections, 100 of which are slow, can have those 100 connections consume most of the server’s memory in their send buffers while the other 49,900 connections are fine. Always cap per-connection buffer size and have a “kill the connection” fallback for clients that are too far behind.

6.3 Message Ordering Guarantees

Not all applications need the same ordering guarantees:
GuaranteeWhat It MeansCostUse Case
Total orderEvery client sees every message in the same orderHigh (central sequencer required)Financial trading, collaborative editing
Causal orderIf A caused B, everyone sees A before B (concurrent messages may vary)Medium (vector clocks or similar)Chat (replies appear after the message they reply to)
Per-sender orderMessages from the same sender are in order; across senders, no guaranteeLow (per-sender sequence numbers)Most chat systems
Best-effortNo guaranteesLowestTyping indicators, presence updates

6.4 Geographic Distribution

For a global user base, running all WebSocket servers in a single region means users on the other side of the planet have 150-300ms minimum latency just for the TCP connection. Solutions: Multi-region WebSocket deployment:
  • WebSocket edge servers in each major region (US-East, US-West, EU, Asia)
  • Users connect to the nearest edge via GeoDNS or Anycast
  • Edge servers relay messages through a global pub/sub backbone (Kafka, NATS, or a custom mesh)
  • Message persistence happens in a primary region, with edge servers providing low-latency delivery
The consistency challenge: A message sent from Tokyo and a message sent from London for the same channel might arrive at different edge servers at different times. The global message ordering must be resolved before delivery to clients. Common approach: assign a primary region per channel or conversation, route messages through it for ordering, then fan out globally.

6.5 Monitoring Real-Time Systems

Key metrics to track:
MetricWhat It Tells YouAlert Threshold (example)
Active connectionsCurrent load on your WebSocket fleet> 80% of capacity
Connection rate (connects/sec)Deployment events, DDoS, or thundering herd> 2x normal rate
Disconnection rateNetwork issues, server health> 5% of connections in 1 minute
Message throughput (msgs/sec)System load, traffic patternsN/A (baseline varies)
Message delivery latency (p50, p99)User experiencep99 > 500ms
Pub/sub lagBackbone bottleneck> 100ms or growing
Per-connection send buffer sizeSlow clients, backpressureAny connection > 1 MB
Error rateBugs, protocol issues> 0.1% of messages
Cross-chapter connection: Observability. Real-time system monitoring builds on the observability fundamentals covered in Caching & Observability. Use distributed tracing to follow a message from the sender’s WebSocket through the pub/sub backbone to the recipient’s WebSocket. This end-to-end trace is the only way to diagnose “messages are slow” complaints.

6.6 Security

Authentication on connect: Authenticate during the WebSocket handshake, not after. Pass a JWT or session token as a query parameter (wss://example.com/ws?token=...) or in the initial HTTP Upgrade headers. Reject unauthenticated connections before the upgrade completes. Message validation: Validate every inbound message. A malicious client can send arbitrary data over a WebSocket connection. Validate message structure (schema validation), content (length limits, character encoding), and permissions (can this user send to this channel?). Rate limiting per connection: Limit messages per connection per second (e.g., 10 messages/second for chat, 1 message/second for typing indicators). This prevents a single malicious or buggy client from overwhelming the server. Implement rate limiting server-side, not client-side — clients can be modified.
WebSocket-specific security gotcha: WebSocket connections are not subject to CORS (Cross-Origin Resource Sharing) protections the way HTTP requests are. Any web page can open a WebSocket to your server. You must validate the Origin header during the handshake and reject connections from unauthorized origins. Without this, any website can establish a WebSocket to your server and act as an authenticated user if cookies are shared.

6.7 Real-Time Testing Strategies

Testing real-time systems is fundamentally harder than testing HTTP APIs. You are not testing request-response pairs — you are testing persistent connections, event ordering, concurrent state mutations, and behavior under network degradation. Most teams under-invest here and discover failures in production.

Unit and Integration Testing

Message protocol tests: Validate that your WebSocket message schema is enforced. Send malformed messages, oversized payloads, binary data when text is expected, and messages with missing required fields. Assert that the server rejects gracefully (close frame with appropriate error code) rather than crashing or silently dropping. Connection lifecycle tests: Test every state transition: connect, authenticate, subscribe to channels, receive messages, unsubscribe, disconnect, reconnect with resume. Automate these as integration tests that run against a real WebSocket server instance, not mocks. Ordering tests: Send 1,000 numbered messages rapidly on a single channel and verify that all connected clients receive them in order. Then send messages from multiple concurrent senders and verify per-sender ordering is maintained.
// Example: WebSocket integration test with k6
// k6 has native WebSocket support for load testing
import ws from 'k6/ws';
import { check } from 'k6';

export default function () {
  const url = 'wss://staging.example.com/ws?token=test-jwt';
  const res = ws.connect(url, {}, function (socket) {
    socket.on('open', () => {
      // Subscribe to a test channel
      socket.send(JSON.stringify({ type: 'subscribe', channel: 'load-test' }));
    });

    socket.on('message', (data) => {
      const msg = JSON.parse(data);
      check(msg, {
        'has sequence number': (m) => m.seq !== undefined,
        'sequence is increasing': (m) => m.seq > lastSeq,
        'delivery under 500ms': (m) => Date.now() - m.serverTimestamp < 500,
      });
    });

    // Keep connection alive for test duration
    socket.setTimeout(() => socket.close(), 60000);
  });

  check(res, { 'status is 101': (r) => r && r.status === 101 });
}

Load Testing

Real-time systems have unique load characteristics: the metric that matters is not requests-per-second but concurrent connections and message throughput under load. Tools for WebSocket load testing:
ToolStrengthsLimitations
k6Native WebSocket support, scriptable in JavaScript, excellent metrics output, integrates with Grafana. Can simulate complex interaction patterns (subscribe, send messages, validate responses).Single-machine concurrency limited by OS file descriptors. Use k6 cloud or distribute across machines for > 50K connections.
ArtilleryYAML-based scenarios, WebSocket engine built-in, good for CI/CD pipelines. Supports think-time and ramp-up patterns.Less flexible than k6 for complex custom logic.
websocket-benchPurpose-built for WebSocket connection volume testing. Lightweight, can open 100K+ connections from a single machine.Less feature-rich for message validation and scenario scripting.
GatlingJVM-based, excellent for sustained long-duration load tests. WebSocket support via plugin.Steeper learning curve (Scala DSL).
LocustPython-based, easy to write custom WebSocket behaviors. Good for teams already using Python.WebSocket support requires websocket-client extension, not as polished as HTTP testing.
What to measure during load tests:
  • Connection establishment rate: How many connections per second can the server accept? This tests the handshake path, authentication, and subscription setup.
  • Message delivery latency (p50, p95, p99) at load: Latency under 100 connections is meaningless. Test at 50%, 80%, and 100% of expected peak concurrent connections.
  • Message throughput ceiling: At what message rate does delivery latency start degrading? This is your system’s saturation point.
  • Memory growth over time: Run a 4-hour soak test and plot memory usage. WebSocket servers that leak memory per-connection or per-message will show a slow, steady climb.
  • Reconnection storm recovery: Simultaneously disconnect 10% of connections and measure how long until all clients are reconnected and caught up on missed messages.
# Example: Artillery WebSocket load test scenario
config:
  target: "wss://staging.example.com"
  phases:
    - duration: 60
      arrivalRate: 100     # 100 new connections/sec for 60 seconds = 6,000 connections
    - duration: 300
      arrivalRate: 0       # Hold connections for 5 minutes (soak phase)
  ws:
    subprotocol: "chat-v1"

scenarios:
  - engine: ws
    flow:
      - send: '{"type":"auth","token":"{{$randomString()}}"}'
      - think: 1
      - send: '{"type":"subscribe","channel":"load-test"}'
      - think: 5
      - loop:
          - send: '{"type":"message","text":"hello from Artillery"}'
          - think: 2
        count: 50

Chaos Testing for Connections

Production connection failures are messy: partial network partitions, asymmetric packet loss, sudden server termination, DNS flaps. You need to test for these deliberately. Chaos scenarios to run:
  1. Kill the WebSocket server mid-stream. kill -9 the process while clients are connected. Verify all clients detect the disconnection within your heartbeat timeout (30-60 seconds), reconnect with jitter, and resume without message loss.
  2. Introduce asymmetric packet loss. Use tc (Linux traffic control) or Toxiproxy to add 10% packet loss on the server-to-client path only. Verify clients detect degradation and the server’s backpressure mechanisms activate before send buffers overflow.
  3. Partition the pub/sub backbone. Temporarily block traffic between WebSocket servers and Redis/Kafka. Verify that messages sent during the partition are delivered after recovery (if using durable pub/sub like Kafka) and that non-durable messages (typing indicators) are simply dropped without cascading failure.
  4. Exhaust file descriptors. Set ulimit -n to a low value (e.g., 1024) on a test server and push connections past that limit. Verify the server rejects new connections gracefully with a proper error instead of crashing or hanging.
  5. Simulate mobile network transitions. Rapidly cycle a test client through connected-disconnected-connected states (simulating WiFi to cellular handoff). Verify the reconnection and gap-fill path handles rapid state changes without duplicate subscriptions or orphaned server-side state.
Tooling:
  • Toxiproxy (by Shopify): A TCP proxy that simulates network conditions — latency, bandwidth limits, connection resets, timeouts. Run it between your test clients and WebSocket servers.
  • Chaos Monkey / Litmus (for Kubernetes): Randomly terminate WebSocket server pods to test connection migration and reconnection.
  • tc (traffic control): Linux kernel tool for injecting latency, packet loss, and bandwidth limits at the network interface level.
Cross-chapter connection: Testing. The chaos testing patterns above extend the testing fundamentals covered in Testing, Logging & Versioning. The key difference for real-time systems is that you cannot test connections in isolation — you must test the full lifecycle: connect, authenticate, subscribe, send/receive under load, disconnect unexpectedly, reconnect, and verify state recovery. Integration tests that start from a WebSocket handshake are non-negotiable; unit tests on message handlers alone miss the majority of production failure modes.

Interview Questions

What they are really testing: Can you navigate the intersection of distributed systems, real-time protocols, and conflict resolution? Do you understand OT vs CRDTs, or do you wave your hands and say “just merge the edits”?Strong answer framework:
  1. Clarify scope: Text-only or rich text? How many concurrent editors per document? Offline support needed? What latency target for seeing others’ edits?
  2. Protocol choice: WebSocket for bidirectional real-time sync. SSE is insufficient because clients need to send operations, not just receive them.
  3. Conflict resolution: Choose between OT (server-mediated, proven by Google Docs) and CRDTs (peer-to-peer capable, used by Figma and Notion).
    • OT: Client sends operations to server. Server transforms and applies in canonical order, then broadcasts to all clients. Simpler for server-first architectures.
    • CRDT: Each client applies operations locally and broadcasts to peers. Operations commute mathematically, so no central ordering needed. Better for offline/local-first.
  4. Architecture: Document service (HTTP API for CRUD), WebSocket server (real-time sync), operation log (append-only store of all operations for history/undo), snapshot service (periodic full-document snapshots to avoid replaying entire history).
  5. Cursor sync: Broadcast cursor positions via WebSocket, fire-and-forget. Transform cursor positions against incoming operations.
  6. Scaling: Shard by document ID. All editors of the same document connect to the same WebSocket server (or a small cluster). Documents are independent — no cross-document coordination needed.
Common mistakes:
  • Ignoring conflict resolution entirely (“we’ll just use timestamps”)
  • Not distinguishing OT from CRDT and their trade-offs
  • Forgetting cursor synchronization
  • Not addressing what happens when a user goes offline and reconnects with divergent edits
Words that impress: “operational transformation,” “convergence guarantee,” “causal ordering,” “tombstone compaction” (CRDT-specific), “operation rebasing”
What they are really testing: Do you understand the resource constraints of persistent connections and the architectural patterns for horizontal scaling?Strong answer framework:
  1. Capacity math: 1M connections at ~40KB each = 40GB RAM for connection state. Budget for 10-20 WebSocket servers at 50K-100K connections each. Increase OS file descriptor limits (ulimit -n 200000).
  2. Load balancing: L7 load balancer with WebSocket upgrade support (ALB, Nginx, HAProxy). Distribute connections by consistent hashing on user ID for even spread.
  3. Pub/sub backbone: Redis Pub/Sub or Kafka for message routing between servers. When Server 3 receives a message for a user on Server 7, it publishes to a topic. Server 7 subscribes and delivers.
  4. Connection registry: Redis hash mapping user_id → server_id. Updated on connect/disconnect. Used for direct message routing and presence.
  5. Graceful deploys: Rolling deploys with connection draining. Send reconnect signal to clients on old instances. New instances absorb reconnections.
  6. Reconnection strategy: Exponential backoff with jitter to prevent thundering herd. Last-message-ID resumption to avoid missing messages during reconnect.
Example answer: “I’d start with the math: 1 million connections at 40KB each is 40GB of RAM for connection state alone. That tells me I need 15-20 servers handling 50-70K connections each, with file descriptor limits raised to 100K+. I’d use an ALB for load balancing with WebSocket upgrade support, and a Redis Pub/Sub layer as the message routing backbone between servers. Each server subscribes to channels relevant to its connected users. A Redis hash tracks which user is on which server for direct message routing. For deploys, I’d drain connections gracefully by sending a reconnect control message to clients, who reconnect with exponential backoff and jitter to spread the load. The key insight is that the WebSocket protocol itself scales fine — the challenge is message routing across servers and managing connection state.”Common mistakes:
  • Thinking WebSockets are inherently harder to scale than HTTP (they use the same TCP connections)
  • Forgetting about file descriptor limits
  • Proposing sticky sessions without addressing what happens when a server dies
  • Not mentioning reconnection strategy
What they are really testing: Do you reach for the simplest tool that solves the problem, or do you default to the most complex one?Strong answer framework:“For a live sports score dashboard, I would choose SSE over WebSocket, and here is why:
  1. Direction: Scores flow server → client. The client does not need to send real-time data back. SSE is designed for exactly this pattern.
  2. Auto-reconnect: SSE’s EventSource API reconnects automatically and sends Last-Event-ID so the server can replay missed score updates. With WebSocket, I’d have to build this reconnection and replay logic myself.
  3. Infrastructure compatibility: SSE is plain HTTP. It works through corporate proxies, CDNs, and load balancers without special configuration. WebSocket requires upgrade support from every proxy in the chain.
  4. HTTP/2 multiplexing: If the dashboard tracks multiple sports simultaneously, each sport can be a separate SSE stream. Under HTTP/2, these share a single TCP connection. Under HTTP/1.1, you are limited to 6 connections per domain, which is a constraint if tracking many streams.
  5. When I would switch to WebSocket: If the dashboard adds interactive features — live betting, chat, user-generated predictions that need real-time sync — then the bidirectional nature of WebSocket becomes necessary. But for score updates alone, SSE is simpler, more robust, and easier to operate.”
Common mistakes:
  • Defaulting to WebSocket without considering simpler alternatives
  • Not knowing that SSE has built-in reconnection and event replay
  • Missing the HTTP/2 multiplexing advantage for SSE
Words that impress: “EventSource auto-reconnect,” “Last-Event-ID resumption,” “HTTP/2 stream multiplexing,” “infrastructure transparency”
What they are really testing: Do you understand the root causes of message ordering issues in distributed systems, and can you design a solution that balances correctness with performance?Strong answer framework:
  1. Diagnose the root cause. Out-of-order delivery can come from multiple sources:
    • Clock skew across servers (if ordering by timestamp)
    • Messages routed through different servers with different latencies
    • Race condition in the pub/sub layer (two messages published concurrently arrive in different order at different subscribers)
    • Client-side rendering race (two messages arrive, the UI renders the second before the first)
  2. Solution: Per-conversation sequence numbers.
    • Assign each message a monotonically increasing sequence number within its conversation
    • The sequence is generated at the persistence layer (a single writer per conversation, or an atomic counter)
    • Clients sort by sequence number, not by arrival time or timestamp
    • Clients detect gaps: if they receive sequence 42 and then 44, they know 43 is missing and request it
  3. Snowflake IDs as a secondary sort. For messages across conversations or for global ordering, use Snowflake IDs that encode timestamp + worker ID + sequence, providing roughly chronological order without central coordination.
  4. Client-side buffering. Hold incoming messages in a small buffer (50-100ms) before rendering. Sort the buffer by sequence number before displaying. This smooths out minor ordering inconsistencies from network jitter.
Example answer: “First, I’d investigate whether the ordering issue is at the server or client. I’d add logging to track message sequence numbers at each hop: on send, at the pub/sub layer, on delivery to the WebSocket server, and on arrival at the client. If messages are in order at the server but out of order at the client, it is a client rendering issue — fix it with a small client-side buffer that sorts by sequence number before rendering. If messages are out of order at the server, the issue is likely that messages for the same conversation are being processed by multiple servers concurrently. The fix is to partition conversations so all messages for a single conversation route through a single ordering point — either a dedicated partition in Kafka or a per-conversation sequence counter in Redis.”Common mistakes:
  • Saying “just use timestamps” without acknowledging clock skew
  • Not distinguishing between server-side and client-side ordering issues
  • Proposing a global total order when per-conversation order is sufficient (and much cheaper)
What they are really testing: Can you design a system that delivers notifications across multiple channels (in-app, push, email) with appropriate urgency, deduplication, and user preferences?Strong answer framework:
  1. Event ingestion: All notifiable events (likes, comments, follows, mentions) flow through an event bus (Kafka). A notification service consumes these events and decides what to send to whom.
  2. User preference engine: Users configure which notifications they want and through which channels. Store preferences in a fast-access store (Redis or DynamoDB). The notification service checks preferences before dispatching.
  3. Channel selection based on user state:
    • User has active WebSocket? Deliver in-app immediately.
    • User app is backgrounded? Send push notification (with 5-10 second delay in case they return to the app).
    • User offline for > 1 hour? Queue for email digest.
  4. Deduplication: If 50 people like your photo in 1 minute, do not send 50 notifications. Aggregate: “Alice, Bob, and 48 others liked your photo.” Use a time-window aggregation per (user, event_type, target_object).
  5. Priority levels: Mention in a conversation = high priority (immediate push). Someone liked your post = medium (can batch). Weekly digest = low (email only). Rate limit per user per channel to prevent notification fatigue.
  6. Scale considerations: At 100M users, even 1% being active generates 1M notification decisions per minute. The notification service must be horizontally scalable, with Kafka partitioned by recipient user ID for even distribution.
Common mistakes:
  • Designing only one delivery channel (e.g., only push notifications)
  • Not aggregating similar notifications
  • Forgetting user preferences and mute settings
  • Not considering the “cancel notification” case (user reads the message before push is sent)
What they are really testing: Do you understand that naive presence broadcast does not scale, and can you design a system with bounded fan-out?Strong answer framework:
  1. State management: Each user has a presence state: online, idle, offline, DND. Presence is updated by WebSocket heartbeats (every 30s). If no heartbeat in 60s, transition to offline.
  2. Storage: Redis with TTL-based expiration. Key: presence:{user_id}, value: {status, last_seen, device}, TTL: 90 seconds. The heartbeat resets the TTL. When TTL expires, user is implicitly offline — no explicit “set offline” needed for disconnection detection.
  3. Fan-out strategy — subscription-based, not broadcast.
    • When User A opens a channel or conversation, they subscribe to presence for users in that view (not all 50M users)
    • The presence service tracks these subscriptions: “User A cares about the presence of Users B, C, D”
    • When User B’s presence changes, the service notifies only users subscribed to B’s presence
    • When User A navigates away, unsubscribe from those presence updates
  4. Batching: Do not send individual presence updates. Batch all changes in a 5-second window and send a single “presence delta” message: {online: [B, C], offline: [D]}.
  5. Consistency trade-off: Presence is inherently eventual. Showing a user as “online” for 30 seconds after they disconnect is acceptable. Over-investing in presence consistency is a common mistake.
Words that impress: “subscription-based fan-out,” “TTL-based expiration,” “presence delta batching,” “eventual consistency is acceptable here”
What they are really testing: Do you understand ICE, STUN, TURN, and NAT traversal, or do you think WebRTC is magic?Strong answer framework:
  1. Signaling: Peers exchange SDP (Session Description Protocol) offers and answers through a signaling server (typically WebSocket). SDP describes media codecs, encryption parameters, and ICE candidates.
  2. ICE candidate gathering: Each peer queries a STUN server to discover its public IP and port (the “server reflexive candidate”). It also knows its local IP (“host candidate”). If a TURN server is configured, it allocates a relay address (“relay candidate”).
  3. Connectivity checks: Both peers attempt to connect using all candidate pairs (host-to-host, host-to-server-reflexive, host-to-relay). ICE tries the most direct path first and falls back to TURN relay if nothing else works.
  4. When it fails:
    • Symmetric NAT: Both peers are behind symmetric NATs that assign different external ports for different destinations. STUN cannot discover a usable external address. ~15-20% of connections require TURN for this reason.
    • Corporate firewalls: Firewalls that block UDP entirely. TURN over TCP (or TURN over TLS on port 443) is the fallback.
    • No TURN server configured: If STUN-only candidates fail and there is no TURN server, the connection fails silently. This is the #1 deployment mistake with WebRTC.
  5. At scale (SFU vs Mesh vs MCU):
    • Mesh: Each peer connects to every other peer. Works for 2-4 participants. At 10 participants, that is 90 connections — unsustainable.
    • SFU (Selective Forwarding Unit): Each peer sends one stream to the server, which forwards it to all others. O(1) upload, O(N-1) download. Discord, Zoom, and Google Meet use SFUs.
    • MCU (Multipoint Control Unit): Server decodes all streams, composites them into one, and sends a single stream to each peer. Lowest client bandwidth but highest server CPU cost. Rarely used now.
Common mistakes:
  • Not mentioning TURN servers (critical for production)
  • Thinking WebRTC is purely peer-to-peer at scale (production systems use SFUs)
  • Not understanding why symmetric NAT breaks STUN
What they are really testing: Can you reason about latency compensation, authoritative servers, and the trade-off between responsiveness and consistency in a latency-sensitive context?Strong answer framework:
  1. Authoritative server model: The server holds the ground truth of game state. Clients send inputs (key presses, movements), not state changes. The server validates inputs, updates the state, and broadcasts the authoritative state to all clients. This prevents cheating.
  2. Client-side prediction: To avoid the feel of input lag, the client predicts the result of its own inputs immediately (e.g., moves the player character before the server confirms). When the server’s authoritative state arrives, the client reconciles: if the prediction was correct, smooth; if not, the client corrects.
  3. Server reconciliation: The client tags each input with a sequence number. When the server’s state update arrives, the client replays any unacknowledged inputs on top of the server’s state to produce the current predicted state. This is the “rollback and replay” technique.
  4. Entity interpolation: For other players’ positions, the client does not show the latest server state (which is in the past due to latency). Instead, it interpolates between the two most recent server states, displaying a smooth animation that is slightly behind real-time (~100ms behind).
  5. Protocol: UDP (or WebRTC data channels) for game state, not TCP. TCP’s retransmission and ordering guarantees add latency that is unacceptable for real-time games. A dropped position update is irrelevant once the next one arrives.
  6. Tick rate: The server runs a game loop at a fixed rate (e.g., 60 ticks/second for a fast-paced game, 20 ticks/second for a slower game). Each tick processes all buffered player inputs, updates the game state, and sends a snapshot to clients.
Common mistakes:
  • Using TCP for real-time game state (adds unacceptable latency from retransmissions)
  • Trusting the client to send state changes instead of just inputs (enables cheating)
  • Not mentioning client-side prediction and server reconciliation
  • Forgetting about entity interpolation for other players
Words that impress: “authoritative server,” “client-side prediction,” “server reconciliation,” “entity interpolation,” “tick rate,” “input buffering”
What they are really testing: Do you understand the operational reality of WebSocket connection management — that connections die silently and you need active detection?Strong answer framework:
  1. Why connections go stale: Mobile users enter tunnels, laptops hibernate, NAT mappings expire. The TCP connection appears open on the server side, but no data can reach the client. Without active detection, these “zombie connections” accumulate and waste memory, file descriptors, and message delivery effort.
  2. Detection: Ping/pong heartbeats.
    • Server sends a WebSocket ping frame every 30 seconds
    • Client responds with pong (this is handled automatically by the WebSocket protocol, no application code needed)
    • If no pong is received within 10 seconds, mark the connection as suspect
    • If two consecutive pings go unanswered (60 seconds), close the connection and clean up
  3. Application-level heartbeats: In addition to protocol-level ping/pong, have the client send an application-level heartbeat (e.g., {"type": "heartbeat", "timestamp": ...}) every 30 seconds. This confirms not just that the TCP connection is alive, but that the client application is running and responsive.
  4. Cleanup: On connection close (clean or timeout), remove the user from the connection registry, update presence to “offline” (with a grace period for reconnection), and unsubscribe from all pub/sub channels.
  5. Monitoring: Track the ratio of stale connections detected per minute. A sudden spike in stale connections indicates a network issue, a load balancer timeout misconfiguration, or a server-side bug.
Common mistakes:
  • Relying solely on TCP keepalive (default interval is 2 hours — far too slow)
  • Not cleaning up connection registry and pub/sub subscriptions on stale connection detection
  • Setting heartbeat interval longer than the load balancer’s idle timeout
What they are really testing: Can you handle the intersection of real-time delivery, strict ordering, and data consistency under high contention?Strong answer framework:
  1. Bid submission: Clients submit bids via WebSocket. Each bid includes: auction ID, user ID, bid amount, client timestamp.
  2. Ordering and fairness: The server is the authoritative timestamp source, not the client. Bids are ordered by server receive time. For bids arriving within the same millisecond, a deterministic tiebreaker (e.g., lower user ID) ensures fairness and consistency.
  3. Bid validation: Single-threaded processing per auction (or optimistic locking). For each bid: verify bid > current highest bid, verify user has funds/authorization, atomically update the highest bid, and broadcast the new bid to all watchers.
  4. Real-time broadcast: All users watching the auction receive bid updates via WebSocket within 100ms. Use a pub/sub channel per auction. Display the current highest bid, the bidder, and a countdown timer.
  5. Auction close timing: The auction close time must be server-authoritative. Client clocks cannot be trusted. The server sends periodic time-sync messages so clients display accurate countdowns. “Snipe protection”: extend the auction by 30-60 seconds if a bid arrives in the final minute (eBay-style).
  6. Consistency: This is a case where strong consistency matters. Two users placing bids simultaneously must not both think they won. Use a single writer per auction (dedicated partition or lock) to serialize bid processing.
Words that impress: “server-authoritative timestamp,” “single-threaded bid processing per auction,” “snipe protection extension,” “optimistic locking with conflict detection”

Real-World Architecture Examples

Stack: WebSocket (text chat, signaling) + WebRTC (voice/video) + Elixir (WebSocket servers) + Rust (voice SFUs) + Cassandra → ScyllaDB (message storage) + Redis (presence, pub/sub)Key decisions:
  • Snowflake IDs for globally unique, roughly ordered message IDs
  • Lazy fan-out for large servers (> 1000 members): push messages only to connected, viewing members
  • Voice uses SFU architecture: each client sends one audio stream, receives N-1 streams
  • Custom Elixir-based gateway servers handle millions of WebSocket connections (Elixir’s BEAM VM excels at massive concurrency)
Scale: 200+ million monthly users, 10M+ concurrent voice connections, billions of messages per day.
Stack: WebSocket (real-time messaging) + PHP → Hack → Java (backend evolution) + MySQL (message storage) + Redis + Kafka (pub/sub) + Vitess (MySQL sharding)Key decisions:
  • Evolved from long polling → WebSocket as scale grew
  • Channel-based pub/sub: each Slack channel is a pub/sub topic
  • Per-workspace message ordering (not global): messages within a workspace are ordered, but cross-workspace ordering is not guaranteed
  • Enterprise features drove architecture: message retention policies, compliance exports, DLP (Data Loss Prevention) required messages to be stored server-side in a queryable format
Scale: 750K+ organizations, tens of millions of daily active users, message delivery < 200ms p99.
Stack: WebSocket (real-time sync) + Custom OT-like conflict resolution + Rust (performance-critical server components) + PostgreSQL (metadata) + S3 (file storage)Key decisions:
  • Custom conflict resolution system (not pure OT or pure CRDT) optimized for their tree-structured design data model
  • Server-mediated synchronization: all operations go through the server for ordering
  • Incremental sync: clients send small operation deltas, not full document state
  • Multiplayer presence: cursor positions, selection highlights, and viewport indicators synced via WebSocket
  • WebAssembly for client-side rendering (consistent performance across browsers)
Scale: Hundreds of concurrent editors per document, sub-100ms edit propagation latency.
Stack: WebSocket-like persistent connection + OT (Operational Transformation) + Bigtable/Spanner (storage) + Custom collaboration infrastructureKey decisions:
  • OT with server as the single source of truth for operation ordering
  • Operations are transformed, applied, and broadcast in server-canonical order
  • Revision history is a complete operation log — any past state can be reconstructed by replaying operations from the beginning
  • “Suggested edits” mode is a separate OT operation type that does not modify the document but exists in a parallel operation stream
Scale: Billions of documents, millions of concurrent collaborative sessions.

Cross-Chapter Connections

Where real-time systems connect to the rest of this guide:
  • OS Fundamentals: File descriptors, epoll for I/O multiplexing, socket buffer tuning, and ulimit configuration are the kernel-level primitives that determine how many WebSocket connections a single server can handle. Understanding why epoll scales to 100K connections while select does not is essential for capacity planning.
  • Messaging, Concurrency & State: The pub/sub backbone for WebSocket fan-out uses the same messaging patterns (Kafka, Redis Pub/Sub, delivery semantics). Delivery guarantees (at-least-once + dedup) are identical. Kafka’s durable log enables message replay for connection recovery; Redis Pub/Sub’s fire-and-forget model suits ephemeral events like typing indicators.
  • Distributed Systems Theory: CRDTs for collaborative editing, vector clocks for causal ordering, and the CAP theorem trade-offs in multi-region WebSocket deployments all build on distributed systems fundamentals. The consistency-availability trade-off in presence systems is a direct application of the theory covered there.
  • API Gateways & Service Mesh: WebSocket connections at the gateway level require HTTP Upgrade support, long-lived connection management, and message-based routing — fundamentally different from stateless HTTP proxying. Kong, Envoy, and AWS API Gateway each handle WebSocket differently.
  • Cloud Service Patterns: AWS API Gateway WebSocket APIs, AWS AppSync real-time subscriptions, and serverless WebSocket patterns (Lambda + DynamoDB for connection tracking) provide managed alternatives to self-hosted WebSocket infrastructure. The cost-vs-control trade-off is critical at different scale points.
  • Performance & Scalability: Connection capacity planning, backpressure, and latency budgets are performance engineering applied to persistent connections. Tail latency in message delivery follows the same principles as HTTP request tail latency.
  • Networking & Deployment: WebSocket load balancing, TLS termination, DNS-based routing for multi-region deployments, and HTTP/2 multiplexing for SSE all build on networking fundamentals.
  • Caching & Observability: Monitoring WebSocket connection counts, message delivery latency percentiles, and pub/sub lag uses the same observability stack (Prometheus, Grafana, distributed tracing).
  • Reliability Principles: Graceful degradation (downgrading from push to pull), circuit breakers on downstream services, and reconnection with exponential backoff are reliability patterns applied to real-time systems.
  • Auth & Security: WebSocket authentication (JWT on handshake), Origin header validation, per-connection rate limiting, and E2EE (Signal Protocol) are security concerns specific to persistent connections.
  • Testing, Logging & Versioning: Real-time systems require chaos testing (connection drops, network partitions, pub/sub failures) and load testing of persistent connections — extending the testing patterns covered there beyond request-response paradigms.
  • System Design Practice: Chat systems, notification systems, and collaborative editors are common system design interview problems that require the real-time patterns covered here.

Foundational:
  • WebSocket RFC 6455 — The actual protocol specification. Surprisingly readable for an RFC. Read sections 1 (Introduction) and 4 (Opening Handshake) at minimum.
  • MDN: Server-Sent Events — The best practical guide to SSE with code examples.
  • WebRTC for the Curious — A free, open-source book that explains WebRTC from the ground up. The best resource for understanding ICE, STUN, TURN, and SDP without getting lost in browser API details.
Architecture and Scale:Collaborative Editing:
  • Designing Data-Intensive Applications by Martin Kleppmann — Chapter 5 (Replication) and Chapter 9 (Consistency and Consensus) provide the distributed systems foundation for understanding OT and CRDTs.
  • CRDTs: The Hard Parts — Martin Kleppmann’s talk on the real challenges of CRDTs beyond the academic papers. Essential viewing for anyone considering CRDTs in production.
  • Yjs Documentation — Practical guide to implementing collaborative editing with the most popular CRDT library.
Advanced:

Quick Reference Cheatsheet

ScenarioProtocolWhy
Chat / messagingWebSocketBidirectional, low latency
Live dashboard / scoresSSEServer push only, auto-reconnect
Voice / video callWebRTCP2P, ultra-low latency
IoT sensor dataWebSocket or MQTTPersistent, lightweight
Legacy enterpriseLong pollingProxy/firewall compatibility
Collaborative editorWebSocketBidirectional operations + cursor sync
Notifications (in-app)WebSocket or SSEServer push
File transfer (P2P)WebRTC data channelDirect peer connection
CI/CD build logsSSEOne-way stream, auto-resume with event IDs
Multiplayer game stateWebRTC data channel or raw UDPLowest latency, loss-tolerant
The senior engineer’s heuristic: Start with SSE if your data flows one way. Move to WebSocket if you need bidirectional communication. Use WebRTC only for media or P2P data. Use long polling as a fallback, not a first choice. The simplest protocol that solves your problem is the right one — it will be easier to debug, monitor, and operate in production.

Interview Deep-Dive Questions

These questions go beyond surface-level recall. They simulate the multi-layered probing that senior and staff-level interviews use to separate candidates who have operated real-time systems in production from those who have only read about them.

Q1: You inherit a WebSocket service that works fine at 10K connections but falls over at 50K. Walk me through your investigation.

What the interviewer is really testing: Can you reason about system resource exhaustion methodically? Do you reach for data before theories, or do you guess? Strong answer: The way I think about this is as a resource exhaustion problem, and the question is which resource. I would investigate in order of most common to least common:
  • File descriptors first. This is the most frequent cause and the easiest to verify. ulimit -n on most Linux distros defaults to 1024. At 50K connections, you need at least 50K file descriptors plus headroom for log files, database connections, and the pub/sub client. I would check cat /proc/<pid>/limits for the running process and compare Max open files against the actual count from ls /proc/<pid>/fd | wc -l. If the soft limit is anywhere near 50K, that is the issue. Fix: set ulimit -n 200000 in the service unit file, and also raise fs.file-max at the kernel level if it is low.
  • Memory next. Each WebSocket connection consumes 20-50KB between TCP socket buffers, TLS session state, and application-level data. At 50K connections and 40KB each, that is 2GB just for connection state. If the server has 4GB total and your application plus the JVM or Node.js runtime already uses 1.5GB, you are pushing into swap or hitting OOM. I would check memory consumption patterns with pmap or the runtime’s heap profiler, looking specifically for per-connection allocations that are larger than expected — maybe someone stored the full user object instead of just the user ID on each connection.
  • Ephemeral port exhaustion. If the WebSocket server itself makes outbound connections — to a Redis pub/sub, a database, or an upstream service — each outbound connection consumes an ephemeral port. The default range is only ~28K ports. If the server is creating and tearing down outbound connections frequently without reusing them (connection pool misconfiguration), you can hit this limit. Check with ss -s and look at the TIME_WAIT count.
  • CPU saturation from message fan-out. If each incoming message triggers a broadcast to all 50K connections, the serialization and write syscall overhead per message scales linearly. At 50K connections with 100 messages per second, you are doing 5 million serialization-and-send operations per second. Profile the CPU: is it spending time in JSON serialization, in TLS encryption, or in kernel syscalls for socket writes?
  • The load balancer or proxy in front. Many reverse proxies have their own connection limits that are lower than the backend server’s capacity. Nginx’s worker_connections default is 1024. An ALB has a default of 100 concurrent connections per target in some configurations. The server might be fine; the proxy might be the bottleneck.
I would always start by reproducing the failure in a staging environment with a load testing tool like k6, incrementally ramping connections from 10K to 60K while watching dmesg, application logs, and key system metrics (file descriptors, memory RSS, CPU, and network buffer stats). The logs at the moment of failure usually point directly to the root cause.

Follow-up: You discover the cause is memory — each connection holds 200KB instead of the expected 40KB. How do you find the leak?

A 5x memory overhead per connection tells me something is being allocated per-connection that should not be. My approach:
  • Heap snapshot diff. In Node.js, I would take a heap snapshot at 10K connections and another at 20K connections, then compare them in Chrome DevTools. I am looking for objects whose count grew proportionally with the connection count. Often the culprit is a retained reference — an event listener that was added on connection open but never removed on close, or a message history buffer that was supposed to be capped but has no size limit.
  • Common culprits I have seen in production. (1) Each connection subscribes to a pub/sub channel, but the subscription callback closure captures a reference to the entire request context including large headers and the HTTP body from the upgrade request. (2) A message broadcast function that creates a per-connection copy of the message payload instead of serializing once and sending the same buffer to all connections. (3) TLS session caching gone wrong — the server is caching TLS session tickets per-connection instead of per-client.
  • The fix pattern. Once I identify the oversized object, I either eliminate it (why is the full HTTP upgrade request retained after the connection is established?), cap it (ring buffer instead of unbounded array), or share it (serialize the message once, send the same Buffer to all connections instead of serializing per-connection).

Follow-up: The system is in production and crashing right now. You cannot afford a heap snapshot. What do you do?

Triage under fire means I need data without invasive profiling:
  • Immediate mitigation: Reduce the connection limit per server by pulling some instances out of the load balancer rotation, spreading the load. If I have auto-scaling, scale out horizontally to reduce per-instance connection count below the threshold where the problem manifests.
  • Non-invasive data collection: Check /proc/<pid>/status for VmRSS (resident set size) to confirm memory is the issue. Use jemalloc’s stats endpoint if the runtime supports it. For Node.js, process.memoryUsage() exposed via a health endpoint gives heap used vs heap total without pausing the process.
  • If I suspect a specific code path: Add a lightweight gauge metric that tracks the size of the data structure I suspect (connection map size, subscription count, buffer sizes) and emit it to the monitoring system. This costs almost nothing and gives me the trend data I need.
  • Last resort: Enable core dumps, let the next OOM kill generate a dump, and analyze offline with gdb or lldb while fresh instances handle the traffic.

Going Deeper: How would you design the connection state management from scratch to prevent this class of problem?

I would enforce a strict memory budget per connection at the architecture level:
  • Define a ConnectionContext struct or class with a fixed set of fields: user ID, server ID, subscribed channels (set with max cardinality), last heartbeat timestamp, and a capped send buffer (ring buffer, max 1MB). No extensible maps or “metadata” bags. If you need to add a field, it is a code change with a review.
  • Implement a per-connection memory accounting system. When a connection is created, increment a gauge. When it is destroyed, decrement. Alert if the per-connection memory (total RSS / active connections) drifts above the budget. This catches regressions before they become outages.
  • Separate the hot path (message routing) from the cold path (user profile lookup, channel metadata). The connection context should hold only what is needed to route a message. Everything else is looked up on demand from a shared cache.

Q2: Explain the trade-offs between Redis Pub/Sub and Kafka as the backbone for cross-server WebSocket message routing.

What the interviewer is really testing: Do you understand the delivery semantics, failure modes, and operational characteristics of each system, or do you just know their names? Strong answer: The way I frame this decision is around one fundamental question: what happens when a WebSocket server misses a message? If the answer is “nothing important — the user will just not see a typing indicator for two seconds,” Redis Pub/Sub is the right choice. If the answer is “a chat message is permanently lost and the user has to refresh to see it,” you need Kafka or something with similar durability.
  • Redis Pub/Sub is fire-and-forget. When a message is published to a channel, Redis delivers it to all currently subscribed clients. If a WebSocket server is temporarily disconnected from Redis — during a network blip, a Redis failover, or during the server’s own restart — every message published during that window is gone. There is no replay, no offset, no “give me what I missed.” This makes Redis Pub/Sub ideal for ephemeral events: typing indicators, presence updates, cursor positions in a collaborative editor. These are events where the latest state supersedes any missed intermediate states. Redis Pub/Sub is also extremely fast — sub-millisecond publish-to-subscribe latency within the same datacenter — and operationally simple. A single Redis instance can handle hundreds of thousands of publishes per second.
  • Kafka provides durable, ordered, replayable message streams. Each message is written to a partition log and retained for a configurable duration (hours, days, or indefinitely). Consumers track their offset — the position in the log they have read up to. If a WebSocket server restarts, it resumes from its last committed offset and replays everything it missed. This makes Kafka necessary for durable messages: chat messages, notifications, bid updates in an auction — anything where losing a message means data loss visible to the user. The trade-off is higher latency (typically 5-20ms end-to-end vs sub-1ms for Redis), more operational complexity (Zookeeper or KRaft, partition management, consumer group coordination), and higher resource consumption.
  • The hybrid approach, which is what I have seen work best in production: Use Redis Pub/Sub as the primary real-time delivery path for all messages (including durable ones). Simultaneously, write durable messages to Kafka. If a WebSocket server detects it missed messages (via sequence number gap detection on the client), it fetches the missing messages from Kafka’s log. This gives you Redis’s speed for the happy path and Kafka’s durability for recovery. The critical insight: you are not choosing one or the other. You are using each for what it is good at.

Follow-up: Redis Pub/Sub has no backpressure mechanism. What happens when a subscriber cannot keep up, and how do you handle it?

This is a real production failure mode that bites people. When a WebSocket server subscribes to Redis Pub/Sub but processes messages slower than they arrive — maybe because it is CPU-saturated from serializing messages for 80K connected clients — Redis does not slow down. It buffers messages in memory for that subscriber. If the subscriber falls far enough behind, Redis’s output buffer for that client grows unbounded. The default behavior when the output buffer exceeds the configured limit (client-output-buffer-limit pubsub) is that Redis disconnects the subscriber entirely. So the failure mode is: slow subscriber falls behind, Redis buffers messages, buffer exceeds limit, Redis kills the subscription, the WebSocket server loses its real-time feed silently (unless it monitors the subscription connection), and all clients on that server stop receiving messages. I have seen this cascade: one slow server loses its subscription, users on that server stop receiving messages, they start hitting refresh, the refresh storm creates more load, and more servers fall behind. Mitigation:
  • Monitor the Redis subscriber output buffer size. Alert if any subscriber’s buffer exceeds 50% of the configured limit. INFO CLIENTS in Redis shows per-client buffer usage. This gives you early warning.
  • Set aggressive output buffer limits with a hard cap. Configure client-output-buffer-limit pubsub 256mb 64mb 60 — meaning: hard limit 256MB (disconnect immediately), soft limit 64MB sustained for 60 seconds (disconnect if above 64MB for over a minute). These prevent a single slow subscriber from exhausting Redis’s memory.
  • Implement subscriber health checks. Each WebSocket server should monitor the age of the last message it received from Redis. If no message arrives for a threshold period (and messages are expected), proactively reconnect the subscription. Do not wait for Redis to kill it.
  • Decouple subscription from processing. Instead of processing messages synchronously in the Redis subscription callback, push them to a local in-process queue (bounded, with drop-oldest policy for non-durable messages) and process from the queue. This isolates the subscription speed from the processing speed.

Follow-up: You chose the hybrid Redis + Kafka approach. How do you ensure a message is not displayed twice on the client — once from Redis real-time delivery and once from Kafka replay?

Deduplication is handled at the client layer, not the server layer. Every message has a globally unique ID (Snowflake ID or UUID). The client maintains a set of recently seen message IDs (bounded window, last 10,000 IDs or last 30 minutes). When a message arrives — whether from the live WebSocket stream or from a gap-fill replay — the client checks the ID against its dedup set. If it has been seen, it is silently dropped. On the server side, the gap-fill response from Kafka includes messages that may already have been delivered via Redis. This is intentional — it is cheaper to over-deliver and let the client dedup than to track which messages were successfully delivered to which client via which path. This is the at-least-once delivery with client-side deduplication pattern that virtually every production chat system uses.

Q3: A product manager asks you to add “who is typing” to a chat product that has 200K concurrent users. How do you design it, and what do you explicitly not do?

What the interviewer is really testing: Can you design a feature with the right level of reliability — not over-engineering, not under-engineering? Do you know where to cut corners intentionally? Strong answer: Typing indicators are the canonical example of a feature where the reliability requirements are deliberately low, and recognizing that is the first sign of an experienced engineer. The worst case of a missed typing indicator is that someone does not see the ”…” animation for three seconds. The worst case of over-engineering it is that you add load and complexity to systems that should be focused on delivering actual messages. What I would build:
  • Client emits a typing_start event when the user begins typing in a conversation. This is sent over the existing WebSocket connection as a lightweight message: {"type": "typing", "conversation_id": "abc", "user_id": "123"}. The client throttles emissions to at most one every 3 seconds to prevent flooding.
  • Server broadcasts the event to other participants in the conversation via the existing pub/sub backbone (Redis Pub/Sub). No Kafka, no persistence, no acknowledgment. Fire-and-forget.
  • Receiving clients display the indicator for 4-5 seconds, then auto-expire it. If another typing_start arrives from the same user, reset the timer. No typing_stop event is strictly necessary — the auto-expiry handles it. I would send a typing_stop when the user submits the message (so the indicator disappears immediately rather than lingering for 4 seconds), but if it is lost, no harm done.
What I would explicitly not do:
  • No persistence. Typing state is never written to a database. It is ephemeral by nature. If a user disconnects and reconnects, they do not need to know who was typing 10 seconds ago.
  • No delivery guarantees. No ACKs, no retries, no sequence numbers on typing events. If Redis drops a typing event during a blip, nobody notices.
  • No fan-out to offline users. If a conversation participant is offline, do not queue the typing indicator for them. Do not send push notifications. This sounds obvious, but I have seen systems where typing events flowed through the same delivery pipeline as messages, including the “queue for offline delivery” path, which wasted storage and push notification budget.
  • No per-user tracking on the server. The server does not maintain a data structure of “who is currently typing in which conversation.” It simply relays the event. The client is responsible for tracking the state and auto-expiring stale indicators. This keeps the server stateless with respect to typing.
At 200K concurrent users, the math: If 5% of users are actively typing at any given time, that is 10K typing events every 3 seconds, or about 3,300 events per second. Each event is ~100 bytes. That is 330KB/s of typing traffic through Redis Pub/Sub — trivial. The fan-out is bounded by conversation size, not total user count. A typing event in a 5-person DM fans out to 4 recipients. Even if the average conversation has 20 participants, that is 66K recipient-events per second — well within Redis Pub/Sub’s capacity on a single instance.

Follow-up: The product manager now wants “Alice is typing…” to show the user’s name. How does this change your design if you have 50 participants in a channel?

It does not change the transport design at all. The typing event already carries the user_id. The receiving client looks up the display name from its local user cache (which it already has because it needs to display message sender names). The only new consideration is the UI aggregation logic: with 50 participants, you might have 5 people typing simultaneously. Showing “Alice, Bob, Carol, Dave, and Eve are typing…” is cluttered. The standard pattern is: 1 person shows their name (“Alice is typing…”), 2-3 people show names (“Alice and Bob are typing…”), 4+ shows a count (“4 people are typing…”). The one subtle issue at 50 participants: if typing events arrive from many users in a burst, the client UI should debounce the display update. Do not re-render the typing indicator on every single event. Batch events into a 500ms window and then update the display once. This prevents UI jank from rapid sequential re-renders.

Follow-up: How do you prevent a malicious client from sending 1,000 typing events per second to flood the system?

Rate limiting per connection on the server side. I would set a limit of 1 typing event per second per connection. Any typing event that arrives within 1 second of the previous one from the same connection is silently dropped. This is implemented as a simple timestamp check in the message handler — no external rate limiter needed. For a more sophisticated attack where someone opens many connections, per-user rate limiting (across all their connections) is needed. A lightweight in-memory counter per user ID, reset every second, with a limit of 2-3 events per second. If exceeded, drop the event and optionally send a rate-limit warning back to the client. The key insight: rate limiting on typing events can be aggressive because the cost of false-positive throttling is zero. Nobody notices if their typing indicator takes an extra second to appear.

Q4: Compare OT and CRDTs for collaborative editing. When would you choose each, and what is the hardest production problem you would face with each?

What the interviewer is really testing: Do you understand these at an implementation level, or just at a buzzword level? Can you reason about the actual production trade-offs beyond “OT needs a server, CRDTs do not”? Strong answer: Both OT and CRDTs solve the same fundamental problem — converging divergent document states when multiple users edit concurrently — but they make opposite architectural bets. Operational Transformation bets on a central server. Every operation flows through the server, which establishes a total order. When two operations conflict, the server transforms one against the other using a pair-wise transformation function. The mathematical requirement is that transformations satisfy the “TP1” property: applying operation A then transformed-B produces the same result as applying B then transformed-A. Google Docs has been running OT in production for over 15 years. CRDTs bet on the data structure itself. Instead of transforming operations after the fact, you design the data structure so that concurrent operations commute — they produce the same result regardless of the order they are applied. No central server is needed for convergence. Yjs and Automerge are the main libraries. Figma uses a custom approach with CRDT-like properties. When I would choose OT:
  • Server-mediated architecture where all clients connect to a central server (the Google Docs model). OT is simpler when you already have a server ordering operations.
  • Rich document formats with many operation types (text, tables, images, formatting). OT’s N-squared transformation matrix is painful but at least each transformation function can be hand-tuned. CRDT metadata overhead grows with document complexity.
  • Undo/redo is a hard requirement. OT has a well-understood inverse operation model. CRDT undo requires tracking causal history, which is significantly more complex.
When I would choose CRDTs:
  • Local-first or offline-first applications where users may edit for hours without connectivity and then sync. CRDTs were designed for exactly this. OT requires server round-trips for conflict resolution.
  • Peer-to-peer architectures without a central server. Think a collaborative tool that syncs directly between devices.
  • When I want to use an existing library (Yjs) rather than building conflict resolution from scratch. Yjs has been battle-tested and the ergonomics are excellent.
The hardest production problem with OT: transformation function correctness. With N operation types, you need N-squared transformation functions, and every single one must be mathematically correct. If even one transformation function has a subtle bug — say, it miscalculates a position offset when an insert and a formatting change overlap at a boundary — users’ documents diverge. They are both looking at the same “document” but seeing different content. And the divergence may not be detected until much later. Google has teams dedicated to verifying OT correctness. For a startup, this is a significant investment. The hardest production problem with CRDTs: metadata bloat. Every character in a CRDT-based document carries a unique ID and a reference to its predecessor. Deleted characters are not removed — they become tombstones that are invisible to the user but still consume memory and must be processed during merge operations. A document that has been heavily edited — say, 100,000 insertions and 80,000 deletions — carries 100,000 character IDs and 80,000 tombstones. Over time, the document’s internal representation can be 10-50x larger than the visible text. Tombstone compaction (garbage collecting tombstones that all peers have observed) is the mitigation, but it requires coordination between peers to confirm that a tombstone is safe to remove — which partially negates the “no coordination needed” advantage.

Follow-up: You are building a Notion-like tool. The document is not just text — it is a tree of blocks (paragraphs, headings, lists, embeds). How does this affect your choice?

This is where the decision gets nuanced. Block-based documents are fundamentally tree-structured, and the conflict resolution semantics are different from linear text. For text content within a block, both OT and CRDTs work well — this is the problem they were designed for. The harder problem is structural operations on the tree: moving a block from one position to another, nesting a block inside a list, deleting a parent block while someone else is editing a child. CRDTs have an advantage here because tree CRDTs (like the one Yjs implements) can handle concurrent structural changes — one user moves block A above block B while another user nests block B inside block A. The CRDT’s merge semantics produce a deterministic result, though it may not be the result either user intended. The UI then needs to handle the “surprising merge” gracefully. OT handles tree operations by defining transformation functions for move, nest, unnest, and delete operations. The N-squared problem gets worse because tree operations interact with text operations in non-obvious ways. Figma’s custom approach is instructive: they model the document as a flat set of nodes with parent pointers rather than a nested tree, which reduces the combinatorial explosion of operation interactions. In practice, Notion chose to use a CRDT-based approach for this reason. The structural merge semantics are imperfect, but they converge reliably without needing a massive transformation function matrix.

Follow-up: How do you test that your collaborative editing system actually converges correctly?

This is an area where most teams under-invest, and it is where production bugs hide. My approach:
  • Fuzz testing with random operations. Generate thousands of random document operations (insert, delete, format, move) across N simulated clients. Apply them in every possible order permutation (or a large random sample of orderings). Verify that all clients converge to the same document state. This is the most effective test — it finds edge cases that hand-written tests never cover. The Yjs and Automerge test suites are heavily based on this approach.
  • Specific regression tests for known-hard cases. Concurrent insert at the same position. Delete of a range that overlaps with an insert. Formatting change that spans a boundary where another user is typing. Block move that conflicts with a block delete. Each of these should be an explicit test case.
  • Long-running convergence soak test. Simulate a document with 10 clients editing continuously for 24 hours. Periodically snapshot all clients’ document states and verify they are identical. This catches slow divergence bugs — the kind where documents drift apart by one character every few thousand operations.

Q5: You are designing a system where 100,000 users are watching a live sports event and need score updates within 1 second. Walk through your architecture.

What the interviewer is really testing: Can you design a high-fan-out, read-heavy real-time system? Do you understand the difference between pushing to 100K users and pushing to 100K individual connections? Strong answer: This is a classic one-to-many broadcast problem, and the key insight is that the data flow is entirely unidirectional — server to client. That immediately points me toward SSE over WebSocket. The users are not sending data back; they are consuming a stream. Architecture:
  • Data ingestion: A small service consumes score data from the sports data provider’s API (or webhook). This service validates, normalizes, and publishes score events to an internal message bus (Redis Pub/Sub is sufficient here — the event rate is low, maybe one event every few seconds per game, and the data is ephemeral).
  • Edge SSE servers: A fleet of SSE servers, each handling 10-20K concurrent SSE connections. Each server subscribes to the Redis Pub/Sub channel for score events. When a score event arrives, the server writes it to all active SSE connections. At 100K users, that is 5-10 edge servers.
  • Why SSE and not WebSocket: Auto-reconnection with Last-Event-ID is built into the browser’s EventSource API. If a user’s connection drops (mobile network switch, laptop wakes from sleep), the browser reconnects automatically and the server replays missed events from a small in-memory buffer keyed by event ID. With WebSocket, I would have to build all of this reconnection and replay logic myself. For a read-only stream, that is unnecessary complexity.
  • CDN edge caching (the advanced optimization). If the sports data provider sends updates every 5 seconds, I can put a CDN (Cloudflare, Fastly) in front of the SSE endpoints with a 1-2 second cache TTL. Every CDN edge location holds the SSE connection to my origin and fans out to thousands of local users. This turns my 100K origin connections into maybe 50 CDN-to-origin connections, with the CDN handling the fan-out at the edge. Fastly and Cloudflare both support SSE streaming through their CDN. This is the architecture that many live-score websites use.
  • Fallback for environments where SSE is blocked. Long polling with a 5-second timeout as the fallback. The response includes the latest score state plus a cursor. The next request includes the cursor. Simple, works everywhere.
The math: Each SSE event is roughly 200 bytes (JSON payload). At one event every 5 seconds, that is 40 bytes/second per connection. At 100K connections, that is 4MB/s of outbound bandwidth — trivial. The bottleneck is not bandwidth but connection count. With SSE over HTTP/2, each domain can multiplex many streams over a single TCP connection, but in practice the browser opens one EventSource connection per stream. The 6-connection-per-domain limit in HTTP/1.1 is the constraint to watch for if users are subscribed to multiple simultaneous game streams.

Follow-up: The product team now wants to show a real-time activity feed — “John just predicted Team A will win” — alongside the scores. 10,000 users are submitting predictions per minute. How does this change your architecture?

Now the data flow is bidirectional. Users submit predictions (client-to-server) and see others’ predictions (server-to-client). I would keep SSE for the server-to-client score stream but add an HTTP POST endpoint for prediction submissions. This avoids upgrading to WebSocket. The interesting design problem is the activity feed fan-out. 10,000 predictions per minute is ~167 per second. Pushing all 167 predictions per second to all 100K users is 16.7 million messages per second. That is expensive and wasteful — users cannot cognitively process 167 predictions per second anyway. The solution is server-side sampling and aggregation. Instead of pushing every prediction, the server aggregates predictions in a 5-second window (“1,247 predictions in the last 5 seconds: 62% Team A, 38% Team B”) and pushes the aggregate. Plus, push a curated sample of 3-5 individual predictions with user names for social proof (“John from London just predicted Team A”). This reduces fan-out from 167 events/second to 1 aggregate event every 5 seconds — perfectly manageable.

Follow-up: How do you handle the thundering herd when a goal is scored and 100K clients all simultaneously receive the event?

The thundering herd here is not on the receive side — SSE connections are already open and the server writes to each one. The concern is if the goal event triggers 100K clients to simultaneously make HTTP requests for additional data (goal replay video, updated standings, commentary). Mitigation:
  • Push enough data in the SSE event that the client does not need to fetch more. Include the scorer name, new score, and minute in the event payload. The client can update the UI immediately without any HTTP request.
  • Stagger client-initiated requests with client-side jitter. If the client does need to fetch additional data (like a highlight clip URL), add a random delay of 0-5 seconds before making the request. setTimeout(() => fetch('/highlights/latest'), Math.random() * 5000).
  • CDN caching for supplementary data. The highlight clip URL, updated standings, and commentary are the same for all 100K users. Serve them from CDN with a short cache TTL. The first request hits origin; the next 99,999 hit the CDN edge.

Q6: Your WebSocket-based chat application works perfectly in development but users report “messages arrive 5-10 seconds late” in production. How do you diagnose this?

What the interviewer is really testing: Can you systematically trace a latency problem through a distributed system, or do you randomly poke at things? Strong answer: A 5-10 second latency in production but not development tells me the problem is in the infrastructure between the client and the application logic, not in the application logic itself. In development, the client connects directly to a local server with no intermediaries. In production, there are layers — load balancers, proxies, TLS termination, pub/sub backbone, geographic routing — and any of them could be adding latency. My diagnostic approach, layer by layer:
  • Step 1: Instrument the message path. Add timestamps at each hop: (1) when the sender’s client sends the message, (2) when the sender’s WebSocket server receives it, (3) when it is published to the pub/sub backbone, (4) when the recipient’s WebSocket server receives it from pub/sub, (5) when it is written to the recipient’s WebSocket, and (6) when the recipient’s client receives it. The gap between these timestamps tells me exactly where the latency lives. If the gap is between (3) and (4), the pub/sub layer is the bottleneck. If it is between (5) and (6), it is a network or proxy issue.
  • Step 2: Check the load balancer idle timeout. This is the most common surprise. AWS ALB defaults to a 60-second idle timeout, and some HTTP proxies (Cloudflare, nginx) have shorter timeouts. If the WebSocket connection goes idle for longer than the proxy timeout, the proxy silently kills it. The client does not know the connection is dead until it tries to send or until the next heartbeat fails. Messages sent to that connection on the server side are written to a dead socket and silently dropped. The client eventually reconnects and fetches missed messages — hence the 5-10 second delay (heartbeat detection time + reconnect time + gap-fill time). The fix: ensure the heartbeat interval is shorter than the smallest idle timeout in the proxy chain.
  • Step 3: Check Nagle’s algorithm and TCP buffering. Nagle’s algorithm (TCP_NODELAY not set) causes the TCP stack to buffer small messages and send them in batches. A small WebSocket frame (like a chat message) might be held by the TCP stack for up to 200ms waiting for more data. In some cases, interaction with delayed ACKs on the receiving side can push this to seconds. The fix: set TCP_NODELAY on the WebSocket server’s TCP sockets. Most WebSocket libraries do this by default, but some do not.
  • Step 4: Check pub/sub consumer lag. If using Kafka, check consumer lag metrics. If the WebSocket servers’ Kafka consumer group has significant lag (thousands of unconsumed messages), every message is delayed by the time it takes to process the backlog. This can happen if the consumer’s processing throughput dropped (GC pauses, resource contention) or if the message rate spiked (viral event in a large channel).
  • Step 5: Is it all users or specific users? If specific users report latency but others are fine, the issue is likely geographic (user far from the server region, adding 200ms+ RTT) or network-specific (user’s corporate proxy buffering WebSocket traffic, mobile carrier doing transparent proxying that interferes with WebSocket frames). Ask the affected users to run a WebSocket latency test from their network and compare with unaffected users.

Follow-up: You discover the latency is in the pub/sub layer — Redis Pub/Sub has 3-5 seconds of latency. What could cause that?

Redis Pub/Sub should have sub-millisecond latency under normal conditions. A 3-5 second delay is pathological and points to one of a few causes:
  • Redis is under memory pressure and swapping to disk. Check INFO memory for used_memory vs maxmemory, and check the OS for swap usage. If Redis is swapping, every operation that touches swapped-out pages stalls.
  • A slow subscriber is blocking the event loop. In Redis’s single-threaded model, if one subscriber is slow to consume messages, Redis buffers messages for that subscriber, and in extreme cases the buffer management itself can slow down the event loop. Check CLIENT LIST for clients with large obl (output buffer length) values.
  • Network saturation between the WebSocket servers and Redis. If the Redis instance is in a different availability zone or region, the inter-AZ network might be congested. Check network throughput and packet loss between the WebSocket servers and Redis.
  • Too many subscriptions. Each WebSocket server subscribes to a Redis channel for every conversation its connected users are in. At 50K connections with an average of 20 conversations each, that is 1 million subscription patterns. Redis evaluates every publish against all subscriber patterns. If you are using pattern-based subscriptions (PSUBSCRIBE), the matching cost is higher than exact channel matches. Switch to exact SUBSCRIBE per channel.

Going Deeper: If you had to redesign the pub/sub layer from scratch for sub-50ms guaranteed delivery latency, what would you do differently?

I would move from Redis Pub/Sub to NATS or NATS JetStream. NATS is purpose-built for low-latency messaging and handles millions of messages per second with consistent sub-5ms latency. It does not have Redis’s single-threaded bottleneck. For durable message delivery, NATS JetStream provides Kafka-like persistence with NATS’s latency characteristics. The alternative is to eliminate the pub/sub hop entirely for the common case. If I use consistent hashing to route all participants of a conversation to the same WebSocket server, messages within a conversation never leave the server process — no pub/sub hop at all. The pub/sub layer is only needed for conversations that span multiple servers (which is the minority if the hashing is designed around conversation, not user). This is the approach Discord uses: each “guild” (server) is assigned to a specific gateway process, and all members of that guild connect to the same process.

Q7: Explain backpressure in a WebSocket system. What happens when you ignore it, and how do you implement it?

What the interviewer is really testing: Do you understand flow control in persistent-connection systems? Have you dealt with slow consumers in production? Strong answer: Backpressure is the mechanism by which a system signals “slow down, I cannot keep up.” In HTTP, backpressure is natural — the client sends a request, waits for the response, and the rate of requests is bounded by the response time. In WebSocket, there is no such natural governor. The server can push messages to a client as fast as it wants, and if the client’s network or processing speed cannot keep up, messages accumulate. What happens when you ignore backpressure: The server maintains a send buffer for each WebSocket connection. When the server writes a message to a connection, it goes into this buffer. The operating system’s TCP stack drains the buffer by sending data over the network. If the client is on a slow connection (3G mobile, congested WiFi, VPN with high latency), the TCP send window shrinks, the buffer drains slower, and messages accumulate. Without a cap on this buffer, a single slow client in a busy channel can consume tens or hundreds of megabytes of server memory. Multiply by 100 slow clients and you can OOM the server. I have seen a production incident where one WebSocket server OOM-crashed because 200 clients on a corporate VPN were slow, and each had a 50MB accumulated send buffer. The server had 8GB of RAM; 200 clients at 50MB each consumed 10GB. How I implement backpressure:
  • Per-connection send buffer cap. Set a maximum send buffer size per connection (1-5MB depending on the application). Monitor the buffer size. When it approaches the limit, start shedding load for that connection.
  • Priority-based message shedding. When a connection’s buffer is above 50% capacity, drop low-priority messages first: typing indicators, presence updates, cursor positions. These are ephemeral and the latest state supersedes any dropped intermediate states. Between 50-80%, start dropping older messages in the buffer (the client will gap-fill on reconnect). Above 80%, send a “you are too slow, please reconnect” control message and close the connection.
  • Adaptive message quality. For connections with growing buffers, switch from push-every-message to push-notification-only mode: instead of sending the full message payload, send a lightweight notification (“3 new messages in #general”) and let the client pull the full messages via HTTP when its connection improves. This is what Slack’s hybrid push-pull model does.
  • Server-side monitoring. Track a distribution of per-connection buffer sizes across the fleet. Alert on the 99th percentile buffer size, not the average. The average will look healthy because 99% of connections are fast. The p99 is where the time bombs are.

Follow-up: How does TCP flow control interact with WebSocket backpressure? Isn’t TCP already handling this?

TCP has its own flow control via the receive window — the receiver advertises how much data it can accept. When the receive window fills, the sender stops sending. But this TCP-level backpressure only prevents the kernel from sending more TCP segments. It does not prevent the application from writing more data to the socket’s send buffer. The application-level write succeeds immediately (it just copies data to the kernel buffer), so the application thinks the message was sent when it is actually sitting in a kernel buffer waiting for the TCP window to open. The problem compounds because Node.js and similar runtimes buffer writes in user-space too. The write call returns a boolean indicating whether the kernel buffer is full (socket.write() returns false in Node.js), but many WebSocket libraries ignore this return value and keep writing. The correct implementation checks the return value and pauses writing until the drain event fires, indicating the kernel buffer has space. So the layers are: application write queue, user-space runtime buffer, kernel TCP send buffer, and the network. You need backpressure at the application level (your per-connection buffer cap) because by the time TCP flow control kicks in, you have already committed megabytes of memory per connection.

Follow-up: How would you implement graceful degradation for a user on a slow connection rather than just disconnecting them?

I would implement a tiered quality-of-service model per connection:
  • Tier 1 (full fidelity): Buffer is below 30% capacity. Client receives everything: messages, typing indicators, presence updates, read receipts, cursor positions.
  • Tier 2 (reduced fidelity): Buffer is between 30-60%. Stop sending typing indicators, presence updates, and cursor positions for this connection. These are reconstructed from the current state when the connection recovers. This reduces message volume by roughly 40-60% in a typical chat application.
  • Tier 3 (essential only): Buffer is between 60-85%. Only send actual chat messages and direct mentions. Aggregate channel activity into periodic summaries (“12 new messages in #general”) instead of individual messages. Switch the client to pull-on-demand mode where opening a channel triggers an HTTP fetch rather than relying on the push stream.
  • Tier 4 (disconnection): Buffer exceeds 85%. Send a control message indicating the connection quality is too poor, close the connection cleanly, and let the client reconnect when conditions improve.
The client should be aware of its current tier (via a control message from the server) and adjust its UI accordingly — for example, showing a “slow connection” indicator and fetching messages on demand instead of expecting them to arrive instantly.

Q8: How would you implement end-to-end encryption for group messages in a chat application with 100 participants?

What the interviewer is really testing: Do you understand the key management complexity of group E2EE and the fundamental tension between E2EE and server-side features? Strong answer: End-to-end encryption for 1:1 messages is well-understood — the Signal Protocol’s double ratchet handles it elegantly. Group E2EE is where the complexity explodes, because you need every group member to decrypt every message without the server having access to the plaintext. The naive approach and why it fails: Encrypt the message with each recipient’s public key and send N copies (one per recipient). For 100 participants, the sender encrypts the message 100 times and uploads 100 ciphertexts. If the message is 1KB, that is 100KB uploaded per message. This is the “fan-out encryption” approach, and it does not scale beyond small groups. The Sender Keys approach (used by Signal for groups, WhatsApp):
  • Each group member generates a “sender key” — a symmetric key used to encrypt their messages to the group. The sender key is itself distributed to each group member via individual 1:1 encrypted channels (using the double ratchet).
  • When User A sends a message to the group, they encrypt it once with their sender key. All 100 recipients can decrypt it using User A’s sender key that they received earlier.
  • The result: The sender encrypts once (not 100 times). The server stores one ciphertext (not 100). Fan-out is handled by the server distributing the same ciphertext to all recipients. Massive bandwidth improvement.
  • The trade-off: When a member leaves the group, every remaining member must generate a new sender key and distribute it to all other members. Otherwise, the departed member could still decrypt future messages if they have the old sender keys. This “group ratchet” on member removal is O(N) key distributions for N remaining members. For a group of 100, removing one member triggers 99 key distributions. This is manageable but gets expensive for groups with frequent membership changes.
The MLS (Messaging Layer Security) protocol — the emerging standard: MLS uses a tree-based key agreement structure (a “ratchet tree”) where each group member is a leaf node. Key updates propagate up the tree logarithmically. When a member is added or removed, only O(log N) key operations are needed instead of O(N). For a group of 100, that is ~7 operations instead of 99. IETF ratified MLS as RFC 9420, and it is being adopted by Cisco Webex and is under consideration by others. The features you lose with E2EE:
  • Server-side search is impossible (the server cannot read messages). You must build client-side search indexes, which means every client indexes all their messages locally. For mobile clients with limited storage, this is a significant constraint.
  • Server-side content moderation, spam filtering, and link previews cannot operate on encrypted content. Moderation must be reporter-based (users flag content) or client-side (the client scans after decryption, which raises privacy concerns).
  • Message history for new group members requires the group to re-encrypt historical messages for the new member’s key, or accept that new members cannot see messages from before they joined.

Follow-up: A user loses their phone and gets a new one. How do they recover their message history if messages are end-to-end encrypted?

This is one of the hardest UX problems in E2EE systems. The messages are encrypted with keys that existed on the old device, which is now lost. Option 1: Encrypted cloud backup. The client periodically backs up the message database and encryption keys to the cloud (iCloud, Google Drive), encrypted with a user-chosen passphrase or a key derived from a PIN. On the new device, the user enters their passphrase and recovers everything. WhatsApp uses this approach. The risk: if the passphrase is weak, the backup is vulnerable. If the user forgets the passphrase, the history is gone. Option 2: Multi-device sync with key escrow to yourself. If the user has a second device (tablet, desktop), the encryption keys can be synced between devices via a secure channel established when the second device was added. Losing one device does not lose the keys because the other device still has them. Signal implements this with “linked devices.” Option 3: Accept that history is lost. This is what Signal does for users without backup or linked devices. The security model prioritizes forward secrecy over convenience. New device, fresh start. This is a deliberate design choice, not a limitation. The honest answer in an interview: “There is no solution that preserves both strong E2EE guarantees and seamless history recovery. Every approach trades off security for convenience. The right choice depends on the user base — a consumer app like WhatsApp chooses convenience (encrypted backups), while a high-security tool like Signal chooses security (accept history loss).”

Q9: You need to deploy a WebSocket service across 3 regions (US, EU, Asia) to serve a global user base. What are the hardest problems you will face?

What the interviewer is really testing: Can you reason about the intersection of real-time systems and geo-distribution? Do you understand that real-time across continents is fundamentally harder than real-time within a datacenter? Strong answer: The physics problem is that the speed of light imposes a floor on cross-region latency: US-East to EU-West is ~80ms, US-East to Tokyo is ~150ms. You cannot make a real-time system feel instant when the participants are on different continents. The architecture’s job is to minimize how often you pay that cross-region penalty. Problem 1: Message ordering across regions. If User A in Tokyo and User B in London both send messages to the same channel at nearly the same time, the messages arrive at their respective regional servers at different times. The servers might assign different orderings. If I use a Snowflake ID with a millisecond timestamp, and the two messages are within a few milliseconds of each other, clock synchronization between regions (even with NTP, drift can be tens of milliseconds) could cause inconsistent ordering at different regions. Solution: Designate a “home region” per conversation or channel. All messages for that conversation are routed through the home region for ordering. The home region assigns the authoritative sequence number and then fans out to other regions. This means a user in Tokyo sending a message to a conversation homed in US-East pays a 150ms penalty for the message to travel to US-East, get sequenced, and propagate back. But the ordering is globally consistent. The alternative — allowing each region to accept and order messages independently — requires a consensus protocol (which has even higher latency) or accepting eventual consistency in ordering (which causes messages to appear in different orders for different users temporarily). Problem 2: Presence and “who is online” across regions. Each regional WebSocket fleet knows who is connected to its servers, but not who is connected in other regions. When a user in EU opens a channel and wants to see who is online, the presence system must aggregate data from all three regions. If this aggregation happens synchronously, it adds cross-region latency to every presence query. Solution: Replicate presence state asynchronously. Each region publishes its presence changes to a global presence topic (Kafka or NATS across regions). Each region maintains a local cache of global presence state that is eventually consistent (up to a few seconds stale). Since presence is inherently approximate (showing someone as “online” when they disconnected 5 seconds ago is fine), eventual consistency is acceptable here. Problem 3: Connection routing and failover. Users should connect to the nearest region via GeoDNS or Anycast. But what happens when a region fails? If the EU region goes down, EU users need to fail over to the next nearest region (US-East). Their WebSocket connections all drop, and they reconnect to US-East. The thundering herd of EU users reconnecting to US-East must not overwhelm it. This requires capacity headroom in each region to absorb a neighboring region’s traffic (typically provisioning each region at 50-60% capacity to handle failover). Problem 4: Write amplification in the pub/sub layer. A message published in Tokyo must reach subscribers in EU and US. That is cross-region network traffic for every single message. For a high-traffic channel with messages every second, this is 3x write amplification (once per region). Optimize by batching cross-region replication — instead of sending each message individually across the WAN, batch messages in 50-100ms windows and send a single compressed batch. This trades a small increase in latency for a large reduction in cross-region bandwidth.

Follow-up: A conversation has 3 participants: one in Tokyo, one in London, one in New York. Where do you “home” the conversation, and what latency do they each experience?

I would home the conversation in the region where the creator is located or where the majority of participants are, but for a 3-person conversation evenly distributed, there is no great answer — someone pays the cross-region penalty. If homed in US-East: New York user sees ~5ms latency (local), London user sees ~80ms, Tokyo user sees ~150ms. If homed in EU: London is local, New York is ~80ms, Tokyo is ~200ms (EU to Asia). US-East is the best compromise because it minimizes the maximum latency (150ms worst case vs 200ms if EU-homed). A more sophisticated approach is to let the conversation home follow the activity. If for the past hour only the Tokyo and London participants are chatting, temporarily re-home to EU (which is roughly equidistant from both). But this adds complexity and the benefit is marginal. The honest answer is: for 3 participants across 3 continents, the minimum message delivery latency is bounded by the speed of light, and no amount of software architecture changes that. The best you can do is ensure the software adds zero unnecessary latency on top of the physics.

Follow-up: How do you handle the case where the home region goes down mid-conversation?

This requires a failover mechanism for conversation ownership. My approach:
  • Pre-designate a failover region for each conversation (e.g., if home is US-East, failover is EU-West). Store this mapping in a globally replicated metadata store (DynamoDB Global Tables, CockroachDB, or Consul).
  • Failure detection: Each region monitors the health of other regions via heartbeats. If US-East misses heartbeats from EU-West’s WebSocket fleet, EU-West may be down. After a configurable threshold (10-15 seconds to avoid false positives from transient network issues), declare the region unhealthy.
  • Failover: All conversations homed in the failed region are automatically re-homed to their designated failover region. The failover region begins accepting and sequencing messages. The last known sequence number from the failed region is used as the starting point (retrieved from the globally replicated metadata store or from Kafka’s durable log).
  • Recovery: When the failed region comes back, it does not automatically reclaim conversations. There is a manual or operator-triggered process to re-home conversations back, avoiding split-brain scenarios where both regions think they own the conversation.

Q10: What is the difference between total ordering, causal ordering, and per-sender ordering in a chat system? When is each sufficient?

What the interviewer is really testing: Do you understand ordering semantics at a theoretical level, and more importantly, can you map that theory to practical product requirements? Strong answer: Message ordering in distributed systems is a spectrum, and choosing the wrong point on that spectrum either wastes engineering effort (over-ordering) or creates user-visible bugs (under-ordering). Total ordering means every participant sees every message in the exact same order. Message 1, then Message 2, then Message 3 — no variation. This requires a single sequencer that assigns a global order to every message. The cost is that every message must pass through this sequencer, which is a throughput bottleneck and a latency penalty (every message pays the round-trip to the sequencer). Total ordering is necessary for financial trading systems (the order of buy/sell matters absolutely), collaborative editing (operations must be applied in the same order for convergence), and live auctions (bid ordering determines the winner). Causal ordering means if Message A caused Message B (B is a reply to A, or B was sent after the sender read A), then everyone sees A before B. But for concurrent messages (sent independently with no causal relationship), different participants might see them in different orders. This is implemented with vector clocks or Lamport timestamps. The cost is lower than total ordering — no central sequencer needed, just metadata per message that tracks causal dependencies. Causal ordering is sufficient for most chat applications. The product requirement is: “a reply always appears after the message it replies to.” That is a causal constraint. Whether two independent messages in #general appear as (Alice, Bob) or (Bob, Alice) does not matter — users will not notice because there is no causal relationship between them. Per-sender ordering means messages from the same sender arrive in the order that sender sent them. Alice’s messages are always in order relative to each other, and Bob’s messages are always in order relative to each other, but Alice’s Message 3 might appear before or after Bob’s Message 2 on different clients. This is the cheapest to implement: each sender maintains a local sequence number, and the client sorts each sender’s messages by sequence. No server-side coordination needed. Per-sender ordering is sufficient for typing indicators, presence updates, and low-stakes chat where the occasional cross-sender reorder is not confusing. In practice, most production chat systems use a hybrid: Per-conversation sequence numbers (assigned by a single writer per conversation) provide total ordering within a conversation. Across conversations, ordering is best-effort. This gives users a strong ordering guarantee where it matters (within the conversation they are reading) without the cost of global total ordering (which would require serializing every message across every conversation through a single point).

Follow-up: Two users in the same conversation both send a message at nearly the same time. With per-conversation sequence numbers, one gets sequence 42 and the other gets 43. But User A sees their message first and User B sees their message first. Is this a problem?

This is a subtle but important UX question. Both users applied “optimistic UI” — their own message appears immediately in their view before the server assigns a sequence number. When the server-assigned sequence arrives, each user’s client reorders to match the authoritative sequence. So there is a brief moment (50-200ms) where User A sees their message above User B’s, and User B sees their message above User A’s. Then both clients converge to the server-assigned order (say, User A’s message is 42, User B’s is 43, so User A’s appears first). Is this a problem? For casual chat, no. Users rarely notice this sub-second reorder for concurrent messages. For applications where visible ordering matters (support ticketing, medical records, audit logs), you should not use optimistic UI for message ordering. Instead, wait for the server-assigned sequence before displaying the message. The trade-off is that every message has a visible delay equal to the round-trip time to the server. The pragmatic approach most chat apps take: apply optimistic UI (show your message immediately) but do not reorder other people’s messages around yours. If your message gets sequence 43 but you already displayed it above the incoming sequence 42 message, keep it there in your local view. Only new-arriving messages respect the sequence. This avoids the jarring visual reorder while maintaining convergence for all subsequent viewers of the conversation history (who will see the messages in sequence order).

Follow-up: How do vector clocks work for causal ordering, and what is their practical limitation?

A vector clock is an array of counters, one per participant. When User A sends a message, they increment their counter and attach the full vector clock to the message. When User B receives User A’s message, they update their vector clock by taking the element-wise maximum of their clock and the received clock, then increment their own counter. Two messages are causally ordered if one’s vector clock is strictly less than the other’s (every element is less-than-or-equal, with at least one strictly less). If neither dominates, the messages are concurrent. The practical limitation: The vector clock’s size is proportional to the number of participants. In a conversation with 100 participants, every message carries a 100-element vector clock. That is 800 bytes of metadata per message (100 elements times 8 bytes each). For a high-volume channel with thousands of messages per hour, this metadata overhead is significant — both in network bandwidth and in storage. Mitigations: (1) Use a hash or interval tree clock that compresses the vector when most entries are identical (common when only a few users are actively chatting). (2) Use Lamport timestamps instead of vector clocks when you only need a partial causal ordering (Lamport timestamps are a single counter, not a vector, but they cannot detect concurrency — they only guarantee “if A happened before B, then A’s timestamp is less than B’s”). (3) For chat specifically, just use server-assigned per-conversation sequence numbers — they give you total ordering within the conversation, which is stronger than causal ordering and simpler to implement. Vector clocks shine in peer-to-peer systems without a central server, but if you have a server, you do not need them.

Q11: Walk me through how you would implement graceful shutdown for a WebSocket server during a rolling deployment without losing messages.

What the interviewer is really testing: Have you actually deployed WebSocket services, or is this all theoretical? Do you understand the operational complexity of managing persistent connections during infrastructure changes? Strong answer: Graceful shutdown for a WebSocket server is fundamentally different from graceful shutdown for an HTTP server. With HTTP, you stop accepting new requests and wait for in-flight requests to complete — typically a few seconds. With WebSocket, “in-flight” means 50,000 persistent connections that could be mid-conversation. You cannot just wait for them to finish because they might never finish. The goal is to migrate all connections to a new server instance with zero message loss and minimal user disruption. My step-by-step deployment process:
  • Step 1: Remove from load balancer. Signal the load balancer to stop routing new connections to the instance being drained. On AWS ALB, this is deregistration. On Kubernetes, this is failing the readiness probe while keeping the liveness probe healthy (so the pod is not killed but stops receiving new traffic).
  • Step 2: Send a reconnect advisory. Send a control message to all connected clients: {"type": "system", "action": "reconnect", "reason": "server_maintenance", "delay_ms": 5000}. The delay_ms field tells the client to wait a random duration between 0 and 5000ms before reconnecting. This is the jitter that prevents thundering herd.
  • Step 3: Wait for voluntary disconnections. Start a drain timer (e.g., 30 seconds). Well-behaved clients will receive the advisory, buffer any pending outbound messages, close the connection, and reconnect to a different server (since this one is no longer in the load balancer pool). During this window, the server continues processing messages normally for still-connected clients.
  • Step 4: Force-close remaining connections. After 30 seconds, force-close any connections that did not disconnect voluntarily. These are likely stale connections (client crashed, network partition) or clients that do not implement the reconnect advisory. Send a WebSocket close frame with status code 1001 (“Going Away”) before TCP-closing.
  • Step 5: Flush state to durable storage. Before the process exits, ensure the connection registry is updated (all connections for this server are removed from Redis), and any buffered messages that were not yet published to the pub/sub backbone are flushed.
  • Step 6: New instance is already accepting connections. The new server version was started and added to the load balancer before Step 1 began. Clients reconnecting from the draining instance land on the new version. On reconnect, each client sends its lastMessageId and the new server replays missed messages from the message store.
The critical detail most people miss: message delivery during the drain window. Between Step 1 (when the server is removed from the load balancer) and Step 4 (when connections are force-closed), the server is still receiving messages from the pub/sub backbone and delivering them to still-connected clients. If a message arrives at the pub/sub layer and is consumed by the draining server, but the client disconnects before receiving it, the message could be lost unless the pub/sub consumer has not committed the offset (Kafka) or unless the message is also available for the new server to deliver from the message store.

Follow-up: How does this work in Kubernetes with a rolling deployment strategy?

In Kubernetes, the lifecycle hook is the key integration point:
  • Define a preStop hook that sends the reconnect advisory and sleeps for the drain duration. The preStop hook runs before the SIGTERM is sent to the process.
  • Set terminationGracePeriodSeconds to be longer than the drain duration (e.g., 45 seconds for a 30-second drain). This gives the pod time to complete the drain before Kubernetes force-kills it.
  • The readiness probe must fail immediately when the drain starts (so Kubernetes removes the pod from the Service endpoint and the load balancer stops sending new connections). The liveness probe must continue passing during the drain (so Kubernetes does not restart the pod prematurely).
  • Use a PodDisruptionBudget to ensure Kubernetes does not drain too many WebSocket pods simultaneously. If you have 10 pods, a PDB of maxUnavailable: 1 ensures only one pod drains at a time, preventing a scenario where 50% of your capacity is simultaneously draining.
The gotcha: Kubernetes Services have an endpoint propagation delay. After the readiness probe fails, it can take 1-5 seconds for all kube-proxy instances across the cluster to update their routing tables. During that window, new connections may still arrive at the draining pod. The pod must handle these gracefully — either accept them (they will be short-lived) or reject them with a “try again” response.

Follow-up: What happens if the reconnect advisory message itself is lost because the client’s connection is already degraded?

If the client never receives the advisory, it will not proactively disconnect. It will discover the disconnection when the server force-closes the connection in Step 4, or when its next heartbeat ping gets no response. From the client’s perspective, this looks like an unexpected disconnection, and its standard reconnection logic kicks in (exponential backoff with jitter, reconnect to a new server, send lastMessageId for gap-fill). The user experience is slightly worse — instead of a seamless background reconnection with zero visible interruption, the user might see a brief “reconnecting…” indicator for 1-2 seconds. But no messages are lost because the gap-fill mechanism catches up on missed messages. To minimize even this scenario, I would send the reconnect advisory 3 times over a 5-second window (with a unique advisory ID so the client deduplicates). If the connection is degraded but not dead, repetition increases the chance of at least one advisory getting through.

Q12: Your WebSocket server uses JSON for message serialization. A staff engineer proposes switching to Protocol Buffers. Evaluate this proposal.

What the interviewer is really testing: Can you evaluate a technical proposal by quantifying the trade-offs rather than just listing pros and cons? Do you resist or embrace changes based on evidence? Strong answer: My first question would be: “What is the problem we are trying to solve?” If the answer is “JSON is slow” without data, I would push back. If the answer is “our profiling shows 30% of CPU time is spent on JSON serialization, and our message throughput is bottlenecked at 50K messages/second per server,” then the proposal has merit and I would evaluate it seriously. Where Protocol Buffers help:
  • Serialization speed. Protobuf serialization is typically 5-10x faster than JSON serialization and 2-5x faster than JSON parsing. For a WebSocket server processing 50K messages per second, if each message takes 0.1ms to serialize as JSON but 0.01ms as Protobuf, that is the difference between 5 seconds and 0.5 seconds of CPU time per second — a meaningful reduction that frees CPU for message routing.
  • Message size. Protobuf messages are typically 30-50% smaller than their JSON equivalents. Field names are replaced by numeric tags, types are encoded efficiently (varints for integers), and there is no quoting or delimiters. For a chat message, JSON might be 200 bytes; Protobuf might be 120 bytes. At 50K messages per second to 50K connections, that is a bandwidth saving of 200GB/hour. At cloud bandwidth prices, that adds up.
  • Schema enforcement. Protobuf requires a schema definition (.proto file). This acts as a contract between client and server. Malformed messages fail at deserialization rather than causing runtime errors when a field is missing or the wrong type. This catches integration bugs earlier.
Where Protocol Buffers hurt:
  • Debugging and observability. JSON is human-readable. When I tail a WebSocket connection’s traffic with wscat or inspect it in browser devtools, I can read the messages. Protobuf is binary — I need a decoder and the .proto file to make sense of it. This makes debugging production issues slower. Every engineer on the team needs the proto file and a decoder tool, and log inspection becomes a multi-step process.
  • Client complexity. Browser clients need a Protobuf library (protobuf.js or protobuf-ts). This adds bundle size (protobuf.js is ~40KB minified+gzipped). Native JSON parsing in browsers is implemented in C++ inside the JS engine and is highly optimized — the gap between JSON and Protobuf in browser JavaScript is much smaller than in server-side Java or Go.
  • Schema evolution. Adding a field to a Protobuf message is backward-compatible (old clients ignore new fields). Removing or renaming a field is not. You must maintain field numbers forever. With JSON, clients simply ignore fields they do not recognize, and schema evolution is more forgiving (though less safe).
  • Migration cost. Every client and server must be updated simultaneously, or you need a compatibility layer that accepts both JSON and Protobuf during the transition. For a mobile app with a 2-week release cycle and users who do not update, this transition period could last months.
My recommendation framework: If message throughput is not a measured bottleneck, keep JSON. The debugging and development velocity advantages outweigh the theoretical performance gains. If profiling shows serialization is a top-3 CPU consumer and message throughput is the scaling constraint, Protobuf is the right move — but invest in tooling (proto-to-JSON debug proxy, logging middleware that decodes Protobuf for log inspection, browser devtools extension). A middle ground that I have seen work well: use JSON for control messages (auth, subscribe, error responses — low volume, debugging matters) and Protobuf for data messages (chat messages, game state updates — high volume, performance matters). This requires a message framing layer that indicates the encoding, but it gives you the best of both worlds.

Follow-up: What about MessagePack or CBOR as alternatives that are binary but self-describing?

MessagePack and CBOR sit between JSON and Protobuf on the spectrum. They are binary (smaller, faster to parse than JSON) but self-describing (field names are included, no schema required). This means you get roughly 50-70% of Protobuf’s size reduction and 2-3x JSON’s serialization speed, without the schema management overhead. MessagePack is the more practical choice for WebSocket upgrades from JSON because the data model is identical — you can convert JSON to MessagePack mechanically. No schema definition, no code generation, no proto files to manage. Libraries are mature and available in every language. Socket.IO supports MessagePack as a drop-in serialization alternative. The trade-off compared to Protobuf: MessagePack is larger (field names are still transmitted) and does not enforce a schema (so you do not get the contract-enforcement benefit). If your primary goal is performance, MessagePack gives you 70% of the benefit at 20% of the migration cost. If your goal is both performance and contract enforcement, Protobuf is the right investment.

Going Deeper: At what scale does serialization format actually matter? Give me the numbers.

Let me work through the math for a concrete scenario: a chat service handling 100K concurrent connections with an average message rate of 500 messages per second, each message averaging 250 bytes as JSON. JSON: 500 messages/second times 250 bytes times 100K fan-out (worst case, large channel broadcast) = 12.5 GB/second of outbound serialized data. But most messages fan out to small conversations (average 5 recipients), so realistic outbound is 500 times 5 times 250 bytes = 625 KB/second. CPU for serialization: at ~100 microseconds per JSON serialize in Node.js, 500 messages/second costs 50ms of CPU per second — about 5% of one core. Negligible. Protobuf: Same scenario, messages are 150 bytes (40% smaller). Outbound is 375 KB/second. CPU for serialization: ~10 microseconds per message, 500 messages/second costs 5ms of CPU per second — 0.5% of one core. The verdict: At 500 messages per second, the serialization format does not matter. The CPU and bandwidth differences are noise. At 50,000 messages per second (a high-throughput trading or gaming system), the difference becomes meaningful: JSON consumes 5 seconds of CPU per second on serialization alone (you need 5 cores just for serialization), while Protobuf consumes 0.5 seconds (half a core). That is where the investment in Protobuf pays off. The rule of thumb I use: below 5,000 messages per second per server, stick with JSON. Between 5,000-50,000, consider MessagePack. Above 50,000, Protobuf or FlatBuffers (zero-copy deserialization) become necessary.

Advanced Interview Scenarios

These questions are designed to expose blind spots that survive even strong preparation. Several have counterintuitive answers. All demand that candidates show judgment, not just knowledge.

Q13: Your product manager says “we need real-time updates on the dashboard.” You have 48 hours. The obvious answer is WebSockets. Why might that be the wrong call, and what do you actually build?

What weak candidates say:“WebSockets are the standard for real-time, so I would set up a WebSocket server, connect the dashboard clients, and push updates.” They jump straight to implementation without questioning whether the requirement actually needs WebSockets. Some mention Socket.IO as if it solves all problems.What strong candidates say:The first thing I do is interrogate what “real-time” actually means to the PM. In my experience, 80% of the time someone says “real-time,” they mean “fresh enough that users do not notice staleness” — which is 5-10 seconds, not 50 milliseconds. That distinction changes the architecture completely.
  • If “real-time” means sub-second updates and the dashboard has fewer than 10,000 concurrent viewers: SSE is the correct answer, not WebSocket. The data flows one direction (server to client), the browser’s EventSource API gives me auto-reconnect and Last-Event-ID resumption for free, and SSE works through every corporate proxy and CDN without configuration changes. I can have this running in 4 hours, not 48. At Cloudflare’s scale, their analytics dashboard uses SSE for exactly this reason — it is operationally invisible compared to WebSocket.
  • If “real-time” means every-5-seconds freshness and the data source is a SQL database: I would use short polling with a 5-second interval from the client. A simple setInterval(() => fetch('/api/dashboard'), 5000). No persistent connections, no new infrastructure, no new failure modes. The dashboard queries hit a read replica. At 10,000 concurrent dashboards polling every 5 seconds, that is 2,000 requests per second — trivial for any HTTP server behind a CDN. This sounds unsophisticated, but it is what Grafana does for most of its dashboard panels. Grafana’s default refresh interval is 5 seconds of plain HTTP polling. Nobody calls Grafana “not real-time.”
  • If the dashboard needs sub-100ms updates with bidirectional interaction (users clicking to drill down triggers real-time data exploration): Now WebSocket is justified. But even then, in a 48-hour deadline, I would reach for a managed service: AWS AppSync with GraphQL subscriptions, or Supabase Realtime (which wraps PostgreSQL’s WAL into a WebSocket stream). Self-hosting WebSocket infrastructure in 48 hours means skipping reconnection logic, backpressure, monitoring, and graceful shutdown — all of which will bite you in production.
The meta-lesson: the engineering maturity signal here is not knowing how to build WebSocket infrastructure. It is knowing when not to. The cheapest, most reliable real-time system is the one that avoids persistent connections entirely.War Story: At a fintech startup I advised, the team spent three weeks building a WebSocket-based dashboard for account balances. After launch, they discovered that balance updates happened at most once per minute (when settlement batches completed). They had built persistent-connection infrastructure to deliver data that changed 60 times per hour. We replaced it with a 10-second polling endpoint in an afternoon. Server costs dropped by 40% because we no longer needed dedicated WebSocket server instances, and the ops burden of monitoring connection health disappeared entirely.

Follow-up: The PM comes back and says the dashboard now needs to serve 500,000 concurrent viewers during a product launch event. Does your answer change?

At 500K concurrent viewers, the answer is SSE behind a CDN, and the architecture looks nothing like a traditional WebSocket deployment. The key insight is that a dashboard broadcasting the same data to all viewers is a content distribution problem, not a messaging problem.I would put Cloudflare or Fastly in front of my SSE origin servers. The CDN edge nodes each hold one SSE connection to my origin and fan out to thousands of local viewers. My origin serves maybe 200 SSE connections (one per CDN edge PoP worldwide), not 500,000. The CDN handles the fan-out at the edge layer. Fastly’s streaming capabilities and Cloudflare Workers with TransformStream both support this pattern.The math: 500K connections at 40KB each would require 20GB of RAM if served directly — 10-20 WebSocket servers, a pub/sub backbone, connection registry, monitoring. With a CDN-fronted SSE approach, I need 2-3 small origin servers and a CDN plan that costs a fraction of the server fleet. This is how the BBC and ESPN serve live score updates to millions of viewers.

Follow-up: A colleague argues that polling “wastes resources” compared to WebSocket. How do you respond with data?

This is a common misconception that deserves a quantitative rebuttal. Let me compare the actual resource consumption.Polling at 5-second intervals, 10,000 clients: 2,000 HTTP requests per second. Each request is stateless, handled in ~5ms, consuming ~50KB of memory transiently. Peak concurrent connections: ~10 (requests per second times response time). Server memory for connections: effectively zero because connections close immediately.WebSocket with 10,000 clients: 10,000 persistent connections. Each consumes 20-50KB of memory continuously (TCP buffers, TLS state, application state). Total: 200-500MB of dedicated RAM, plus file descriptor overhead, heartbeat processing, connection registry maintenance, and reconnection handling. A message every 5 seconds is 2,000 messages per second — the same throughput as polling, but with 500MB of persistent memory overhead.For infrequent updates (every 5+ seconds), polling is often cheaper than WebSocket in total resource consumption. WebSocket wins when update frequency is high (multiple times per second) or when latency below the polling interval matters. The break-even point in my experience is around 1-2 updates per second per client — below that, polling is more resource-efficient.The real cost of WebSocket is not CPU or bandwidth. It is operational complexity: monitoring connection health, handling reconnection storms, managing graceful deploys, and debugging issues that only manifest with persistent connections.

Q14: You have a Slack-like application with one channel that has 500,000 members. A user posts a message. Walk me through exactly what happens and where it breaks.

What weak candidates say:“We look up all 500,000 members, find their WebSocket connections, and push the message to each one.” They describe the fan-out as a simple loop and do not recognize that this is a fundamentally different problem from a 50-person channel.What strong candidates say:A 500K-member channel is a qualitatively different beast from a normal channel. The naive approach — push the full message to every member’s WebSocket connection — does not just perform poorly; it can take down your entire real-time infrastructure. Let me trace the message through the system and identify every failure point.
  • Step 1: Message arrives at a WebSocket server. A user sends a message. The server persists it (Cassandra write, ~5ms) and publishes to the pub/sub backbone. So far, identical to a small channel. No problem.
  • Step 2: Fan-out begins — and this is where it breaks. The system needs to determine which of 500K members are currently connected and where. If 10% are online, that is 50,000 WebSocket connections spread across 20 servers. The connection registry lookup (SMEMBERS channel:500k-channel in Redis) returns 50,000 entries. Just reading this set from Redis takes measurable time — at 500 bytes per entry, that is 25MB of data transferred from Redis. On a 10Gbps network link, that is 20ms just for the read, but the real cost is Redis CPU time iterating the set.
  • Step 3: Server-to-server routing. The message must reach 20 WebSocket servers. With Redis Pub/Sub, this is fine — publish once, 20 subscribers receive it. But each of those 20 servers must then write the message to 2,500 local connections each. Serializing a 500-byte JSON message 2,500 times takes ~250ms of CPU (at 100us per serialization). If the server serializes once and sends the same buffer to all connections, the CPU cost drops to negligible, but the I/O cost remains: 2,500 write() syscalls, each pushing data into a TCP send buffer.
  • Step 4: The thundering herd on the client. 50,000 clients receive the message roughly simultaneously. If the message contains a user avatar URL that is not cached, 50,000 clients fetch the avatar at the same time. If the message triggers a read receipt update, 50,000 clients send read receipt events back to the server. These secondary effects amplify the original fan-out.
How production systems actually handle this:The answer is not to push the full message to all 500K members. Slack, Discord, and Teams all use tiered delivery for large channels:
  1. Active viewers (channel is on-screen): Push the full message via WebSocket. This might be 2,000-5,000 users. This is the only group that needs instant delivery.
  2. Sidebar-visible (channel is in the sidebar but not actively viewed): Push only a lightweight badge update: {"channel_id": "xyz", "unread_count": 47}. This is ~50 bytes instead of the full message. The full message is fetched when the user clicks into the channel. Another 10,000-20,000 users.
  3. Muted or inactive members: Do nothing. Do not push anything. When the user next opens the app, the client fetches unread counts via HTTP. The remaining 470,000+ members.
This tiered approach reduces the fan-out from 500,000 to maybe 25,000 — a 20x reduction. And the 25,000 includes 20,000 badge-only updates that are 10x smaller than the full message. The effective load is equivalent to broadcasting to maybe 4,000-5,000 full recipients.War Story: Discord’s engineering blog describes exactly this pattern. They call it “lazy guilds.” For servers with more than 1,000 members, Discord does not push full message content to all connected members. The client requests messages for a channel only when the user scrolls to it. Discord found that in a 100,000-member server, fewer than 1% of members are actively viewing any given channel at any given time. Pushing messages to the other 99% was pure waste — consuming pub/sub bandwidth, WebSocket server CPU, and client battery life for messages that would never be read in real-time.

Follow-up: The 500K-member channel receives 200 messages per minute during a company all-hands. How do you prevent this from degrading performance for the rest of the platform?

200 messages per minute is ~3.3 messages per second. With 5,000 active viewers each needing the full message, that is 16,500 message deliveries per second from that single channel. This is significant but not catastrophic — unless the channel’s traffic competes for the same pub/sub and WebSocket server resources as every other channel on the platform.Isolation is the key architectural decision:
  • Dedicated pub/sub topic/partition for hot channels. Instead of routing the 500K channel through the same Redis Pub/Sub instance as smaller channels, assign it a dedicated Redis instance or a dedicated Kafka partition. This prevents the hot channel’s traffic from increasing latency for all other channels. At Slack’s scale, they dynamically promote channels to “hot” status when message rate exceeds a threshold, and route them to dedicated infrastructure.
  • Dedicated WebSocket server group. If the 500K channel is predictably hot (company all-hands are scheduled), pre-assign a pool of WebSocket servers specifically for users viewing this channel. This is capacity isolation — a burst in the hot channel cannot starve connection capacity for users in other channels.
  • Server-side message batching. Instead of pushing each of the 3.3 messages per second individually, batch them into 1-second windows: push a batch of 3-4 messages every second. This reduces the number of WebSocket write operations by 3-4x and allows the client to render messages in chunks rather than one-by-one, which is smoother for the user and cheaper for the client’s rendering engine.
  • Client-side rate limiting on read receipts. If 5,000 users each send a read receipt for each of the 200 messages, that is 1,000,000 read receipt events per minute flowing back to the server. Throttle: the client sends at most one read receipt per channel per 5 seconds, reporting the highest message ID seen. This reduces read receipt traffic from 1M to 60K per minute.

Follow-up: How do you detect that a channel has become “hot” before it causes problems?

Hot channel detection requires real-time monitoring of per-channel metrics, not just aggregate metrics. Most monitoring setups track total message throughput and total connection count but miss per-channel distribution. A platform handling 100K messages per second looks healthy in aggregate, but if 80K of those messages are going to one channel, that channel is a ticking bomb.What I would instrument:
  • Per-channel message rate (messages per second, computed as a sliding window counter in Redis or an in-process counter). Alert when any single channel exceeds a threshold — say, 10x the platform median channel rate.
  • Per-channel fan-out cost (number of WebSocket deliveries per message). A message in a 5-person DM costs 4 deliveries. A message in the 500K channel costs 5,000+ deliveries. Track this as a metric: fan_out_ratio = deliveries / messages. Alert when it exceeds a threshold.
  • Per-channel pub/sub consumer lag. If one channel’s messages are accumulating faster than they are being consumed, that channel is the bottleneck. Kafka per-partition lag is the direct signal.
At Slack, they built an internal tool called “Channel Health” that displays per-channel message rates, member counts, and delivery latency in real-time. When a channel crosses into “hot” territory, it is automatically promoted to dedicated infrastructure. The promotion is reversible — when the burst subsides, the channel returns to shared infrastructure.

Q15: A mobile user is on a subway, cycling between connectivity and dead zones every 30-90 seconds. Describe exactly what happens to their WebSocket connection and chat experience, and what you build to make it seamless.

What weak candidates say:“The WebSocket disconnects and reconnects. We use exponential backoff.” They treat the reconnection as a single event and do not consider the cascading effects on state, message delivery, and user experience.What strong candidates say:The subway scenario is the hardest real-world test for a real-time system because it combines rapid connection cycling, unpredictable disconnection duration, and a user who is actively trying to use the app. Let me trace through the full lifecycle.
  • Second 0-30 (connected): User is reading and sending messages normally. WebSocket is open, messages flow in both directions, heartbeats succeed. The client has a local message cache (SQLite on mobile) that mirrors the last N messages per conversation.
  • Second 30 (enters dead zone): The cellular connection drops. Neither the client nor server detects this immediately. The TCP connection is technically “open” on both sides — no FIN or RST was sent. The server’s next heartbeat ping (sent at second 60 if the interval is 30 seconds) will go unanswered.
  • Second 30-60 (dead zone, server unaware): The server continues to believe the client is connected. Messages arriving for this user are pushed to the WebSocket — but the write() call succeeds at the application level (data goes into the kernel’s TCP send buffer). The kernel attempts to send TCP segments but gets no ACKs back. TCP retransmission kicks in: retry after 200ms, 400ms, 800ms… The kernel’s send buffer fills up.
  • Second 60 (server detects): The heartbeat ping times out. The server marks the connection as dead and begins cleanup: removes the user from the connection registry, updates presence to “offline” (with a 30-second grace period for reconnection), and unsubscribes from pub/sub channels. Messages arriving after this point for the offline user are queued in the undelivered message store.
  • Second 60-90 (dead zone, client side): The mobile client’s WebSocket onclose event may not fire until the OS-level TCP timeout (which can be 30-120 seconds depending on the mobile OS’s TCP keepalive configuration). The client should not wait for onclose. Instead, it should detect the dead connection by its own heartbeat timeout: if the client sends a ping and gets no pong within 10 seconds, it considers the connection dead and enters “offline mode.”
  • Second 90 (exits dead zone): Cellular connectivity returns. The client’s reconnection logic activates.
What I build to make this seamless:
  1. Aggressive client-side dead connection detection. Do not rely on the WebSocket onclose event on mobile. Implement an application-level heartbeat where the client sends a ping every 15 seconds and expects a pong within 5 seconds. If two consecutive pings fail, transition to offline mode immediately. On iOS, use NWPathMonitor and on Android use ConnectivityManager to detect network state changes and trigger immediate reconnection attempts when connectivity returns, rather than waiting for the next backoff timer.
  2. Offline message queue with optimistic UI. While offline, the user can still compose and “send” messages. These go into a local outbound queue (SQLite, capped at 500 messages). The UI shows a clock icon indicating “pending.” On reconnect, the queue is flushed with idempotency keys. The server deduplicates using the client-generated message UUID. Users in a subway frequently type messages during dead zones and expect them to send when connectivity returns.
  3. Fast reconnection with minimal round-trips. On reconnect, the client sends a single “resume” message: {"type": "resume", "last_seq": {"general": 4271, "random": 8903, "dm-alice": 142}, "token": "jwt..."}. This combines authentication, channel re-subscription, and gap-fill request into one round-trip. The server responds with missed messages for all channels in a single batched response. Two round-trips total: TCP+TLS handshake (1 RTT if using TLS 1.3 with 0-RTT resumption), then resume+response (1 RTT). On a 4G connection with 50ms RTT, the user is fully caught up in 100ms after connectivity returns.
  4. Connection state indicator in the UI. A subtle bar at the top of the chat: green (connected), yellow (reconnecting, your messages will be sent when connection is restored), red (offline for >60 seconds, showing last-known time). This sets user expectations. The worst UX is when the user thinks they are connected but their messages are silently failing.
  5. Adaptive heartbeat interval. When the client detects it is on a mobile network (vs WiFi), reduce the heartbeat interval from 30 seconds to 15 seconds. Mobile connections drop more frequently, and faster detection means faster recovery. The trade-off is slightly more battery consumption from more frequent heartbeats, but modern mobile radios batch small packets efficiently.
War Story: WhatsApp’s engineering team published data showing that the average mobile user experiences 3-5 connection drops per hour in urban environments, and 10-15 per hour during commutes. Their reconnection path is optimized to complete in under 200ms on 4G, including TLS resumption and message gap-fill. They achieve this by keeping the TLS session ticket cached on the client and using a single-round-trip resume protocol that avoids the full handshake. The result: most users never see the “connecting…” indicator because the reconnection completes faster than the UI refresh rate.

Follow-up: The user has been offline for 4 hours (long flight). They open the app. What is different about this reconnection compared to the 90-second subway gap?

A 4-hour gap crosses the threshold where message-by-message replay is impractical. During a 90-second gap, the user might have missed 5-20 messages across a few channels. During a 4-hour gap, they might have missed 10,000+ messages across 50 channels. Replaying 10,000 messages over a cellular connection would take 30+ seconds and consume significant bandwidth.The approach changes from delta-sync to snapshot-sync:Instead of replaying individual messages, the server sends a lightweight state snapshot: for each channel the user is in, send {channel_id, unread_count, last_message_preview, last_message_timestamp}. This is maybe 200 bytes per channel, times 50 channels = 10KB. The client renders the channel list with unread counts immediately.When the user taps into a specific channel, then the client fetches the last 50 messages for that channel via a paginated HTTP API (not WebSocket). This lazy-loading approach means the user sees something useful in under 1 second (the channel list with counts) and loads message details on demand.Slack calls this “lazy boot” and it reduced their median app load time from 5 seconds to 1.5 seconds. The key insight: after a long offline period, the user does not need every message immediately. They need orientation — which channels have activity, how much did they miss — and then depth on demand.

Q16: Your WebSocket service is under a DDoS attack — an attacker is opening 100,000 connections per second. The connections authenticate successfully (the attacker has stolen valid API keys). How do you defend?

What weak candidates say:“Block the IP address” or “Add rate limiting on connections.” These are necessary but insufficient — a sophisticated attacker uses a botnet with thousands of IPs, and standard rate limiting does not help when the credentials are valid.What strong candidates say:This is one of the nastiest attack vectors for real-time systems because it exploits the fundamental resource asymmetry of WebSocket: the attacker spends almost nothing to open a connection, but the server allocates 20-50KB of memory, a file descriptor, pub/sub subscriptions, and a presence entry for each one. At 100K connections per second, the server fleet is exhausting resources within minutes even if each connection does nothing after connecting.Immediate triage (first 5 minutes):
  • Identify the compromised API keys. The attacker has valid keys, but they are likely using a small set of stolen credentials. Query the connection registry for API keys with abnormally high concurrent connection counts. A legitimate user might have 3-5 concurrent connections (phone, desktop, tablet, two browser tabs). An attacker using a stolen key might have 10,000. Identify keys with more than 20 concurrent connections and revoke them immediately. This is the fastest path to mitigation.
  • Per-key connection limit. Implement a hard cap: no single API key can have more than 50 concurrent WebSocket connections. Reject new connections that exceed this limit at the handshake stage, before allocating any resources. This should have been in place before the attack, but if it was not, deploy it as an emergency measure. Redis INCR with TTL on connections:{api_key} — increment on connect, decrement on disconnect, reject if the count exceeds 50.
  • Connection rate limit per key. Cap new connections per key to 5 per minute. A legitimate user reconnecting after a network blip creates 1-2 connections. An attacker trying to exhaust resources creates hundreds per second. Use a sliding window counter in Redis: INCR conn_rate:{api_key}:{minute_bucket}.
Deeper defenses (next hours):
  • Proof-of-work on the handshake. Before the WebSocket upgrade completes, require the client to solve a computational challenge (a hashcash-style proof-of-work). Legitimate clients spend 50-100ms computing the solution — imperceptible to a human user. An attacker trying to open 100K connections per second must now spend 100K times 100ms = 10,000 CPU-seconds per second on proof-of-work, which requires a massive botnet. This shifts the economic asymmetry back toward the defender.
  • Behavioral analysis post-connection. Legitimate users send and receive messages, subscribe to channels, send heartbeats with consistent timing, and exhibit bursty traffic patterns (type a message, wait, read, etc.). Attack connections tend to be idle (resource exhaustion attack) or send at fixed intervals (scripted). After 30 seconds, score each connection on a behavioral model. Connections that score below a threshold get challenged (CAPTCHA via a control message) or disconnected.
  • Connection cost accounting. Each WebSocket connection should have a “cost budget.” Subscribing to a channel costs 1 unit. Each channel subscription costs 0.1 units per second (ongoing). Sending a message costs 0.5 units. The budget starts at 100 units and replenishes at 1 unit per second. When the budget hits zero, the connection is throttled (messages queued, not delivered in real-time) and eventually disconnected. This prevents an attacker from subscribing to thousands of channels to amplify the pub/sub fan-out cost.
  • API key rotation and scoping. The stolen API keys are a credential management problem. Long-lived API keys for WebSocket access are a liability. Migrate to short-lived tokens (JWT with 15-minute expiry, refreshed via HTTP). Scope tokens to specific capabilities: a “dashboard viewer” token cannot subscribe to chat channels. This limits the blast radius of a compromised token.
War Story: In 2020, a gaming platform I consulted for experienced exactly this attack. The attacker had scraped 50,000 valid user session tokens from a compromised third-party browser extension. They opened 200K connections in 10 minutes, each subscribing to 100 game lobby channels, creating a 20M-subscription fan-out that overwhelmed the Redis Pub/Sub layer. The immediate fix was per-token connection limits (deployed in 20 minutes via a feature flag). The longer-term fix was migrating from long-lived session tokens to short-lived JWTs with a 10-minute expiry and adding per-connection subscription caps (max 50 channels per connection). The attack cost us approximately 45 minutes of degraded service. Without the per-token connection limit, we estimated it would have been a full outage within 5 more minutes.

Follow-up: The attacker changes tactics — instead of opening many connections, they open 100 connections and send 10,000 messages per second per connection to large channels, amplifying fan-out. How do you defend?

This is a message amplification attack, and it is more insidious because 100 connections look completely normal. The damage comes from the fan-out: if each message goes to a 10,000-member channel, those 10,000 messages per second become 100,000,000 deliveries per second. The pub/sub layer and WebSocket servers collapse under the fan-out, not the ingest.Defense layers:
  • Per-connection message rate limit. 10,000 messages per second from a single connection is trivially detectable. Cap at 10 messages per second for chat, 1 per second for channel messages in large channels. This should already be in place.
  • Per-user per-channel rate limit. Even at 10 messages per second, if the user posts to 100 different large channels in rotation, the aggregate fan-out is enormous. Track per-user message rate across all channels and cap it (e.g., 30 messages per minute total).
  • Fan-out cost awareness in rate limiting. This is the sophisticated defense: rate limits should not be flat across all channels. Posting to a 5-person DM has a fan-out cost of 4. Posting to a 500K-member channel has a fan-out cost of 5,000+. Weight the rate limit by fan-out cost. A user might be allowed 30 small-channel messages per minute, but only 2 large-channel messages per minute. Discord implements this: posting in large servers is more aggressively rate-limited than in small ones.
  • Temporary channel muting. If a channel’s inbound message rate exceeds a threshold (say, 10x its normal rate, sustained for 30 seconds), automatically enable “slow mode” — restrict posting to one message per user per 30 seconds. This is a product-level defense that most chat platforms already have, usually triggered manually by admins. Making it automatic for attack scenarios is a small but effective improvement.

Q17: You are debugging a production incident: two users in the same collaborative editing session are looking at different document content. The OT/CRDT system has silently diverged. How do you find and fix the root cause?

What weak candidates say:“Just have both clients reload the document from the server.” This fixes the symptom but not the root cause. The divergence will recur. Some candidates do not even recognize that divergence in a collaborative editing system is a critical bug — they treat it as a minor UI glitch.What strong candidates say:Silent divergence is the most dangerous bug category in collaborative editing systems because users may not notice it for hours or days. User A thinks the document says “deploy to production,” User B thinks it says “deploy to staging.” Both are confident they are reading the authoritative version. This can have real business consequences.My investigation approach:
  • Step 1: Confirm and characterize the divergence. Get both users’ document state (via a debug endpoint that exports the raw internal representation, not just the rendered text). Diff the two representations. Is the divergence in text content, in formatting, in block structure, or in metadata? A single-character divergence points to an off-by-one in a transformation function. A structural divergence (blocks in different order) points to a concurrent move/rearrange conflict that was resolved differently.
  • Step 2: Replay the operation log. Every collaborative editing system maintains an operation log — the ordered sequence of operations applied to the document. Export the full operation log from the server and the last N operations from each client’s local log. For OT systems, the server’s log is the authoritative sequence. For CRDTs, each client’s log includes operations from all peers.
  • Step 3: Find the divergence point. Apply the operation log from the beginning, one operation at a time, on two fresh document instances (simulating Client A and Client B). At each step, compare the document states. The first operation where the states diverge is the bug. This is tedious but deterministic. I would automate this as a replay tool that takes the operation log as input and outputs the divergence point.
  • Step 4: Analyze the diverging operation. The divergence point is typically one of these:
    • Transformation function bug (OT). Two concurrent operations that should have been transformed against each other were not, or the transformation produced an incorrect result. For example: User A inserts at position 10, User B deletes the range 8-12. The transformation should adjust User A’s insert position to account for the deletion, but a boundary condition was missed (insert at the start of the deleted range vs. end of the deleted range).
    • Merge order sensitivity (CRDT). Two concurrent operations were applied in different orders on different clients, and the CRDT’s commutativity guarantee was violated for a specific operation combination. This means the CRDT implementation has a bug — CRDTs must commute by definition. Check for missed edge cases in the merge function, especially around tombstone handling and concurrent inserts at the same position.
    • Clock/sequencing error. The server assigned the same sequence number to two different operations, or clock skew caused an operation to be classified as “concurrent” when it was actually causally dependent. This is an infrastructure bug, not an algorithm bug.
  • Step 5: Write a regression test. Extract the exact operation sequence that caused the divergence and encode it as an automated test. The test replays the operations in both orderings and asserts that the final document states are identical. Add this to the fuzz testing suite.
The nuclear option for recovery: If the divergence is detected in production and users are actively editing, the server can force-push a “canonical snapshot” to all clients: take the server’s current document state, broadcast it as a full snapshot, and have all clients replace their local state. This is destructive to any in-flight client-side operations but guarantees reconvergence. Notify users with a banner: “Document was resynchronized. Please verify your recent edits.” Every major collaborative editor has this mechanism as a last resort.War Story: The Figma engineering team disclosed that during early development, they encountered divergence bugs approximately once per million operations. At their scale (hundreds of millions of operations per day), that meant multiple divergence events per day. Their fix was not just patching individual transformation functions but building an automated divergence detection system: every 60 seconds, the server computes a hash of the authoritative document state and broadcasts it to all clients. Each client computes its own document hash and compares. If the hashes differ, the client reports the divergence (with its local operation log) and requests a full re-sync. This turned silent divergence into loud, automatically-detected, and automatically-recovered divergence.

Follow-up: How do you build automated divergence detection into a collaborative editing system?

The pattern is periodic checksum reconciliation:
  1. The server maintains a running hash (CRC32 or xxHash for speed) of the canonical document state. Every time an operation is applied, the hash is updated incrementally.
  2. Every N seconds (30-60), the server broadcasts a “checkpoint” message to all connected clients: {"type": "checkpoint", "seq": 48291, "hash": "a3f8c2b1"}.
  3. Each client, upon receiving the checkpoint, fast-forwards or rewinds its local state to sequence 48291 (applying or undoing operations as needed), computes the hash, and compares.
  4. If the hashes match, the client is in sync. No action needed.
  5. If the hashes differ, the client reports the divergence and requests a full document snapshot from the server. The client replaces its local state with the snapshot and replays any locally-buffered operations that have not yet been acknowledged by the server.
The engineering subtlety is that computing the hash must be fast enough to not block the editing experience. CRC32 on a 100KB document takes microseconds. On a 10MB document with embedded images, you need to hash only the text content and structural metadata, not the binary blobs (which are stored by reference, not inline).Google Docs uses a similar mechanism. Their “revision checkpoint” system periodically snapshots the document state and distributes the snapshot hash. This is also the mechanism that powers their version history — each checkpoint is a recoverable snapshot.

Q18: Your team runs 50 WebSocket servers behind an ALB. On Tuesday at 2 AM, a Redis cluster failover happens (primary dies, replica promotes). No messages are lost in Redis — but 30% of your users report they stopped receiving messages for 2 minutes. What happened?

What weak candidates say:“Redis went down so messages were lost.” But the question explicitly states no messages were lost in Redis. Weak candidates do not read the scenario carefully and jump to the most obvious explanation.What strong candidates say:This is a classic production incident that every team running Redis Pub/Sub for real-time fan-out hits eventually. The messages were not lost in Redis — they were lost between Redis and the WebSocket servers during the failover. Here is exactly what happened:
  • The Redis primary dies. The Redis Sentinel (or Redis Cluster) detects the failure and promotes a replica to primary. This takes 5-15 seconds depending on configuration (sentinel down-after-milliseconds plus election time plus failover time).
  • During the failover window, the 50 WebSocket servers’ Redis Pub/Sub connections are broken. Each server had a persistent TCP connection to the old primary for its SUBSCRIBE commands. When the primary dies, these connections either get a TCP RST (if the failure is clean) or simply time out (if the server process crashed). The Redis client library on each WebSocket server detects the disconnection and begins reconnecting.
  • Here is the critical failure: Redis Pub/Sub subscriptions are connection-scoped. When the connection to the old primary dies, all subscriptions on that connection are gone. The client library reconnects to the new primary, but it does not automatically re-subscribe to all the channels. Some Redis client libraries (like ioredis in Node.js) do re-subscribe automatically on reconnect. Others (like the basic redis package in Python) do not. If your library does not re-subscribe, the WebSocket server silently loses its pub/sub feed. It does not crash, does not log an error — it just stops receiving messages.
  • Even if the library re-subscribes automatically, there is a gap. Between the moment the old connection died and the moment the new subscription is active on the new primary, any messages published to those channels are lost. Redis Pub/Sub is fire-and-forget — there is no replay. Messages published during the 10-30 second reconnection window are delivered only to subscribers that maintained their connection (the other 70% of servers that reconnected faster, or that were connected to a different shard).
  • Why 30% and not 100%? Because not all 50 WebSocket servers reconnected at the same speed. Some detected the failure faster (shorter TCP timeout, more aggressive health checks), reconnected sooner, and re-subscribed. Their users saw at most a 2-3 second gap. The 30% of users on servers with slower reconnection saw the full 2-minute gap (the time it took for the server to detect the dead connection via TCP timeout, reconnect, re-subscribe, and for the gap-fill mechanism to catch up).
How to prevent this:
  • Use a Redis client library that auto-resubscribes. In Node.js, ioredis re-subscribes all channels on reconnect by default. Verify this behavior in your specific library. Write an integration test that kills the Redis primary, waits for failover, and verifies that the subscriber receives messages after reconnection.
  • Implement a pub/sub health check. Every 10 seconds, publish a “heartbeat” message on a dedicated channel. Each WebSocket server subscribes to this channel and tracks the last-received heartbeat. If no heartbeat arrives for 30 seconds, the server knows its pub/sub feed is broken and initiates a forced reconnection and resubscription, regardless of what the Redis client library thinks the connection state is.
  • Dual-path delivery for critical messages. The same hybrid Redis + Kafka pattern from Q2: publish durable messages to both Redis Pub/Sub (for real-time delivery) and Kafka (for durability). When a client reports a sequence gap on reconnect, the WebSocket server fetches missing messages from Kafka. The 2-minute pub/sub blackout becomes a 2-minute delay that is backfilled transparently.
  • Reduce failover detection time. Set down-after-milliseconds in Redis Sentinel to 5000ms (5 seconds) instead of the default 30000ms. Set the Redis client’s reconnection backoff to start at 100ms, not 1 second. Set TCP_USER_TIMEOUT on the subscriber socket to 10 seconds. These aggressive timeouts shrink the failure window from 2 minutes to 10-15 seconds.
War Story: This exact incident happened at a social media company I worked with in 2022. They ran 40 WebSocket servers with Redis Pub/Sub. During a routine Redis maintenance (planned failover), 25% of their users lost real-time updates for 90 seconds. The root cause was that their Python Redis client (redis-py) did not auto-resubscribe on reconnect. The servers reconnected to the new Redis primary within 5 seconds, but the subscription state was empty. No error was logged because the reconnection itself succeeded — the client just was not subscribed to anything. The fix was migrating to redis-py with the retry_on_timeout=True and a custom callback that re-issued all SUBSCRIBE commands. They also added the pub/sub heartbeat health check, which has caught 3 similar incidents since.

Follow-up: How would you design the pub/sub layer to be resilient to broker failures without the complexity of Kafka?

NATS is the answer that more teams should consider. NATS provides at-most-once pub/sub (like Redis) with significantly better clustering and failure handling. A NATS cluster of 3 nodes tolerates single-node failure with no subscription loss — subscribers connected to surviving nodes continue receiving messages without interruption. If a subscriber’s connected node dies, the NATS client library automatically reconnects to another cluster member and resubscribes. This is built into the NATS protocol, not bolted on as a client-library feature.For the gap-fill problem (messages lost during any failure window), combine NATS with NATS JetStream, which adds durable, replayable streams on top of NATS’s core pub/sub. JetStream gives you Kafka-like replay semantics with NATS’s operational simplicity (single binary, no Zookeeper, no partition management). Consumer offsets, replay from offset, and durable subscriptions — all built in.The trade-off compared to Redis Pub/Sub: NATS is a separate service to deploy and monitor. If you are already running Redis for caching and connection registry, adding NATS means another piece of infrastructure. But if your pub/sub failure mode is “silent subscription loss during Redis failover,” NATS eliminates that entire failure class.

Q19: A staff engineer on your team proposes using Cloudflare Durable Objects for the real-time collaboration layer instead of traditional WebSocket servers with Redis. Evaluate this architecture.

What weak candidates say:“I have not heard of Durable Objects” (which is fair — it is a newer technology) or “edge computing is just for CDNs.” They cannot evaluate an unfamiliar proposal on its merits because they lack a framework for reasoning about the trade-offs.What strong candidates say:This is an interesting proposal that represents a fundamentally different architecture from the traditional “WebSocket servers + pub/sub backbone” model. Let me evaluate it on its merits.What Durable Objects are: Cloudflare Durable Objects are single-threaded JavaScript objects that run on Cloudflare’s edge network. Each Durable Object has a unique ID, a persistent storage layer (key-value, backed by SQLite under the hood), and the ability to accept WebSocket connections directly. The critical property is that a Durable Object with a given ID runs on exactly one Cloudflare edge server at a time — it is a globally unique singleton. If two clients in different continents connect to the same Durable Object, the first client’s connection pins the object to a specific edge location, and the second client’s connection is routed to that same location.Why this is compelling for real-time collaboration:
  • The pub/sub backbone disappears. In the traditional architecture, if User A is on WebSocket Server 1 and User B is on Server 3, a Redis Pub/Sub layer routes messages between them. With Durable Objects, both users connect to the same Durable Object instance (one per document or conversation). All message routing happens in-process. No Redis, no Kafka, no cross-server communication. The entire fan-out problem for a single document collapses to a single-threaded event loop.
  • State is co-located with connections. The Durable Object holds the document state, the operation log, and the connected users’ WebSocket connections in the same process. No network hops to a database or cache for state reads. Mutation is a local operation with persistence to Durable Object storage (strong consistency within the object).
  • Automatic scaling at the document level. Each document is an independent Durable Object. 10,000 documents means 10,000 independent actors distributed across Cloudflare’s network. No capacity planning for WebSocket servers, no connection registry, no load balancing.
  • Global distribution with strong consistency per document. The Durable Object runs at one edge location, but Cloudflare’s network routes all connections to that location with sub-100ms latency for most of the world. The object can migrate closer to where most clients are, though the migration mechanism has latency.
Where the proposal breaks down:
  • Single-threaded performance ceiling. A Durable Object is a single JavaScript isolate. For a collaborative document with 5-50 concurrent editors, this is fine — the message volume is low. For a chat channel with 10,000 active participants sending messages at 100/second, a single-threaded JavaScript isolate cannot keep up. The traditional architecture handles this by distributing load across multiple WebSocket servers.
  • Geographic penalty for globally distributed users. The Durable Object runs at one location. If a document has editors in London and Tokyo, one group pays 200-250ms round-trip latency for every operation. Traditional architectures can mitigate this with regional read replicas or regional edge servers that handle display while routing writes to the primary. Durable Objects do not have a read replica model.
  • Vendor lock-in. Durable Objects are a Cloudflare-proprietary technology. There is no equivalent on AWS, GCP, or Azure. The collaboration logic — the most complex, business-critical code in the system — becomes inseparable from the Cloudflare platform. If Cloudflare changes pricing, has an outage, or discontinues the product, migration is a rewrite.
  • Debugging and observability are immature. Durable Objects run in a sandboxed environment with limited logging, no traditional APM integration, and no ability to SSH into the runtime to inspect state. Debugging a divergence bug in a Durable Object running on an edge server in Singapore is significantly harder than debugging the same issue on an EC2 instance you control.
  • Cost at scale is uncertain. Durable Objects bill per request, per millisecond of CPU time, and per GB of storage. For a collaborative editing workload with high message volume (hundreds of operations per second per document), the per-request cost may exceed the cost of dedicated WebSocket servers. I would build a cost model comparing: (a) Durable Objects at projected message volume, (b) self-managed WebSocket servers on reserved EC2 instances, and (c) a managed alternative like AWS AppSync.
My recommendation:For a startup building a collaborative editing product with fewer than 100,000 documents and fewer than 50 concurrent editors per document, Durable Objects are an excellent choice. They eliminate 80% of the infrastructure complexity (no pub/sub, no connection registry, no load balancing, no horizontal scaling logic) and let the team focus on the collaboration algorithm. The geographic latency penalty is acceptable for most use cases.For a platform operating at Figma or Google Docs scale — millions of documents, hundreds of concurrent editors, strict latency requirements across continents — the single-threaded ceiling and geographic penalty make Durable Objects insufficient as the sole architecture. But they could serve as the edge layer for a hybrid architecture: Durable Objects for low-traffic documents, with a traditional WebSocket+pub/sub backend for high-traffic or latency-sensitive documents, and a router that promotes documents between tiers based on usage patterns.War Story: PartyKit, an open-source framework built on Cloudflare Workers and Durable Objects, is being used by companies like Stashpad and tldraw for exactly this pattern — real-time collaboration at the edge. Their published benchmarks show sub-50ms operation propagation for users within the same continent as the Durable Object, and 150-250ms for cross-continental scenarios. For comparison, a traditional WebSocket + Redis Pub/Sub architecture within a single AWS region achieves 10-30ms, but requires significantly more infrastructure to set up and operate. The trade-off is infrastructure simplicity vs. latency control.
It depends on message volume, not connection count. Durable Objects can handle thousands of concurrent WebSocket connections — each connection is lightweight (the Cloudflare runtime is optimized for this). The constraint is CPU time per request, not connection count.At 500 concurrent editors with a typical collaborative editing workload (each user generates 1-5 operations per second while actively editing, but most users are reading at any given time), the realistic inbound operation rate is maybe 50-100 operations per second. Each operation requires: validate, transform/merge, persist, broadcast to 499 connections. If transformation and persistence take 1ms each and broadcasting is batched, total CPU per operation is ~5ms. At 100 operations per second, that is 500ms of CPU per second — 50% utilization of the single-threaded isolate. Comfortable, but approaching the point where latency starts increasing.The concern is not steady-state but burst. If 200 users paste large content blocks simultaneously (a common pattern during collaborative brainstorming), the operation queue spikes and the single thread becomes a bottleneck. Durable Objects do not have an auto-scaling mechanism within a single object — you cannot add more threads.Mitigation: implement operation batching within the Durable Object. Instead of processing each operation individually, collect operations for a 10ms window, transform them as a batch, persist once, and broadcast once. This amortizes the per-operation overhead and smooths out bursts.

Q20: Your company has a microservice that sends real-time price updates to trading dashboards. Latency requirement: p99 under 50ms from price change to screen update. Current p99 is 200ms. Where do you look?

What weak candidates say:“Optimize the WebSocket server” or “Use a faster serialization format.” These are generic answers that do not demonstrate the ability to systematically trace latency through a pipeline. Optimizing the wrong component wastes weeks.What strong candidates say:A p99 of 200ms means that 99% of price updates arrive within 200ms, but the target is 50ms. The gap is 150ms. Before optimizing anything, I need to know where that 150ms lives. In a price-update pipeline, the latency budget is consumed across multiple stages, and the fix depends entirely on which stage is the bottleneck.My latency decomposition approach:I instrument every stage with monotonic timestamps (not wall-clock time — use performance.now() in JavaScript, clock_gettime(CLOCK_MONOTONIC) in C, time.monotonic_ns() in Python) and emit them as structured log events or metrics:
  • T0: Price change event generated by the market data source (exchange feed or data provider API).
  • T1: Event received by our ingestion service. The gap T1-T0 is external network latency — we cannot optimize this directly, but we can choose data providers with lower latency (co-located feeds vs. REST API polling).
  • T2: Event processed and published to the internal message bus (Kafka topic or Redis Pub/Sub channel). The gap T2-T1 is processing time: validation, normalization, enrichment. If this is >5ms, profile the processing logic.
  • T3: Event consumed by the WebSocket/SSE server from the message bus. The gap T3-T2 is message bus latency. For Kafka, this includes partition leader write, replication, and consumer poll interval. For Redis Pub/Sub, this should be sub-1ms.
  • T4: Event serialized and written to the client’s WebSocket connection. The gap T4-T3 is server-side processing: serialization, TLS encryption, and the write() syscall.
  • T5: Event received by the client and rendered on screen. The gap T5-T4 is network latency (server to client) plus client-side rendering time.
Where the 150ms typically hides in trading dashboards:
  • Kafka consumer poll interval (the most common offender). Kafka consumers poll for messages in batches. The default fetch.min.bytes and fetch.max.wait.ms settings cause the consumer to wait up to 500ms to accumulate a batch before returning. For latency-sensitive consumption, set fetch.max.wait.ms=1 and fetch.min.bytes=1. This makes the consumer return immediately when any message is available, at the cost of more frequent, smaller fetches. This single configuration change often drops p99 from 200ms to 50ms.
  • Client-side rendering and JavaScript event loop. If the dashboard uses React or a similar framework, a price update arriving via WebSocket goes through: onmessage callback, state update, virtual DOM diff, DOM commit, browser paint. With a complex dashboard rendering 500 price cells, the React render cycle alone can take 50-100ms. The fix: bypass React for real-time price updates. Write directly to the DOM (element.textContent = newPrice) for the price cells, and use React only for structural changes. This is what Bloomberg Terminal’s web client and Refinitiv Eikon do — the real-time price grid uses direct DOM manipulation, not a declarative framework.
  • TLS overhead. Each WebSocket message goes through TLS encryption. For small messages (a price update is ~100 bytes), the per-message TLS overhead (generating the IV, encrypting, computing the MAC) can add 0.1-0.5ms. At p99, this is usually negligible, but if the server is CPU-saturated, TLS processing can queue up. Ensure the server uses hardware AES-NI acceleration (openssl speed aes-256-gcm should show >1 GB/s on modern CPUs).
  • Nagle’s algorithm. As mentioned in Q6, TCP_NODELAY not set can add up to 200ms of buffering for small messages. This is the first thing I check for any WebSocket latency issue. A single setsockopt(TCP_NODELAY, 1) call eliminates it.
  • Garbage collection pauses. If the WebSocket server runs on a GC’d runtime (JVM, Node.js, Go), GC pauses directly add to message delivery latency. A 100ms GC pause at the wrong moment pushes p99 past 200ms. Check GC logs. For JVM, use ZGC or Shenandoah (<1ms pause). For Node.js, check --max-old-space-size is generous enough to avoid frequent major GCs. For Go, GC pauses are typically <1ms in recent versions but verify with GODEBUG=gctrace=1.
The optimization order I would follow:
  1. Instrument every stage (1 day).
  2. Fix Kafka consumer poll interval if applicable (1 hour, configuration change).
  3. Set TCP_NODELAY everywhere (1 hour).
  4. Profile and fix GC pauses if present (1-3 days).
  5. Optimize client-side rendering if T5-T4 is the bottleneck (3-5 days).
  6. Consider replacing Kafka with Redis Pub/Sub or NATS for the real-time path (1-2 weeks, only if Kafka is confirmed as the bottleneck after step 2).
War Story: At a fintech company I worked with, the trading dashboard had a p99 of 350ms. After instrumenting every stage, we found: Kafka consumer poll interval accounted for 180ms (they were using default settings), Nagle’s algorithm on the WebSocket server added 60ms intermittently (TCP_NODELAY was not set), and React re-rendering the 800-cell price grid took 90ms on slower client machines. Fixing the Kafka settings took 5 minutes. Setting TCP_NODELAY took 10 minutes. Optimizing the client rendering (switching from React state updates to direct DOM writes for the price grid) took a week. The result: p99 dropped from 350ms to 35ms. The entire budget was consumed roughly equally across the three fixes.

Follow-up: The data provider sends price updates over a REST API that you poll every second. How does this change your latency analysis?

Polling at 1-second intervals introduces an inherent average latency of 500ms (half the polling interval) and a worst-case latency of 1000ms (the price changes right after you polled). No downstream optimization can overcome this — you are limited by the polling frequency.To meet the 50ms p99 target with a polled source:
  • Increase polling frequency to 50ms intervals (20 polls per second). This reduces the polling-induced latency to an average of 25ms and worst-case of 50ms. But this is 20 HTTP requests per second to the data provider, which may violate their rate limits or increase costs.
  • Negotiate a WebSocket or webhook feed from the data provider. Many financial data providers (Bloomberg B-PIPE, Refinitiv Elektron, IEX Cloud) offer streaming feeds that push updates in real-time. This eliminates the polling latency entirely. The cost is typically higher than a REST API plan.
  • Use a market data aggregator (like Polygon.io or Alpaca) that provides WebSocket streaming as a standard interface, even if the underlying exchanges provide data differently.
The fundamental lesson: in any real-time pipeline, the latency floor is determined by the slowest ingestion method. If you poll a REST API once per second, your p99 can never be below ~1 second, regardless of how fast the rest of the pipeline is. The first optimization is always to fix the ingestion.

Q21: You are tasked with building a “live cursors” feature — showing where every user’s cursor is on a shared whiteboard, like Figma. There are 200 concurrent users on the same board. What are the non-obvious engineering challenges?

What weak candidates say:“Each client broadcasts its cursor position over WebSocket whenever it moves. Other clients render it.” This describes the happy path and misses every hard problem.What strong candidates say:Live cursors look trivial — it is just x,y coordinates, right? — but at 200 concurrent users, the engineering challenges are surprisingly deep. The core problem is that cursor movements generate extremely high-frequency events (a mouse moving generates 60-120 position events per second), and broadcasting all of them to 199 other users creates a volume that neither the network nor the rendering engine can handle.Challenge 1: Event volume explosion.200 users, each generating 60 cursor events per second = 12,000 events per second inbound to the server. Each event must be broadcast to 199 other users = 2.4 million cursor deliveries per second. Each event is ~50 bytes (user_id, x, y, viewport info). That is 120 MB/second of outbound cursor data. This is untenable.Solution: Client-side throttling + server-side batching.
  • Client-side: Throttle cursor position broadcasts to at most 20 per second (every 50ms). Humans cannot perceive cursor update rates above ~30 fps, so 20 updates per second is visually smooth. This reduces inbound events from 12,000 to 4,000 per second.
  • Server-side: Batch all cursor updates into 50ms windows. Every 50ms, collect all cursor positions received, package them into a single batch message, and broadcast once. Instead of 4,000 individual messages, the server sends 20 batches per second, each containing up to 200 cursor positions. 20 broadcasts times 199 recipients = 3,980 deliveries per second. Manageable.
  • Client-side interpolation: Receiving cursors at 20fps would look jittery. The client interpolates between received positions, rendering smooth cursor movement at 60fps locally by tweening between the last two known positions. This is the same technique used for entity interpolation in multiplayer games.
Challenge 2: Viewport-based filtering.If the whiteboard is large (10,000 x 10,000 pixels), User A might be looking at the top-left quadrant while User B is looking at the bottom-right. There is no reason to send User A the cursor positions of users whose cursors are outside User A’s viewport. This is spatial filtering.Each client reports its viewport rectangle to the server. The server maintains a spatial index (a simple grid partition — divide the board into 100x100 cells) and only broadcasts a user’s cursor to clients whose viewport overlaps the cell containing that cursor. For a 10,000x10,000 board with 200 users distributed evenly, the average viewport might cover 10% of the board, so each user receives cursor updates for ~20 nearby users instead of 199. This is a 10x reduction in fan-out.Challenge 3: Z-order and name label rendering at scale.With 200 cursors, the whiteboard could be cluttered with 200 name labels. The client needs to manage visual clutter: fade out cursors that have not moved in 5 seconds, cluster overlapping name labels, prioritize rendering cursors near the user’s own cursor position.Challenge 4: Cursor position is in document coordinates, not screen coordinates.If the whiteboard supports zoom and pan, cursor positions must be transmitted in document-space coordinates and each client must transform them to its own screen-space coordinates based on its zoom level and pan offset. A cursor at document position (500, 300) appears at different screen positions for a user zoomed to 100% vs. one zoomed to 50%.Challenge 5: Cursor persistence across reconnection.When a user reconnects after a brief disconnection, they need the current positions of all other users’ cursors immediately — not a history of movements. The server maintains a last-known-position cache per user. On reconnect, the client receives a snapshot of all active cursor positions in a single message.War Story: Figma’s engineering team shared that cursor synchronization consumed more engineering effort than they expected. The initial implementation broadcast every cursor event to every client, and performance degraded noticeably above 20 concurrent editors. Their fix involved aggressive throttling (limiting broadcasts to 12 per second), viewport-based filtering (only sending cursor positions for users visible in your viewport), and a custom batching protocol that packs up to 50 cursor positions into a single binary WebSocket frame using a compact encoding (2 bytes for user ID, 4 bytes for x, 4 bytes for y = 10 bytes per cursor, 50 cursors = 500 bytes per batch). The final system handles hundreds of concurrent cursors with minimal CPU and bandwidth overhead.

Follow-up: Figma uses binary WebSocket frames for cursor data. A junior engineer on your team says “just use JSON, it is easier to debug.” Who is right and at what scale does it matter?

The junior engineer is right for a product with fewer than 30 concurrent cursors. The binary encoding matters at scale, and the threshold is lower than most people expect.JSON cursor update: {"uid":"u_123","x":1547.32,"y":892.14} = ~45 bytes. At 20 updates per second from 200 users, broadcast to 199 recipients with viewport filtering (assume 10x reduction): 20 * 200 * 20 * 45 = 3.6 MB/second of cursor data.Binary cursor update: 2 bytes (user ID as uint16) + 4 bytes (x as float32) + 4 bytes (y as float32) = 10 bytes. Same calculation: 20 * 200 * 20 * 10 = 0.8 MB/second.The 4.5x size difference matters when you consider that this data flows over WebSocket on every connected client’s network. On a mobile device with limited bandwidth or a user on a congested office WiFi, 3.6 MB/s of cursor data alone can cause noticeable degradation. At 0.8 MB/s, it is background noise.But the more impactful difference is CPU. JSON parsing a cursor update takes ~5-10 microseconds in V8. Binary parsing (DataView reads) takes <1 microsecond. At 4,000 cursor updates per second arriving at each client (200 users * 20 fps * viewport filter), JSON parsing consumes 20-40ms of CPU per second — 2-4% of one core at 60fps. Binary parsing consumes 4ms. On a low-power mobile device where the JavaScript thread is also handling rendering and user interaction, that 2-4% matters.My recommendation: use JSON during development and the first 6 months of the product. Switch to a binary protocol when you have performance data showing cursor sync is a measurable bottleneck, which will happen around 30-50 concurrent cursors. Build the binary protocol behind a feature flag so you can A/B test the performance impact.

Q22: Your chat system’s WebSocket servers are stateless — all state is in Redis and Postgres. A senior engineer argues this is wrong and you should keep user session state in-memory on the WebSocket server. Who is right?

What weak candidates say:“Stateless is always better because it makes scaling easier.” This is a reasonable default for HTTP services, but applying it blindly to WebSocket services reveals a lack of experience with persistent connections.What strong candidates say:This is one of those cases where the “obvious best practice” from the HTTP world — stateless servers — is actually suboptimal for WebSocket services. The senior engineer is right, and here is why.Why stateless WebSocket servers are slower than they need to be:In a stateless model, every time the server needs to route a message, it must:
  1. Look up the sender’s subscriptions (Redis call, ~1ms).
  2. Look up the recipient’s connection location (Redis call, ~1ms).
  3. Check the sender’s permissions for the target channel (Postgres or Redis, ~1-5ms).
  4. Fetch the sender’s display name for the message payload (Redis or Postgres, ~1ms).
That is 3-8ms of external lookups per message, added to every message’s processing latency. At 50,000 messages per second, that is 50,000 Redis round-trips per second just for subscription lookups. Redis can handle this, but each round-trip adds network latency and occupies a connection from the Redis connection pool.What in-memory state eliminates:If the WebSocket server maintains an in-memory map of {connection_id -> {user_id, display_name, subscribed_channels, permissions_cache}}, all of those lookups become in-process hash table reads: ~100 nanoseconds instead of ~1 millisecond. That is a 10,000x speedup per lookup. The per-message processing drops from 3-8ms to microseconds.The in-memory state is populated once on connection establishment (when the user authenticates and subscribes to channels) and updated incrementally on subscription changes. The total memory cost is modest: 200 users at 2KB of session state each = 400KB. Even 100K users at 2KB each = 200MB — well within a server’s RAM budget.The trade-off:
  • Server restarts lose state. On restart or crash, all in-memory session state is lost. Every connected client must re-authenticate and re-subscribe. But this already happens because the WebSocket connections themselves are lost on restart — clients reconnect and go through the full handshake including auth and subscription. The in-memory state is rebuilt naturally during reconnection.
  • State can become stale. If a user’s permissions change (they are removed from a channel) while they are connected, the in-memory permissions cache must be invalidated. Solution: subscribe to a “user_permissions_changed” event on the pub/sub backbone. When it fires, invalidate the affected user’s cache, and on the next message, re-fetch from the source of truth. The stale window is bounded by the pub/sub delivery latency (sub-second).
  • Cross-server lookups still need Redis. If Server A needs to route a direct message to a user on Server B, Server A does not have Server B’s in-memory state. The connection registry in Redis is still needed for cross-server routing. But cross-server lookups are the minority — most messages are delivered to users on the same server (especially with conversation-based connection sharding).
My recommendation:Use in-memory state as a cache with Redis as the source of truth. Every WebSocket server maintains an in-memory session cache. On connection open, populate from Redis. On message routing, read from the in-memory cache (hot path). On cache miss, fall back to Redis (cold path, rare after warmup). On external state change (permissions, subscription), invalidate via pub/sub. On server restart, the cache is cold and all lookups hit Redis until clients reconnect and warm it up.This is not stateful vs. stateless. It is hot-cache with fallback to source of truth. The WebSocket server is not the owner of the state — Redis is. But the server keeps a local copy for performance. This pattern is exactly how CPU caches work relative to main memory, and the same principle applies.War Story: Discord’s engineering blog describes this exact architecture. Their gateway servers (Elixir-based WebSocket servers) maintain per-connection state in Elixir process memory. Each connection is a lightweight Elixir process holding the user’s guild memberships, permissions, and active subscriptions. This in-memory state allows Discord to route messages without any Redis lookups in the hot path. They estimate this saves them 50-100 billion Redis reads per day at their current scale. The Elixir/BEAM VM is particularly well-suited for this because each connection is an isolated process — if one connection’s state is corrupted, it crashes only that process, not the entire server.

Follow-up: How do you handle the case where a user is connected on 5 devices simultaneously, and session state changes need to propagate to all of them?

Five concurrent connections for the same user may live on 5 different WebSocket servers. When the user’s state changes (e.g., they join a new channel from their phone), all 5 connections need to learn about the new subscription.The propagation mechanism:
  1. User joins channel from Device 1, which is connected to Server A. Server A updates Redis (the source of truth) and its own in-memory cache.
  2. Server A publishes a state-change event to the pub/sub backbone: {"type": "user_state_changed", "user_id": "123", "change": "channel_joined", "channel_id": "new-channel"}.
  3. All WebSocket servers receive this event. Each server checks its local connection map for user 123. Servers B and C (which host the user’s other devices) find matching connections, update their in-memory caches, and subscribe those connections to the new channel’s message stream. Servers D-Z, which do not host any of user 123’s connections, ignore the event.
  4. The user’s other 4 devices now reflect the new channel without any explicit action from the user.
The subtlety: this pub/sub event is broadcast to all 50 WebSocket servers, but only 2-3 of them have the relevant connections. This is a waste — 47 servers process an event that is irrelevant to them. At scale (10M connected users, each with 2-3 devices), the volume of per-user state-change events can be significant.Optimization: Instead of broadcasting to all servers, use the connection registry (which maps user_id -> [server_ids]) to publish the state-change event only to the servers that host the user’s connections. This is a targeted publish rather than a broadcast. Redis supports this with per-server channels: publish to state_changes:server_a, state_changes:server_b, etc.