Skip to main content
WhatsApp Messaging System Design

Problem Statement

Design a messaging application like WhatsApp that:
  • Supports 1:1 and group messaging
  • Delivers messages in real-time
  • Stores chat history
  • Shows online status and read receipts
  • Supports media sharing

Step 1: Requirements Clarification

Functional Requirements

Core Features

  • 1:1 messaging
  • Group chats (up to 500 members)
  • Message delivery & read receipts
  • Online/offline status
  • Push notifications

Extended Features

  • Media sharing (images, videos, files)
  • Voice/video calls
  • End-to-end encryption
  • Message search
  • Typing indicators

Non-Functional Requirements

  • Low Latency: Messages delivered in <100ms
  • High Availability: 99.99% uptime
  • Message Ordering: Messages appear in order
  • Durability: No message loss
  • Scale: 2 billion users, 100 billion messages/day

Capacity Estimation

┌─────────────────────────────────────────────────────────────────┐
│                 WhatsApp Scale Estimation                       │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  Users:                                                         │
│  • 2 billion total users                                       │
│  • 500 million DAU                                             │
│  • 60 million concurrent connections at peak                   │
│                                                                 │
│  Messages:                                                      │
│  • 100 billion messages/day                                    │
│  • Average message size: 100 bytes                             │
│  • Peak: 100B / 86,400 ≈ 1.15 million messages/second          │
│                                                                 │
│  Storage:                                                       │
│  • Daily text: 100B × 100 bytes = 10 TB/day                    │
│  • Daily media (20% with 100KB avg): 20B × 100KB = 2 PB/day   │
│                                                                 │
│  Connections:                                                   │
│  • 60M concurrent WebSocket connections                        │
│  • Each connection: ~10 KB memory                              │
│  • Total: 600 GB just for connections                          │
│                                                                 │
│  Bandwidth:                                                     │
│  • 1.15M msg/sec × 100 bytes = 115 MB/sec (text only)         │
│  • With media: ~50 GB/sec                                      │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

Step 2: High-Level Design

┌─────────────────────────────────────────────────────────────────┐
│                 WhatsApp Architecture                           │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│                      ┌───────────────┐                         │
│                      │    Clients    │                         │
│                      │ (Mobile/Web)  │                         │
│                      └───────┬───────┘                         │
│                              │ WebSocket                       │
│                      ┌───────▼───────┐                         │
│                      │     CDN       │ (for media)             │
│                      │ + Load Balancer│                        │
│                      └───────┬───────┘                         │
│                              │                                  │
│        ┌─────────────────────┼─────────────────────┐           │
│        │                     │                     │            │
│  ┌─────▼─────┐        ┌──────▼──────┐       ┌─────▼─────┐     │
│  │  Gateway  │        │   Gateway   │       │  Gateway  │     │
│  │ Server 1  │        │  Server 2   │       │ Server N  │     │
│  │(WebSocket)│        │ (WebSocket) │       │(WebSocket)│     │
│  └─────┬─────┘        └──────┬──────┘       └─────┬─────┘     │
│        │                     │                    │            │
│        └─────────────────────┼────────────────────┘            │
│                              │                                  │
│                       ┌──────▼──────┐                          │
│                       │   Message   │                          │
│                       │   Router    │                          │
│                       └──────┬──────┘                          │
│                              │                                  │
│   ┌──────────────────────────┼──────────────────────────┐      │
│   │                          │                          │       │
│   ▼                          ▼                          ▼       │
│ ┌──────────┐          ┌──────────────┐          ┌──────────┐  │
│ │ Message  │          │   User &     │          │  Media   │  │
│ │   DB     │          │   Session    │          │  Storage │  │
│ │(Cassandra)│          │   Store     │          │   (S3)   │  │
│ └──────────┘          └──────────────┘          └──────────┘  │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

Core Services

ServiceResponsibility
Gateway ServiceWebSocket connections, message routing
Message ServiceStore/retrieve messages, delivery tracking
User ServiceUser profiles, contacts, blocking
Session ServiceTrack online/offline, connection mapping
Push ServiceOffline notifications
Group ServiceGroup management, membership
Media ServiceUpload, process, serve media

Step 3: WebSocket Connection Management

Connection Flow

┌─────────────────────────────────────────────────────────────────┐
│                 WebSocket Connection Flow                       │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  1. Client connects to WebSocket                               │
│     ────────────────────────────                                │
│     wss://chat.whatsapp.com/ws?token=xxx                       │
│                                                                 │
│  2. Gateway validates token                                     │
│     ──────────────────────────                                  │
│     Verify JWT, extract user_id                                │
│                                                                 │
│  3. Register connection                                         │
│     ─────────────────────                                       │
│     Session Store: user_id → {gateway_id, connection_id}       │
│                                                                 │
│  4. Subscribe to user's message queue                          │
│     ─────────────────────────────────                           │
│     Gateway subscribes to "messages:{user_id}"                 │
│                                                                 │
│  5. Send pending messages                                       │
│     ───────────────────────                                     │
│     Deliver any messages queued while offline                  │
│                                                                 │
│  6. Heartbeat                                                   │
│     ─────────                                                   │
│     Client sends ping every 30 seconds                         │
│     Server responds with pong                                  │
│     No ping for 60s → connection dead                          │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

Session Store Design

# Redis structure for session management

# User's current connection(s)
# Key: session:{user_id}
# Value: {gateway_id, connection_id, last_seen, device_info}

class SessionStore:
    def __init__(self, redis_client):
        self.redis = redis_client
    
    def register_connection(self, user_id, gateway_id, connection_id, device):
        session_data = {
            "gateway_id": gateway_id,
            "connection_id": connection_id,
            "device": device,
            "connected_at": time.time()
        }
        
        # Support multiple devices
        self.redis.hset(
            f"session:{user_id}",
            device,
            json.dumps(session_data)
        )
        
        # Update last seen
        self.redis.zadd("online_users", {user_id: time.time()})
    
    def get_connections(self, user_id):
        """Get all active connections for a user"""
        sessions = self.redis.hgetall(f"session:{user_id}")
        return [json.loads(s) for s in sessions.values()]
    
    def is_online(self, user_id):
        last_seen = self.redis.zscore("online_users", user_id)
        if not last_seen:
            return False
        return time.time() - last_seen < 60  # 60 second threshold
    
    def remove_connection(self, user_id, device):
        self.redis.hdel(f"session:{user_id}", device)
        
        # Check if user has other active sessions
        if not self.redis.exists(f"session:{user_id}"):
            self.redis.zrem("online_users", user_id)

Step 4: Message Delivery

Message Flow

┌─────────────────────────────────────────────────────────────────┐
│                 Message Delivery Flow                           │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│    Sender                                                Receiver│
│       │                                                      │   │
│       │ 1. Send message                                      │   │
│       ├────────────────────────────────────────────────────►│   │
│       │                                                      │   │
│       │   {                                                  │   │
│       │     "msg_id": "uuid",                               │   │
│       │     "to": "receiver_id",                            │   │
│       │     "content": "Hello!",                            │   │
│       │     "timestamp": 1705312800                         │   │
│       │   }                                                  │   │
│       │                                                      │   │
│   ┌───┴───┐                                              ┌───┴───┐
│   │Gateway│                                              │Gateway│
│   │   A   │                                              │   B   │
│   └───┬───┘                                              └───┬───┘
│       │                                                      │   │
│       │ 2. Store in DB                                       │   │
│       ├─────────────►[Message DB]                           │   │
│       │                                                      │   │
│       │ 3. Lookup receiver's gateway                        │   │
│       ├─────────────►[Session Store]                        │   │
│       │◄─────────────  gateway_B                            │   │
│       │                                                      │   │
│       │ 4. Route message                                     │   │
│       ├─────────────────────────────────────────────────────►│   │
│       │                                                      │   │
│       │                                          5. Deliver  │   │
│       │                                          ◄───────────┤   │
│       │                                                      │   │
│       │◄─────────────────────────────────────────────────────┤   │
│       │              6. ACK (delivered)                      │   │
│       │                                                      │   │
│   [Update DB: delivered=true]                               │   │
│       │                                                      │   │
│       │◄─────────────────────────────────────────────────────┤   │
│       │              7. ACK (read)                           │   │
│       │                                                      │   │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

Message States

┌─────────────────────────────────────────────────────────────────┐
│                    Message State Machine                        │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  ┌────────┐     ┌────────┐     ┌────────┐     ┌────────┐      │
│  │  SENT  │────►│ STORED │────►│DELIVERED│────►│  READ  │      │
│  └────────┘     └────────┘     └────────┘     └────────┘      │
│      │              │              │              │             │
│      │              │              │              │             │
│   1 Tick         2 Ticks        2 Blue Ticks  (in UI)         │
│                                                                 │
│  SENT:      Message sent to server                             │
│  STORED:    Message persisted in database                      │
│  DELIVERED: Message received by recipient's device             │
│  READ:      Message opened/viewed by recipient                 │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

Message Service Implementation

class MessageService:
    def __init__(self, db, session_store, message_router, push_service):
        self.db = db
        self.sessions = session_store
        self.router = message_router
        self.push = push_service
    
    async def send_message(self, sender_id, message):
        # 1. Generate message ID (for idempotency)
        msg_id = message.get("msg_id") or generate_uuid()
        
        # 2. Store message in database
        stored_message = await self.db.store_message({
            "id": msg_id,
            "conversation_id": self.get_conversation_id(sender_id, message["to"]),
            "sender_id": sender_id,
            "recipient_id": message["to"],
            "content": message["content"],
            "type": message.get("type", "text"),
            "status": "stored",
            "created_at": time.time()
        })
        
        # 3. Check if recipient is online
        recipient_sessions = self.sessions.get_connections(message["to"])
        
        if recipient_sessions:
            # 4a. Route to online recipient
            for session in recipient_sessions:
                await self.router.route_message(
                    session["gateway_id"],
                    session["connection_id"],
                    stored_message
                )
        else:
            # 4b. Send push notification for offline recipient
            await self.push.send_notification(
                message["to"],
                {
                    "title": f"New message from {sender_id}",
                    "body": message["content"][:100]
                }
            )
        
        # 5. Send ACK to sender
        return {"msg_id": msg_id, "status": "stored"}
    
    async def ack_delivered(self, msg_id, user_id):
        await self.db.update_message_status(msg_id, "delivered")
        
        # Notify sender about delivery
        message = await self.db.get_message(msg_id)
        sender_sessions = self.sessions.get_connections(message["sender_id"])
        
        for session in sender_sessions:
            await self.router.send_receipt(
                session["gateway_id"],
                session["connection_id"],
                {"msg_id": msg_id, "status": "delivered"}
            )
    
    async def ack_read(self, msg_ids, user_id):
        # Batch update for efficiency
        await self.db.update_messages_status(msg_ids, "read")
        
        # Group by sender and notify
        # ...

Step 5: Data Models

Message Storage (Cassandra)

-- Messages table partitioned by conversation
CREATE TABLE messages (
    conversation_id UUID,
    message_id      TIMEUUID,
    sender_id       UUID,
    content         TEXT,
    content_type    TEXT,        -- text, image, video, audio
    media_url       TEXT,
    status          TEXT,        -- stored, delivered, read
    created_at      TIMESTAMP,
    PRIMARY KEY (conversation_id, message_id)
) WITH CLUSTERING ORDER BY (message_id DESC);

-- Index for fetching unread messages per user
CREATE TABLE user_messages (
    user_id         UUID,
    conversation_id UUID,
    message_id      TIMEUUID,
    status          TEXT,
    PRIMARY KEY (user_id, conversation_id, message_id)
) WITH CLUSTERING ORDER BY (conversation_id ASC, message_id DESC);

-- Conversation metadata
CREATE TABLE conversations (
    user_id             UUID,
    conversation_id     UUID,
    other_participant   UUID,       -- for 1:1 chats
    is_group            BOOLEAN,
    last_message_at     TIMESTAMP,
    last_message_preview TEXT,
    unread_count        INT,
    PRIMARY KEY (user_id, last_message_at, conversation_id)
) WITH CLUSTERING ORDER BY (last_message_at DESC);

Why Cassandra?

RequirementCassandra Feature
High write throughputDistributed writes, no single master
Time-series dataNatural fit for chat history
Partition by conversationData locality for chat retrieval
Horizontal scalingAdd nodes as users grow

Step 6: Group Messaging

Group Message Fan-out

┌─────────────────────────────────────────────────────────────────┐
│                 Group Message Fan-out                           │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  Sender sends to group (500 members)                           │
│                                                                 │
│     Sender                                                      │
│        │                                                        │
│        ▼                                                        │
│   ┌─────────┐                                                  │
│   │ Gateway │                                                  │
│   └────┬────┘                                                  │
│        │                                                        │
│        ▼                                                        │
│   ┌─────────┐         ┌──────────────┐                        │
│   │ Message │────────►│ Group        │                        │
│   │ Service │         │ Fan-out      │                        │
│   └─────────┘         │ Service      │                        │
│                       └──────┬───────┘                        │
│                              │                                  │
│         ┌────────────────────┼────────────────────┐            │
│         │                    │                    │             │
│         ▼                    ▼                    ▼             │
│   ┌──────────┐        ┌──────────┐        ┌──────────┐        │
│   │Member 1-100│      │Member 101-200│    │Member 401-500│     │
│   │(Batch 1) │        │(Batch 2) │        │(Batch 5) │        │
│   └──────────┘        └──────────┘        └──────────┘        │
│                                                                 │
│   For each member:                                             │
│   1. Check if online (session store)                           │
│   2. If online → route message via WebSocket                  │
│   3. If offline → queue for later + push notification         │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘
class GroupMessageService:
    def __init__(self, group_store, message_router, session_store):
        self.groups = group_store
        self.router = message_router
        self.sessions = session_store
    
    async def send_group_message(self, sender_id, group_id, message):
        # 1. Verify sender is in group
        if not await self.groups.is_member(group_id, sender_id):
            raise UnauthorizedError()
        
        # 2. Get all group members
        members = await self.groups.get_members(group_id)
        
        # 3. Store message once
        stored_message = await self.store_message(group_id, sender_id, message)
        
        # 4. Fan out to members in batches
        batch_size = 100
        for i in range(0, len(members), batch_size):
            batch = members[i:i + batch_size]
            await self.fanout_batch(batch, stored_message, sender_id)
        
        return stored_message
    
    async def fanout_batch(self, members, message, sender_id):
        tasks = []
        for member_id in members:
            if member_id == sender_id:
                continue  # Don't send to sender
            
            tasks.append(self.deliver_to_member(member_id, message))
        
        await asyncio.gather(*tasks)

Step 7: Presence and Typing Indicators

Presence System

┌─────────────────────────────────────────────────────────────────┐
│                 Presence System Design                          │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  Online Status:                                                 │
│  ──────────────                                                 │
│  • Updated on WebSocket connect/disconnect                     │
│  • Heartbeat every 30 seconds                                  │
│  • "Last seen" stored on disconnect                            │
│                                                                 │
│  Redis Structure:                                               │
│  ────────────────                                               │
│  ZADD online_users {timestamp} {user_id}                       │
│                                                                 │
│  Check online: Is score > (now - 60 seconds)?                  │
│                                                                 │
│  Privacy Controls:                                              │
│  ─────────────────                                              │
│  • "Nobody" - always show "last seen recently"                 │
│  • "Contacts" - only contacts see status                       │
│  • "Everyone" - public status                                  │
│                                                                 │
│  Presence Subscription:                                         │
│  ──────────────────────                                         │
│  • Only subscribe to contacts' presence                        │
│  • Pub/sub channel per user                                    │
│  • Aggregate updates (don't send every second)                 │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

Typing Indicators

class TypingIndicatorService:
    def __init__(self, redis_client, message_router, session_store):
        self.redis = redis_client
        self.router = message_router
        self.sessions = session_store
    
    async def set_typing(self, user_id, conversation_id):
        """Called when user starts typing"""
        # Set typing indicator with 5 second TTL
        key = f"typing:{conversation_id}:{user_id}"
        self.redis.setex(key, 5, "1")
        
        # Notify other participants
        participants = await self.get_conversation_participants(conversation_id)
        
        for participant_id in participants:
            if participant_id == user_id:
                continue
            
            sessions = self.sessions.get_connections(participant_id)
            for session in sessions:
                await self.router.send_event(
                    session["gateway_id"],
                    session["connection_id"],
                    {
                        "type": "typing",
                        "conversation_id": conversation_id,
                        "user_id": user_id
                    }
                )
    
    async def clear_typing(self, user_id, conversation_id):
        """Called when user stops typing or sends message"""
        key = f"typing:{conversation_id}:{user_id}"
        self.redis.delete(key)
        
        # Notify stop typing
        # ...

Step 8: End-to-End Encryption

E2E Encryption Overview

┌─────────────────────────────────────────────────────────────────┐
│                 End-to-End Encryption (Signal Protocol)        │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  Key Exchange (Initial Setup):                                  │
│  ─────────────────────────────                                  │
│                                                                 │
│  Alice                              Bob                         │
│    │                                 │                          │
│    │ 1. Generate key pairs:          │                          │
│    │    - Identity Key (IK)          │                          │
│    │    - Signed Pre-Key (SPK)       │                          │
│    │    - One-Time Pre-Keys (OPK)    │                          │
│    │                                 │                          │
│    ├────► Upload to server ◄─────────┤                          │
│    │                                 │                          │
│    │ 2. To message Bob:              │                          │
│    │    - Fetch Bob's public keys    │                          │
│    │    - Generate ephemeral key     │                          │
│    │    - Derive shared secret       │                          │
│    │                                 │                          │
│    │    shared_secret = ECDH(        │                          │
│    │      Alice_IK, Bob_SPK,         │                          │
│    │      Alice_ephemeral, Bob_OPK   │                          │
│    │    )                            │                          │
│    │                                 │                          │
│    │ 3. Encrypt message              │                          │
│    │    ciphertext = AES(            │                          │
│    │      message,                   │                          │
│    │      derived_key(shared_secret) │                          │
│    │    )                            │                          │
│    │                                 │                          │
│    ├──────────────────────────────────►                          │
│    │         Encrypted message        │                          │
│                                                                 │
│  Server CANNOT read messages - only routes encrypted blobs     │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘
Server’s Role: With E2E encryption, the server only stores and routes encrypted messages. It cannot read content. This affects search functionality (must be client-side) and backups (encrypted).

Step 9: Offline Message Handling

┌─────────────────────────────────────────────────────────────────┐
│                 Offline Message Queue                           │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  When recipient is offline:                                     │
│  ─────────────────────────                                      │
│  1. Store message in persistent queue                          │
│  2. Send push notification                                     │
│  3. When recipient connects, deliver queued messages           │
│                                                                 │
│  Redis List per user:                                           │
│  ─────────────────────                                          │
│  LPUSH offline:{user_id} {message_json}                        │
│  LRANGE offline:{user_id} 0 -1  # Get all                      │
│  DEL offline:{user_id}          # After delivery               │
│                                                                 │
│  Flow:                                                          │
│  ──────                                                         │
│                                                                 │
│  User connects                                                  │
│       │                                                         │
│       ▼                                                         │
│  Check offline queue                                            │
│       │                                                         │
│       ▼                                                         │
│  ┌─────────────────┐                                           │
│  │ Messages exist? │───No───► Done                             │
│  └────────┬────────┘                                           │
│           │ Yes                                                 │
│           ▼                                                     │
│  Deliver all messages in order                                 │
│           │                                                     │
│           ▼                                                     │
│  Clear queue                                                    │
│           │                                                     │
│           ▼                                                     │
│  Send delivery ACKs to senders                                 │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

Final Architecture

┌─────────────────────────────────────────────────────────────────┐
│                 Complete WhatsApp Architecture                  │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│                        ┌───────────────┐                       │
│                        │    Clients    │                       │
│                        └───────┬───────┘                       │
│                                │ WebSocket                     │
│                        ┌───────▼───────┐                       │
│                        │ Load Balancer │                       │
│                        └───────┬───────┘                       │
│                                │                                │
│    ┌───────────────────────────┼───────────────────────────┐   │
│    │              │            │            │              │    │
│    ▼              ▼            ▼            ▼              ▼    │
│ ┌──────┐     ┌──────┐     ┌──────┐     ┌──────┐     ┌──────┐  │
│ │  GW  │     │  GW  │     │  GW  │     │  GW  │     │  GW  │  │
│ │  1   │     │  2   │     │  3   │     │  4   │     │  N   │  │
│ └──┬───┘     └──┬───┘     └──┬───┘     └──┬───┘     └──┬───┘  │
│    │            │            │            │            │       │
│    └────────────┴────────────┼────────────┴────────────┘       │
│                              │                                  │
│                       ┌──────▼──────┐                          │
│                       │   Kafka     │                          │
│                       │  (Events)   │                          │
│                       └──────┬──────┘                          │
│                              │                                  │
│    ┌─────────────────────────┼─────────────────────────┐       │
│    │                         │                         │        │
│    ▼                         ▼                         ▼        │
│ ┌──────────┐          ┌──────────────┐          ┌──────────┐  │
│ │ Message  │          │   Session    │          │   Push   │  │
│ │ Service  │          │   Service    │          │ Service  │  │
│ └────┬─────┘          └──────┬───────┘          └────┬─────┘  │
│      │                       │                       │         │
│      ▼                       ▼                       ▼         │
│ ┌──────────┐          ┌──────────────┐          ┌──────────┐  │
│ │Cassandra │          │    Redis     │          │  FCM/    │  │
│ │(Messages)│          │  (Sessions)  │          │  APNS    │  │
│ └──────────┘          └──────────────┘          └──────────┘  │
│                                                                 │
│                       ┌──────────────┐                         │
│                       │  S3 + CDN    │                         │
│                       │   (Media)    │                         │
│                       └──────────────┘                         │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

Key Design Decisions

DecisionChoiceReasoning
ProtocolWebSocketBidirectional, low latency
Message DBCassandraHigh write throughput, time-series
Session StoreRedisFast lookups, TTL support
Message QueueKafkaReliable message routing
MediaS3 + CDNScalable, cost-effective
PushFCM/APNSPlatform native notifications
EncryptionSignal ProtocolIndustry standard E2E

Common Interview Questions

  1. Use TIMEUUID in Cassandra (includes timestamp)
  2. Client-side sequence numbers for conversation
  3. Deliver messages in batches, sorted by timestamp
  4. Handle network reordering by buffering and sorting
  1. Store all device sessions for each user
  2. Fan out message to all connected devices
  3. Track delivery status per device
  4. Show latest status (delivered if any device received)
  1. Horizontal scaling of gateway servers
  2. Consistent hashing for user→gateway mapping
  3. Redis pub/sub for cross-gateway routing
  4. Each gateway handles ~1M connections (64GB RAM)
  1. Queue messages during partition
  2. Retry with exponential backoff
  3. Client-side message caching
  4. Sync on reconnection
  5. Accept eventual consistency (chat doesn’t need strong consistency)