Problem Statement
Design a messaging application like WhatsApp that:- Supports 1:1 and group messaging
- Delivers messages in real-time
- Stores chat history
- Shows online status and read receipts
- Supports media sharing
Step 1: Requirements Clarification
Functional Requirements
Core Features
- 1:1 messaging
- Group chats (up to 500 members)
- Message delivery & read receipts
- Online/offline status
- Push notifications
Extended Features
- Media sharing (images, videos, files)
- Voice/video calls
- End-to-end encryption
- Message search
- Typing indicators
Non-Functional Requirements
- Low Latency: Messages delivered in <100ms
- High Availability: 99.99% uptime
- Message Ordering: Messages appear in order
- Durability: No message loss
- Scale: 2 billion users, 100 billion messages/day
Capacity Estimation
Copy
┌─────────────────────────────────────────────────────────────────┐
│ WhatsApp Scale Estimation │
├─────────────────────────────────────────────────────────────────┤
│ │
│ Users: │
│ • 2 billion total users │
│ • 500 million DAU │
│ • 60 million concurrent connections at peak │
│ │
│ Messages: │
│ • 100 billion messages/day │
│ • Average message size: 100 bytes │
│ • Peak: 100B / 86,400 ≈ 1.15 million messages/second │
│ │
│ Storage: │
│ • Daily text: 100B × 100 bytes = 10 TB/day │
│ • Daily media (20% with 100KB avg): 20B × 100KB = 2 PB/day │
│ │
│ Connections: │
│ • 60M concurrent WebSocket connections │
│ • Each connection: ~10 KB memory │
│ • Total: 600 GB just for connections │
│ │
│ Bandwidth: │
│ • 1.15M msg/sec × 100 bytes = 115 MB/sec (text only) │
│ • With media: ~50 GB/sec │
│ │
└─────────────────────────────────────────────────────────────────┘
Step 2: High-Level Design
Copy
┌─────────────────────────────────────────────────────────────────┐
│ WhatsApp Architecture │
├─────────────────────────────────────────────────────────────────┤
│ │
│ ┌───────────────┐ │
│ │ Clients │ │
│ │ (Mobile/Web) │ │
│ └───────┬───────┘ │
│ │ WebSocket │
│ ┌───────▼───────┐ │
│ │ CDN │ (for media) │
│ │ + Load Balancer│ │
│ └───────┬───────┘ │
│ │ │
│ ┌─────────────────────┼─────────────────────┐ │
│ │ │ │ │
│ ┌─────▼─────┐ ┌──────▼──────┐ ┌─────▼─────┐ │
│ │ Gateway │ │ Gateway │ │ Gateway │ │
│ │ Server 1 │ │ Server 2 │ │ Server N │ │
│ │(WebSocket)│ │ (WebSocket) │ │(WebSocket)│ │
│ └─────┬─────┘ └──────┬──────┘ └─────┬─────┘ │
│ │ │ │ │
│ └─────────────────────┼────────────────────┘ │
│ │ │
│ ┌──────▼──────┐ │
│ │ Message │ │
│ │ Router │ │
│ └──────┬──────┘ │
│ │ │
│ ┌──────────────────────────┼──────────────────────────┐ │
│ │ │ │ │
│ ▼ ▼ ▼ │
│ ┌──────────┐ ┌──────────────┐ ┌──────────┐ │
│ │ Message │ │ User & │ │ Media │ │
│ │ DB │ │ Session │ │ Storage │ │
│ │(Cassandra)│ │ Store │ │ (S3) │ │
│ └──────────┘ └──────────────┘ └──────────┘ │
│ │
└─────────────────────────────────────────────────────────────────┘
Core Services
| Service | Responsibility |
|---|---|
| Gateway Service | WebSocket connections, message routing |
| Message Service | Store/retrieve messages, delivery tracking |
| User Service | User profiles, contacts, blocking |
| Session Service | Track online/offline, connection mapping |
| Push Service | Offline notifications |
| Group Service | Group management, membership |
| Media Service | Upload, process, serve media |
Step 3: WebSocket Connection Management
Connection Flow
Copy
┌─────────────────────────────────────────────────────────────────┐
│ WebSocket Connection Flow │
├─────────────────────────────────────────────────────────────────┤
│ │
│ 1. Client connects to WebSocket │
│ ──────────────────────────── │
│ wss://chat.whatsapp.com/ws?token=xxx │
│ │
│ 2. Gateway validates token │
│ ────────────────────────── │
│ Verify JWT, extract user_id │
│ │
│ 3. Register connection │
│ ───────────────────── │
│ Session Store: user_id → {gateway_id, connection_id} │
│ │
│ 4. Subscribe to user's message queue │
│ ───────────────────────────────── │
│ Gateway subscribes to "messages:{user_id}" │
│ │
│ 5. Send pending messages │
│ ─────────────────────── │
│ Deliver any messages queued while offline │
│ │
│ 6. Heartbeat │
│ ───────── │
│ Client sends ping every 30 seconds │
│ Server responds with pong │
│ No ping for 60s → connection dead │
│ │
└─────────────────────────────────────────────────────────────────┘
Session Store Design
Copy
# Redis structure for session management
# User's current connection(s)
# Key: session:{user_id}
# Value: {gateway_id, connection_id, last_seen, device_info}
class SessionStore:
def __init__(self, redis_client):
self.redis = redis_client
def register_connection(self, user_id, gateway_id, connection_id, device):
session_data = {
"gateway_id": gateway_id,
"connection_id": connection_id,
"device": device,
"connected_at": time.time()
}
# Support multiple devices
self.redis.hset(
f"session:{user_id}",
device,
json.dumps(session_data)
)
# Update last seen
self.redis.zadd("online_users", {user_id: time.time()})
def get_connections(self, user_id):
"""Get all active connections for a user"""
sessions = self.redis.hgetall(f"session:{user_id}")
return [json.loads(s) for s in sessions.values()]
def is_online(self, user_id):
last_seen = self.redis.zscore("online_users", user_id)
if not last_seen:
return False
return time.time() - last_seen < 60 # 60 second threshold
def remove_connection(self, user_id, device):
self.redis.hdel(f"session:{user_id}", device)
# Check if user has other active sessions
if not self.redis.exists(f"session:{user_id}"):
self.redis.zrem("online_users", user_id)
Step 4: Message Delivery
Message Flow
Copy
┌─────────────────────────────────────────────────────────────────┐
│ Message Delivery Flow │
├─────────────────────────────────────────────────────────────────┤
│ │
│ Sender Receiver│
│ │ │ │
│ │ 1. Send message │ │
│ ├────────────────────────────────────────────────────►│ │
│ │ │ │
│ │ { │ │
│ │ "msg_id": "uuid", │ │
│ │ "to": "receiver_id", │ │
│ │ "content": "Hello!", │ │
│ │ "timestamp": 1705312800 │ │
│ │ } │ │
│ │ │ │
│ ┌───┴───┐ ┌───┴───┐
│ │Gateway│ │Gateway│
│ │ A │ │ B │
│ └───┬───┘ └───┬───┘
│ │ │ │
│ │ 2. Store in DB │ │
│ ├─────────────►[Message DB] │ │
│ │ │ │
│ │ 3. Lookup receiver's gateway │ │
│ ├─────────────►[Session Store] │ │
│ │◄───────────── gateway_B │ │
│ │ │ │
│ │ 4. Route message │ │
│ ├─────────────────────────────────────────────────────►│ │
│ │ │ │
│ │ 5. Deliver │ │
│ │ ◄───────────┤ │
│ │ │ │
│ │◄─────────────────────────────────────────────────────┤ │
│ │ 6. ACK (delivered) │ │
│ │ │ │
│ [Update DB: delivered=true] │ │
│ │ │ │
│ │◄─────────────────────────────────────────────────────┤ │
│ │ 7. ACK (read) │ │
│ │ │ │
│ │
└─────────────────────────────────────────────────────────────────┘
Message States
Copy
┌─────────────────────────────────────────────────────────────────┐
│ Message State Machine │
├─────────────────────────────────────────────────────────────────┤
│ │
│ ┌────────┐ ┌────────┐ ┌────────┐ ┌────────┐ │
│ │ SENT │────►│ STORED │────►│DELIVERED│────►│ READ │ │
│ └────────┘ └────────┘ └────────┘ └────────┘ │
│ │ │ │ │ │
│ │ │ │ │ │
│ 1 Tick 2 Ticks 2 Blue Ticks (in UI) │
│ │
│ SENT: Message sent to server │
│ STORED: Message persisted in database │
│ DELIVERED: Message received by recipient's device │
│ READ: Message opened/viewed by recipient │
│ │
└─────────────────────────────────────────────────────────────────┘
Message Service Implementation
Copy
class MessageService:
def __init__(self, db, session_store, message_router, push_service):
self.db = db
self.sessions = session_store
self.router = message_router
self.push = push_service
async def send_message(self, sender_id, message):
# 1. Generate message ID (for idempotency)
msg_id = message.get("msg_id") or generate_uuid()
# 2. Store message in database
stored_message = await self.db.store_message({
"id": msg_id,
"conversation_id": self.get_conversation_id(sender_id, message["to"]),
"sender_id": sender_id,
"recipient_id": message["to"],
"content": message["content"],
"type": message.get("type", "text"),
"status": "stored",
"created_at": time.time()
})
# 3. Check if recipient is online
recipient_sessions = self.sessions.get_connections(message["to"])
if recipient_sessions:
# 4a. Route to online recipient
for session in recipient_sessions:
await self.router.route_message(
session["gateway_id"],
session["connection_id"],
stored_message
)
else:
# 4b. Send push notification for offline recipient
await self.push.send_notification(
message["to"],
{
"title": f"New message from {sender_id}",
"body": message["content"][:100]
}
)
# 5. Send ACK to sender
return {"msg_id": msg_id, "status": "stored"}
async def ack_delivered(self, msg_id, user_id):
await self.db.update_message_status(msg_id, "delivered")
# Notify sender about delivery
message = await self.db.get_message(msg_id)
sender_sessions = self.sessions.get_connections(message["sender_id"])
for session in sender_sessions:
await self.router.send_receipt(
session["gateway_id"],
session["connection_id"],
{"msg_id": msg_id, "status": "delivered"}
)
async def ack_read(self, msg_ids, user_id):
# Batch update for efficiency
await self.db.update_messages_status(msg_ids, "read")
# Group by sender and notify
# ...
Step 5: Data Models
Message Storage (Cassandra)
Copy
-- Messages table partitioned by conversation
CREATE TABLE messages (
conversation_id UUID,
message_id TIMEUUID,
sender_id UUID,
content TEXT,
content_type TEXT, -- text, image, video, audio
media_url TEXT,
status TEXT, -- stored, delivered, read
created_at TIMESTAMP,
PRIMARY KEY (conversation_id, message_id)
) WITH CLUSTERING ORDER BY (message_id DESC);
-- Index for fetching unread messages per user
CREATE TABLE user_messages (
user_id UUID,
conversation_id UUID,
message_id TIMEUUID,
status TEXT,
PRIMARY KEY (user_id, conversation_id, message_id)
) WITH CLUSTERING ORDER BY (conversation_id ASC, message_id DESC);
-- Conversation metadata
CREATE TABLE conversations (
user_id UUID,
conversation_id UUID,
other_participant UUID, -- for 1:1 chats
is_group BOOLEAN,
last_message_at TIMESTAMP,
last_message_preview TEXT,
unread_count INT,
PRIMARY KEY (user_id, last_message_at, conversation_id)
) WITH CLUSTERING ORDER BY (last_message_at DESC);
Why Cassandra?
| Requirement | Cassandra Feature |
|---|---|
| High write throughput | Distributed writes, no single master |
| Time-series data | Natural fit for chat history |
| Partition by conversation | Data locality for chat retrieval |
| Horizontal scaling | Add nodes as users grow |
Step 6: Group Messaging
Group Message Fan-out
Copy
┌─────────────────────────────────────────────────────────────────┐
│ Group Message Fan-out │
├─────────────────────────────────────────────────────────────────┤
│ │
│ Sender sends to group (500 members) │
│ │
│ Sender │
│ │ │
│ ▼ │
│ ┌─────────┐ │
│ │ Gateway │ │
│ └────┬────┘ │
│ │ │
│ ▼ │
│ ┌─────────┐ ┌──────────────┐ │
│ │ Message │────────►│ Group │ │
│ │ Service │ │ Fan-out │ │
│ └─────────┘ │ Service │ │
│ └──────┬───────┘ │
│ │ │
│ ┌────────────────────┼────────────────────┐ │
│ │ │ │ │
│ ▼ ▼ ▼ │
│ ┌──────────┐ ┌──────────┐ ┌──────────┐ │
│ │Member 1-100│ │Member 101-200│ │Member 401-500│ │
│ │(Batch 1) │ │(Batch 2) │ │(Batch 5) │ │
│ └──────────┘ └──────────┘ └──────────┘ │
│ │
│ For each member: │
│ 1. Check if online (session store) │
│ 2. If online → route message via WebSocket │
│ 3. If offline → queue for later + push notification │
│ │
└─────────────────────────────────────────────────────────────────┘
Copy
class GroupMessageService:
def __init__(self, group_store, message_router, session_store):
self.groups = group_store
self.router = message_router
self.sessions = session_store
async def send_group_message(self, sender_id, group_id, message):
# 1. Verify sender is in group
if not await self.groups.is_member(group_id, sender_id):
raise UnauthorizedError()
# 2. Get all group members
members = await self.groups.get_members(group_id)
# 3. Store message once
stored_message = await self.store_message(group_id, sender_id, message)
# 4. Fan out to members in batches
batch_size = 100
for i in range(0, len(members), batch_size):
batch = members[i:i + batch_size]
await self.fanout_batch(batch, stored_message, sender_id)
return stored_message
async def fanout_batch(self, members, message, sender_id):
tasks = []
for member_id in members:
if member_id == sender_id:
continue # Don't send to sender
tasks.append(self.deliver_to_member(member_id, message))
await asyncio.gather(*tasks)
Step 7: Presence and Typing Indicators
Presence System
Copy
┌─────────────────────────────────────────────────────────────────┐
│ Presence System Design │
├─────────────────────────────────────────────────────────────────┤
│ │
│ Online Status: │
│ ────────────── │
│ • Updated on WebSocket connect/disconnect │
│ • Heartbeat every 30 seconds │
│ • "Last seen" stored on disconnect │
│ │
│ Redis Structure: │
│ ──────────────── │
│ ZADD online_users {timestamp} {user_id} │
│ │
│ Check online: Is score > (now - 60 seconds)? │
│ │
│ Privacy Controls: │
│ ───────────────── │
│ • "Nobody" - always show "last seen recently" │
│ • "Contacts" - only contacts see status │
│ • "Everyone" - public status │
│ │
│ Presence Subscription: │
│ ────────────────────── │
│ • Only subscribe to contacts' presence │
│ • Pub/sub channel per user │
│ • Aggregate updates (don't send every second) │
│ │
└─────────────────────────────────────────────────────────────────┘
Typing Indicators
Copy
class TypingIndicatorService:
def __init__(self, redis_client, message_router, session_store):
self.redis = redis_client
self.router = message_router
self.sessions = session_store
async def set_typing(self, user_id, conversation_id):
"""Called when user starts typing"""
# Set typing indicator with 5 second TTL
key = f"typing:{conversation_id}:{user_id}"
self.redis.setex(key, 5, "1")
# Notify other participants
participants = await self.get_conversation_participants(conversation_id)
for participant_id in participants:
if participant_id == user_id:
continue
sessions = self.sessions.get_connections(participant_id)
for session in sessions:
await self.router.send_event(
session["gateway_id"],
session["connection_id"],
{
"type": "typing",
"conversation_id": conversation_id,
"user_id": user_id
}
)
async def clear_typing(self, user_id, conversation_id):
"""Called when user stops typing or sends message"""
key = f"typing:{conversation_id}:{user_id}"
self.redis.delete(key)
# Notify stop typing
# ...
Step 8: End-to-End Encryption
E2E Encryption Overview
Copy
┌─────────────────────────────────────────────────────────────────┐
│ End-to-End Encryption (Signal Protocol) │
├─────────────────────────────────────────────────────────────────┤
│ │
│ Key Exchange (Initial Setup): │
│ ───────────────────────────── │
│ │
│ Alice Bob │
│ │ │ │
│ │ 1. Generate key pairs: │ │
│ │ - Identity Key (IK) │ │
│ │ - Signed Pre-Key (SPK) │ │
│ │ - One-Time Pre-Keys (OPK) │ │
│ │ │ │
│ ├────► Upload to server ◄─────────┤ │
│ │ │ │
│ │ 2. To message Bob: │ │
│ │ - Fetch Bob's public keys │ │
│ │ - Generate ephemeral key │ │
│ │ - Derive shared secret │ │
│ │ │ │
│ │ shared_secret = ECDH( │ │
│ │ Alice_IK, Bob_SPK, │ │
│ │ Alice_ephemeral, Bob_OPK │ │
│ │ ) │ │
│ │ │ │
│ │ 3. Encrypt message │ │
│ │ ciphertext = AES( │ │
│ │ message, │ │
│ │ derived_key(shared_secret) │ │
│ │ ) │ │
│ │ │ │
│ ├──────────────────────────────────► │
│ │ Encrypted message │ │
│ │
│ Server CANNOT read messages - only routes encrypted blobs │
│ │
└─────────────────────────────────────────────────────────────────┘
Server’s Role: With E2E encryption, the server only stores and routes encrypted messages. It cannot read content. This affects search functionality (must be client-side) and backups (encrypted).
Step 9: Offline Message Handling
Copy
┌─────────────────────────────────────────────────────────────────┐
│ Offline Message Queue │
├─────────────────────────────────────────────────────────────────┤
│ │
│ When recipient is offline: │
│ ───────────────────────── │
│ 1. Store message in persistent queue │
│ 2. Send push notification │
│ 3. When recipient connects, deliver queued messages │
│ │
│ Redis List per user: │
│ ───────────────────── │
│ LPUSH offline:{user_id} {message_json} │
│ LRANGE offline:{user_id} 0 -1 # Get all │
│ DEL offline:{user_id} # After delivery │
│ │
│ Flow: │
│ ────── │
│ │
│ User connects │
│ │ │
│ ▼ │
│ Check offline queue │
│ │ │
│ ▼ │
│ ┌─────────────────┐ │
│ │ Messages exist? │───No───► Done │
│ └────────┬────────┘ │
│ │ Yes │
│ ▼ │
│ Deliver all messages in order │
│ │ │
│ ▼ │
│ Clear queue │
│ │ │
│ ▼ │
│ Send delivery ACKs to senders │
│ │
└─────────────────────────────────────────────────────────────────┘
Final Architecture
Copy
┌─────────────────────────────────────────────────────────────────┐
│ Complete WhatsApp Architecture │
├─────────────────────────────────────────────────────────────────┤
│ │
│ ┌───────────────┐ │
│ │ Clients │ │
│ └───────┬───────┘ │
│ │ WebSocket │
│ ┌───────▼───────┐ │
│ │ Load Balancer │ │
│ └───────┬───────┘ │
│ │ │
│ ┌───────────────────────────┼───────────────────────────┐ │
│ │ │ │ │ │ │
│ ▼ ▼ ▼ ▼ ▼ │
│ ┌──────┐ ┌──────┐ ┌──────┐ ┌──────┐ ┌──────┐ │
│ │ GW │ │ GW │ │ GW │ │ GW │ │ GW │ │
│ │ 1 │ │ 2 │ │ 3 │ │ 4 │ │ N │ │
│ └──┬───┘ └──┬───┘ └──┬───┘ └──┬───┘ └──┬───┘ │
│ │ │ │ │ │ │
│ └────────────┴────────────┼────────────┴────────────┘ │
│ │ │
│ ┌──────▼──────┐ │
│ │ Kafka │ │
│ │ (Events) │ │
│ └──────┬──────┘ │
│ │ │
│ ┌─────────────────────────┼─────────────────────────┐ │
│ │ │ │ │
│ ▼ ▼ ▼ │
│ ┌──────────┐ ┌──────────────┐ ┌──────────┐ │
│ │ Message │ │ Session │ │ Push │ │
│ │ Service │ │ Service │ │ Service │ │
│ └────┬─────┘ └──────┬───────┘ └────┬─────┘ │
│ │ │ │ │
│ ▼ ▼ ▼ │
│ ┌──────────┐ ┌──────────────┐ ┌──────────┐ │
│ │Cassandra │ │ Redis │ │ FCM/ │ │
│ │(Messages)│ │ (Sessions) │ │ APNS │ │
│ └──────────┘ └──────────────┘ └──────────┘ │
│ │
│ ┌──────────────┐ │
│ │ S3 + CDN │ │
│ │ (Media) │ │
│ └──────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────┘
Key Design Decisions
| Decision | Choice | Reasoning |
|---|---|---|
| Protocol | WebSocket | Bidirectional, low latency |
| Message DB | Cassandra | High write throughput, time-series |
| Session Store | Redis | Fast lookups, TTL support |
| Message Queue | Kafka | Reliable message routing |
| Media | S3 + CDN | Scalable, cost-effective |
| Push | FCM/APNS | Platform native notifications |
| Encryption | Signal Protocol | Industry standard E2E |
Common Interview Questions
How do you ensure message ordering?
How do you ensure message ordering?
- Use TIMEUUID in Cassandra (includes timestamp)
- Client-side sequence numbers for conversation
- Deliver messages in batches, sorted by timestamp
- Handle network reordering by buffering and sorting
How do you handle message delivery to multiple devices?
How do you handle message delivery to multiple devices?
- Store all device sessions for each user
- Fan out message to all connected devices
- Track delivery status per device
- Show latest status (delivered if any device received)
How do you scale WebSocket connections?
How do you scale WebSocket connections?
- Horizontal scaling of gateway servers
- Consistent hashing for user→gateway mapping
- Redis pub/sub for cross-gateway routing
- Each gateway handles ~1M connections (64GB RAM)
How do you handle network partitions?
How do you handle network partitions?
- Queue messages during partition
- Retry with exponential backoff
- Client-side message caching
- Sync on reconnection
- Accept eventual consistency (chat doesn’t need strong consistency)