Skip to main content

Documentation Index

Fetch the complete documentation index at: https://resources.devweekends.com/llms.txt

Use this file to discover all available pages before exploring further.

Problem Statement

Design a Twitter-like social media platform that:
  • Users can post tweets (280 characters)
  • Users can follow other users
  • Users see a timeline of tweets from people they follow
  • Support for likes, retweets, and replies

Step 1: Requirements Clarification

Functional Requirements

Core Features

  • Post tweets (text, images, videos)
  • Follow/unfollow users
  • View home timeline (feed)
  • Like, retweet, reply
  • User profiles

Extended Features

  • Search tweets
  • Trending topics
  • Notifications
  • Direct messages

Non-Functional Requirements

  • Low Latency: Timeline loads in <200ms
  • High Availability: 99.99% uptime
  • Eventual Consistency: Acceptable for feeds
  • Scale: 500M users, 200M DAU

Capacity Estimation

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                 Twitter Scale Estimation                        β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚                                                                 β”‚
β”‚  Users:                                                         β”‚
β”‚  β€’ 500 million total users                                     β”‚
β”‚  β€’ 200 million DAU                                             β”‚
β”‚  β€’ Average: 200 followers per user                             β”‚
β”‚  β€’ Power users: 10M+ followers (celebrities)                   β”‚
β”‚                                                                 β”‚
β”‚  Tweets:                                                        β”‚
β”‚  β€’ 10% users tweet daily = 20M tweets/day                      β”‚
β”‚  β€’ Tweet QPS = 20M / 86,400 β‰ˆ 230 QPS                          β”‚
β”‚  β€’ Peak: 230 Γ— 3 = 700 QPS                                     β”‚
β”‚                                                                 β”‚
β”‚  Timeline reads:                                                β”‚
β”‚  β€’ 200M DAU Γ— 5 views/day = 1B views/day                       β”‚
β”‚  β€’ Timeline QPS = 1B / 86,400 β‰ˆ 11,600 QPS                     β”‚
β”‚  β€’ Peak: 11,600 Γ— 3 = 35,000 QPS                               β”‚
β”‚                                                                 β”‚
β”‚  Read:Write ratio = 11,600:230 = 50:1                          β”‚
β”‚                                                                 β”‚
β”‚  Storage:                                                       β”‚
β”‚  β€’ Tweet: 280 chars + metadata = 500 bytes                     β”‚
β”‚  β€’ Daily: 20M Γ— 500 = 10 GB                                    β”‚
β”‚  β€’ Yearly: 10 GB Γ— 365 = 3.6 TB                                β”‚
β”‚  β€’ Media: 10% tweets with 1MB image = 2M Γ— 1MB = 2 TB/day     β”‚
β”‚                                                                 β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Step 2: High-Level Design

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                    Twitter Architecture                         β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚                                                                 β”‚
β”‚                       β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”                           β”‚
β”‚                       β”‚   Client   β”‚                           β”‚
β”‚                       β””β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”˜                           β”‚
β”‚                             β”‚                                   β”‚
β”‚                       β”Œβ”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”                           β”‚
β”‚                       β”‚    CDN     β”‚                           β”‚
β”‚                       β””β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”˜                           β”‚
β”‚                             β”‚                                   β”‚
β”‚                       β”Œβ”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”                           β”‚
β”‚                       β”‚ API Gatewayβ”‚                           β”‚
β”‚                       β””β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”˜                           β”‚
β”‚                             β”‚                                   β”‚
β”‚    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”         β”‚
β”‚    β”‚                        β”‚                        β”‚          β”‚
β”‚    β–Ό                        β–Ό                        β–Ό          β”‚
β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”         β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”         β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”‚
β”‚ β”‚  Tweet   β”‚         β”‚   Timeline   β”‚         β”‚  User    β”‚    β”‚
β”‚ β”‚ Service  β”‚         β”‚   Service    β”‚         β”‚ Service  β”‚    β”‚
β”‚ β””β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”˜         β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”˜         β””β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”˜    β”‚
β”‚      β”‚                      β”‚                      β”‚           β”‚
β”‚      β”‚                      β”‚                      β”‚           β”‚
β”‚ β”Œβ”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”         β”Œβ”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”         β”Œβ”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”    β”‚
β”‚ β”‚Tweet DB  β”‚         β”‚Timeline Cacheβ”‚         β”‚ User DB  β”‚    β”‚
β”‚ β”‚(Cassandra)β”‚         β”‚   (Redis)    β”‚         β”‚(Postgres)β”‚    β”‚
β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜         β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜         β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β”‚
β”‚                                                                 β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Core Services

ServiceResponsibility
Tweet ServiceCreate, read, delete tweets
Timeline ServiceBuild and serve user timelines
User ServiceUser profiles, follow relationships
Fan-out ServiceDistribute tweets to followers
Search ServiceFull-text search on tweets
Notification ServicePush notifications

Step 3: The Timeline Problem

The core challenge is: How do we show a user the latest tweets from everyone they follow?

Approach 1: Fan-out on Read (Pull)

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                    Fan-out on Read                              β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚                                                                 β”‚
β”‚  When user opens timeline:                                      β”‚
β”‚  ────────────────────────                                       β”‚
β”‚  1. Get list of followees (200 users)                          β”‚
β”‚  2. For each followee, get recent tweets                       β”‚
β”‚  3. Merge and sort by timestamp                                β”‚
β”‚  4. Return top N tweets                                         β”‚
β”‚                                                                 β”‚
β”‚  User                                                           β”‚
β”‚    β”‚                                                            β”‚
β”‚    β–Ό                                                            β”‚
β”‚  "Get timeline"                                                β”‚
β”‚    β”‚                                                            β”‚
β”‚    β–Ό                                                            β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”                   β”‚
β”‚  β”‚  SELECT tweets FROM tweet_table         β”‚                   β”‚
β”‚  β”‚  WHERE user_id IN (followee_ids)        β”‚                   β”‚
β”‚  β”‚  ORDER BY created_at DESC               β”‚                   β”‚
β”‚  β”‚  LIMIT 100                              β”‚                   β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜                   β”‚
β”‚                                                                 β”‚
β”‚  + Simple to implement                                         β”‚
β”‚  + No extra storage                                            β”‚
β”‚  - Slow: 200 queries per timeline request                     β”‚
β”‚  - 11,600 QPS Γ— 200 = 2.3M queries/second!                    β”‚
β”‚                                                                 β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Approach 2: Fan-out on Write (Push)

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                    Fan-out on Write                             β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚                                                                 β”‚
β”‚  When user posts a tweet:                                       β”‚
β”‚  ────────────────────────                                       β”‚
β”‚  1. Save tweet to database                                     β”‚
β”‚  2. Get all followers (could be millions)                      β”‚
β”‚  3. Push tweet ID to each follower's timeline cache            β”‚
β”‚                                                                 β”‚
β”‚  Tweet Posted                                                   β”‚
β”‚    β”‚                                                            β”‚
β”‚    β–Ό                                                            β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”                                               β”‚
β”‚  β”‚ Save Tweet  β”‚                                               β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”˜                                               β”‚
β”‚         β”‚                                                       β”‚
β”‚    β”Œβ”€β”€β”€β”€β”΄β”€β”€β”€β”€β”                                                 β”‚
β”‚    β–Ό         β–Ό                                                 β”‚
β”‚  Fan-out to each follower's timeline cache:                    β”‚
β”‚                                                                 β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”          β”‚
β”‚  β”‚Follower 1β”‚ β”‚Follower 2β”‚ β”‚Follower 3β”‚ β”‚    ...   β”‚          β”‚
β”‚  β”‚ Timeline β”‚ β”‚ Timeline β”‚ β”‚ Timeline β”‚ β”‚          β”‚          β”‚
β”‚  β”‚[tweet_id]β”‚ β”‚[tweet_id]β”‚ β”‚[tweet_id]β”‚ β”‚[tweet_id]β”‚          β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜          β”‚
β”‚                                                                 β”‚
β”‚  + Fast reads: O(1) cache lookup                              β”‚
β”‚  + Timeline is pre-computed                                    β”‚
β”‚  - Celebrity problem: 50M followers = 50M writes              β”‚
β”‚  - Wasted writes for inactive users                           β”‚
β”‚  - More storage (timeline caches)                             β”‚
β”‚                                                                 β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Approach 3: Hybrid (What Twitter Uses)

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                    Hybrid Approach                              β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚                                                                 β”‚
β”‚  Regular Users (< 10K followers):                              β”‚
β”‚  ──────────────────────────────                                 β”‚
β”‚  β†’ Fan-out on Write                                            β”‚
β”‚  β†’ Push to all followers' timelines                            β”‚
β”‚                                                                 β”‚
β”‚  Celebrities (> 10K followers):                                 β”‚
β”‚  ─────────────────────────────                                  β”‚
β”‚  β†’ Fan-out on Read                                             β”‚
β”‚  β†’ Fetch at read time and merge                                β”‚
β”‚                                                                 β”‚
β”‚  Timeline Generation:                                           β”‚
β”‚  ────────────────────                                           β”‚
β”‚  1. Read user's pre-computed timeline (regular users' tweets)  β”‚
β”‚  2. Get list of celebrity followees                            β”‚
β”‚  3. Fetch recent tweets from each celebrity                    β”‚
β”‚  4. Merge both lists, sort by time                             β”‚
β”‚  5. Return top N                                                β”‚
β”‚                                                                 β”‚
β”‚         User's Timeline                                         β”‚
β”‚              β”‚                                                  β”‚
β”‚    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”                                       β”‚
β”‚    β”‚                   β”‚                                        β”‚
β”‚    β–Ό                   β–Ό                                        β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”   β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”                             β”‚
β”‚  β”‚ Timeline  β”‚   β”‚  Celebrity    β”‚                             β”‚
β”‚  β”‚  Cache    β”‚   β”‚   Tweets      β”‚                             β”‚
β”‚  β”‚ (pushed)  β”‚   β”‚  (pulled)     β”‚                             β”‚
β”‚  β””β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”˜   β””β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”˜                             β”‚
β”‚        β”‚                 β”‚                                      β”‚
β”‚        β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”˜                                      β”‚
β”‚                 β”‚                                               β”‚
β”‚           β”Œβ”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”                                        β”‚
β”‚           β”‚   Merge   β”‚                                        β”‚
β”‚           β”‚   & Sort  β”‚                                        β”‚
β”‚           β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜                                        β”‚
β”‚                                                                 β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Step 4: Detailed Component Design

Data Models

-- Users table
CREATE TABLE users (
    user_id         BIGINT PRIMARY KEY,
    username        VARCHAR(50) UNIQUE,
    email           VARCHAR(255),
    display_name    VARCHAR(100),
    bio             TEXT,
    profile_image   VARCHAR(500),
    followers_count BIGINT DEFAULT 0,
    following_count BIGINT DEFAULT 0,
    created_at      TIMESTAMP
);

-- Tweets table (Cassandra)
CREATE TABLE tweets (
    tweet_id        BIGINT,         -- Snowflake ID
    user_id         BIGINT,
    content         TEXT,
    media_urls      LIST<TEXT>,
    created_at      TIMESTAMP,
    likes_count     BIGINT,
    retweets_count  BIGINT,
    replies_count   BIGINT,
    PRIMARY KEY (user_id, tweet_id)
) WITH CLUSTERING ORDER BY (tweet_id DESC);

-- Follow relationships (Graph or Cassandra)
CREATE TABLE followers (
    user_id         BIGINT,
    follower_id     BIGINT,
    created_at      TIMESTAMP,
    PRIMARY KEY (user_id, follower_id)
);

CREATE TABLE following (
    user_id         BIGINT,
    following_id    BIGINT,
    created_at      TIMESTAMP,
    PRIMARY KEY (user_id, following_id)
);

Timeline Cache Structure

# Redis sorted set for each user's timeline
# Key: timeline:{user_id}
# Score: tweet timestamp
# Value: tweet_id

# Add tweet to timeline
ZADD timeline:123 1705312800 "tweet_456789"

# Get latest 50 tweets
ZREVRANGE timeline:123 0 49

# Trim old tweets (keep only 800)
ZREMRANGEBYRANK timeline:123 0 -801

# Timeline structure
{
    "timeline:123": {
        "tweet_999": 1705312800,  # Most recent
        "tweet_998": 1705312700,
        "tweet_997": 1705312600,
        ...
        # Store last 800 tweet IDs
    }
}

# Full tweet data cached separately
{
    "tweet:999": {
        "user_id": 456,
        "content": "Hello world!",
        "created_at": 1705312800,
        "likes": 42
    }
}

Fan-out Service

class FanoutService:
    def __init__(self, follower_service, timeline_cache, message_queue):
        self.follower_service = follower_service
        self.timeline_cache = timeline_cache
        self.queue = message_queue
    
    def fanout_tweet(self, tweet):
        user_id = tweet.user_id
        followers = self.follower_service.get_followers(user_id)
        
        # Check if user is a celebrity
        if len(followers) > 10_000:
            # Don't fan out for celebrities
            # Store tweet in celebrity tweets cache instead
            self.cache_celebrity_tweet(user_id, tweet)
            return
        
        # Fan-out to all followers
        for batch in self.batch(followers, 1000):
            # Process in parallel
            self.queue.publish("fanout", {
                "tweet_id": tweet.id,
                "tweet_time": tweet.created_at,
                "follower_ids": batch
            })
    
    def process_fanout_batch(self, message):
        tweet_id = message["tweet_id"]
        tweet_time = message["tweet_time"]
        
        pipeline = self.timeline_cache.pipeline()
        for follower_id in message["follower_ids"]:
            # Add to timeline sorted set
            pipeline.zadd(
                f"timeline:{follower_id}",
                {tweet_id: tweet_time}
            )
            # Trim to 800 tweets
            pipeline.zremrangebyrank(f"timeline:{follower_id}", 0, -801)
        
        pipeline.execute()

Timeline Service

class TimelineService:
    def __init__(self, timeline_cache, tweet_service, follow_service):
        self.cache = timeline_cache
        self.tweet_service = tweet_service
        self.follow_service = follow_service
    
    def get_timeline(self, user_id, count=50, cursor=None):
        # 1. Get pre-computed timeline (from regular users)
        if cursor:
            timeline_ids = self.cache.zrevrangebyscore(
                f"timeline:{user_id}",
                cursor, 
                "-inf",
                start=0, 
                num=count
            )
        else:
            timeline_ids = self.cache.zrevrange(
                f"timeline:{user_id}",
                0, 
                count - 1
            )
        
        # 2. Get celebrity tweets (fan-out on read)
        celebrity_followees = self.follow_service.get_celebrity_followees(user_id)
        celebrity_tweets = []
        
        for celeb_id in celebrity_followees:
            recent_tweets = self.tweet_service.get_recent_tweets(
                celeb_id, 
                count=10
            )
            celebrity_tweets.extend(recent_tweets)
        
        # 3. Merge and sort
        all_tweet_ids = list(timeline_ids) + [t.id for t in celebrity_tweets]
        
        # Fetch full tweet objects
        tweets = self.tweet_service.get_tweets_batch(all_tweet_ids)
        
        # Sort by time and take top N
        tweets.sort(key=lambda t: t.created_at, reverse=True)
        return tweets[:count]

Tweet Search with Elasticsearch

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                    Search Architecture                          β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚                                                                 β”‚
β”‚  Tweet Created                                                  β”‚
β”‚      β”‚                                                          β”‚
β”‚      β–Ό                                                          β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”     β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”     β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”       β”‚
β”‚  β”‚  Kafka   │────►│   Consumer   │────►│ Elasticsearch β”‚       β”‚
β”‚  β”‚          β”‚     β”‚   Service    β”‚     β”‚               β”‚       β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜     β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜     β””β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”˜       β”‚
β”‚                                                β”‚                β”‚
β”‚                                                β–Ό                β”‚
β”‚                              β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”   β”‚
β”‚                              β”‚ GET /tweets/_search         β”‚   β”‚
β”‚                              β”‚ {                           β”‚   β”‚
β”‚                              β”‚   "query": {                β”‚   β”‚
β”‚                              β”‚     "match": {              β”‚   β”‚
β”‚                              β”‚       "content": "keyword"  β”‚   β”‚
β”‚                              β”‚     }                       β”‚   β”‚
β”‚                              β”‚   }                         β”‚   β”‚
β”‚                              β”‚ }                           β”‚   β”‚
β”‚                              β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜   β”‚
β”‚                                                                 β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
class TrendingService:
    def __init__(self, redis_client):
        self.redis = redis_client
    
    def track_hashtag(self, hashtag, location="global"):
        """Called when a tweet with hashtag is posted"""
        current_window = self.get_current_window()
        
        # Increment counter in current time window
        self.redis.zincrby(
            f"trending:{location}:{current_window}",
            1,
            hashtag
        )
    
    def get_trending(self, location="global", count=10):
        """Get top trending hashtags"""
        # Get last 5 time windows (e.g., 5 minute windows)
        windows = self.get_recent_windows(5)
        
        # Merge counts from recent windows
        self.redis.zunionstore(
            f"trending:{location}:merged",
            [f"trending:{location}:{w}" for w in windows]
        )
        
        # Get top hashtags
        return self.redis.zrevrange(
            f"trending:{location}:merged",
            0,
            count - 1,
            withscores=True
        )
    
    def get_current_window(self):
        """5-minute time windows"""
        import time
        return int(time.time() / 300) * 300

Step 6: Additional Components

Media Upload

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                    Media Upload Flow                            β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚                                                                 β”‚
β”‚  1. Client requests upload URL                                 β”‚
β”‚     POST /api/v1/media/upload-url                              β”‚
β”‚                                                                 β”‚
β”‚  2. Server returns pre-signed S3 URL                           β”‚
β”‚     { "upload_url": "https://s3.../presigned", "media_id": 123}β”‚
β”‚                                                                 β”‚
β”‚  3. Client uploads directly to S3                              β”‚
β”‚     PUT https://s3.../presigned                                β”‚
β”‚                                                                 β”‚
β”‚  4. S3 triggers processing lambda                              β”‚
β”‚     - Generate thumbnails                                      β”‚
β”‚     - Transcode video                                          β”‚
β”‚     - CDN propagation                                          β”‚
β”‚                                                                 β”‚
β”‚  5. Client includes media_id in tweet                          β”‚
β”‚     POST /api/v1/tweets                                        β”‚
β”‚     { "content": "Hello!", "media_ids": [123] }               β”‚
β”‚                                                                 β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”      β”‚
β”‚  β”‚ Client │───►│   API   │───►│   S3    │───►│   CDN   β”‚      β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜      β”‚
β”‚                    β”‚              β”‚                            β”‚
β”‚                    β”‚              β–Ό                            β”‚
β”‚                    β”‚         β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”                       β”‚
β”‚                    β”‚         β”‚ Lambda  β”‚                       β”‚
β”‚                    β”‚         β”‚ Process β”‚                       β”‚
β”‚                    β”‚         β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜                       β”‚
β”‚                    β”‚              β”‚                            β”‚
β”‚                    β”‚              β–Ό                            β”‚
β”‚                    β”‚         β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”                       β”‚
β”‚                    └────────►│ Media   β”‚                       β”‚
β”‚                              β”‚   DB    β”‚                       β”‚
β”‚                              β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜                       β”‚
β”‚                                                                 β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Notifications

class NotificationService:
    def __init__(self, push_service, websocket_service, email_service):
        self.push = push_service
        self.ws = websocket_service
        self.email = email_service
    
    def notify_mention(self, mentioned_user_id, tweet):
        notification = {
            "type": "mention",
            "from_user": tweet.user_id,
            "tweet_id": tweet.id,
            "content": f"@{tweet.user.username} mentioned you"
        }
        
        self.send_notification(mentioned_user_id, notification)
    
    def notify_like(self, tweet_owner_id, liker_id, tweet_id):
        # Batch likes to avoid notification spam
        # "5 people liked your tweet"
        pass
    
    def send_notification(self, user_id, notification):
        # 1. Store in notifications table
        self.store_notification(user_id, notification)
        
        # 2. Send real-time if user is online
        if self.ws.is_connected(user_id):
            self.ws.send(user_id, notification)
        
        # 3. Send push notification
        self.push.send(user_id, notification)

Final Architecture

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                 Complete Twitter Architecture                   β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚                                                                 β”‚
β”‚                         β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”                       β”‚
β”‚                         β”‚   Clients    β”‚                       β”‚
β”‚                         β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”˜                       β”‚
β”‚                                β”‚                                β”‚
β”‚                         β”Œβ”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”                       β”‚
β”‚                         β”‚     CDN      β”‚                       β”‚
β”‚                         β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”˜                       β”‚
β”‚                                β”‚                                β”‚
β”‚                         β”Œβ”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”                       β”‚
β”‚                         β”‚ API Gateway  β”‚                       β”‚
β”‚                         β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”˜                       β”‚
β”‚                                β”‚                                β”‚
β”‚    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”   β”‚
β”‚    β”‚            β”‚              β”‚              β”‚            β”‚    β”‚
β”‚    β–Ό            β–Ό              β–Ό              β–Ό            β–Ό    β”‚
β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”   β”Œβ”€β”€β”€β”€β”€β”€β”      β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”   β”Œβ”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”  β”‚
β”‚ β”‚Tweet β”‚   β”‚ User β”‚      β”‚ Timeline β”‚   β”‚Searchβ”‚    β”‚Notif β”‚  β”‚
β”‚ β”‚ Svc  β”‚   β”‚ Svc  β”‚      β”‚   Svc    β”‚   β”‚ Svc  β”‚    β”‚ Svc  β”‚  β”‚
β”‚ β””β”€β”€β”¬β”€β”€β”€β”˜   β””β”€β”€β”¬β”€β”€β”€β”˜      β””β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”˜   β””β”€β”€β”¬β”€β”€β”€β”˜    β””β”€β”€β”¬β”€β”€β”€β”˜  β”‚
β”‚    β”‚          β”‚               β”‚            β”‚           β”‚       β”‚
β”‚    β”‚          β”‚               β”‚            β”‚           β”‚       β”‚
β”‚    β–Ό          β–Ό               β–Ό            β–Ό           β–Ό       β”‚
β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”      β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”‚
β”‚ β”‚Tweet β”‚  β”‚ User β”‚      β”‚ Timeline β”‚  β”‚Elasticβ”‚  β”‚  Push   β”‚  β”‚
β”‚ β”‚  DB  β”‚  β”‚  DB  β”‚      β”‚  Cache   β”‚  β”‚Search β”‚  β”‚ Service β”‚  β”‚
β”‚ β”‚      β”‚  β”‚      β”‚      β”‚ (Redis)  β”‚  β”‚       β”‚  β”‚         β”‚  β”‚
β”‚ β””β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”˜      β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β”‚
β”‚                                                                 β”‚
β”‚    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”   β”‚
β”‚    β”‚                    Kafka                              β”‚   β”‚
β”‚    β”‚  (tweets, fanout, notifications, search indexing)     β”‚   β”‚
β”‚    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜   β”‚
β”‚                                                                 β”‚
β”‚    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”   β”‚
β”‚    β”‚                 S3 + CDN (Media)                      β”‚   β”‚
β”‚    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜   β”‚
β”‚                                                                 β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Key Design Decisions

DecisionChoiceReasoning
TimelineHybrid fan-outBalance between read/write costs
Tweet StorageCassandraHigh write throughput, time-series friendly
User DataPostgreSQLACID for user operations
Timeline CacheRedis Sorted SetsO(log n) insert, O(1) range reads
SearchElasticsearchFull-text search, real-time indexing
MediaS3 + CDNScalable object storage
MessagingKafkaDurable, high-throughput event streaming

Common Interview Questions

Use hybrid approach: fan-out on write for regular users, fan-out on read for celebrities (>10K followers). Celebrity tweets are fetched at read time and merged with the pre-computed timeline.Why 10K is the threshold: Below 10K followers, fan-out on write is fast (writing 10K Redis entries takes ~50ms). Above 10K, write latency becomes noticeable and wastes writes for inactive followers. Twitter reportedly uses a similar threshold. The exact number is tunable β€” monitor the P99 fan-out latency and adjust.The math behind this: If a celebrity with 50M followers tweets, fan-out on write means 50M Redis ZADD operations. At 100K writes/second, that takes 500 seconds β€” by which time the tweet is old news. Fan-out on read for that celebrity means each of their followers fetches one extra tweet at read time, adding ~1ms of latency.
  1. Pre-compute timelines via fan-out on write β€” this is the main mechanism for low latency
  2. Short TTL on cache (5 minutes) to bound staleness
  3. Use WebSocket or Server-Sent Events for real-time updates to active sessions (only push to users with the app open)
  4. Periodic client-side refresh (pull) as a fallback for users who lose their WebSocket connection
  5. For celebrity tweets merged at read time, cache the merged result for 30-60 seconds to avoid recomputing on every scroll
Trade-off: Pushing real-time updates via WebSocket to 200M DAU is expensive β€” each connection consumes ~10KB of server memory. Twitter uses a selective push: only push to users who are actively viewing their timeline.
  1. Mark tweet as deleted in DB (soft delete) β€” never hard delete, you need audit trails
  2. Async job to remove from all timeline caches β€” this is a reverse fan-out that can take minutes for popular users
  3. Client-side filtering as backup β€” check deletion status when rendering, hiding deleted tweets immediately
  4. Accept eventual consistency β€” deleted tweets may briefly appear in some users’ timelines
Key nuance: For legal/compliance deletions (court orders, GDPR right-to-erasure), you need a faster path. Use a β€œblocked tweet IDs” set in Redis that the timeline service checks before returning results. This adds ~1ms latency but ensures legally required deletions take effect within seconds.
  1. Chronological is the simplest and what users often say they prefer
  2. Algorithmic ranking uses engagement prediction (will this user engage with this tweet?) based on features like: author relationship strength, content type, recency, engagement velocity
  3. Hybrid: show the most recent tweets chronologically at the top, then switch to relevance-ranked for older tweets
  4. Store a pre-computed ranking score alongside the tweet ID in the timeline cache
What interviewers want to hear: The ranking model runs at read time using pre-computed features. Training happens offline on billions of engagement signals. A/B testing is essential β€” Twitter famously received backlash when switching to algorithmic timelines, so they offered both options. Mention this trade-off between engagement metrics (algorithmic wins) and user satisfaction (chronological feels more fair).

Key Trade-offs

DecisionOption AOption BRecommendation
Fan-out strategyWrite (push)Read (pull)Hybrid. Fan-out on write for users with fewer than ~10K followers (covers 99%+ of accounts). Fan-out on read for celebrities. The math: at 200M DAU with a 50:1 read-to-write ratio, pre-computing timelines converts 7M DB queries/sec into 35K Redis lookups/sec. But fanning out to 50M celebrity followers takes 500 seconds at 100K writes/sec β€” unacceptable. The hybrid gives you O(1) reads for 99% of the timeline and adds only 5-20ms of merge latency for celebrity tweets.
Tweet storageSQL (PostgreSQL)NoSQL (Cassandra)Cassandra for tweets, PostgreSQL for users. Tweets are append-heavy (20M/day), rarely updated, and read by partition key (user_id + time range). This is Cassandra’s sweet spot. User data needs ACID transactions (follow/unfollow, profile updates), which PostgreSQL handles natively. The trade-off: you lose JOIN capability, but you never need to JOIN tweets with user data in the hot path β€” denormalize the author name into the tweet record.
Timeline cache structureList (LPUSH/LRANGE)Sorted Set (ZADD/ZRANGE)Sorted Set in Redis. The score is the tweet timestamp (or ranking score for algorithmic timelines). This gives you O(log N) inserts, O(log N + M) range queries, and natural deduplication. Lists are simpler but require O(N) deduplication and do not support score-based ranking. At 200M cached timelines x 800 entries x 8 bytes per entry = ~1.28 TB of Redis memory β€” expensive but feasible with a cluster.
Real-time updatesWebSocket pushClient pollingSelective WebSocket push for active sessions, client polling as fallback. Maintaining 200M persistent connections is expensive (~10KB per connection = 2TB of memory). Twitter’s approach: push updates only to users with the app open (typically 5-10% of DAU), poll for everyone else on a 30-60 second interval. This reduces connection count by 90% while keeping active users feeling real-time.
Search indexIn-houseElasticsearchElasticsearch for the first iteration β€” it handles full-text search, real-time indexing, and relevance scoring out of the box. At Twitter’s scale (20M tweets/day), you eventually build custom infrastructure, but ES handles up to ~100K index operations/sec per cluster. The trade-off: ES clusters are operationally expensive and require careful shard management, but the alternative (building a custom inverted index) is a multi-year engineering investment.

Common Candidate Mistakes

Mistake 1: Only discussing one fan-out strategy
  Candidates who only propose fan-out on write get stuck when
  asked "what about a user with 50M followers?" Those who only
  propose fan-out on read get stuck when asked "how do you serve
  35K timeline reads per second?" The hybrid approach is the
  expected answer -- know why each piece exists.

Mistake 2: Ignoring the read:write ratio
  Twitter's read:write ratio is roughly 50:1 to 100:1. This
  means you should heavily optimize the read path (pre-computed
  caches, CDN for media) and accept more complexity on writes.
  Candidates who design a write-optimized system miss the point.

Mistake 3: Not considering cache memory requirements
  200M DAU x 800 tweet IDs per timeline x 8 bytes = 1.28 TB of
  Redis memory just for timeline caches. This is achievable with
  a Redis cluster but expensive. Discuss TTL-based eviction for
  inactive users and only caching timelines for users who have
  been active in the last 7 days.

Mistake 4: Using a single Kafka topic for fan-out
  A single fan-out topic with 200M+ messages per tweet creates a
  hot partition. Shard fan-out work by follower ID range, with
  multiple consumer groups processing in parallel. Each consumer
  batch-writes to Redis using pipeline mode for efficiency.

Interview Deep-Dive Questions

Q1: Fan-out Trade-offs at Scale

Question: Walk me through the fan-out on write versus fan-out on read trade-off. You have 200M DAU and a 50:1 read-to-write ratio. Which do you pick and why? Be specific about the numbers. Strong Answer:
  • The read-to-write ratio is the deciding factor. At 50:1, every optimization dollar should go toward the read path. Fan-out on write pre-computes timelines so reads become a single O(1) Redis ZREVRANGE β€” at 35K peak QPS that means 35K simple cache lookups instead of 35K x 200 = 7M DB queries per second.
  • Fan-out on write costs us more storage and write amplification. When a user with 5K followers tweets, that is 5K ZADD operations β€” roughly 5ms with Redis pipelining. Totally acceptable for regular users.
  • The breaking point is celebrities. A user with 50M followers triggers 50M writes. At 100K pipelined writes/sec, that takes ~500 seconds β€” the tweet is stale before delivery finishes. This is why you need the hybrid: push for regular users (under ~10K followers), pull for celebrities at read time.
  • Memory math: 200M DAU x 800 tweet IDs x 8 bytes = ~1.28 TB of Redis. That is around 80 r6g.2xlarge instances on AWS at roughly $25K/month. Expensive but manageable for a platform at this scale.
  • The key trade-off most people miss: fan-out on write also wastes writes for inactive users. If only 40% of users check their timeline daily, 60% of those writes are thrown away when the cache TTLs out. You can mitigate this by only maintaining timeline caches for users active in the last 7 days, and falling back to fan-out on read for dormant accounts.
Red flag answer: β€œI would just use fan-out on write for everyone.” This ignores the celebrity problem entirely and shows the candidate has not thought about write amplification at scale. Equally bad: β€œI would use fan-out on read for everyone” β€” this ignores the 50:1 ratio and guarantees unacceptable read latency. Follow-ups:
  1. How do you handle the transition when a user crosses the 10K follower threshold mid-day? Do their old tweets stay pushed, or do you backfill?
  2. A tweet from a 9,999-follower account goes viral and they gain 50K followers in an hour. How does your system handle this race condition?
  3. What metrics would you monitor to dynamically adjust the celebrity threshold instead of using a hard-coded 10K cutoff?

Q2: Timeline Ranking β€” Chronological vs. Algorithmic

Question: Your PM wants to switch from chronological to algorithmic timeline ranking. How would you design the ranking system, and what are the risks? Strong Answer:
  • The ranking model runs at read time on pre-computed features, not raw data. You never want ML inference blocking the hot path with feature computation. Pre-compute features like author-reader interaction strength (how often does this user like/reply to this author), tweet engagement velocity (likes per minute in the first 10 minutes), content type affinity (does this user engage more with images or text), and recency decay.
  • Architecture: a lightweight scoring service sits between the timeline cache and the API response. It takes the top ~200 candidates from the cache, scores them using a gradient-boosted tree or small neural net (inference under 10ms), and returns the top 50. The model is trained offline on billions of (user, tweet, engagement) tuples using something like XGBoost or a two-tower embedding model.
  • The risk is engagement trap: algorithmic feeds optimize for engagement (clicks, likes, retweets) which can amplify outrage content because anger drives engagement. Twitter saw exactly this backlash β€” users accused the algorithm of promoting inflammatory content. The mitigation is multi-objective optimization: score = 0.6 * engagement_prediction + 0.3 * satisfaction_prediction + 0.1 * diversity_bonus.
  • Always offer a chronological fallback. Twitter learned this the hard way and added β€œLatest Tweets” as a toggle. Store the user’s preference and short-circuit the ranker when chronological is selected.
  • A/B testing is non-negotiable before any rollout. Measure not just engagement rate but also session length, unfollow rate, and β€œtime well spent” surveys. Engagement can go up while user satisfaction goes down β€” those metrics can diverge.
Red flag answer: β€œJust sort by a relevance score using machine learning.” This is hand-waving. A strong candidate names specific features, discusses the inference latency budget, and acknowledges the engagement-vs-satisfaction tension. Bonus red flag: not mentioning A/B testing at all. Follow-ups:
  1. How do you handle the cold-start problem for a new user who has no engagement history to train features on?
  2. Your ranking model has a bug that accidentally suppresses all tweets containing links for 3 hours. How do you detect and recover from this?
  3. How would you design the ranking pipeline to be explainable β€” i.e., β€œwhy am I seeing this tweet?”

Q3: The Celebrity Problem in Depth

Question: Elon Musk has 180M followers and tweets 30 times a day. Walk me through exactly what happens in your system when he posts a tweet, end to end. Where are the bottlenecks? Strong Answer:
  • Since 180M followers far exceeds the celebrity threshold, this tweet does NOT trigger fan-out on write. The tweet is saved to Cassandra (partitioned by user_id, clustered by tweet_id DESC), the tweet object is cached in Redis (tweet:), and it is published to Kafka for search indexing and trending topic tracking. Total write-path latency: ~20ms. No follower-specific work happens at write time.
  • The cost shifts to read time. When any of his 180M followers opens their timeline, the timeline service fetches their pre-computed cache (regular user tweets) AND pulls Elon’s latest tweets from the celebrity tweet cache. This adds one extra Redis lookup per celebrity the user follows β€” typically 5-20 celebrities, so 5-20ms of extra latency.
  • The bottleneck is the celebrity tweet cache. If 50M of Elon’s followers are online and request their timeline within a 5-minute window, that celebrity cache key gets hit 50M / 300 = ~167K times per second. You need read replicas on that Redis key, or a local in-process cache (like Caffeine/Guava) with a 5-10 second TTL on each timeline service instance to absorb the read amplification.
  • Another bottleneck: the merge step. Each user’s timeline service must merge their pre-computed timeline with celebrity tweets and re-sort. If a user follows 20 celebrities, that is merging ~200 candidate tweets and sorting. This is fast (sub-millisecond in memory) but at 35K QPS it adds up to CPU pressure on the timeline service fleet.
  • Engagement counters (likes, retweets) on viral celebrity tweets are a hidden bottleneck. A tweet getting 500K likes per minute means 8,333 counter increments per second on a single row. You need a distributed counter pattern β€” shard the counter across N Redis keys (likes::shard_0 through shard_15) and sum on read.
Red flag answer: β€œWe just fan out to all 180M followers asynchronously.” This shows the candidate did not internalize why the hybrid approach exists. At 100K writes/sec, fanning out 180M entries takes 30 minutes β€” and Elon tweets 30 times a day, meaning the system would never catch up. Follow-ups:
  1. What happens when a celebrity deletes a tweet? How is the reverse fan-out different from the write path?
  2. How do you handle a celebrity tweeting during a major live event (Super Bowl) when read QPS spikes 10x?
  3. If a celebrity’s account gets hacked and posts spam to 180M followers, how does your system limit the blast radius?

Question: Design the trending topics feature. How do you compute trends in real-time, and how do you prevent manipulation? Strong Answer:
  • The naive approach is counting hashtag frequency, but that surfaces perpetually popular topics (like #love or #music) rather than genuinely trending ones. The key insight is measuring the rate of change β€” a topic is trending when its mention velocity exceeds its historical baseline. Think of it as a z-score: (current_rate - historical_mean) / standard_deviation. A hashtag that normally gets 100 mentions/hour but suddenly gets 10,000 mentions/hour is more β€œtrending” than one that always gets 50,000.
  • Architecture: tweets flow into Kafka, a stream processor (Flink or Kafka Streams) extracts hashtags and computes windowed counts β€” 1-minute tumbling windows aggregated into 5-minute sliding windows. These counts feed into Redis sorted sets keyed by location (trending:US:current, trending:global:current). A background job computes the velocity score every minute and updates the sorted set scores.
  • Gaming prevention is the hard part. Bot networks coordinate to push hashtags. Defenses: (1) weight contributions by account age and trust score β€” a 6-year-old verified account’s hashtag counts for more than a 2-day-old account with no followers, (2) detect coordinated behavior β€” if 1,000 accounts all tweet the same hashtag within 30 seconds and those accounts share creation date patterns or IP ranges, discount them, (3) rate-limit hashtag contributions per account β€” one account cannot increment a hashtag more than 3 times per hour, (4) human review queue for topics that spike unnaturally fast.
  • Localization matters. Trends in Tokyo and trends in New York are different. Partition by country and city using the user’s profile location or IP geolocation. Maintain separate sorted sets per region and a global rollup.
  • The ZUNIONSTORE approach in the code above works for small scale but becomes a bottleneck at Twitter scale. At 20M tweets/day with an average of 1.5 hashtags each, that is 30M hashtag events per day flowing through the pipeline. The stream processor must partition by hashtag to avoid hot keys for viral tags.
Red flag answer: β€œJust count hashtags and pick the top 10.” This misses the velocity-over-baseline insight entirely and would surface the same generic hashtags every day. Also a red flag: no mention of gaming/manipulation prevention β€” this is a top-3 concern for any trending feature in production. Follow-ups:
  1. How would you handle a breaking news event where the trending topic is not a hashtag but a person’s name or a phrase? How do you extract trends from unstructured text?
  2. Your trending algorithm surfaces a politically sensitive topic that is genuinely trending but your trust-and-safety team wants removed. How do you design the system to support editorial overrides without compromising the engineering?
  3. How do you handle the β€œecho chamber” effect where trending topics in a user’s network differ wildly from globally trending topics?

Q5: Search Architecture at Twitter Scale

Question: A user searches for β€œelection results” on Twitter. Walk me through the request lifecycle from keystroke to rendered results. How do you handle relevance, recency, and scale? Strong Answer:
  • The query hits the API gateway, which routes to the Search Service. The Search Service sends the query to an Elasticsearch cluster. But this is not a simple match query β€” Twitter search is fundamentally different from web search because recency dominates relevance. A tweet from 2 minutes ago about election results matters more than a highly-liked tweet from 3 months ago.
  • Index design: tweets are indexed into time-partitioned Elasticsearch indices β€” one index per day (or per hour during high-volume events). The search query fans out to the last N indices (e.g., last 7 days by default, last 24 hours for β€œLatest” tab). This partitioning means old indices can be moved to cheaper storage or fewer replicas.
  • Relevance scoring combines: (1) text relevance (BM25 score from Elasticsearch), (2) recency decay (exponential decay function β€” a tweet loses 50% relevance score every 6 hours), (3) author authority (verified status, follower count as a log-scaled signal), (4) engagement signals (likes and retweets boosted, but with diminishing returns to avoid viral-only results).
  • Typeahead/autocomplete is a separate system. As the user types, the client queries a prefix tree (trie) or a dedicated autocomplete index that suggests trending queries, usernames, and hashtags. This must respond in under 50ms, so it uses an in-memory data structure, not Elasticsearch.
  • At scale, the Elasticsearch cluster is enormous β€” potentially hundreds of nodes. The key optimization is routing: use a custom routing key (like user_id % shard_count) for user-specific queries, but for global keyword search, you must scatter-gather across all shards. To keep latency under 200ms, use aggressive timeouts (100ms per shard) and return partial results if some shards are slow.
  • Abuse vector: search is a spam magnet. Spammers stuff trending keywords into tweets to appear in search results. Counter this with a search-quality layer that demotes tweets from low-trust accounts, applies duplicate content detection (near-duplicate hashing like SimHash), and boosts tweets from accounts the searcher follows or engages with.
Red flag answer: β€œJust put tweets in Elasticsearch and query them.” This ignores time-partitioned indexing, the recency-vs-relevance tension, typeahead as a separate system, and abuse prevention. Any answer that treats Twitter search like Google web search misses the fundamental difference: tweets are ephemeral, web pages are not. Follow-ups:
  1. How do you handle a search query that matches 50 million tweets? You cannot score all of them. What is your early-termination strategy?
  2. The Elasticsearch cluster has a hot shard for today’s index during a major event. How do you rebalance in real time?
  3. How would you implement β€œsearch within a conversation thread” versus global search? Are they the same system or different?

Q6: Abuse Detection and Content Moderation at Scale

Question: Design the abuse detection and content moderation pipeline for Twitter. A user posts a tweet β€” how do you decide in real time whether it violates platform policies? Strong Answer:
  • Content moderation must run in the tweet creation hot path, but you cannot block tweet publishing on slow ML models. The solution is a two-tier system: a fast synchronous tier (under 50ms) that catches obvious violations before the tweet is visible, and a slow asynchronous tier (minutes to hours) that does deeper analysis.
  • Tier 1 (synchronous, pre-publish): a lightweight classifier runs against the tweet text. This is typically a distilled BERT model or a fast regex/keyword matcher for known harmful content (slurs, known CSAM hashes using PhotoDNA, known spam URLs from a blocklist). If the confidence score exceeds 0.95, the tweet is blocked immediately and the user sees an error. If it scores between 0.7-0.95, the tweet is published but flagged for urgent human review. Under 0.7, it passes through.
  • Tier 2 (asynchronous, post-publish): a Kafka consumer picks up every published tweet and runs heavier models β€” image classification (nudity detection, violence detection via CNNs), coordinated inauthentic behavior detection (graph analysis of retweet/reply patterns), misinformation matching (embedding similarity against a known-claims database). Violations trigger automatic removal or visibility reduction (β€œsoft intervention” where the tweet stays up but is de-ranked from search and recommendations).
  • The hardest part is not the ML β€” it is the policy layer. Content rules change constantly (new regulations, new types of abuse, cultural context). The system needs a rules engine that is decoupled from the ML models. Policy team writes rules like β€œif hate_speech_score > 0.8 AND region = EU then action = remove” and these rules are hot-reloaded without redeploying the service.
  • False positive management is critical. At 20M tweets/day, even a 0.1% false positive rate means 20K legitimate tweets wrongly blocked per day. You need an appeals pipeline: blocked users can contest, which routes to human review with the ML model’s explanation attached. Track false positive rate as a top-line metric alongside catch rate.
  • Rate limiting and behavioral signals complement content analysis: a new account posting 100 tweets/hour with trending hashtags is almost certainly a bot, regardless of content. Combine content signals with behavioral signals (posting frequency, account age, follower/following ratio, device fingerprint) in a composite risk score.
Red flag answer: β€œUse a machine learning model to classify tweets.” This is vague to the point of being meaningless. No mention of the two-tier sync/async architecture, no mention of false positive rates, no mention of the policy layer being separate from the ML layer, no mention of human-in-the-loop review. Also a red flag: any answer that suggests blocking all flagged content without an appeals process. Follow-ups:
  1. A state actor is running a disinformation campaign using AI-generated text that passes your content classifiers. How do you adapt?
  2. Your abuse detection model has a bias β€” it flags African American Vernacular English (AAVE) as toxic at 2x the rate of other dialects. How do you detect and fix this?
  3. During a mass-shooting event, your system needs to simultaneously suppress graphic content AND allow news reporting. How do you handle this tension in real time?

Q7: Timeline Cache Consistency and Failure Modes

Question: The Redis cluster holding your timeline caches loses a node. What happens to the 10 million users whose timelines were on that node? How does the system recover? Strong Answer:
  • First, Redis cluster setup matters. Each shard should have at least one replica (ideally two for critical data). When a primary node dies, Redis Sentinel or Redis Cluster’s built-in failover promotes the replica to primary within 10-30 seconds. During those seconds, writes to that shard fail β€” fan-out messages for those users accumulate in Kafka (which is durable) and are retried once the new primary is up.
  • Timeline reads during failover return a cache miss. The timeline service must have a fallback path: on cache miss, fall back to fan-out on read for that user β€” fetch their followee list, query Cassandra for recent tweets from each followee, merge and sort, then serve the result AND repopulate the cache. This is slower (200-500ms vs 10ms) but keeps the system available.
  • The critical design principle: the timeline cache is a cache, not a source of truth. The source of truth is the tweet table in Cassandra and the follow graph. Any cache entry can be reconstructed from these two sources. This means losing the entire Redis cluster is survivable (painful, but survivable) β€” it just means temporary latency increase while caches are rebuilt.
  • Cache warming strategy after recovery: do NOT try to rebuild all 10M timelines at once β€” that would overload Cassandra with a thundering herd. Instead, rebuild lazily (on next user request) and proactively only for β€œhot” users (users who have been active in the last hour, based on a separate active-users set). Spread proactive rebuilds over 5-10 minutes using a priority queue.
  • Kafka is the key to durability here. Fan-out messages are not fire-and-forget β€” they are committed to Kafka with a retention period (e.g., 24 hours). If a Redis node dies and comes back, the fan-out consumer can replay messages from the last known offset to rebuild the cache. This is why consumer offset management in Kafka is critical β€” store offsets in Kafka itself (not in the Redis that just died).
Red flag answer: β€œRedis has replication so it is fine.” This hand-waves past the actual failure sequence, the fallback read path, the thundering herd problem during recovery, and the role of Kafka as a durability layer. Also a red flag: any answer that treats the timeline cache as the source of truth. Follow-ups:
  1. What if it is not a single node but an entire availability zone that goes down, taking 33% of your Redis cluster? How does your architecture handle AZ failure?
  2. How do you detect β€œsilent corruption” where a Redis node is up but returning stale data because replication lag exceeded the TTL?
  3. You discover that 5% of timeline caches have a subtle data inconsistency β€” some tweets appear in timelines of users who do not follow the author. How do you audit and repair at scale?

Q8: Designing Twitter Direct Messages β€” Different Constraints Entirely

Question: Now design the Direct Messages (DM) feature. Why can you NOT reuse the timeline architecture for DMs? What changes fundamentally? Strong Answer:
  • The timeline and DMs look superficially similar (a list of messages sorted by time) but have completely different consistency and privacy requirements. Timelines tolerate eventual consistency β€” if a tweet appears 5 seconds late, nobody notices. DMs require strong ordering and at-least-once delivery β€” if a message is lost or arrives out of order, users notice immediately and trust erodes.
  • Data model shift: timelines are a single user reading a merged feed from many authors. DMs are a conversation between exactly 2 (or N in group chats) participants. The partition key changes from user_id to conversation_id, and you need a per-user conversation list sorted by last-message timestamp (an inbox).
  • Storage choice changes: Cassandra is great for timelines (high write throughput, eventual consistency) but DMs need stronger consistency guarantees. You would use a partition-aware store like CockroachDB or DynamoDB with strong consistency reads, or Cassandra with LOCAL_QUORUM reads/writes and careful conflict resolution.
  • No fan-out needed: a DM goes to 1 recipient (or a small group), not thousands of followers. The write path is simple: write to the messages table, update both users’ inbox sorted sets, send a push notification. The complexity shifts to real-time delivery β€” you need WebSocket connections to push messages instantly, with a message queue (like RabbitMQ or Redis Streams) per-user for buffering when the WebSocket is disconnected.
  • End-to-end encryption (E2EE) changes everything architecturally. With E2EE, the server cannot read message content, which means: no server-side search over DM content, no server-side spam/abuse detection on DM text (you have to rely on metadata and user reports), and key management becomes a major subsystem (device-based keys, key rotation, multi-device sync via the Signal Protocol or similar).
  • Read receipts and typing indicators are real-time presence signals that do not exist in the timeline world. These require a lightweight presence service (heartbeats every 5-10 seconds per active connection) that tracks who is online and in which conversation. At 50M concurrent connections, this presence service alone needs careful engineering β€” fan-out of β€œuser is typing” to the other conversation participant must be sub-100ms.
Red flag answer: β€œIt is basically the same as the timeline, just between two users.” This completely misses the consistency, encryption, real-time delivery, and presence requirements that make DMs a fundamentally different system. Also a red flag: not mentioning E2EE at all β€” post-2023, any messaging system design that ignores encryption will raise eyebrows. Follow-ups:
  1. How do you handle message ordering in a group DM when two participants send messages at the same millisecond from different data centers?
  2. A user has 50,000 DM conversations. How do you efficiently render their inbox sorted by most-recent-message without scanning all 50K conversations?
  3. Law enforcement serves a legal request for a user’s DM history, but you have E2EE enabled. What is your technical and architectural response?