Skip to main content

Documentation Index

Fetch the complete documentation index at: https://resources.devweekends.com/llms.txt

Use this file to discover all available pages before exploring further.

What is System Design?

System Design (HLD - High Level Design) focuses on the architecture of software systems. While LLD deals with classes and code, HLD deals with:
  • Components and how they interact
  • Scalability and handling growth
  • Reliability and fault tolerance
  • Data storage and retrieval patterns
  • Trade-offs between different approaches
Think of system design like city planning. LLD is architecture for individual buildings (how rooms connect, plumbing layout). HLD is urban planning: where do you put the roads, power grid, water treatment, and hospitals so that a city of 10 million people actually works? You are the city planner. Every decision — where to route traffic, how many fire stations, whether to build a subway — involves trade-offs between cost, speed, and resilience.
Interview Reality: System design interviews are often the deciding factor for senior roles. A strong system design performance can make up for weaker coding rounds!
Practical tip: Interviewers are not looking for a perfect answer. They are looking for how you think. A senior engineer who says “I’d choose X because of Y, and the trade-off is Z, which is acceptable given our constraints” will always beat a candidate who memorizes the “right” architecture. Practice thinking out loud and articulating trade-offs — that is the real skill being evaluated.

🎯 Learning Path

Choose your path based on your experience level:
LevelFocus AreasTime
Junior/Mid (L3-L4)Fundamentals → Building Blocks → Easy Cases2-3 weeks
Senior (L5)+ Deep Dives → Medium Cases → Trade-offs4-6 weeks
Staff+ (L6+)+ Consensus → Global Architecture → All Cases6-8 weeks

Start Here (Interview Prep)

📋 Interview Guide

Start here! RESHADED framework and strategies

❓ Question Bank

30+ questions by difficulty with hints

🎭 Mock Interviews

Realistic scenarios with evaluation criteria

Course Structure

📚 Foundations (Everyone)

Fundamentals

Scalability, Latency, Throughput, CAP Theorem

Building Blocks

Load Balancers, Caches, Message Queues, CDN

Networking

DNS, TCP/UDP, HTTP, WebSockets, gRPC

Databases

SQL vs NoSQL, Sharding, Replication, Indexing

API Design

REST, GraphQL, Rate Limiting, Versioning

Estimations Mastery

Back-of-envelope calculations with practice problems

🔧 Core Patterns (Senior+)

Distributed Systems

Consistency, Consensus, Transactions

Microservices

Patterns, Service Mesh, Communication

Scalability Patterns

Stateless Design, Multi-level Caching, Sharding

Reliability Patterns

Circuit Breakers, Retries, Bulkheads

Event-Driven Architecture

Event Sourcing, CQRS, Saga Pattern

Trade-off Framework

Systematic approach to design decisions

🚀 Staff+ Deep Dives

For Staff/Principal Interviews: These topics separate senior from staff-level candidates. Interviewers expect deep technical knowledge and ability to make architectural decisions.

Consensus Algorithms

Raft, Paxos, Leader Election (with code)

Global Architecture

Multi-region, Disaster Recovery, Geo-distribution

Data Modeling

Event Sourcing, CQRS, Schema Evolution

Real-Time Systems

WebSockets, SSE, Push Architecture

Search Systems

Elasticsearch, Inverted Index, Ranking

Time-Series Data

Metrics, IoT, Financial data patterns

CDN & Edge Computing

Edge functions, Caching strategies

Data Pipelines

Batch vs Stream, Lambda/Kappa

📖 Quick References

Patterns Reference

Quick lookup for all design patterns

Observability

Metrics, Logs, Traces, SLIs/SLOs

Security Patterns

JWT, OAuth, RBAC, mTLS

Advanced Concepts

Linearizability, Vector Clocks

💼 Case Studies (Practice These!)

URL Shortener

🟢 Easy - Start here

Rate Limiter

🟢 Easy - Algorithms

Notification System

🟡 Medium - Multi-channel

Twitter/X

🟡 Medium - Feed design

WhatsApp

🟡 Medium - Real-time

E-Commerce

🟡 Medium - Transactions

Payment System

🔴 Hard - Financial

Netflix

🔴 Hard - Video streaming

Spotify

🔴 Hard - Audio streaming

Uber

🔴 Hard - Geo-matching

Google Maps

🔴 Hard - Navigation

Interview Framework (RESHADED)

1

R - Requirements (5 min)

Ask about users, scale, features, and constraints. “Are we optimizing for reads or writes?”
2

E - Estimation (5 min)

Calculate QPS, storage, bandwidth. Show your math clearly.
3

S - Storage Schema (5 min)

Define data models, choose SQL vs NoSQL with justification.
4

H - High-Level Design (10 min)

Draw main components and data flow. Start simple.
5

A - API Design (5 min)

Define key endpoints, request/response formats.
6

D - Deep Dive (10 min)

Detail specific components, discuss trade-offs.
7

E - Edge Cases (3 min)

Handle failures, race conditions, hot spots.
8

D - Discuss (2 min)

Summarize trade-offs, mention what you’d add given more time.
Pro Tip: Check out the full Interview Guide for the complete RESHADED framework, cheatsheets, and common mistakes to avoid.

Top Interview Problems by Level

ProblemLevelKey ConceptsPriority
URL ShortenerL3-L4Hashing, DB, CacheMust Know
Rate LimiterL3-L4Token Bucket, RedisMust Know
Twitter FeedL4-L5Fan-out, Timeline, CacheMust Know
WhatsAppL4-L5WebSockets, Message QueueMust Know
E-CommerceL4-L5Inventory, TransactionsMust Know
NotificationL4-L5Push/Pull, TemplatesRecommended
SpotifyL5Streaming, RecommendationsRecommended
YouTubeL5-L6Video Processing, CDNRecommended
UberL5-L6Location, MatchingKnow Basics
Google DocsL6+CRDT, Real-time collabStaff+

Quick Reference: Key Numbers

Latency Numbers

  • L1 cache: 0.5 ns
  • RAM: 100 ns
  • SSD: 100 μs
  • HDD: 10 ms
  • Same DC: 0.5 ms
  • Cross-continent: 150 ms

Scale Numbers

  • Seconds/day: ~100,000
  • 1M DAU → ~12 QPS
  • 100M DAU → ~1,200 QPS
  • 99.99% = 52 min downtime/year
AvailabilityDowntime/YearUse Case
99% (2 nines)3.65 daysInternal tools
99.9% (3 nines)8.76 hoursStandard SaaS
99.99% (4 nines)52.6 minutesCritical services
99.999% (5 nines)5.26 minutesPayment systems

What Makes This Course Different

💻 Working Code

Every pattern includes Python + JavaScript implementations you can run

🎯 Interview Focused

Organized by what interviewers actually ask at each level

⚖️ Trade-off Framework

Systematic approach to design decisions, not just memorized answers

🔄 Real Scenarios

Mock interview scripts with expected follow-ups
Start with fundamentals! Many candidates jump to case studies without understanding the building blocks. This leads to surface-level answers that crumble under follow-up questions. A senior engineer who says “I’d use a message queue here” but cannot explain backpressure, exactly-once delivery trade-offs, or dead-letter queues will not pass. Master the building blocks first, then the patterns, then the case studies. Each layer builds on the one before it.

Interview Deep-Dive Questions

What the interviewer is really testing: Whether you have a repeatable methodology for tackling ambiguity, and whether you understand why each phase of the design process exists — not just the sequence, but the reasoning behind it.Strong Answer:
  • I use a structured approach that I think of as “narrow before you build.” The biggest mistake candidates make is jumping to drawing boxes and arrows before understanding what they are building. My framework has four major phases, and the ordering is intentional because each phase reduces ambiguity for the next.
  • Phase 1 — Requirements and constraints (5 minutes): Ask clarifying questions to turn a vague prompt into a concrete problem. “Design Twitter” is not a problem — “Design the home timeline feed for 500M users where 80% of traffic is reads” is a problem. I specifically ask: Who are the users and what are the primary use cases? What is the expected scale (DAU, QPS, storage)? What are the most important quality attributes (latency, consistency, availability)? What is explicitly out of scope? This phase matters because a system optimized for 1000 users looks nothing like one for 100M users.
  • Phase 2 — Estimation and data model (5 minutes): Back-of-envelope calculations to establish the scale constraints. If we have 500M DAU with an average of 10 timeline loads per day, that is 5B reads/day or roughly 60K QPS. Storage: if each user follows 200 accounts and the average account posts 2 tweets/day, the fan-out for a single timeline load is 400 tweets to rank. This informs whether we need pre-computation (fan-out on write) or can compute on the fly (fan-out on read). I also sketch the core data model here — what entities exist and what are the key relationships.
  • Phase 3 — High-level design (10 minutes): Draw the main components and their interactions. Start with the simplest possible design that meets the requirements, then iterate. For the timeline, that might be: Client, API Gateway, Timeline Service, Cache (pre-computed timelines), Tweet Storage. Explain why each component exists and what protocol connects them.
  • Phase 4 — Deep dive and trade-offs (15 minutes): This is where you show depth. Pick 2-3 components and go deep. For the timeline, the interesting deep dives are: (a) fan-out strategy — on-write for normal users, on-read for celebrities (hybrid approach), (b) ranking algorithm — chronological vs. engagement-weighted, (c) caching strategy — pre-computed timelines in Redis with invalidation on new tweets. For each decision, explicitly state the trade-off: “Fan-out on write gives low-latency reads but costs more storage and makes celebrity tweets slow to propagate.”
  • The ordering matters because you cannot make good component decisions without understanding scale (Phase 2), and you cannot estimate scale without understanding requirements (Phase 1). Candidates who skip to Phase 3 end up backtracking when the interviewer asks “but what if there are 100M users?” and their single-server design falls apart.
  • Example: In a real interview at a FAANG company, I was asked to design a notification system. I spent the first 5 minutes establishing that the key constraint was multi-channel delivery (push, SMS, email) with different latency requirements for each channel. That constraint completely shaped the architecture — a priority queue per channel rather than a single queue. If I had skipped requirements gathering, I would have designed a single-queue system and had to redesign mid-interview.
Follow-up: You are 25 minutes into a 45-minute interview and the interviewer says “let us go deeper on the database choice.” You have been using a generic “Database” box in your diagram. How do you handle this?This is a normal and expected part of the interview. The interviewer is testing depth. Walk through the decision systematically: (1) What are the access patterns? (reads vs. writes, point lookups vs. range scans, joins vs. denormalized). (2) What are the consistency requirements? (strong vs. eventual). (3) What is the scale? (can a single node handle it, or do we need horizontal scaling?). Then make a concrete choice with justification: “For the timeline cache, I would use Redis because we need sub-millisecond reads for pre-computed lists and the data structure (sorted set) maps perfectly to a ranked timeline. For the tweet store, I would use Cassandra because we need high write throughput, the access pattern is partition-key lookups (get tweets by user_id), and we can tolerate eventual consistency for tweet reads.” Always name a specific technology and explain why.
What the interviewer is really testing: Whether you understand that database selection is about access patterns and consistency requirements, not a binary SQL-vs-NoSQL decision, and whether you can teach and mentor effectively.Strong Answer:
  • The statement is not wrong, but it is dangerously incomplete. NoSQL databases can scale writes horizontally more easily than most SQL databases, but that is one dimension of a multi-dimensional decision. The nuance the junior engineer is missing is that “scales better” depends on what you are scaling and what you are giving up.
  • What SQL gives you that most NoSQL databases do not: (1) ACID transactions across multiple rows and tables. If you are building a financial system where “debit account A and credit account B” must be atomic, you need transactions. (2) Flexible querying with joins. If you do not know all your query patterns upfront, SQL’s query planner lets you ask ad-hoc questions. NoSQL databases are optimized for specific access patterns — if your query does not match the partition key design, you get full table scans. (3) Strong consistency by default. With most SQL databases, a read after a write returns the latest value. With many NoSQL databases, you get eventual consistency unless you explicitly request strong reads, which often negates the performance benefit.
  • What NoSQL gives you: (1) Horizontal write scaling. Cassandra, DynamoDB, and MongoDB can distribute writes across many nodes using consistent hashing. Most SQL databases require sharding, which is operationally complex. (2) Schema flexibility. Document stores let you evolve your data model without migrations. (3) Purpose-built data models — key-value stores for caching (Redis), wide-column for time-series (Cassandra), document for nested/variable schemas (MongoDB), graph for relationship traversal (Neo4j).
  • The right framework is to start with access patterns: What queries will you run? How will you write data? What consistency do you need? Then pick the database that best serves those patterns. For an e-commerce product catalog (hierarchical, variable attributes, read-heavy), a document store makes sense. For the order processing pipeline (transactional, relational, consistency-critical), PostgreSQL is likely the right choice. Many real systems use both — polyglot persistence.
  • How I would respond to the junior engineer: “You are right that horizontal scaling is easier with NoSQL. But let me ask you — what are the access patterns for our use case? What happens if we need to join data across tables? What consistency guarantees does the business require? Let us work through the decision together.” This turns it into a teaching moment rather than a correction.
  • Example: Instagram famously runs on PostgreSQL at massive scale. They handle billions of rows by using application-level sharding (not a NoSQL database) and aggressive caching. Their access patterns (user profiles, posts, followers) fit relational modeling well, and they valued strong consistency for core social features. Meanwhile, they use Cassandra for their direct messaging system, where the write throughput requirements and access patterns (fetch messages by conversation ID) are a better fit for wide-column storage. Same company, different databases for different problems.
Follow-up: The junior engineer then asks, “But what about CockroachDB or Spanner? They are SQL databases that scale horizontally. Why would anyone choose NoSQL anymore?”Great question. NewSQL databases like CockroachDB and Spanner do offer SQL semantics with horizontal scaling. The trade-offs are: (1) Cost — Spanner is expensive, and CockroachDB’s distributed transactions add latency (cross-node coordination for every multi-row transaction). For simple key-value lookups or append-only writes, DynamoDB or Cassandra will be faster and cheaper. (2) Operational complexity — running a distributed SQL cluster is more complex than a managed NoSQL service. (3) Specialization — Redis as a cache, Elasticsearch for search, and Neo4j for graph queries are purpose-built for their access patterns and will outperform a general-purpose distributed SQL database for those specific use cases. The right mental model is: use distributed SQL when you need both horizontal scaling AND complex queries with transactions; use NoSQL when your access patterns are simple and well-defined and you can trade consistency or query flexibility for performance or cost.
What the interviewer is really testing: Whether you understand CAP at a deeper level than “pick two out of three” and can reason about consistency-availability trade-offs in practical terms.Strong Answer:
  • The CAP theorem states that in a distributed system experiencing a network partition, you must choose between consistency (every read returns the most recent write) and availability (every request receives a response). You cannot have both during a partition. The “pick two” framing is misleading because partitions are not optional — they will happen. So the real choice is: during a partition, do you sacrifice consistency or availability?
  • Why practitioners say it is misunderstood: (1) CAP only applies during partitions. When the network is healthy (which is the vast majority of the time), you can have both consistency and availability. The interesting design question is what happens during the rare partition event — not the steady state. (2) CAP presents a binary choice, but real systems operate on a spectrum. You can tune consistency levels per operation (Cassandra’s QUORUM reads are more consistent than ONE reads but less available). (3) CAP says nothing about latency. A system that is “consistent and available” but takes 30 seconds to respond is technically CAP-compliant but useless in practice.
  • What you should think about instead: the PACELC model, which extends CAP. It says: if there is a Partition, choose between Availability and Consistency; Else (when the system is running normally), choose between Latency and Consistency. This captures the real trade-off engineers face daily. Most of the time, there is no partition. The daily trade-off is: do I read from a local replica (fast, possibly stale) or the leader (slow, definitely current)?
  • Practical examples: DynamoDB is PA/EL (during partition: available; else: low latency, eventual consistency by default). PostgreSQL with synchronous replication is PC/EC (during partition: consistent, refuses writes; else: consistent, higher latency). Cassandra is configurable — at QUORUM consistency it is PC/EL, at ONE consistency it is PA/EL.
  • In interviews, I translate CAP into concrete decisions: “For the shopping cart, I choose availability over consistency. If there is a network issue, I would rather show the user a potentially stale cart than an error page. For the checkout and payment flow, I choose consistency — I would rather reject the request than risk a double charge.”
  • Example: Amazon’s Dynamo paper (which inspired DynamoDB and Cassandra) explicitly chose availability over consistency for the shopping cart because a lost cart item means a lost sale, while a slightly stale cart is merely annoying. But their ordering system chose consistency because processing an order twice is far worse than temporarily rejecting one. Same company, different CAP choices for different subsystems.
Follow-up: A colleague argues that with modern databases like Spanner achieving strong consistency globally with sub-10ms latency, the CAP theorem is “solved.” Are they right?They are conflating normal operation with partition behavior. Spanner achieves strong consistency with low latency when the network is healthy by using TrueTime (atomic clocks and GPS receivers in every datacenter) to order transactions globally without the coordination round trips that other distributed databases require. But during a network partition, Spanner chooses consistency — it will refuse writes to partitioned nodes rather than risk inconsistency. It does not “solve” CAP; it makes a clear CP choice and invests heavily in making partitions extremely rare and short-lived (via Google’s private network infrastructure). For companies not running on Google’s network, the partition probability is higher and the CAP trade-off is more pressing.