Documentation Index
Fetch the complete documentation index at: https://resources.devweekends.com/llms.txt
Use this file to discover all available pages before exploring further.
What is System Design?
System Design (HLD - High Level Design) focuses on the architecture of software systems. While LLD deals with classes and code, HLD deals with:- Components and how they interact
- Scalability and handling growth
- Reliability and fault tolerance
- Data storage and retrieval patterns
- Trade-offs between different approaches
🎯 Learning Path
Choose your path based on your experience level:| Level | Focus Areas | Time |
|---|---|---|
| Junior/Mid (L3-L4) | Fundamentals → Building Blocks → Easy Cases | 2-3 weeks |
| Senior (L5) | + Deep Dives → Medium Cases → Trade-offs | 4-6 weeks |
| Staff+ (L6+) | + Consensus → Global Architecture → All Cases | 6-8 weeks |
Start Here (Interview Prep)
📋 Interview Guide
Start here! RESHADED framework and strategies
❓ Question Bank
30+ questions by difficulty with hints
🎭 Mock Interviews
Realistic scenarios with evaluation criteria
Course Structure
📚 Foundations (Everyone)
Fundamentals
Scalability, Latency, Throughput, CAP Theorem
Building Blocks
Load Balancers, Caches, Message Queues, CDN
Networking
DNS, TCP/UDP, HTTP, WebSockets, gRPC
Databases
SQL vs NoSQL, Sharding, Replication, Indexing
API Design
REST, GraphQL, Rate Limiting, Versioning
Estimations Mastery
Back-of-envelope calculations with practice problems
🔧 Core Patterns (Senior+)
Distributed Systems
Consistency, Consensus, Transactions
Microservices
Patterns, Service Mesh, Communication
Scalability Patterns
Stateless Design, Multi-level Caching, Sharding
Reliability Patterns
Circuit Breakers, Retries, Bulkheads
Event-Driven Architecture
Event Sourcing, CQRS, Saga Pattern
Trade-off Framework
Systematic approach to design decisions
🚀 Staff+ Deep Dives
Consensus Algorithms
Raft, Paxos, Leader Election (with code)
Global Architecture
Multi-region, Disaster Recovery, Geo-distribution
Data Modeling
Event Sourcing, CQRS, Schema Evolution
Real-Time Systems
WebSockets, SSE, Push Architecture
Search Systems
Elasticsearch, Inverted Index, Ranking
Time-Series Data
Metrics, IoT, Financial data patterns
CDN & Edge Computing
Edge functions, Caching strategies
Data Pipelines
Batch vs Stream, Lambda/Kappa
📖 Quick References
Patterns Reference
Quick lookup for all design patterns
Observability
Metrics, Logs, Traces, SLIs/SLOs
Security Patterns
JWT, OAuth, RBAC, mTLS
Advanced Concepts
Linearizability, Vector Clocks
💼 Case Studies (Practice These!)
URL Shortener
🟢 Easy - Start here
Rate Limiter
🟢 Easy - Algorithms
Notification System
🟡 Medium - Multi-channel
Twitter/X
🟡 Medium - Feed design
🟡 Medium - Real-time
E-Commerce
🟡 Medium - Transactions
Payment System
🔴 Hard - Financial
Netflix
🔴 Hard - Video streaming
Spotify
🔴 Hard - Audio streaming
Uber
🔴 Hard - Geo-matching
Google Maps
🔴 Hard - Navigation
Interview Framework (RESHADED)
R - Requirements (5 min)
Ask about users, scale, features, and constraints. “Are we optimizing for reads or writes?”
Top Interview Problems by Level
| Problem | Level | Key Concepts | Priority |
|---|---|---|---|
| URL Shortener | L3-L4 | Hashing, DB, Cache | Must Know |
| Rate Limiter | L3-L4 | Token Bucket, Redis | Must Know |
| Twitter Feed | L4-L5 | Fan-out, Timeline, Cache | Must Know |
| L4-L5 | WebSockets, Message Queue | Must Know | |
| E-Commerce | L4-L5 | Inventory, Transactions | Must Know |
| Notification | L4-L5 | Push/Pull, Templates | Recommended |
| Spotify | L5 | Streaming, Recommendations | Recommended |
| YouTube | L5-L6 | Video Processing, CDN | Recommended |
| Uber | L5-L6 | Location, Matching | Know Basics |
| Google Docs | L6+ | CRDT, Real-time collab | Staff+ |
Quick Reference: Key Numbers
Latency Numbers
- L1 cache: 0.5 ns
- RAM: 100 ns
- SSD: 100 μs
- HDD: 10 ms
- Same DC: 0.5 ms
- Cross-continent: 150 ms
Scale Numbers
- Seconds/day: ~100,000
- 1M DAU → ~12 QPS
- 100M DAU → ~1,200 QPS
- 99.99% = 52 min downtime/year
| Availability | Downtime/Year | Use Case |
|---|---|---|
| 99% (2 nines) | 3.65 days | Internal tools |
| 99.9% (3 nines) | 8.76 hours | Standard SaaS |
| 99.99% (4 nines) | 52.6 minutes | Critical services |
| 99.999% (5 nines) | 5.26 minutes | Payment systems |
What Makes This Course Different
💻 Working Code
Every pattern includes Python + JavaScript implementations you can run
🎯 Interview Focused
Organized by what interviewers actually ask at each level
⚖️ Trade-off Framework
Systematic approach to design decisions, not just memorized answers
🔄 Real Scenarios
Mock interview scripts with expected follow-ups
Interview Deep-Dive Questions
Q1: Walk me through how you would approach a system design interview for a problem you have never seen before. What is your framework, and why does that ordering matter?
Q1: Walk me through how you would approach a system design interview for a problem you have never seen before. What is your framework, and why does that ordering matter?
What the interviewer is really testing: Whether you have a repeatable methodology for tackling ambiguity, and whether you understand why each phase of the design process exists — not just the sequence, but the reasoning behind it.Strong Answer:
- I use a structured approach that I think of as “narrow before you build.” The biggest mistake candidates make is jumping to drawing boxes and arrows before understanding what they are building. My framework has four major phases, and the ordering is intentional because each phase reduces ambiguity for the next.
- Phase 1 — Requirements and constraints (5 minutes): Ask clarifying questions to turn a vague prompt into a concrete problem. “Design Twitter” is not a problem — “Design the home timeline feed for 500M users where 80% of traffic is reads” is a problem. I specifically ask: Who are the users and what are the primary use cases? What is the expected scale (DAU, QPS, storage)? What are the most important quality attributes (latency, consistency, availability)? What is explicitly out of scope? This phase matters because a system optimized for 1000 users looks nothing like one for 100M users.
- Phase 2 — Estimation and data model (5 minutes): Back-of-envelope calculations to establish the scale constraints. If we have 500M DAU with an average of 10 timeline loads per day, that is 5B reads/day or roughly 60K QPS. Storage: if each user follows 200 accounts and the average account posts 2 tweets/day, the fan-out for a single timeline load is 400 tweets to rank. This informs whether we need pre-computation (fan-out on write) or can compute on the fly (fan-out on read). I also sketch the core data model here — what entities exist and what are the key relationships.
- Phase 3 — High-level design (10 minutes): Draw the main components and their interactions. Start with the simplest possible design that meets the requirements, then iterate. For the timeline, that might be: Client, API Gateway, Timeline Service, Cache (pre-computed timelines), Tweet Storage. Explain why each component exists and what protocol connects them.
- Phase 4 — Deep dive and trade-offs (15 minutes): This is where you show depth. Pick 2-3 components and go deep. For the timeline, the interesting deep dives are: (a) fan-out strategy — on-write for normal users, on-read for celebrities (hybrid approach), (b) ranking algorithm — chronological vs. engagement-weighted, (c) caching strategy — pre-computed timelines in Redis with invalidation on new tweets. For each decision, explicitly state the trade-off: “Fan-out on write gives low-latency reads but costs more storage and makes celebrity tweets slow to propagate.”
- The ordering matters because you cannot make good component decisions without understanding scale (Phase 2), and you cannot estimate scale without understanding requirements (Phase 1). Candidates who skip to Phase 3 end up backtracking when the interviewer asks “but what if there are 100M users?” and their single-server design falls apart.
- Example: In a real interview at a FAANG company, I was asked to design a notification system. I spent the first 5 minutes establishing that the key constraint was multi-channel delivery (push, SMS, email) with different latency requirements for each channel. That constraint completely shaped the architecture — a priority queue per channel rather than a single queue. If I had skipped requirements gathering, I would have designed a single-queue system and had to redesign mid-interview.
Q2: A junior engineer on your team says 'We should use NoSQL because it scales better than SQL.' How do you respond, and what nuance are they missing?
Q2: A junior engineer on your team says 'We should use NoSQL because it scales better than SQL.' How do you respond, and what nuance are they missing?
What the interviewer is really testing: Whether you understand that database selection is about access patterns and consistency requirements, not a binary SQL-vs-NoSQL decision, and whether you can teach and mentor effectively.Strong Answer:
- The statement is not wrong, but it is dangerously incomplete. NoSQL databases can scale writes horizontally more easily than most SQL databases, but that is one dimension of a multi-dimensional decision. The nuance the junior engineer is missing is that “scales better” depends on what you are scaling and what you are giving up.
- What SQL gives you that most NoSQL databases do not: (1) ACID transactions across multiple rows and tables. If you are building a financial system where “debit account A and credit account B” must be atomic, you need transactions. (2) Flexible querying with joins. If you do not know all your query patterns upfront, SQL’s query planner lets you ask ad-hoc questions. NoSQL databases are optimized for specific access patterns — if your query does not match the partition key design, you get full table scans. (3) Strong consistency by default. With most SQL databases, a read after a write returns the latest value. With many NoSQL databases, you get eventual consistency unless you explicitly request strong reads, which often negates the performance benefit.
- What NoSQL gives you: (1) Horizontal write scaling. Cassandra, DynamoDB, and MongoDB can distribute writes across many nodes using consistent hashing. Most SQL databases require sharding, which is operationally complex. (2) Schema flexibility. Document stores let you evolve your data model without migrations. (3) Purpose-built data models — key-value stores for caching (Redis), wide-column for time-series (Cassandra), document for nested/variable schemas (MongoDB), graph for relationship traversal (Neo4j).
- The right framework is to start with access patterns: What queries will you run? How will you write data? What consistency do you need? Then pick the database that best serves those patterns. For an e-commerce product catalog (hierarchical, variable attributes, read-heavy), a document store makes sense. For the order processing pipeline (transactional, relational, consistency-critical), PostgreSQL is likely the right choice. Many real systems use both — polyglot persistence.
- How I would respond to the junior engineer: “You are right that horizontal scaling is easier with NoSQL. But let me ask you — what are the access patterns for our use case? What happens if we need to join data across tables? What consistency guarantees does the business require? Let us work through the decision together.” This turns it into a teaching moment rather than a correction.
- Example: Instagram famously runs on PostgreSQL at massive scale. They handle billions of rows by using application-level sharding (not a NoSQL database) and aggressive caching. Their access patterns (user profiles, posts, followers) fit relational modeling well, and they valued strong consistency for core social features. Meanwhile, they use Cassandra for their direct messaging system, where the write throughput requirements and access patterns (fetch messages by conversation ID) are a better fit for wide-column storage. Same company, different databases for different problems.
Q3: Explain the CAP theorem to me. Then tell me why most practitioners say it is misunderstood and what you should actually think about instead.
Q3: Explain the CAP theorem to me. Then tell me why most practitioners say it is misunderstood and what you should actually think about instead.
What the interviewer is really testing: Whether you understand CAP at a deeper level than “pick two out of three” and can reason about consistency-availability trade-offs in practical terms.Strong Answer:
- The CAP theorem states that in a distributed system experiencing a network partition, you must choose between consistency (every read returns the most recent write) and availability (every request receives a response). You cannot have both during a partition. The “pick two” framing is misleading because partitions are not optional — they will happen. So the real choice is: during a partition, do you sacrifice consistency or availability?
- Why practitioners say it is misunderstood: (1) CAP only applies during partitions. When the network is healthy (which is the vast majority of the time), you can have both consistency and availability. The interesting design question is what happens during the rare partition event — not the steady state. (2) CAP presents a binary choice, but real systems operate on a spectrum. You can tune consistency levels per operation (Cassandra’s QUORUM reads are more consistent than ONE reads but less available). (3) CAP says nothing about latency. A system that is “consistent and available” but takes 30 seconds to respond is technically CAP-compliant but useless in practice.
- What you should think about instead: the PACELC model, which extends CAP. It says: if there is a Partition, choose between Availability and Consistency; Else (when the system is running normally), choose between Latency and Consistency. This captures the real trade-off engineers face daily. Most of the time, there is no partition. The daily trade-off is: do I read from a local replica (fast, possibly stale) or the leader (slow, definitely current)?
- Practical examples: DynamoDB is PA/EL (during partition: available; else: low latency, eventual consistency by default). PostgreSQL with synchronous replication is PC/EC (during partition: consistent, refuses writes; else: consistent, higher latency). Cassandra is configurable — at QUORUM consistency it is PC/EL, at ONE consistency it is PA/EL.
- In interviews, I translate CAP into concrete decisions: “For the shopping cart, I choose availability over consistency. If there is a network issue, I would rather show the user a potentially stale cart than an error page. For the checkout and payment flow, I choose consistency — I would rather reject the request than risk a double charge.”
- Example: Amazon’s Dynamo paper (which inspired DynamoDB and Cassandra) explicitly chose availability over consistency for the shopping cart because a lost cart item means a lost sale, while a slightly stale cart is merely annoying. But their ordering system chose consistency because processing an order twice is far worse than temporarily rejecting one. Same company, different CAP choices for different subsystems.