> ## Documentation Index
> Fetch the complete documentation index at: https://resources.devweekends.com/llms.txt
> Use this file to discover all available pages before exploring further.

# System Design Fundamentals

> Core concepts every system designer must know

<Tip>
  **Interview Essential**: These fundamentals are the building blocks of every system design. Interviewers expect you to naturally incorporate these concepts without being asked.
</Tip>

## Quick Reference Card

```
┌─────────────────────────────────────────────────────────────────┐
│              FUNDAMENTALS CHEAT SHEET                           │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  SCALING                                                        │
│  • Vertical = bigger machine (easy but limited)                 │
│  • Horizontal = more machines (complex but unlimited)           │
│                                                                 │
│  AVAILABILITY (memorize these!)                                 │
│  • 99.9% = 8.7 hours downtime/year                             │
│  • 99.99% = 52 minutes downtime/year                           │
│  • 99.999% = 5 minutes downtime/year                           │
│                                                                 │
│  CAP THEOREM                                                    │
│  • CP = Bank, inventory (consistency > availability)           │
│  • AP = Social media, cache (availability > consistency)       │
│                                                                 │
│  LATENCY NUMBERS (Jeff Dean's famous list)                     │
│  • L1 cache: 0.5 ns                                            │
│  • RAM: 100 ns                                                 │
│  • SSD: 100 μs                                                 │
│  • HDD: 10 ms                                                  │
│  • Same datacenter: 0.5 ms                                     │
│  • Cross-continent: 150 ms                                     │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘
```

## Scalability

Scalability is the system's ability to handle increased load. Think of it like a restaurant: vertical scaling is buying a bigger kitchen for your one chef; horizontal scaling is opening more locations with more chefs. The bigger kitchen has limits (there is only so large a building can be), but multiple locations can grow almost indefinitely -- at the cost of coordinating menus, supply chains, and quality across them.

### Vertical vs Horizontal Scaling

<img src="https://mintcdn.com/devweeekends/2f8Rfaato9LS1FSq/images/system-design/scaling-types.svg?fit=max&auto=format&n=2f8Rfaato9LS1FSq&q=85&s=4c1fe2723f6ed5bfdc1d5d45e73471a1" alt="Vertical vs Horizontal Scaling" width="1080" height="1080" data-path="images/system-design/scaling-types.svg" />

<CardGroup cols={2}>
  <Card title="Vertical Scaling" icon="arrow-up">
    **Pros**: Simple, no code changes

    **Cons**: Hardware limits, single point of failure, expensive
  </Card>

  <Card title="Horizontal Scaling" icon="arrows-left-right">
    **Pros**: Unlimited scale, fault tolerant, cost-effective

    **Cons**: Complex, stateless requirement, data consistency
  </Card>
</CardGroup>

## Latency vs Throughput

These two metrics are the heartbeat and breathing rate of your system. Latency tells you how fast a single request moves through the pipe; throughput tells you how many requests the pipe can handle at once. They are related but not interchangeable -- you can have low latency with low throughput (a single fast server) or high throughput with high latency (a batch processing cluster). In interviews, always clarify which one the requirements prioritize, because optimizing for one often comes at the expense of the other.

| Metric         | Definition                     | Example                |
| -------------- | ------------------------------ | ---------------------- |
| **Latency**    | Time to complete one request   | 200ms response time    |
| **Throughput** | Requests handled per unit time | 10,000 requests/second |
| **Bandwidth**  | Maximum data transfer rate     | 1 Gbps network         |

### Latency Percentiles

```
p50 (median):  50% of requests faster than this
p95:           95% of requests faster than this
p99:           99% of requests faster than this
p99.9:         99.9% of requests faster than this

Example:
p50 = 100ms   (typical request)
p95 = 200ms   (slow request)
p99 = 500ms   (very slow request)
p99.9 = 2s    (worst case)
```

## Availability

Availability = Uptime / (Uptime + Downtime)

### The "Nines" of Availability

| Availability      | Downtime/Year | Downtime/Month |
| ----------------- | ------------- | -------------- |
| 99% (two 9s)      | 3.65 days     | 7.3 hours      |
| 99.9% (three 9s)  | 8.76 hours    | 43.8 minutes   |
| 99.99% (four 9s)  | 52.6 minutes  | 4.38 minutes   |
| 99.999% (five 9s) | 5.26 minutes  | 26.3 seconds   |

### Achieving High Availability

<img src="https://mintcdn.com/devweeekends/2f8Rfaato9LS1FSq/images/system-design/availability-layers.svg?fit=max&auto=format&n=2f8Rfaato9LS1FSq&q=85&s=c84dcdec2ef73f4d3e0b98b4299fa69f" alt="Availability Layers" width="1080" height="1080" data-path="images/system-design/availability-layers.svg" />

## CAP Theorem

In a distributed system during a network partition, you must choose between consistency and availability. This is often stated as "pick 2 out of 3," but that framing is slightly misleading -- partition tolerance is not optional in any real distributed system (networks *will* fail). The real question is: when a partition happens, do you refuse to serve requests (CP) or serve potentially stale data (AP)?

Think of it like a chain of restaurants during a phone outage between locations. A CP restaurant stops taking orders until it can confirm inventory with the warehouse ("Sorry, we cannot guarantee we have that dish right now"). An AP restaurant keeps serving but might accidentally sell a dish it has run out of ("We will fix it later if there is a conflict"). Neither approach is universally better -- it depends on whether your users tolerate errors or staleness.

<img src="https://mintcdn.com/devweeekends/2f8Rfaato9LS1FSq/images/system-design/cap-theorem.svg?fit=max&auto=format&n=2f8Rfaato9LS1FSq&q=85&s=0efc35bec496dea9c2c5b1008b709dbf" alt="CAP Theorem" width="1080" height="1080" data-path="images/system-design/cap-theorem.svg" />

<CardGroup cols={3}>
  <Card title="Consistency (C)" icon="equals">
    All nodes see the same data at the same time
  </Card>

  <Card title="Availability (A)" icon="circle-check">
    Every request gets a response (success or failure)
  </Card>

  <Card title="Partition Tolerance (P)" icon="network-wired">
    System works despite network partitions
  </Card>
</CardGroup>

### Real-World Trade-offs

| System        | Choice | Reason                    |
| ------------- | ------ | ------------------------- |
| Banking       | CP     | Consistency is critical   |
| Social Media  | AP     | Availability preferred    |
| Shopping Cart | AP     | Can merge conflicts later |
| Inventory     | CP     | Need accurate counts      |

## ACID vs BASE

### ACID (Traditional Databases)

ACID is the safety contract of relational databases. Think of it like a bank wire transfer: the money leaves your account and arrives in the recipient's account as one indivisible operation. If anything fails mid-way, the entire operation is rolled back as if it never happened.

| Property        | Description                        |
| --------------- | ---------------------------------- |
| **Atomicity**   | All operations succeed or all fail |
| **Consistency** | Data is always valid               |
| **Isolation**   | Transactions don't interfere       |
| **Durability**  | Committed data survives crashes    |

### BASE (NoSQL Databases)

BASE is the pragmatic alternative for systems that prioritize availability and scale over strict transactional guarantees. Think of it like a social media "like" count -- if two servers temporarily disagree on whether a post has 4,999 or 5,001 likes, nobody notices and the counts will converge shortly. You are trading immediate precision for the ability to handle millions of concurrent operations without locking.

| Property                  | Description                   |
| ------------------------- | ----------------------------- |
| **Basically Available**   | System is always accessible   |
| **Soft state**            | State may change over time    |
| **Eventually consistent** | System will become consistent |

<Tip>
  **Interview Pattern**: When an interviewer asks "SQL or NoSQL?", never answer with just the technology. Frame it as: "The consistency requirements of \[feature X] suggest ACID guarantees, so I would lean toward a relational store here. But for \[feature Y] where we are read-heavy and can tolerate brief staleness, a NoSQL store with eventual consistency gives us better horizontal scalability." This shows you understand the *why*, not just the *what*.
</Tip>

## Consistency Patterns

### Strong Consistency

Every read receives the most recent write. All nodes see the same data at the same time.

<img src="https://mintcdn.com/devweeekends/2f8Rfaato9LS1FSq/images/system-design/consistency-patterns.svg?fit=max&auto=format&n=2f8Rfaato9LS1FSq&q=85&s=403b803ae714c4be9101c03450ee66bb" alt="Consistency Patterns" width="1080" height="1080" data-path="images/system-design/consistency-patterns.svg" />

```python theme={null}
# Python: Strong Consistency with Synchronous Replication
class StrongConsistencyDB:
    def __init__(self, replicas: list):
        self.replicas = replicas
        self.primary = replicas[0]
    
    def write(self, key: str, value: any) -> bool:
        """Write to primary and wait for ALL replicas to acknowledge"""
        # Write to primary
        self.primary.write(key, value)
        
        # Synchronously replicate to all secondaries
        for replica in self.replicas[1:]:
            success = replica.sync_write(key, value)  # Blocking call
            if not success:
                # Rollback on failure
                self.rollback(key)
                return False
        return True
    
    def read(self, key: str) -> any:
        """Read from primary (guaranteed latest)"""
        return self.primary.read(key)

# Usage in banking system
db = StrongConsistencyDB(replicas=[primary, replica1, replica2])
db.write("account:123:balance", 1000)  # Blocks until all replicas confirm
balance = db.read("account:123:balance")  # Always returns 1000
```

```javascript theme={null}
// JavaScript: Strong Consistency with Synchronous Replication
class StrongConsistencyDB {
  constructor(replicas) {
    this.replicas = replicas;
    this.primary = replicas[0];
  }

  async write(key, value) {
    // Write to primary first
    await this.primary.write(key, value);
    
    // Wait for ALL replicas to acknowledge (strong consistency)
    const replicationPromises = this.replicas.slice(1).map(
      replica => replica.syncWrite(key, value)
    );
    
    try {
      await Promise.all(replicationPromises);  // Wait for all
      return true;
    } catch (error) {
      await this.rollback(key);
      throw new Error('Replication failed: ' + error.message);
    }
  }

  async read(key) {
    // Always read from primary for guaranteed consistency
    return await this.primary.read(key);
  }
}

// Usage
const db = new StrongConsistencyDB([primary, replica1, replica2]);
await db.write('account:123:balance', 1000);
const balance = await db.read('account:123:balance');  // Always 1000
```

### Eventual Consistency

Reads might return stale data, but eventually all nodes will have the same data.

<img src="https://mintcdn.com/devweeekends/2f8Rfaato9LS1FSq/images/system-design/eventual-consistency.svg?fit=max&auto=format&n=2f8Rfaato9LS1FSq&q=85&s=a8b2380d40b3dd605be4955b8f7c8064" alt="Eventual Consistency" width="1080" height="720" data-path="images/system-design/eventual-consistency.svg" />

```python theme={null}
# Python: Eventual Consistency with Async Replication
import asyncio
from datetime import datetime

class EventualConsistencyDB:
    def __init__(self, replicas: list):
        self.replicas = replicas
        self.replication_queue = asyncio.Queue()
    
    async def write(self, key: str, value: any) -> bool:
        """Write to local node immediately, replicate asynchronously"""
        # Write to local node (fast!)
        timestamp = datetime.utcnow()
        self.local_node.write(key, value, timestamp)
        
        # Queue async replication (non-blocking)
        await self.replication_queue.put({
            'key': key,
            'value': value,
            'timestamp': timestamp
        })
        
        return True  # Returns immediately!
    
    async def replicate_worker(self):
        """Background worker that replicates to other nodes"""
        while True:
            item = await self.replication_queue.get()
            for replica in self.replicas:
                try:
                    await replica.async_write(
                        item['key'], 
                        item['value'], 
                        item['timestamp']
                    )
                except Exception as e:
                    # Retry later (eventual consistency)
                    await self.retry_queue.put(item)
    
    async def read(self, key: str) -> any:
        """Read from local node (might be stale!)"""
        return self.local_node.read(key)

# Usage in social media
db = EventualConsistencyDB(replicas=[node1, node2, node3])
await db.write("post:456", {"content": "Hello World!"})
# User might not see this post immediately on other nodes
# But eventually (usually within milliseconds), all nodes will have it
```

```javascript theme={null}
// JavaScript: Eventual Consistency with Async Replication
class EventualConsistencyDB {
  constructor(replicas) {
    this.replicas = replicas;
    this.replicationQueue = [];
    this.startReplicationWorker();
  }

  async write(key, value) {
    const timestamp = Date.now();
    
    // Write to local node immediately
    await this.localNode.write(key, value, timestamp);
    
    // Queue for async replication (fire and forget)
    this.replicationQueue.push({ key, value, timestamp });
    
    return true;  // Returns immediately!
  }

  startReplicationWorker() {
    setInterval(async () => {
      while (this.replicationQueue.length > 0) {
        const item = this.replicationQueue.shift();
        
        // Replicate to all nodes in background
        for (const replica of this.replicas) {
          try {
            await replica.asyncWrite(item.key, item.value, item.timestamp);
          } catch (error) {
            // Put back in queue for retry
            this.replicationQueue.push(item);
          }
        }
      }
    }, 100);  // Process every 100ms
  }

  async read(key) {
    // Read from local (might be stale)
    return await this.localNode.read(key);
  }
}
```

### Read-Your-Writes Consistency

Users always see their own writes immediately, even if other users see stale data.

<img src="https://mintcdn.com/devweeekends/2f8Rfaato9LS1FSq/images/system-design/read-your-writes.svg?fit=max&auto=format&n=2f8Rfaato9LS1FSq&q=85&s=81ff3335a383c35b9a4c5042f2cce4c6" alt="Read Your Writes" width="1080" height="720" data-path="images/system-design/read-your-writes.svg" />

```python theme={null}
# Python: Read-Your-Writes with Session Tracking
from datetime import datetime, timedelta

class ReadYourWritesDB:
    def __init__(self, primary, replicas):
        self.primary = primary
        self.replicas = replicas
        self.user_last_write = {}  # Track when each user last wrote
    
    def write(self, user_id: str, key: str, value: any) -> bool:
        """Write to primary and track the write timestamp"""
        timestamp = datetime.utcnow()
        self.primary.write(key, value, timestamp)
        
        # Remember when this user last wrote
        self.user_last_write[user_id] = timestamp
        
        # Async replication to replicas
        self.async_replicate(key, value, timestamp)
        return True
    
    def read(self, user_id: str, key: str) -> any:
        """
        If user recently wrote, read from primary.
        Otherwise, read from replica (faster).
        """
        last_write = self.user_last_write.get(user_id)
        
        # If user wrote in last 5 seconds, use primary
        if last_write and (datetime.utcnow() - last_write) < timedelta(seconds=5):
            return self.primary.read(key)
        
        # Safe to read from replica (user hasn't written recently)
        return self.get_random_replica().read(key)

# Usage
db = ReadYourWritesDB(primary, [replica1, replica2])
db.write("user_123", "profile:user_123", {"name": "Alice"})
profile = db.read("user_123", "profile:user_123")  # Reads from PRIMARY
profile = db.read("user_456", "profile:user_123")  # Reads from REPLICA
```

```javascript theme={null}
// JavaScript: Read-Your-Writes Consistency
class ReadYourWritesDB {
  constructor(primary, replicas) {
    this.primary = primary;
    this.replicas = replicas;
    this.userLastWrite = new Map();  // userId -> timestamp
  }

  async write(userId, key, value) {
    const timestamp = Date.now();
    
    // Write to primary
    await this.primary.write(key, value, timestamp);
    
    // Track when user last wrote
    this.userLastWrite.set(userId, timestamp);
    
    // Async replication (fire and forget)
    this.asyncReplicate(key, value, timestamp);
    return true;
  }

  async read(userId, key) {
    const lastWrite = this.userLastWrite.get(userId);
    const fiveSecondsAgo = Date.now() - 5000;
    
    // If user wrote recently, read from primary
    if (lastWrite && lastWrite > fiveSecondsAgo) {
      return await this.primary.read(key);
    }
    
    // Otherwise, read from any replica (faster)
    const replica = this.replicas[Math.floor(Math.random() * this.replicas.length)];
    return await replica.read(key);
  }
}

// User always sees their own updates immediately
const db = new ReadYourWritesDB(primary, [replica1, replica2]);
await db.write('user_123', 'profile:user_123', { name: 'Alice' });
const myProfile = await db.read('user_123', 'profile:user_123');  // From PRIMARY
const theirProfile = await db.read('user_456', 'profile:user_123');  // From REPLICA
```

## Back-of-the-Envelope Estimation

### Common Calculations

```python theme={null}
# Daily Active Users (DAU) to QPS
DAU = 100_000_000  # 100 million
requests_per_user_per_day = 10
seconds_per_day = 86400

QPS = (DAU * requests_per_user_per_day) / seconds_per_day
# = 1,000,000,000 / 86,400 ≈ 11,574 QPS

# Peak QPS (2-3x average)
peak_QPS = QPS * 2.5  # ≈ 29,000 QPS
```

### Storage Estimation

```python theme={null}
# Example: Twitter-like service
users = 500_000_000
tweets_per_user_per_day = 2
tweet_size = 280  # characters
metadata_size = 200  # bytes

daily_tweets = users * tweets_per_user_per_day
# = 1,000,000,000 tweets/day

daily_storage = daily_tweets * (tweet_size + metadata_size)
# = 1B * 480 bytes = 480 GB/day

yearly_storage = daily_storage * 365
# = 175 TB/year (just text, not including media)
```

### Memory Estimation

```python theme={null}
# Cache sizing (80/20 rule)
# 20% of data serves 80% of requests

daily_requests = 1_000_000_000
request_size = 500  # bytes (average response)
cache_hit_ratio = 0.8

# Cache 20% of daily unique requests
cache_size = 0.2 * daily_requests * request_size
# = 100 GB of cache
```

<Note>
  **Interview Tip**: Don't worry about exact numbers. Round liberally and show your reasoning. 86,400 ≈ 100,000 is fine for estimation.
</Note>

## Interview Questions on Fundamentals

<Accordion title="When would you choose CP over AP?">
  **Answer**: Choose CP (Consistency over Availability) when:

  * **Financial systems**: Bank transfers, payments - incorrect balance is worse than unavailability
  * **Inventory management**: Overselling is costly (e.g., airline seats)
  * **Booking systems**: Double-booking causes real-world problems
  * **Leader election**: Only one leader should exist at a time

  **Key phrase**: "In this case, returning wrong data is worse than returning no data."
</Accordion>

<Accordion title="How do you achieve 99.99% availability?">
  **Answer**: Redundancy at every layer:

  1. Multiple DNS providers
  2. CDN with many edge locations
  3. Load balancers in active-passive or active-active mode
  4. Multiple application servers (stateless)
  5. Database replication (primary + replicas)
  6. Multi-region deployment
  7. Health checks and automatic failover
  8. Circuit breakers to prevent cascade failures
</Accordion>

<Accordion title="Explain eventual consistency with an example">
  **Answer**: "When you post on social media, your friend might not see it for a few seconds because the data needs to propagate across replicas. This is acceptable because:

  1. Availability is more important than instant consistency
  2. The delay is usually sub-second and imperceptible
  3. The data will eventually be consistent everywhere

  Compare to a bank transfer where you MUST see accurate balance immediately - that needs strong consistency."
</Accordion>

<Accordion title="How do you estimate QPS quickly?">
  **Answer**: Use the "divide by 100,000" rule:

  * DAU × requests per day ÷ 100,000 ≈ QPS
  * Example: 100M DAU × 10 requests = 1B / 100,000 = 10,000 QPS
  * Peak = 2-3x average

  For storage:

  * 1 request = \~500 bytes → 10,000 QPS = 5 MB/second = 432 GB/day
</Accordion>
