Skip to main content

Chapter 2: Architecture and Partitioning

DynamoDB’s architecture represents a sophisticated blend of distributed systems principles: consistent hashing for data distribution, a request router layer for routing, storage nodes for durability, and automatic partitioning for scale. Understanding this architecture is crucial for effective data modeling and performance optimization.
Chapter Goals:
  • Understand DynamoDB’s high-level architecture
  • Master consistent hashing and the hash ring concept
  • Learn how data is partitioned and distributed
  • Grasp request routing and storage node design
  • Understand auto-scaling and adaptive capacity

High-Level Architecture

System Components

DynamoDB’s architecture consists of several key layers:
┌─────────────────────────────────────────────────────────────┐
│                  DYNAMODB ARCHITECTURE                      │
│                                                             │
│                        ┌──────────┐                         │
│                        │  Client  │                         │
│                        │   SDK    │                         │
│                        └─────┬────┘                         │
│                              │                              │
│                              │ HTTPS/API Call               │
│                              ↓                              │
│  ┌──────────────────────────────────────────────────────┐  │
│  │           REQUEST ROUTING LAYER                      │  │
│  │  ┌──────────────────────────────────────────────┐   │  │
│  │  │ • Authentication & Authorization (IAM)       │   │  │
│  │  │ • Request validation                         │   │  │
│  │  │ • Partition key hashing                      │   │  │
│  │  │ • Route to correct storage node              │   │  │
│  │  │ • Throttling & request metering              │   │  │
│  │  └──────────────────────────────────────────────┘   │  │
│  └──────────────────┬───────────────────────────────────┘  │
│                     │                                       │
│                     │ Routes to storage nodes               │
│                     ↓                                       │
│  ┌──────────────────────────────────────────────────────┐  │
│  │            STORAGE NODE LAYER                        │  │
│  │                                                      │  │
│  │  ┌──────────┐  ┌──────────┐  ┌──────────┐          │  │
│  │  │ Storage  │  │ Storage  │  │ Storage  │  ...     │  │
│  │  │ Node 1   │  │ Node 2   │  │ Node 3   │          │  │
│  │  │          │  │          │  │          │          │  │
│  │  │ Partition│  │ Partition│  │ Partition│          │  │
│  │  │ P1, P2   │  │ P3, P4   │  │ P5, P6   │          │  │
│  │  └────┬─────┘  └────┬─────┘  └────┬─────┘          │  │
│  │       │             │             │                 │  │
│  └───────┼─────────────┼─────────────┼─────────────────┘  │
│          │             │             │                    │
│          │   Replication (3 AZs)     │                    │
│          ↓             ↓             ↓                    │
│  ┌────────────────────────────────────────────────────┐   │
│  │        REPLICATION & PERSISTENCE LAYER             │   │
│  │                                                    │   │
│  │  AZ-A           AZ-B           AZ-C                │   │
│  │  ┌─────┐        ┌─────┐        ┌─────┐            │   │
│  │  │ SSD │        │ SSD │        │ SSD │            │   │
│  │  │Store│        │Store│        │Store│            │   │
│  │  └─────┘        └─────┘        └─────┘            │   │
│  │                                                    │   │
│  │  • B-tree storage on SSD                          │   │
│  │  • Write-ahead logging                            │   │
│  │  • Synchronous replication across 3 AZs           │   │
│  └────────────────────────────────────────────────────┘   │
│                              ↓                            │
│  ┌────────────────────────────────────────────────────┐   │
│  │            CONTROL PLANE                           │   │
│  │  • Partition management & splitting                │   │
│  │  • Auto-scaling decisions                          │   │
│  │  • Health monitoring                               │   │
│  │  • Backup and restore                              │   │
│  │  • Global table coordination                       │   │
│  └────────────────────────────────────────────────────┘   │
│                                                            │
└────────────────────────────────────────────────────────────┘

Layer Responsibilities

Request Router Responsibilities
┌─────────────────────────────────────────────┐
│      REQUEST ROUTER FUNCTIONS               │
├─────────────────────────────────────────────┤
│                                             │
│  1. AUTHENTICATION & AUTHORIZATION          │
│     • Validate IAM credentials              │
│     • Check resource permissions            │
│     • Enforce security policies             │
│                                             │
│  2. REQUEST VALIDATION                      │
│     • Validate API parameters               │
│     • Check payload size limits             │
│     • Validate data types                   │
│                                             │
│  3. PARTITION KEY HASHING                   │
│     • Hash partition key with MD5           │
│     • Determine partition from hash         │
│     • Locate storage nodes for partition    │
│                                             │
│  4. REQUEST ROUTING                         │
│     • Route to primary storage node         │
│     • Handle replica routing for reads      │
│     • Retry on failures                     │
│                                             │
│  5. CAPACITY MANAGEMENT                     │
│     • Track consumed capacity               │
│     • Apply throttling (provisioned mode)   │
│     • Return throttling errors (400)        │
│                                             │
│  6. METRICS & MONITORING                    │
│     • Emit CloudWatch metrics               │
│     • Track latency and errors              │
│     • Log for audit trails                  │
│                                             │
└─────────────────────────────────────────────┘
Key Point: The request router is stateless and can scale independently of storage nodes. This enables DynamoDB to handle traffic spikes without affecting storage.

Consistent Hashing

The Partitioning Problem

Before diving into consistent hashing, understand the problem it solves:
┌────────────────────────────────────────────────────────────┐
│          WHY PARTITIONING IS NECESSARY                     │
├────────────────────────────────────────────────────────────┤
│                                                            │
│  Single-Node Limitations:                                 │
│  ┌─────────────────────────────────────────────────────┐  │
│  │                                                     │  │
│  │  Single DB Server:                                  │  │
│  │  • Storage: ~10 TB max (practical limit)           │  │
│  │  • Memory: ~1 TB max                                │  │
│  │  • Throughput: ~50k requests/sec                    │  │
│  │  • Network: ~10 Gbps                                │  │
│  │                                                     │  │
│  │  DynamoDB Needs:                                    │  │
│  │  • Storage: Unlimited (petabytes+)                  │  │
│  │  • Memory: Terabytes across cluster                 │  │
│  │  • Throughput: Millions of requests/sec             │  │
│  │  • Network: Aggregate bandwidth across nodes        │  │
│  │                                                     │  │
│  │  Solution: PARTITION DATA ACROSS MANY NODES         │  │
│  │                                                     │  │
│  └─────────────────────────────────────────────────────┘  │
│                                                            │
└────────────────────────────────────────────────────────────┘

Naive Hashing Problems

Simple hash-based partitioning has a major flaw:
NAIVE APPROACH: hash(key) % N

Example with 3 nodes:
┌──────────────────────────────────────┐
│ hash("user-123") = 456               │
│ 456 % 3 = 0 → Node 0                 │
│                                      │
│ hash("user-456") = 789               │
│ 789 % 3 = 0 → Node 0                 │
│                                      │
│ hash("user-789") = 234               │
│ 234 % 3 = 0 → Node 0                 │
└──────────────────────────────────────┘

PROBLEM: Adding or removing a node

Add Node 3 (now N=4):
┌──────────────────────────────────────┐
│ hash("user-123") = 456               │
│ 456 % 4 = 0 → Node 0 ✓ (same)        │
│                                      │
│ hash("user-456") = 789               │
│ 789 % 4 = 1 → Node 1 ✗ (moved!)     │
│                                      │
│ hash("user-789") = 234               │
│ 234 % 4 = 2 → Node 2 ✗ (moved!)     │
└──────────────────────────────────────┘

Result: Adding 1 node moved data from ALL nodes!
        (K/N keys move when adding a node)

With 1000 nodes, adding 1 node moves ~50% of data!
This causes massive data migration and downtime.

Consistent Hashing Solution

Consistent hashing minimizes data movement when nodes are added or removed:
┌────────────────────────────────────────────────────────────┐
│              CONSISTENT HASHING RING                       │
│                                                            │
│                          0°                                │
│                          •                                 │
│                      ╱       ╲                             │
│                   ╱             ╲                          │
│              270° •               • 90°                    │
│                 ╱                   ╲                      │
│                │                     │                     │
│                │    ┌─────────┐      │                     │
│                │    │ Hash    │      │                     │
│                │    │ Ring    │      │                     │
│                │    │ (0-2³²) │      │                     │
│                │    └─────────┘      │                     │
│                │                     │                     │
│                 ╲                   ╱                      │
│                  •                 •                       │
│                180°               360°/0°                  │
│                                                            │
│  Hash Space: 0 to 2³² - 1 (4,294,967,295)                 │
│  Arranged in a ring (2³² - 1 wraps to 0)                  │
│                                                            │
└────────────────────────────────────────────────────────────┘

How Consistent Hashing Works

Place nodes on the hash ring
3 Storage Nodes: A, B, C

Node placement:
┌────────────────────────────────────────┐
│ hash("Node-A") = 1000 → Position 1000  │
│ hash("Node-B") = 2000 → Position 2000  │
│ hash("Node-C") = 3000 → Position 3000  │
└────────────────────────────────────────┘

Ring visualization:

                   0


        3000 •─────┴─────• 1000
       Node-C         Node-A
             ╲       ╱
              ╲     ╱
               ╲   ╱
                ╲ ╱

               2000
              Node-B

Each node is responsible for the range from
the previous node (clockwise) to itself.

Node-A: (3000, 1000]
Node-B: (1000, 2000]
Node-C: (2000, 3000]
Determine which node stores each key
Keys to store:

hash("user-123") = 500  → Between 0 and 1000
                        → Assign to Node-A

hash("user-456") = 1500 → Between 1000 and 2000
                        → Assign to Node-B

hash("user-789") = 2500 → Between 2000 and 3000
                        → Assign to Node-C

Ring with keys:

                   0

               500 • (user-123)

        3000 •─────┴─────• 1000
       Node-C         Node-A
             ╲       ╱
              ╲     ╱
       2500 • ╲     ╱ • 1500
    (user-789) ╲   ╱  (user-456)
                ╲ ╱

               2000
              Node-B

Rule: Key goes to the next node clockwise
Minimal data movement when adding nodes
Add Node-D at position 1500:

Before:
Node-A: (3000, 1000]  ← owns 500 (user-123)
Node-B: (1000, 2000]  ← owns 1500 (user-456)
Node-C: (2000, 3000]  ← owns 2500 (user-789)

After adding Node-D at 1500:
Node-A: (3000, 1000]  ← still owns 500 (user-123) ✓
Node-D: (1000, 1500]  ← takes 1500 from Node-B
Node-B: (1500, 2000]  ← keeps other keys
Node-C: (2000, 3000]  ← still owns 2500 (user-789) ✓

Result: ONLY keys in range (1000, 1500] moved!
        Only 1/4 of keys moved (instead of 50% with naive hashing)

Ring after addition:

                   0

               500 • (user-123)

        3000 •─────┴─────• 1000
       Node-C         Node-A
             ╲           │
              ╲          │
       2500 • ╲     1500 • Node-D
    (user-789) ╲         │ (user-456)
                ╲        │
                 •───────•
               2000
              Node-B

Advantage: Only K/(N+1) keys move when adding a node
           (Much better than K/N with naive hashing)
Minimal data movement when removing nodes
Remove Node-B at position 2000:

Before:
Node-A: (3000, 1000]
Node-B: (1000, 2000]  ← owns 1500 (user-456)
Node-C: (2000, 3000]

After removing Node-B:
Node-A: (3000, 1000]  ← unchanged
Node-C: (1000, 3000]  ← takes Node-B's keys

Only keys in Node-B's range move to Node-C
Node-A and other nodes unaffected

Ring after removal:

                   0

               500 • (user-123)

        3000 •─────┴─────• 1000
       Node-C         Node-A
             ╲       ╱
              ╲     ╱
       2500 • ╲     ╱ • 1500
    (user-789) ╲   ╱  (user-456)
                ╲ ╱

              (Node-B removed)

Virtual Nodes (Tokens)

Real implementations use virtual nodes for better load distribution:
PROBLEM with Single Token Per Node:
┌────────────────────────────────────────┐
│ If nodes are unevenly distributed:    │
│                                        │
│     0°     •────────•     90°          │
│         Node-A   Node-B                │
│                                        │
│ Node-A: 0° to 5° (tiny range)          │
│ Node-B: 5° to 90° (huge range)         │
│                                        │
│ Result: Uneven data distribution!      │
└────────────────────────────────────────┘

SOLUTION: Virtual Nodes (Tokens)
┌────────────────────────────────────────┐
│ Each physical node gets multiple       │
│ positions (tokens) on the ring         │
│                                        │
│ Node-A has 128 tokens:                 │
│   hash("Node-A-token-0") = 100         │
│   hash("Node-A-token-1") = 500         │
│   hash("Node-A-token-2") = 1200        │
│   ...                                  │
│   hash("Node-A-token-127") = 3900      │
│                                        │
│ Node-B has 128 tokens:                 │
│   hash("Node-B-token-0") = 50          │
│   hash("Node-B-token-1") = 300         │
│   ...                                  │
│                                        │
│ Result: Much more even distribution!   │
│ Each node gets ~1/N of the data        │
└────────────────────────────────────────┘

DynamoDB uses this approach internally to ensure
even data distribution across storage nodes.

Partition Key and Data Distribution

Partition Key Hashing

┌────────────────────────────────────────────────────────────┐
│           PARTITION KEY DETERMINATION                      │
└────────────────────────────────────────────────────────────┘

Table Schema:
┌──────────────────────────────────┐
│ Table: Users                     │
│                                  │
│ Partition Key: userId            │
│ Sort Key: (none)                 │
└──────────────────────────────────┘

Write Operation:
┌──────────────────────────────────┐
│ PutItem({                        │
│   userId: "user-123",            │
│   name: "Alice",                 │
│   email: "[email protected]"     │
│ })                               │
└──────────────────────────────────┘

Step 1: Extract partition key

  userId = "user-123"

Step 2: Hash the partition key value

  hash = MD5("user-123")
       = 3e3e9f56c8f7e04b...
       = 1042345678 (as 32-bit integer)

Step 3: Map to partition

  Partition = hash % num_partitions
            = 1042345678 % 1000
            = Partition 678

Step 4: Route to storage nodes hosting Partition 678

  Primary: Node-42 (AZ-A)
  Replica 1: Node-108 (AZ-B)
  Replica 2: Node-251 (AZ-C)

Step 5: Write to all replicas (quorum W=2)

  Write to Node-42 (AZ-A) ✓
  Write to Node-108 (AZ-B) ✓
  Return success (2/3 acks received)

Partition Size Limits

Partition Constraints:Each partition has hard limits:
┌────────────────────────────────────┐
│   PARTITION LIMITS (Per Partition) │
├────────────────────────────────────┤
│                                    │
│  Storage: 10 GB maximum            │
│  RCU: 3,000 read capacity units    │
│  WCU: 1,000 write capacity units   │
│                                    │
│  Exceeding these triggers:         │
│  → Automatic partition split       │
│                                    │
└────────────────────────────────────┘
When a partition exceeds limits, DynamoDB automatically splits it:
Before Split:
┌─────────────────────────────────┐
│  Partition P1                   │
│  Range: (0, 1000]               │
│  Size: 10.2 GB (exceeds limit!) │
│  Storage Node: Node-A           │
└─────────────────────────────────┘

After Split:
┌─────────────────────────────────┐
│  Partition P1-A                 │
│  Range: (0, 500]                │
│  Size: 5.1 GB                   │
│  Storage Node: Node-A           │
└─────────────────────────────────┘

┌─────────────────────────────────┐
│  Partition P1-B                 │
│  Range: (500, 1000]             │
│  Size: 5.1 GB                   │
│  Storage Node: Node-B           │
└─────────────────────────────────┘

Note: Once split, partitions NEVER merge back!

Choosing Good Partition Keys

┌────────────────────────────────────────────────────────────┐
│              PARTITION KEY BEST PRACTICES                  │
├────────────────────────────────────────────────────────────┤
│                                                            │
│  ✓ HIGH CARDINALITY                                       │
│    ┌────────────────────────────────────────────┐         │
│    │ Good: userId (millions of unique values)   │         │
│    │ Bad: status (only "active" or "inactive")  │         │
│    │                                            │         │
│    │ Reason: Many unique values = even          │         │
│    │         distribution across partitions     │         │
│    └────────────────────────────────────────────┘         │
│                                                            │
│  ✓ UNIFORM ACCESS PATTERN                                 │
│    ┌────────────────────────────────────────────┐         │
│    │ Good: userId (random access)               │         │
│    │ Bad: date (everyone queries today's date)  │         │
│    │                                            │         │
│    │ Reason: Avoid hot partitions where all    │         │
│    │         traffic goes to one partition      │         │
│    └────────────────────────────────────────────┘         │
│                                                            │
│  ✓ KNOWN ACCESS PATTERNS                                  │
│    ┌────────────────────────────────────────────┐         │
│    │ Design partition key based on how you      │         │
│    │ will query the data                        │         │
│    │                                            │         │
│    │ Example:                                   │         │
│    │ • Access by userId → PK: userId            │         │
│    │ • Access by orderId → PK: orderId          │         │
│    │ • Access by country+date →                 │         │
│    │   PK: country, SK: date                    │         │
│    └────────────────────────────────────────────┘         │
│                                                            │
│  ✗ AVOID TIME-BASED PARTITION KEYS                        │
│    ┌────────────────────────────────────────────┐         │
│    │ Bad: timestamp or date as PK               │         │
│    │                                            │         │
│    │ Problem:                                   │         │
│    │ • All writes go to "current time" partition│         │
│    │ • Creates hot partition                    │         │
│    │ • Causes throttling                        │         │
│    │                                            │         │
│    │ Solution:                                  │         │
│    │ • Use userId as PK, timestamp as SK        │         │
│    │ • Or use composite key with random suffix  │         │
│    └────────────────────────────────────────────┘         │
│                                                            │
└────────────────────────────────────────────────────────────┘

Request Routing

Write Path

┌────────────────────────────────────────────────────────────┐
│                    WRITE REQUEST FLOW                      │
└────────────────────────────────────────────────────────────┘

1. Client sends PutItem request
   ┌──────────────┐
   │    Client    │ PutItem(userId="user-123", name="Alice")
   └──────┬───────┘

          ↓ HTTPS/TLS
   ┌─────────────────────────────────────────┐
   │      Request Router (Endpoint)          │
   │  ┌────────────────────────────────────┐ │
   │  │ 1. Authenticate IAM credentials    │ │
   │  │ 2. Authorize access to table       │ │
   │  │ 3. Validate request parameters     │ │
   │  │ 4. Check capacity (if provisioned) │ │
   │  └────────────────────────────────────┘ │
   └──────────────┬──────────────────────────┘


   ┌─────────────────────────────────────────┐
   │  2. Hash partition key to find partition│
   │     hash("user-123") = 1042345678       │
   │     → Partition 678                     │
   └──────────────┬──────────────────────────┘


   ┌─────────────────────────────────────────┐
   │  3. Look up storage nodes for P678      │
   │     Primary: Node-42 (AZ-A)             │
   │     Replica 1: Node-108 (AZ-B)          │
   │     Replica 2: Node-251 (AZ-C)          │
   └──────────────┬──────────────────────────┘


   ┌─────────────────────────────────────────┐
   │  4. Forward write to Primary (Node-42)  │
   └─────────────────┬───────────────────────┘


              ┌─────────────┐
              │   Node-42   │ (Primary - AZ-A)
              │  Partition  │
              │     678     │
              └──────┬──────┘

      ┌──────────────┼──────────────┐
      │              │              │
      ↓              ↓              ↓
  ┌────────┐    ┌────────┐    ┌────────┐
  │Node-42 │    │Node-108│    │Node-251│
  │ (AZ-A) │    │ (AZ-B) │    │ (AZ-C) │
  │Primary │    │Replica │    │Replica │
  └───┬────┘    └───┬────┘    └───┬────┘
      │             │             │
      │  5. Write to WAL (Write-Ahead Log)
      │             │             │
      ↓             ↓             ↓
  ┌────────┐    ┌────────┐    ┌────────┐
  │  WAL   │    │  WAL   │    │  WAL   │
  └───┬────┘    └───┬────┘    └───┬────┘
      │             │             │
      │  6. Write to B-tree storage
      │             │             │
      ↓             ↓             ↓
  ┌────────┐    ┌────────┐    ┌────────┐
  │ B-tree │    │ B-tree │    │ B-tree │
  │  SSD   │    │  SSD   │    │  SSD   │
  └───┬────┘    └───┬────┘    └───┬────┘
      │             │             │
      │  7. Send ACK back
      │             │             │
      ↓             ↓             ↓
      ACK 1         ACK 2         ACK 3
      │             │
      └─────────────┴─────────────┐

      8. Wait for quorum (W=2)    │
         Got 2 ACKs → Success!    │

                        ┌──────────────────┐
                        │ Return success   │
                        │ to client        │
                        └──────────────────┘

Latency Breakdown (typical):
• Request routing: ~1ms
• Network to storage node: ~1ms
• Write to WAL + replicate: ~2-3ms
• B-tree update: ~1ms
• Total: ~5-7ms (p50)

Read Path (Eventually Consistent)

┌────────────────────────────────────────────────────────────┐
│         EVENTUALLY CONSISTENT READ FLOW                    │
└────────────────────────────────────────────────────────────┘

1. Client sends GetItem request
   ┌──────────────┐
   │    Client    │ GetItem(userId="user-123")
   └──────┬───────┘  ConsistentRead=false (default)


   ┌─────────────────────────────────────────┐
   │      Request Router                     │
   │  • Authenticate                         │
   │  • Hash partition key → Partition 678   │
   │  • Lookup storage nodes                 │
   └──────────────┬──────────────────────────┘


   ┌─────────────────────────────────────────┐
   │  Partition 678 replicas:                │
   │  • Node-42 (AZ-A) - Primary             │
   │  • Node-108 (AZ-B) - Replica 1          │
   │  • Node-251 (AZ-C) - Replica 2          │
   └──────────────┬──────────────────────────┘


   Eventually Consistent Read:
   Route to ANY replica (load balanced)

          ┌───────┼───────┐
          │       │       │
          ↓       ↓       ↓
      Node-42  Node-108  Node-251

         Pick Node-108 (random)


          ┌──────────────┐
          │  Node-108    │
          │  Read from   │
          │  B-tree      │
          └──────┬───────┘


          ┌──────────────┐
          │ Return item  │
          │ to client    │
          └──────────────┘

Latency: ~3-5ms (p50)
Cost: 0.5 RCU per 4KB read

Note: May return stale data if replication lag exists!
      But much faster and cheaper than strong reads.

Read Path (Strongly Consistent)

┌────────────────────────────────────────────────────────────┐
│          STRONGLY CONSISTENT READ FLOW                     │
└────────────────────────────────────────────────────────────┘

1. Client sends GetItem request
   ┌──────────────┐
   │    Client    │ GetItem(userId="user-123")
   └──────┬───────┘  ConsistentRead=true


   ┌─────────────────────────────────────────┐
   │      Request Router                     │
   │  • Authenticate                         │
   │  • Hash partition key → Partition 678   │
   │  • Lookup storage nodes                 │
   └──────────────┬──────────────────────────┘


   Strongly Consistent Read:
   MUST route to PRIMARY replica only


          ┌──────────────┐
          │   Node-42    │ (Primary only)
          │   Read from  │
          │   B-tree     │
          └──────┬───────┘


          ┌──────────────┐
          │ Return item  │
          │ to client    │
          └──────────────┘

Latency: ~5-10ms (p50) - higher than eventual
Cost: 1 RCU per 4KB read (double the cost)

Guarantees: ALWAYS returns latest written data

Auto-Scaling and Adaptive Capacity

Provisioned Capacity Mode

┌────────────────────────────────────────────────────────────┐
│            PROVISIONED CAPACITY MODEL                      │
├────────────────────────────────────────────────────────────┤
│                                                            │
│  Table-Level Capacity:                                     │
│  ┌──────────────────────────────────────────────┐         │
│  │ RCU (Read Capacity Units): 1000             │         │
│  │ WCU (Write Capacity Units): 500              │         │
│  │                                              │         │
│  │ 1 RCU = 1 strongly consistent read/sec       │         │
│  │         (or 2 eventually consistent reads)   │         │
│  │         for items up to 4 KB                 │         │
│  │                                              │         │
│  │ 1 WCU = 1 write/sec for items up to 1 KB     │         │
│  └──────────────────────────────────────────────┘         │
│                                                            │
│  Partition-Level Distribution:                            │
│  ┌──────────────────────────────────────────────┐         │
│  │ Table has 10 partitions                      │         │
│  │                                              │         │
│  │ Each partition gets:                         │         │
│  │ • RCU: 1000 / 10 = 100 RCU per partition     │         │
│  │ • WCU: 500 / 10 = 50 WCU per partition       │         │
│  │                                              │         │
│  │ If traffic exceeds partition's capacity:    │         │
│  │ → ProvisionedThroughputExceededException    │         │
│  │ → Throttling (HTTP 400)                      │         │
│  └──────────────────────────────────────────────┘         │
│                                                            │
└────────────────────────────────────────────────────────────┘

The Hot Partition Problem

┌────────────────────────────────────────────────────────────┐
│                  HOT PARTITION PROBLEM                     │
└────────────────────────────────────────────────────────────┘

Table: Products
Total Capacity: 1000 WCU across 10 partitions
Per-Partition: 100 WCU each

Scenario: Product launch (iPhone 15)
┌──────────────────────────────────────────┐
│                                          │
│  Partition 1: productId="iphone-15"      │
│  ↓ 500 writes/sec                        │
│  Capacity: 100 WCU                       │
│  Utilization: 500% ✗ THROTTLED!          │
│                                          │
│  Partition 2-10: Other products          │
│  ↓ 10 writes/sec each                    │
│  Capacity: 100 WCU each                  │
│  Utilization: 10% (idle)                 │
│                                          │
└──────────────────────────────────────────┘

Result:
• Table has 1000 WCU available
• Only using 590 WCU total
• But Partition 1 is throttled!

This is the "hot partition" problem.
Even with plenty of total capacity, uneven
distribution causes throttling.

Adaptive Capacity

DynamoDB’s solution to hot partitions:
┌────────────────────────────────────────────────────────────┐
│               ADAPTIVE CAPACITY                            │
├────────────────────────────────────────────────────────────┤
│                                                            │
│  Automatic Capacity Redistribution:                        │
│                                                            │
│  BEFORE Adaptive Capacity:                                 │
│  ┌────────────────────────────────────────────┐           │
│  │ Partition 1: 100 WCU (fixed)               │           │
│  │ Partition 2: 100 WCU (fixed)               │           │
│  │ ...                                        │           │
│  │ Partition 10: 100 WCU (fixed)              │           │
│  │                                            │           │
│  │ Total: 1000 WCU                            │           │
│  │ Hot partition → Throttled                  │           │
│  └────────────────────────────────────────────┘           │
│                                                            │
│  AFTER Adaptive Capacity:                                  │
│  ┌────────────────────────────────────────────┐           │
│  │ Partition 1: 300 WCU (boosted) ✓           │           │
│  │ Partition 2: 90 WCU                        │           │
│  │ Partition 3: 90 WCU                        │           │
│  │ ...                                        │           │
│  │ Partition 10: 90 WCU                       │           │
│  │                                            │           │
│  │ Total: 1000 WCU                            │           │
│  │ Hot partition → Boosted from idle capacity │           │
│  └────────────────────────────────────────────┘           │
│                                                            │
│  How it works:                                             │
│  • DynamoDB monitors partition utilization                 │
│  • Detects hot partitions                                  │
│  • Borrows capacity from underutilized partitions          │
│  • Temporarily boosts hot partition capacity               │
│  • Happens automatically in minutes                        │
│                                                            │
│  Limitations:                                              │
│  • Can't exceed table's total provisioned capacity         │
│  • Best effort (not guaranteed)                            │
│  • Takes ~5-30 minutes to activate                         │
│  • Doesn't solve sustained hot partitions                  │
│                                                            │
└────────────────────────────────────────────────────────────┘

Burst Capacity

┌────────────────────────────────────────────────────────────┐
│                    BURST CAPACITY                          │
├────────────────────────────────────────────────────────────┤
│                                                            │
│  DynamoDB reserves unused capacity for bursts:             │
│                                                            │
│  ┌──────────────────────────────────────────────┐         │
│  │ Provisioned: 100 WCU                         │         │
│  │                                              │         │
│  │ If using only 50 WCU:                        │         │
│  │ • 50 WCU unused                              │         │
│  │ • DynamoDB saves 50 WCU for up to 5 minutes  │         │
│  │ • Burst pool: Up to 300 WCU (5 min * 50)     │         │
│  │                                              │         │
│  │ Traffic spike:                               │         │
│  │ • Burst to 150 WCU (50 over provisioned)     │         │
│  │ • Use burst capacity pool                    │         │
│  │ • No throttling for short spikes             │         │
│  └──────────────────────────────────────────────┘         │
│                                                            │
│  Burst Capacity Pool:                                      │
│  • Accumulates up to 300 seconds of unused capacity        │
│  • Per partition                                           │
│  │ • Helps handle short traffic spikes                     │
│  • Not guaranteed (best effort)                            │
│                                                            │
└────────────────────────────────────────────────────────────┘

On-Demand Capacity Mode

┌────────────────────────────────────────────────────────────┐
│              ON-DEMAND CAPACITY MODE                       │
├────────────────────────────────────────────────────────────┤
│                                                            │
│  Pay-per-request pricing:                                  │
│  ┌──────────────────────────────────────────────┐         │
│  │ • No capacity planning required              │         │
│  │ • Scales automatically to workload           │         │
│  │ • Pay only for requests you make             │         │
│  │                                              │         │
│  │ Pricing (example):                           │         │
│  │ • Write: $1.25 per million requests          │         │
│  │ • Read: $0.25 per million requests           │         │
│  │                                              │         │
│  │ (vs Provisioned: ~$0.47/WCU/month)           │         │
│  └──────────────────────────────────────────────┘         │
│                                                            │
│  Scaling:                                                  │
│  ┌──────────────────────────────────────────────┐         │
│  │ • Scales to 2x previous peak within 30 min   │         │
│  │ • Can handle traffic spikes automatically    │         │
│  │ • No throttling (unless DDoS protection)     │         │
│  │                                              │         │
│  │ Throttling protection:                       │         │
│  │ • Maximum 40,000 RCU per table               │         │
│  │ • Maximum 40,000 WCU per table               │         │
│  │ • Per partition: 3,000 RCU, 1,000 WCU        │         │
│  └──────────────────────────────────────────────┘         │
│                                                            │
│  When to use On-Demand:                                    │
│  • Unpredictable workloads                                 │
│  • New applications (unknown traffic)                      │
│  • Spiky traffic patterns                                  │
│  • Development/testing                                     │
│                                                            │
│  When to use Provisioned:                                  │
│  • Predictable traffic                                     │
│  • Cost optimization (cheaper at consistent load)          │
│  • Reserved capacity discounts                             │
│                                                            │
└────────────────────────────────────────────────────────────┘

Key Takeaways

Chapter 2 Summary:
  1. Architecture Layers:
    • Request Router: Stateless, handles routing and throttling
    • Storage Nodes: Store data, handle replication
    • Control Plane: Manages partitions and scaling
  2. Consistent Hashing:
    • Minimizes data movement when nodes are added/removed
    • Uses hash ring (0 to 2³²-1)
    • Virtual nodes ensure even distribution
    • Only K/(N+1) keys move when adding a node
  3. Partitioning:
    • Partition key hashed with MD5
    • Each partition limited to 10GB, 3000 RCU, 1000 WCU
    • Automatic splitting when limits exceeded
    • Choose high-cardinality partition keys
  4. Request Routing:
    • Writes go to primary, replicate to 2 replicas (W=2)
    • Eventually consistent reads from any replica
    • Strongly consistent reads from primary only
    • Typical latency: 5-10ms
  5. Capacity Management:
    • Provisioned mode: Fixed capacity, cheaper for predictable loads
    • On-demand mode: Auto-scaling, pay-per-request
    • Adaptive capacity: Redistributes capacity to hot partitions
    • Burst capacity: Handles short traffic spikes
  6. Hot Partitions:
    • Occur when traffic is uneven across partitions
    • Mitigated by adaptive capacity
    • Best solved with better partition key design

Interview Questions

Difficulty: MediumStrong Answer:
1. THE PROBLEM:
   Traditional hash (hash(key) % N) causes massive
   data movement when nodes are added/removed.
   With 1000 nodes, adding 1 more moves ~50% of data!

2. CONSISTENT HASHING SOLUTION:
   • Hash space forms a ring (0 to 2³²-1)
   • Nodes are placed on the ring via hash
   • Keys are placed on the ring via hash
   • Keys belong to the next node clockwise

3. ADDING A NODE:
   • New node placed on ring
   • Only keys between new node and previous node move
   • Other keys unaffected
   • Minimal data movement: K/(N+1) keys

4. REMOVING A NODE:
   • Node removed from ring
   • Its keys move to next node clockwise
   • Other nodes unaffected

5. VIRTUAL NODES:
   • Each physical node has multiple positions (tokens)
   • Ensures even distribution
   • DynamoDB uses this internally

6. BENEFITS:
   • Minimal data movement during scaling
   • Load balanced across nodes
   • Fault tolerance (replicas on different nodes)
Follow-up: What happens if hash distribution is uneven?Answer: Use virtual nodes (tokens). Each physical node gets 128+ positions on the ring, ensuring statistical even distribution.
Difficulty: MediumStrong Answer:
SEPARATION OF CONCERNS:

1. INDEPENDENT SCALING:
   Request Router:
   • CPU-bound (hashing, routing, auth)
   • Stateless (can scale horizontally easily)
   • Scale based on request rate

   Storage Nodes:
   • I/O-bound (disk reads/writes)
   • Stateful (contains data)
   • Scale based on data size and throughput

   Benefit: Scale each layer independently!

2. MULTI-TENANCY:
   • Router can handle many tables
   • Storage nodes isolated per tenant
   • Better resource utilization
   • Noisy neighbor protection

3. OPERATIONAL SIMPLICITY:
   • Storage nodes focused on durability
   • Routers handle complex logic
   • Easier to debug and monitor
   • Cleaner failure domains

4. SECURITY:
   • Router enforces IAM policies
   • Storage nodes don't need credentials
   • Attack surface reduced

5. PERFORMANCE:
   • Router layer can cache metadata
   • Storage optimized for disk I/O
   • No coordination overhead

CONTRAST WITH DYNAMO:
Original Dynamo: P2P architecture, every node
does everything. Complex but more control.

DynamoDB: Layered architecture, simpler to
operate and scale, better for managed service.
Difficulty: Medium-HardStrong Answer:
HOT PARTITION: One partition receives disproportionate
traffic, causing throttling even when table has capacity.

CAUSES:
1. Poor partition key choice (e.g., status="active")
2. Viral content (celebrity tweet, viral video)
3. Time-based access (everyone queries "today")

DYNAMODB SOLUTIONS:

1. ADAPTIVE CAPACITY (Automatic):
   • Monitors partition utilization
   • Borrows unused capacity from cold partitions
   • Boosts hot partition capacity
   • Activates in 5-30 minutes
   • Limitation: Can't exceed table total capacity

2. BURST CAPACITY (Automatic):
   • Accumulates unused capacity (up to 5 minutes)
   • Uses burst pool for short spikes
   • Per-partition burst pool
   • Helps with brief traffic spikes

3. PARTITION SPLITTING (Automatic):
   • If partition exceeds 10GB or sustained high traffic
   • DynamoDB splits partition
   • Creates two partitions with half the key range
   • Distributes load across two nodes

APPLICATION-LEVEL SOLUTIONS:

4. WRITE SHARDING:
   Add random suffix to partition key
   PK: productId + random(1-10)

   Distributes writes across 10 partitions
   On read: Query all 10 shards, merge results

5. CACHING (DAX):
   • Use DynamoDB Accelerator (DAX)
   • In-memory cache for reads
   • Microsecond latency
   • Reduces load on hot partitions

6. BETTER PARTITION KEY DESIGN:
   • Use high-cardinality keys (userId, not status)
   • Avoid time-based keys as PK
   • Use composite keys to spread load

EXAMPLE:
Bad: PK=date (everyone queries today)
Good: PK=userId, SK=date
Difficulty: Easy-MediumStrong Answer:
┌────────────────────────────────────────────────┐
│         PROVISIONED CAPACITY                   │
├────────────────────────────────────────────────┤
│ Pricing:                                       │
│ • Pay for provisioned RCU/WCU per hour         │
│ • $0.00065/WCU/hour, $0.00013/RCU/hour (approx)│
│ • Cheaper for sustained consistent load        │
│                                                │
│ Scaling:                                       │
│ • Manual scaling or auto-scaling               │
│ • 4 scale-ups per day, 1 scale-down per day    │
│ • Must provision capacity upfront              │
│                                                │
│ Throttling:                                    │
│ • Throttled if exceeding provisioned capacity  │
│ • Need to handle ProvisionedThroughputException│
│                                                │
│ Best for:                                      │
│ • Predictable traffic                          │
│ • Cost optimization                            │
│ • Production workloads with known patterns     │
└────────────────────────────────────────────────┘

┌────────────────────────────────────────────────┐
│           ON-DEMAND CAPACITY                   │
├────────────────────────────────────────────────┤
│ Pricing:                                       │
│ • Pay per request                              │
│ • $1.25/million WRU, $0.25/million RRU (approx)│
│ • ~5x more expensive than provisioned          │
│                                                │
│ Scaling:                                       │
│ • Automatic instant scaling                    │
│ • No capacity planning needed                  │
│ • Scales to 2x previous peak in 30 min         │
│                                                │
│ Throttling:                                    │
│ • Minimal throttling (DDoS protection only)    │
│ • Max 40K RCU/WCU per table                    │
│                                                │
│ Best for:                                      │
│ • Unpredictable workloads                      │
│ • Spiky traffic                                │
│ • New applications                             │
│ • Dev/test environments                        │
└────────────────────────────────────────────────┘

COST COMPARISON EXAMPLE:

Workload: 1M writes/day, 5M reads/day

Provisioned:
• WCU: 12 (1M/86400sec)
• RCU: 58 (5M/86400sec)
• Cost: ~$30/month

On-Demand:
• Writes: $1.25
• Reads: $1.25
• Cost: ~$2.50/month

For LOW traffic: On-demand cheaper
For HIGH sustained traffic: Provisioned cheaper

SWITCHING:
• Can switch once per 24 hours
• Choose based on traffic patterns
Difficulty: HardScenario: Design partition key strategy for Twitter-like app with:
  • 100M users
  • Users post tweets
  • Users view their timeline (tweets from followers)
  • Popular users have millions of followers
Strong Answer:
REQUIREMENTS ANALYSIS:

1. Access Patterns:
   • Get user's tweets: Frequent
   • Get user timeline: Very frequent
   • Post tweet: Moderate
   • Follow/unfollow: Infrequent

2. Scale Considerations:
   • 100M users → Need high cardinality PK
   • Viral tweets → Hot partition risk
   • Celebrity users → Massive fan-out

PARTITION KEY STRATEGIES:

OPTION 1: User-Centric (Single Table Design)
┌──────────────────────────────────────────┐
│ PK: userId                               │
│ SK: TYPE#timestamp                       │
│                                          │
│ User Profile:                            │
│ PK: USER#alice                           │
│ SK: METADATA                             │
│                                          │
│ User's Tweets:                           │
│ PK: USER#alice                           │
│ SK: TWEET#2024-01-15T10:00:00#id123      │
│                                          │
│ User Timeline (fan-out on write):        │
│ PK: USER#alice#TIMELINE                  │
│ SK: TWEET#2024-01-15T10:00:00#id456      │
│                                          │
│ Pros:                                    │
│ + Single query for user's tweets         │
│ + Single query for timeline              │
│ + No hot partitions (userId distributed) │
│                                          │
│ Cons:                                    │
│ - Fan-out write expensive for celebrities│
│ - Timeline storage duplicates tweet data │
└──────────────────────────────────────────┘

OPTION 2: Hybrid Approach (Handle Celebrities)
┌──────────────────────────────────────────┐
│ Regular users: Fan-out on write          │
│ • Tweet written to followers' timelines  │
│ • Fast timeline reads                    │
│                                          │
│ Celebrities (>10K followers):            │
│ • NO fan-out                             │
│ • Timeline: Pull from followed users     │
│ • Use DAX cache for celebrity tweets     │
│                                          │
│ Timeline Query (alice follows celebrity):│
│ 1. Read alice's timeline (fan-out tweets)│
│ 2. Query celebrity tweets separately     │
│ 3. Merge and sort by timestamp           │
│ 4. Cache in DAX                          │
└──────────────────────────────────────────┘

OPTION 3: Sharded Tweets (Avoid Hot Partitions)
┌──────────────────────────────────────────┐
│ For viral tweets, shard the tweet:       │
│                                          │
│ Original tweet:                          │
│ PK: TWEET#viral123                       │
│ Gets hot (millions of reads)             │
│                                          │
│ Sharded approach:                        │
│ PK: TWEET#viral123#SHARD#{rand(1-100)}   │
│                                          │
│ Read path:                               │
│ • Pick random shard: SHARD#42            │
│ • All shards have same content           │
│ • Distributes load across 100 partitions │
│                                          │
│ Benefit: 100x capacity for viral content │
└──────────────────────────────────────────┘

RECOMMENDED STRATEGY:

Tables:
1. Users table:
   PK: userId
   (User profiles, settings)

2. Tweets table:
   PK: userId
   SK: tweetId
   (User's tweets, easy to query)

3. Timeline table:
   PK: userId
   SK: timestamp#tweetId
   (Fan-out on write for regular users)

4. Follows table:
   PK: userId
   SK: FOLLOWS#followedUserId
   (Track relationships)

Fan-out Strategy:
• If follower count < 10K: Fan-out on write
• If follower count ≥ 10K: Pull model
• Use DynamoDB Streams + Lambda for fan-out
• Cache celebrity tweets in DAX

This balances read performance, write cost,
and handles hot partitions gracefully.

What’s Next?

In Chapter 3: Data Model and Access Patterns, we’ll explore:
  • Tables, items, and attributes in detail
  • Primary keys: partition key vs composite key
  • Secondary indexes (GSI and LSI)
  • Data modeling patterns and anti-patterns

Continue to Chapter 3

Master DynamoDB’s data model and learn how to design effective schemas