Chapter 2: Architecture and Partitioning
DynamoDB’s architecture represents a sophisticated blend of distributed systems principles: consistent hashing for data distribution, a request router layer for routing, storage nodes for durability, and automatic partitioning for scale. Understanding this architecture is crucial for effective data modeling and performance optimization.Chapter Goals:
- Understand DynamoDB’s high-level architecture
- Master consistent hashing and the hash ring concept
- Learn how data is partitioned and distributed
- Grasp request routing and storage node design
- Understand auto-scaling and adaptive capacity
High-Level Architecture
System Components
DynamoDB’s architecture consists of several key layers:Copy
┌─────────────────────────────────────────────────────────────┐
│ DYNAMODB ARCHITECTURE │
│ │
│ ┌──────────┐ │
│ │ Client │ │
│ │ SDK │ │
│ └─────┬────┘ │
│ │ │
│ │ HTTPS/API Call │
│ ↓ │
│ ┌──────────────────────────────────────────────────────┐ │
│ │ REQUEST ROUTING LAYER │ │
│ │ ┌──────────────────────────────────────────────┐ │ │
│ │ │ • Authentication & Authorization (IAM) │ │ │
│ │ │ • Request validation │ │ │
│ │ │ • Partition key hashing │ │ │
│ │ │ • Route to correct storage node │ │ │
│ │ │ • Throttling & request metering │ │ │
│ │ └──────────────────────────────────────────────┘ │ │
│ └──────────────────┬───────────────────────────────────┘ │
│ │ │
│ │ Routes to storage nodes │
│ ↓ │
│ ┌──────────────────────────────────────────────────────┐ │
│ │ STORAGE NODE LAYER │ │
│ │ │ │
│ │ ┌──────────┐ ┌──────────┐ ┌──────────┐ │ │
│ │ │ Storage │ │ Storage │ │ Storage │ ... │ │
│ │ │ Node 1 │ │ Node 2 │ │ Node 3 │ │ │
│ │ │ │ │ │ │ │ │ │
│ │ │ Partition│ │ Partition│ │ Partition│ │ │
│ │ │ P1, P2 │ │ P3, P4 │ │ P5, P6 │ │ │
│ │ └────┬─────┘ └────┬─────┘ └────┬─────┘ │ │
│ │ │ │ │ │ │
│ └───────┼─────────────┼─────────────┼─────────────────┘ │
│ │ │ │ │
│ │ Replication (3 AZs) │ │
│ ↓ ↓ ↓ │
│ ┌────────────────────────────────────────────────────┐ │
│ │ REPLICATION & PERSISTENCE LAYER │ │
│ │ │ │
│ │ AZ-A AZ-B AZ-C │ │
│ │ ┌─────┐ ┌─────┐ ┌─────┐ │ │
│ │ │ SSD │ │ SSD │ │ SSD │ │ │
│ │ │Store│ │Store│ │Store│ │ │
│ │ └─────┘ └─────┘ └─────┘ │ │
│ │ │ │
│ │ • B-tree storage on SSD │ │
│ │ • Write-ahead logging │ │
│ │ • Synchronous replication across 3 AZs │ │
│ └────────────────────────────────────────────────────┘ │
│ ↓ │
│ ┌────────────────────────────────────────────────────┐ │
│ │ CONTROL PLANE │ │
│ │ • Partition management & splitting │ │
│ │ • Auto-scaling decisions │ │
│ │ • Health monitoring │ │
│ │ • Backup and restore │ │
│ │ • Global table coordination │ │
│ └────────────────────────────────────────────────────┘ │
│ │
└────────────────────────────────────────────────────────────┘
Layer Responsibilities
- Request Router Layer
- Storage Node Layer
- Control Plane
Request Router ResponsibilitiesKey Point: The request router is stateless and can scale independently of storage nodes. This enables DynamoDB to handle traffic spikes without affecting storage.
Copy
┌─────────────────────────────────────────────┐
│ REQUEST ROUTER FUNCTIONS │
├─────────────────────────────────────────────┤
│ │
│ 1. AUTHENTICATION & AUTHORIZATION │
│ • Validate IAM credentials │
│ • Check resource permissions │
│ • Enforce security policies │
│ │
│ 2. REQUEST VALIDATION │
│ • Validate API parameters │
│ • Check payload size limits │
│ • Validate data types │
│ │
│ 3. PARTITION KEY HASHING │
│ • Hash partition key with MD5 │
│ • Determine partition from hash │
│ • Locate storage nodes for partition │
│ │
│ 4. REQUEST ROUTING │
│ • Route to primary storage node │
│ • Handle replica routing for reads │
│ • Retry on failures │
│ │
│ 5. CAPACITY MANAGEMENT │
│ • Track consumed capacity │
│ • Apply throttling (provisioned mode) │
│ • Return throttling errors (400) │
│ │
│ 6. METRICS & MONITORING │
│ • Emit CloudWatch metrics │
│ • Track latency and errors │
│ • Log for audit trails │
│ │
└─────────────────────────────────────────────┘
Storage Node ResponsibilitiesKey Point: Each storage node manages multiple partitions and handles replication to ensure durability and availability.
Copy
┌─────────────────────────────────────────────┐
│ STORAGE NODE FUNCTIONS │
├─────────────────────────────────────────────┤
│ │
│ 1. DATA STORAGE │
│ • Store items in B-tree index │
│ • Maintain sort order by sort key │
│ • Store on SSD for fast access │
│ │
│ 2. REPLICATION │
│ • Leader-based replication │
│ • Synchronous writes to 2 replicas │
│ • Cross-AZ replication │
│ • Quorum: W=2, R=1 or R=2 │
│ │
│ 3. WRITE-AHEAD LOGGING │
│ • Log writes before applying │
│ • Enable crash recovery │
│ • Support point-in-time recovery │
│ │
│ 4. INDEX MAINTENANCE │
│ • Update B-tree on writes │
│ • Maintain secondary index entries │
│ • Asynchronous GSI updates │
│ │
│ 5. CONSISTENCY │
│ • Eventually consistent: Read from any │
│ • Strongly consistent: Read from leader │
│ • Last-write-wins conflict resolution │
│ │
│ 6. STREAMS (if enabled) │
│ • Capture change events │
│ • Store in DynamoDB Streams │
│ • 24-hour retention │
│ │
└─────────────────────────────────────────────┘
Control Plane ResponsibilitiesKey Point: The control plane operates asynchronously and handles all operational tasks, keeping the data plane (request router + storage nodes) focused on serving requests.
Copy
┌─────────────────────────────────────────────┐
│ CONTROL PLANE FUNCTIONS │
├─────────────────────────────────────────────┤
│ │
│ 1. PARTITION MANAGEMENT │
│ • Monitor partition metrics │
│ • Trigger partition splits │
│ • Balance load across nodes │
│ • Handle partition migrations │
│ │
│ 2. AUTO-SCALING │
│ • Monitor capacity utilization │
│ • Scale up/down provisioned capacity │
│ • Adaptive capacity allocation │
│ • Burst capacity management │
│ │
│ 3. HEALTH MONITORING │
│ • Detect node failures │
│ • Trigger failover │
│ • Replace unhealthy nodes │
│ • Ensure replication factor │
│ │
│ 4. BACKUP & RESTORE │
│ • Continuous backups │
│ • Point-in-time recovery │
│ • On-demand backups │
│ • Cross-region backup copies │
│ │
│ 5. GLOBAL TABLES │
│ • Multi-region replication │
│ • Conflict resolution │
│ • Region failover │
│ • Consistency management │
│ │
│ 6. TABLE OPERATIONS │
│ • Create/delete tables │
│ • Update table settings │
│ • Add/remove indexes │
│ • Change capacity modes │
│ │
└─────────────────────────────────────────────┘
Consistent Hashing
The Partitioning Problem
Before diving into consistent hashing, understand the problem it solves:Copy
┌────────────────────────────────────────────────────────────┐
│ WHY PARTITIONING IS NECESSARY │
├────────────────────────────────────────────────────────────┤
│ │
│ Single-Node Limitations: │
│ ┌─────────────────────────────────────────────────────┐ │
│ │ │ │
│ │ Single DB Server: │ │
│ │ • Storage: ~10 TB max (practical limit) │ │
│ │ • Memory: ~1 TB max │ │
│ │ • Throughput: ~50k requests/sec │ │
│ │ • Network: ~10 Gbps │ │
│ │ │ │
│ │ DynamoDB Needs: │ │
│ │ • Storage: Unlimited (petabytes+) │ │
│ │ • Memory: Terabytes across cluster │ │
│ │ • Throughput: Millions of requests/sec │ │
│ │ • Network: Aggregate bandwidth across nodes │ │
│ │ │ │
│ │ Solution: PARTITION DATA ACROSS MANY NODES │ │
│ │ │ │
│ └─────────────────────────────────────────────────────┘ │
│ │
└────────────────────────────────────────────────────────────┘
Naive Hashing Problems
Simple hash-based partitioning has a major flaw:Copy
NAIVE APPROACH: hash(key) % N
Example with 3 nodes:
┌──────────────────────────────────────┐
│ hash("user-123") = 456 │
│ 456 % 3 = 0 → Node 0 │
│ │
│ hash("user-456") = 789 │
│ 789 % 3 = 0 → Node 0 │
│ │
│ hash("user-789") = 234 │
│ 234 % 3 = 0 → Node 0 │
└──────────────────────────────────────┘
PROBLEM: Adding or removing a node
Add Node 3 (now N=4):
┌──────────────────────────────────────┐
│ hash("user-123") = 456 │
│ 456 % 4 = 0 → Node 0 ✓ (same) │
│ │
│ hash("user-456") = 789 │
│ 789 % 4 = 1 → Node 1 ✗ (moved!) │
│ │
│ hash("user-789") = 234 │
│ 234 % 4 = 2 → Node 2 ✗ (moved!) │
└──────────────────────────────────────┘
Result: Adding 1 node moved data from ALL nodes!
(K/N keys move when adding a node)
With 1000 nodes, adding 1 node moves ~50% of data!
This causes massive data migration and downtime.
Consistent Hashing Solution
Consistent hashing minimizes data movement when nodes are added or removed:Copy
┌────────────────────────────────────────────────────────────┐
│ CONSISTENT HASHING RING │
│ │
│ 0° │
│ • │
│ ╱ ╲ │
│ ╱ ╲ │
│ 270° • • 90° │
│ ╱ ╲ │
│ │ │ │
│ │ ┌─────────┐ │ │
│ │ │ Hash │ │ │
│ │ │ Ring │ │ │
│ │ │ (0-2³²) │ │ │
│ │ └─────────┘ │ │
│ │ │ │
│ ╲ ╱ │
│ • • │
│ 180° 360°/0° │
│ │
│ Hash Space: 0 to 2³² - 1 (4,294,967,295) │
│ Arranged in a ring (2³² - 1 wraps to 0) │
│ │
└────────────────────────────────────────────────────────────┘
How Consistent Hashing Works
Step 1: Hash Nodes onto Ring
Step 1: Hash Nodes onto Ring
Place nodes on the hash ring
Copy
3 Storage Nodes: A, B, C
Node placement:
┌────────────────────────────────────────┐
│ hash("Node-A") = 1000 → Position 1000 │
│ hash("Node-B") = 2000 → Position 2000 │
│ hash("Node-C") = 3000 → Position 3000 │
└────────────────────────────────────────┘
Ring visualization:
0
│
│
3000 •─────┴─────• 1000
Node-C Node-A
╲ ╱
╲ ╱
╲ ╱
╲ ╱
•
2000
Node-B
Each node is responsible for the range from
the previous node (clockwise) to itself.
Node-A: (3000, 1000]
Node-B: (1000, 2000]
Node-C: (2000, 3000]
Step 2: Hash Keys onto Ring
Step 2: Hash Keys onto Ring
Determine which node stores each key
Copy
Keys to store:
hash("user-123") = 500 → Between 0 and 1000
→ Assign to Node-A
hash("user-456") = 1500 → Between 1000 and 2000
→ Assign to Node-B
hash("user-789") = 2500 → Between 2000 and 3000
→ Assign to Node-C
Ring with keys:
0
│
500 • (user-123)
│
3000 •─────┴─────• 1000
Node-C Node-A
╲ ╱
╲ ╱
2500 • ╲ ╱ • 1500
(user-789) ╲ ╱ (user-456)
╲ ╱
•
2000
Node-B
Rule: Key goes to the next node clockwise
Step 3: Adding a Node
Step 3: Adding a Node
Minimal data movement when adding nodes
Copy
Add Node-D at position 1500:
Before:
Node-A: (3000, 1000] ← owns 500 (user-123)
Node-B: (1000, 2000] ← owns 1500 (user-456)
Node-C: (2000, 3000] ← owns 2500 (user-789)
After adding Node-D at 1500:
Node-A: (3000, 1000] ← still owns 500 (user-123) ✓
Node-D: (1000, 1500] ← takes 1500 from Node-B
Node-B: (1500, 2000] ← keeps other keys
Node-C: (2000, 3000] ← still owns 2500 (user-789) ✓
Result: ONLY keys in range (1000, 1500] moved!
Only 1/4 of keys moved (instead of 50% with naive hashing)
Ring after addition:
0
│
500 • (user-123)
│
3000 •─────┴─────• 1000
Node-C Node-A
╲ │
╲ │
2500 • ╲ 1500 • Node-D
(user-789) ╲ │ (user-456)
╲ │
•───────•
2000
Node-B
Advantage: Only K/(N+1) keys move when adding a node
(Much better than K/N with naive hashing)
Step 4: Removing a Node
Step 4: Removing a Node
Minimal data movement when removing nodes
Copy
Remove Node-B at position 2000:
Before:
Node-A: (3000, 1000]
Node-B: (1000, 2000] ← owns 1500 (user-456)
Node-C: (2000, 3000]
After removing Node-B:
Node-A: (3000, 1000] ← unchanged
Node-C: (1000, 3000] ← takes Node-B's keys
Only keys in Node-B's range move to Node-C
Node-A and other nodes unaffected
Ring after removal:
0
│
500 • (user-123)
│
3000 •─────┴─────• 1000
Node-C Node-A
╲ ╱
╲ ╱
2500 • ╲ ╱ • 1500
(user-789) ╲ ╱ (user-456)
╲ ╱
•
(Node-B removed)
Virtual Nodes (Tokens)
Real implementations use virtual nodes for better load distribution:Copy
PROBLEM with Single Token Per Node:
┌────────────────────────────────────────┐
│ If nodes are unevenly distributed: │
│ │
│ 0° •────────• 90° │
│ Node-A Node-B │
│ │
│ Node-A: 0° to 5° (tiny range) │
│ Node-B: 5° to 90° (huge range) │
│ │
│ Result: Uneven data distribution! │
└────────────────────────────────────────┘
SOLUTION: Virtual Nodes (Tokens)
┌────────────────────────────────────────┐
│ Each physical node gets multiple │
│ positions (tokens) on the ring │
│ │
│ Node-A has 128 tokens: │
│ hash("Node-A-token-0") = 100 │
│ hash("Node-A-token-1") = 500 │
│ hash("Node-A-token-2") = 1200 │
│ ... │
│ hash("Node-A-token-127") = 3900 │
│ │
│ Node-B has 128 tokens: │
│ hash("Node-B-token-0") = 50 │
│ hash("Node-B-token-1") = 300 │
│ ... │
│ │
│ Result: Much more even distribution! │
│ Each node gets ~1/N of the data │
└────────────────────────────────────────┘
DynamoDB uses this approach internally to ensure
even data distribution across storage nodes.
Partition Key and Data Distribution
Partition Key Hashing
Copy
┌────────────────────────────────────────────────────────────┐
│ PARTITION KEY DETERMINATION │
└────────────────────────────────────────────────────────────┘
Table Schema:
┌──────────────────────────────────┐
│ Table: Users │
│ │
│ Partition Key: userId │
│ Sort Key: (none) │
└──────────────────────────────────┘
Write Operation:
┌──────────────────────────────────┐
│ PutItem({ │
│ userId: "user-123", │
│ name: "Alice", │
│ email: "[email protected]" │
│ }) │
└──────────────────────────────────┘
Step 1: Extract partition key
↓
userId = "user-123"
Step 2: Hash the partition key value
↓
hash = MD5("user-123")
= 3e3e9f56c8f7e04b...
= 1042345678 (as 32-bit integer)
Step 3: Map to partition
↓
Partition = hash % num_partitions
= 1042345678 % 1000
= Partition 678
Step 4: Route to storage nodes hosting Partition 678
↓
Primary: Node-42 (AZ-A)
Replica 1: Node-108 (AZ-B)
Replica 2: Node-251 (AZ-C)
Step 5: Write to all replicas (quorum W=2)
↓
Write to Node-42 (AZ-A) ✓
Write to Node-108 (AZ-B) ✓
Return success (2/3 acks received)
Partition Size Limits
Partition Constraints:Each partition has hard limits:When a partition exceeds limits, DynamoDB automatically splits it:
Copy
┌────────────────────────────────────┐
│ PARTITION LIMITS (Per Partition) │
├────────────────────────────────────┤
│ │
│ Storage: 10 GB maximum │
│ RCU: 3,000 read capacity units │
│ WCU: 1,000 write capacity units │
│ │
│ Exceeding these triggers: │
│ → Automatic partition split │
│ │
└────────────────────────────────────┘
Copy
Before Split:
┌─────────────────────────────────┐
│ Partition P1 │
│ Range: (0, 1000] │
│ Size: 10.2 GB (exceeds limit!) │
│ Storage Node: Node-A │
└─────────────────────────────────┘
After Split:
┌─────────────────────────────────┐
│ Partition P1-A │
│ Range: (0, 500] │
│ Size: 5.1 GB │
│ Storage Node: Node-A │
└─────────────────────────────────┘
┌─────────────────────────────────┐
│ Partition P1-B │
│ Range: (500, 1000] │
│ Size: 5.1 GB │
│ Storage Node: Node-B │
└─────────────────────────────────┘
Note: Once split, partitions NEVER merge back!
Choosing Good Partition Keys
Copy
┌────────────────────────────────────────────────────────────┐
│ PARTITION KEY BEST PRACTICES │
├────────────────────────────────────────────────────────────┤
│ │
│ ✓ HIGH CARDINALITY │
│ ┌────────────────────────────────────────────┐ │
│ │ Good: userId (millions of unique values) │ │
│ │ Bad: status (only "active" or "inactive") │ │
│ │ │ │
│ │ Reason: Many unique values = even │ │
│ │ distribution across partitions │ │
│ └────────────────────────────────────────────┘ │
│ │
│ ✓ UNIFORM ACCESS PATTERN │
│ ┌────────────────────────────────────────────┐ │
│ │ Good: userId (random access) │ │
│ │ Bad: date (everyone queries today's date) │ │
│ │ │ │
│ │ Reason: Avoid hot partitions where all │ │
│ │ traffic goes to one partition │ │
│ └────────────────────────────────────────────┘ │
│ │
│ ✓ KNOWN ACCESS PATTERNS │
│ ┌────────────────────────────────────────────┐ │
│ │ Design partition key based on how you │ │
│ │ will query the data │ │
│ │ │ │
│ │ Example: │ │
│ │ • Access by userId → PK: userId │ │
│ │ • Access by orderId → PK: orderId │ │
│ │ • Access by country+date → │ │
│ │ PK: country, SK: date │ │
│ └────────────────────────────────────────────┘ │
│ │
│ ✗ AVOID TIME-BASED PARTITION KEYS │
│ ┌────────────────────────────────────────────┐ │
│ │ Bad: timestamp or date as PK │ │
│ │ │ │
│ │ Problem: │ │
│ │ • All writes go to "current time" partition│ │
│ │ • Creates hot partition │ │
│ │ • Causes throttling │ │
│ │ │ │
│ │ Solution: │ │
│ │ • Use userId as PK, timestamp as SK │ │
│ │ • Or use composite key with random suffix │ │
│ └────────────────────────────────────────────┘ │
│ │
└────────────────────────────────────────────────────────────┘
Request Routing
Write Path
Copy
┌────────────────────────────────────────────────────────────┐
│ WRITE REQUEST FLOW │
└────────────────────────────────────────────────────────────┘
1. Client sends PutItem request
┌──────────────┐
│ Client │ PutItem(userId="user-123", name="Alice")
└──────┬───────┘
│
↓ HTTPS/TLS
┌─────────────────────────────────────────┐
│ Request Router (Endpoint) │
│ ┌────────────────────────────────────┐ │
│ │ 1. Authenticate IAM credentials │ │
│ │ 2. Authorize access to table │ │
│ │ 3. Validate request parameters │ │
│ │ 4. Check capacity (if provisioned) │ │
│ └────────────────────────────────────┘ │
└──────────────┬──────────────────────────┘
│
↓
┌─────────────────────────────────────────┐
│ 2. Hash partition key to find partition│
│ hash("user-123") = 1042345678 │
│ → Partition 678 │
└──────────────┬──────────────────────────┘
│
↓
┌─────────────────────────────────────────┐
│ 3. Look up storage nodes for P678 │
│ Primary: Node-42 (AZ-A) │
│ Replica 1: Node-108 (AZ-B) │
│ Replica 2: Node-251 (AZ-C) │
└──────────────┬──────────────────────────┘
│
↓
┌─────────────────────────────────────────┐
│ 4. Forward write to Primary (Node-42) │
└─────────────────┬───────────────────────┘
│
↓
┌─────────────┐
│ Node-42 │ (Primary - AZ-A)
│ Partition │
│ 678 │
└──────┬──────┘
│
┌──────────────┼──────────────┐
│ │ │
↓ ↓ ↓
┌────────┐ ┌────────┐ ┌────────┐
│Node-42 │ │Node-108│ │Node-251│
│ (AZ-A) │ │ (AZ-B) │ │ (AZ-C) │
│Primary │ │Replica │ │Replica │
└───┬────┘ └───┬────┘ └───┬────┘
│ │ │
│ 5. Write to WAL (Write-Ahead Log)
│ │ │
↓ ↓ ↓
┌────────┐ ┌────────┐ ┌────────┐
│ WAL │ │ WAL │ │ WAL │
└───┬────┘ └───┬────┘ └───┬────┘
│ │ │
│ 6. Write to B-tree storage
│ │ │
↓ ↓ ↓
┌────────┐ ┌────────┐ ┌────────┐
│ B-tree │ │ B-tree │ │ B-tree │
│ SSD │ │ SSD │ │ SSD │
└───┬────┘ └───┬────┘ └───┬────┘
│ │ │
│ 7. Send ACK back
│ │ │
↓ ↓ ↓
ACK 1 ACK 2 ACK 3
│ │
└─────────────┴─────────────┐
│
8. Wait for quorum (W=2) │
Got 2 ACKs → Success! │
↓
┌──────────────────┐
│ Return success │
│ to client │
└──────────────────┘
Latency Breakdown (typical):
• Request routing: ~1ms
• Network to storage node: ~1ms
• Write to WAL + replicate: ~2-3ms
• B-tree update: ~1ms
• Total: ~5-7ms (p50)
Read Path (Eventually Consistent)
Copy
┌────────────────────────────────────────────────────────────┐
│ EVENTUALLY CONSISTENT READ FLOW │
└────────────────────────────────────────────────────────────┘
1. Client sends GetItem request
┌──────────────┐
│ Client │ GetItem(userId="user-123")
└──────┬───────┘ ConsistentRead=false (default)
│
↓
┌─────────────────────────────────────────┐
│ Request Router │
│ • Authenticate │
│ • Hash partition key → Partition 678 │
│ • Lookup storage nodes │
└──────────────┬──────────────────────────┘
│
↓
┌─────────────────────────────────────────┐
│ Partition 678 replicas: │
│ • Node-42 (AZ-A) - Primary │
│ • Node-108 (AZ-B) - Replica 1 │
│ • Node-251 (AZ-C) - Replica 2 │
└──────────────┬──────────────────────────┘
│
↓
Eventually Consistent Read:
Route to ANY replica (load balanced)
│
┌───────┼───────┐
│ │ │
↓ ↓ ↓
Node-42 Node-108 Node-251
│
Pick Node-108 (random)
│
↓
┌──────────────┐
│ Node-108 │
│ Read from │
│ B-tree │
└──────┬───────┘
│
↓
┌──────────────┐
│ Return item │
│ to client │
└──────────────┘
Latency: ~3-5ms (p50)
Cost: 0.5 RCU per 4KB read
Note: May return stale data if replication lag exists!
But much faster and cheaper than strong reads.
Read Path (Strongly Consistent)
Copy
┌────────────────────────────────────────────────────────────┐
│ STRONGLY CONSISTENT READ FLOW │
└────────────────────────────────────────────────────────────┘
1. Client sends GetItem request
┌──────────────┐
│ Client │ GetItem(userId="user-123")
└──────┬───────┘ ConsistentRead=true
│
↓
┌─────────────────────────────────────────┐
│ Request Router │
│ • Authenticate │
│ • Hash partition key → Partition 678 │
│ • Lookup storage nodes │
└──────────────┬──────────────────────────┘
│
↓
Strongly Consistent Read:
MUST route to PRIMARY replica only
│
↓
┌──────────────┐
│ Node-42 │ (Primary only)
│ Read from │
│ B-tree │
└──────┬───────┘
│
↓
┌──────────────┐
│ Return item │
│ to client │
└──────────────┘
Latency: ~5-10ms (p50) - higher than eventual
Cost: 1 RCU per 4KB read (double the cost)
Guarantees: ALWAYS returns latest written data
Auto-Scaling and Adaptive Capacity
Provisioned Capacity Mode
Copy
┌────────────────────────────────────────────────────────────┐
│ PROVISIONED CAPACITY MODEL │
├────────────────────────────────────────────────────────────┤
│ │
│ Table-Level Capacity: │
│ ┌──────────────────────────────────────────────┐ │
│ │ RCU (Read Capacity Units): 1000 │ │
│ │ WCU (Write Capacity Units): 500 │ │
│ │ │ │
│ │ 1 RCU = 1 strongly consistent read/sec │ │
│ │ (or 2 eventually consistent reads) │ │
│ │ for items up to 4 KB │ │
│ │ │ │
│ │ 1 WCU = 1 write/sec for items up to 1 KB │ │
│ └──────────────────────────────────────────────┘ │
│ │
│ Partition-Level Distribution: │
│ ┌──────────────────────────────────────────────┐ │
│ │ Table has 10 partitions │ │
│ │ │ │
│ │ Each partition gets: │ │
│ │ • RCU: 1000 / 10 = 100 RCU per partition │ │
│ │ • WCU: 500 / 10 = 50 WCU per partition │ │
│ │ │ │
│ │ If traffic exceeds partition's capacity: │ │
│ │ → ProvisionedThroughputExceededException │ │
│ │ → Throttling (HTTP 400) │ │
│ └──────────────────────────────────────────────┘ │
│ │
└────────────────────────────────────────────────────────────┘
The Hot Partition Problem
Copy
┌────────────────────────────────────────────────────────────┐
│ HOT PARTITION PROBLEM │
└────────────────────────────────────────────────────────────┘
Table: Products
Total Capacity: 1000 WCU across 10 partitions
Per-Partition: 100 WCU each
Scenario: Product launch (iPhone 15)
┌──────────────────────────────────────────┐
│ │
│ Partition 1: productId="iphone-15" │
│ ↓ 500 writes/sec │
│ Capacity: 100 WCU │
│ Utilization: 500% ✗ THROTTLED! │
│ │
│ Partition 2-10: Other products │
│ ↓ 10 writes/sec each │
│ Capacity: 100 WCU each │
│ Utilization: 10% (idle) │
│ │
└──────────────────────────────────────────┘
Result:
• Table has 1000 WCU available
• Only using 590 WCU total
• But Partition 1 is throttled!
This is the "hot partition" problem.
Even with plenty of total capacity, uneven
distribution causes throttling.
Adaptive Capacity
DynamoDB’s solution to hot partitions:Copy
┌────────────────────────────────────────────────────────────┐
│ ADAPTIVE CAPACITY │
├────────────────────────────────────────────────────────────┤
│ │
│ Automatic Capacity Redistribution: │
│ │
│ BEFORE Adaptive Capacity: │
│ ┌────────────────────────────────────────────┐ │
│ │ Partition 1: 100 WCU (fixed) │ │
│ │ Partition 2: 100 WCU (fixed) │ │
│ │ ... │ │
│ │ Partition 10: 100 WCU (fixed) │ │
│ │ │ │
│ │ Total: 1000 WCU │ │
│ │ Hot partition → Throttled │ │
│ └────────────────────────────────────────────┘ │
│ │
│ AFTER Adaptive Capacity: │
│ ┌────────────────────────────────────────────┐ │
│ │ Partition 1: 300 WCU (boosted) ✓ │ │
│ │ Partition 2: 90 WCU │ │
│ │ Partition 3: 90 WCU │ │
│ │ ... │ │
│ │ Partition 10: 90 WCU │ │
│ │ │ │
│ │ Total: 1000 WCU │ │
│ │ Hot partition → Boosted from idle capacity │ │
│ └────────────────────────────────────────────┘ │
│ │
│ How it works: │
│ • DynamoDB monitors partition utilization │
│ • Detects hot partitions │
│ • Borrows capacity from underutilized partitions │
│ • Temporarily boosts hot partition capacity │
│ • Happens automatically in minutes │
│ │
│ Limitations: │
│ • Can't exceed table's total provisioned capacity │
│ • Best effort (not guaranteed) │
│ • Takes ~5-30 minutes to activate │
│ • Doesn't solve sustained hot partitions │
│ │
└────────────────────────────────────────────────────────────┘
Burst Capacity
Copy
┌────────────────────────────────────────────────────────────┐
│ BURST CAPACITY │
├────────────────────────────────────────────────────────────┤
│ │
│ DynamoDB reserves unused capacity for bursts: │
│ │
│ ┌──────────────────────────────────────────────┐ │
│ │ Provisioned: 100 WCU │ │
│ │ │ │
│ │ If using only 50 WCU: │ │
│ │ • 50 WCU unused │ │
│ │ • DynamoDB saves 50 WCU for up to 5 minutes │ │
│ │ • Burst pool: Up to 300 WCU (5 min * 50) │ │
│ │ │ │
│ │ Traffic spike: │ │
│ │ • Burst to 150 WCU (50 over provisioned) │ │
│ │ • Use burst capacity pool │ │
│ │ • No throttling for short spikes │ │
│ └──────────────────────────────────────────────┘ │
│ │
│ Burst Capacity Pool: │
│ • Accumulates up to 300 seconds of unused capacity │
│ • Per partition │
│ │ • Helps handle short traffic spikes │
│ • Not guaranteed (best effort) │
│ │
└────────────────────────────────────────────────────────────┘
On-Demand Capacity Mode
Copy
┌────────────────────────────────────────────────────────────┐
│ ON-DEMAND CAPACITY MODE │
├────────────────────────────────────────────────────────────┤
│ │
│ Pay-per-request pricing: │
│ ┌──────────────────────────────────────────────┐ │
│ │ • No capacity planning required │ │
│ │ • Scales automatically to workload │ │
│ │ • Pay only for requests you make │ │
│ │ │ │
│ │ Pricing (example): │ │
│ │ • Write: $1.25 per million requests │ │
│ │ • Read: $0.25 per million requests │ │
│ │ │ │
│ │ (vs Provisioned: ~$0.47/WCU/month) │ │
│ └──────────────────────────────────────────────┘ │
│ │
│ Scaling: │
│ ┌──────────────────────────────────────────────┐ │
│ │ • Scales to 2x previous peak within 30 min │ │
│ │ • Can handle traffic spikes automatically │ │
│ │ • No throttling (unless DDoS protection) │ │
│ │ │ │
│ │ Throttling protection: │ │
│ │ • Maximum 40,000 RCU per table │ │
│ │ • Maximum 40,000 WCU per table │ │
│ │ • Per partition: 3,000 RCU, 1,000 WCU │ │
│ └──────────────────────────────────────────────┘ │
│ │
│ When to use On-Demand: │
│ • Unpredictable workloads │
│ • New applications (unknown traffic) │
│ • Spiky traffic patterns │
│ • Development/testing │
│ │
│ When to use Provisioned: │
│ • Predictable traffic │
│ • Cost optimization (cheaper at consistent load) │
│ • Reserved capacity discounts │
│ │
└────────────────────────────────────────────────────────────┘
Key Takeaways
Chapter 2 Summary:
-
Architecture Layers:
- Request Router: Stateless, handles routing and throttling
- Storage Nodes: Store data, handle replication
- Control Plane: Manages partitions and scaling
-
Consistent Hashing:
- Minimizes data movement when nodes are added/removed
- Uses hash ring (0 to 2³²-1)
- Virtual nodes ensure even distribution
- Only K/(N+1) keys move when adding a node
-
Partitioning:
- Partition key hashed with MD5
- Each partition limited to 10GB, 3000 RCU, 1000 WCU
- Automatic splitting when limits exceeded
- Choose high-cardinality partition keys
-
Request Routing:
- Writes go to primary, replicate to 2 replicas (W=2)
- Eventually consistent reads from any replica
- Strongly consistent reads from primary only
- Typical latency: 5-10ms
-
Capacity Management:
- Provisioned mode: Fixed capacity, cheaper for predictable loads
- On-demand mode: Auto-scaling, pay-per-request
- Adaptive capacity: Redistributes capacity to hot partitions
- Burst capacity: Handles short traffic spikes
-
Hot Partitions:
- Occur when traffic is uneven across partitions
- Mitigated by adaptive capacity
- Best solved with better partition key design
Interview Questions
Question 1: Explain how consistent hashing works in DynamoDB
Question 1: Explain how consistent hashing works in DynamoDB
Difficulty: MediumStrong Answer:Follow-up: What happens if hash distribution is uneven?Answer: Use virtual nodes (tokens). Each physical node gets 128+ positions on the ring, ensuring statistical even distribution.
Copy
1. THE PROBLEM:
Traditional hash (hash(key) % N) causes massive
data movement when nodes are added/removed.
With 1000 nodes, adding 1 more moves ~50% of data!
2. CONSISTENT HASHING SOLUTION:
• Hash space forms a ring (0 to 2³²-1)
• Nodes are placed on the ring via hash
• Keys are placed on the ring via hash
• Keys belong to the next node clockwise
3. ADDING A NODE:
• New node placed on ring
• Only keys between new node and previous node move
• Other keys unaffected
• Minimal data movement: K/(N+1) keys
4. REMOVING A NODE:
• Node removed from ring
• Its keys move to next node clockwise
• Other nodes unaffected
5. VIRTUAL NODES:
• Each physical node has multiple positions (tokens)
• Ensures even distribution
• DynamoDB uses this internally
6. BENEFITS:
• Minimal data movement during scaling
• Load balanced across nodes
• Fault tolerance (replicas on different nodes)
Question 2: Why does DynamoDB separate the request router from storage nodes?
Question 2: Why does DynamoDB separate the request router from storage nodes?
Difficulty: MediumStrong Answer:
Copy
SEPARATION OF CONCERNS:
1. INDEPENDENT SCALING:
Request Router:
• CPU-bound (hashing, routing, auth)
• Stateless (can scale horizontally easily)
• Scale based on request rate
Storage Nodes:
• I/O-bound (disk reads/writes)
• Stateful (contains data)
• Scale based on data size and throughput
Benefit: Scale each layer independently!
2. MULTI-TENANCY:
• Router can handle many tables
• Storage nodes isolated per tenant
• Better resource utilization
• Noisy neighbor protection
3. OPERATIONAL SIMPLICITY:
• Storage nodes focused on durability
• Routers handle complex logic
• Easier to debug and monitor
• Cleaner failure domains
4. SECURITY:
• Router enforces IAM policies
• Storage nodes don't need credentials
• Attack surface reduced
5. PERFORMANCE:
• Router layer can cache metadata
• Storage optimized for disk I/O
• No coordination overhead
CONTRAST WITH DYNAMO:
Original Dynamo: P2P architecture, every node
does everything. Complex but more control.
DynamoDB: Layered architecture, simpler to
operate and scale, better for managed service.
Question 3: How does DynamoDB handle hot partitions?
Question 3: How does DynamoDB handle hot partitions?
Difficulty: Medium-HardStrong Answer:
Copy
HOT PARTITION: One partition receives disproportionate
traffic, causing throttling even when table has capacity.
CAUSES:
1. Poor partition key choice (e.g., status="active")
2. Viral content (celebrity tweet, viral video)
3. Time-based access (everyone queries "today")
DYNAMODB SOLUTIONS:
1. ADAPTIVE CAPACITY (Automatic):
• Monitors partition utilization
• Borrows unused capacity from cold partitions
• Boosts hot partition capacity
• Activates in 5-30 minutes
• Limitation: Can't exceed table total capacity
2. BURST CAPACITY (Automatic):
• Accumulates unused capacity (up to 5 minutes)
• Uses burst pool for short spikes
• Per-partition burst pool
• Helps with brief traffic spikes
3. PARTITION SPLITTING (Automatic):
• If partition exceeds 10GB or sustained high traffic
• DynamoDB splits partition
• Creates two partitions with half the key range
• Distributes load across two nodes
APPLICATION-LEVEL SOLUTIONS:
4. WRITE SHARDING:
Add random suffix to partition key
PK: productId + random(1-10)
Distributes writes across 10 partitions
On read: Query all 10 shards, merge results
5. CACHING (DAX):
• Use DynamoDB Accelerator (DAX)
• In-memory cache for reads
• Microsecond latency
• Reduces load on hot partitions
6. BETTER PARTITION KEY DESIGN:
• Use high-cardinality keys (userId, not status)
• Avoid time-based keys as PK
• Use composite keys to spread load
EXAMPLE:
Bad: PK=date (everyone queries today)
Good: PK=userId, SK=date
Question 4: Compare provisioned vs on-demand capacity modes
Question 4: Compare provisioned vs on-demand capacity modes
Difficulty: Easy-MediumStrong Answer:
Copy
┌────────────────────────────────────────────────┐
│ PROVISIONED CAPACITY │
├────────────────────────────────────────────────┤
│ Pricing: │
│ • Pay for provisioned RCU/WCU per hour │
│ • $0.00065/WCU/hour, $0.00013/RCU/hour (approx)│
│ • Cheaper for sustained consistent load │
│ │
│ Scaling: │
│ • Manual scaling or auto-scaling │
│ • 4 scale-ups per day, 1 scale-down per day │
│ • Must provision capacity upfront │
│ │
│ Throttling: │
│ • Throttled if exceeding provisioned capacity │
│ • Need to handle ProvisionedThroughputException│
│ │
│ Best for: │
│ • Predictable traffic │
│ • Cost optimization │
│ • Production workloads with known patterns │
└────────────────────────────────────────────────┘
┌────────────────────────────────────────────────┐
│ ON-DEMAND CAPACITY │
├────────────────────────────────────────────────┤
│ Pricing: │
│ • Pay per request │
│ • $1.25/million WRU, $0.25/million RRU (approx)│
│ • ~5x more expensive than provisioned │
│ │
│ Scaling: │
│ • Automatic instant scaling │
│ • No capacity planning needed │
│ • Scales to 2x previous peak in 30 min │
│ │
│ Throttling: │
│ • Minimal throttling (DDoS protection only) │
│ • Max 40K RCU/WCU per table │
│ │
│ Best for: │
│ • Unpredictable workloads │
│ • Spiky traffic │
│ • New applications │
│ • Dev/test environments │
└────────────────────────────────────────────────┘
COST COMPARISON EXAMPLE:
Workload: 1M writes/day, 5M reads/day
Provisioned:
• WCU: 12 (1M/86400sec)
• RCU: 58 (5M/86400sec)
• Cost: ~$30/month
On-Demand:
• Writes: $1.25
• Reads: $1.25
• Cost: ~$2.50/month
For LOW traffic: On-demand cheaper
For HIGH sustained traffic: Provisioned cheaper
SWITCHING:
• Can switch once per 24 hours
• Choose based on traffic patterns
Question 5: Design a partition key strategy for a social media app
Question 5: Design a partition key strategy for a social media app
Difficulty: HardScenario: Design partition key strategy for Twitter-like app with:
- 100M users
- Users post tweets
- Users view their timeline (tweets from followers)
- Popular users have millions of followers
Copy
REQUIREMENTS ANALYSIS:
1. Access Patterns:
• Get user's tweets: Frequent
• Get user timeline: Very frequent
• Post tweet: Moderate
• Follow/unfollow: Infrequent
2. Scale Considerations:
• 100M users → Need high cardinality PK
• Viral tweets → Hot partition risk
• Celebrity users → Massive fan-out
PARTITION KEY STRATEGIES:
OPTION 1: User-Centric (Single Table Design)
┌──────────────────────────────────────────┐
│ PK: userId │
│ SK: TYPE#timestamp │
│ │
│ User Profile: │
│ PK: USER#alice │
│ SK: METADATA │
│ │
│ User's Tweets: │
│ PK: USER#alice │
│ SK: TWEET#2024-01-15T10:00:00#id123 │
│ │
│ User Timeline (fan-out on write): │
│ PK: USER#alice#TIMELINE │
│ SK: TWEET#2024-01-15T10:00:00#id456 │
│ │
│ Pros: │
│ + Single query for user's tweets │
│ + Single query for timeline │
│ + No hot partitions (userId distributed) │
│ │
│ Cons: │
│ - Fan-out write expensive for celebrities│
│ - Timeline storage duplicates tweet data │
└──────────────────────────────────────────┘
OPTION 2: Hybrid Approach (Handle Celebrities)
┌──────────────────────────────────────────┐
│ Regular users: Fan-out on write │
│ • Tweet written to followers' timelines │
│ • Fast timeline reads │
│ │
│ Celebrities (>10K followers): │
│ • NO fan-out │
│ • Timeline: Pull from followed users │
│ • Use DAX cache for celebrity tweets │
│ │
│ Timeline Query (alice follows celebrity):│
│ 1. Read alice's timeline (fan-out tweets)│
│ 2. Query celebrity tweets separately │
│ 3. Merge and sort by timestamp │
│ 4. Cache in DAX │
└──────────────────────────────────────────┘
OPTION 3: Sharded Tweets (Avoid Hot Partitions)
┌──────────────────────────────────────────┐
│ For viral tweets, shard the tweet: │
│ │
│ Original tweet: │
│ PK: TWEET#viral123 │
│ Gets hot (millions of reads) │
│ │
│ Sharded approach: │
│ PK: TWEET#viral123#SHARD#{rand(1-100)} │
│ │
│ Read path: │
│ • Pick random shard: SHARD#42 │
│ • All shards have same content │
│ • Distributes load across 100 partitions │
│ │
│ Benefit: 100x capacity for viral content │
└──────────────────────────────────────────┘
RECOMMENDED STRATEGY:
Tables:
1. Users table:
PK: userId
(User profiles, settings)
2. Tweets table:
PK: userId
SK: tweetId
(User's tweets, easy to query)
3. Timeline table:
PK: userId
SK: timestamp#tweetId
(Fan-out on write for regular users)
4. Follows table:
PK: userId
SK: FOLLOWS#followedUserId
(Track relationships)
Fan-out Strategy:
• If follower count < 10K: Fan-out on write
• If follower count ≥ 10K: Pull model
• Use DynamoDB Streams + Lambda for fan-out
• Cache celebrity tweets in DAX
This balances read performance, write cost,
and handles hot partitions gracefully.
What’s Next?
In Chapter 3: Data Model and Access Patterns, we’ll explore:- Tables, items, and attributes in detail
- Primary keys: partition key vs composite key
- Secondary indexes (GSI and LSI)
- Data modeling patterns and anti-patterns
Continue to Chapter 3
Master DynamoDB’s data model and learn how to design effective schemas