Documentation Index
Fetch the complete documentation index at: https://resources.devweekends.com/llms.txt
Use this file to discover all available pages before exploring further.
Chapter 2: Architecture and Partitioning
DynamoDBβs architecture represents a sophisticated blend of distributed systems principles: consistent hashing for data distribution, a request router layer for routing, storage nodes for durability, and automatic partitioning for scale. Understanding this architecture is crucial for effective data modeling and performance optimization.Chapter Goals:
- Understand DynamoDBβs high-level architecture
- Master consistent hashing and the hash ring concept
- Learn how data is partitioned and distributed
- Grasp request routing and storage node design
- Understand auto-scaling and adaptive capacity
High-Level Architecture
System Components
DynamoDBβs architecture consists of several key layers:βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β DYNAMODB ARCHITECTURE β
β β
β ββββββββββββ β
β β Client β β
β β SDK β β
β βββββββ¬βββββ β
β β β
β β HTTPS/API Call β
β β β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β REQUEST ROUTING LAYER β β
β β ββββββββββββββββββββββββββββββββββββββββββββββββ β β
β β β β’ Authentication & Authorization (IAM) β β β
β β β β’ Request validation β β β
β β β β’ Partition key hashing β β β
β β β β’ Route to correct storage node β β β
β β β β’ Throttling & request metering β β β
β β ββββββββββββββββββββββββββββββββββββββββββββββββ β β
β ββββββββββββββββββββ¬ββββββββββββββββββββββββββββββββββββ β
β β β
β β Routes to storage nodes β
β β β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β STORAGE NODE LAYER β β
β β β β
β β ββββββββββββ ββββββββββββ ββββββββββββ β β
β β β Storage β β Storage β β Storage β ... β β
β β β Node 1 β β Node 2 β β Node 3 β β β
β β β β β β β β β β
β β β Partitionβ β Partitionβ β Partitionβ β β
β β β P1, P2 β β P3, P4 β β P5, P6 β β β
β β ββββββ¬ββββββ ββββββ¬ββββββ ββββββ¬ββββββ β β
β β β β β β β
β βββββββββΌββββββββββββββΌββββββββββββββΌββββββββββββββββββ β
β β β β β
β β Replication (3 AZs) β β
β β β β β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β REPLICATION & PERSISTENCE LAYER β β
β β β β
β β AZ-A AZ-B AZ-C β β
β β βββββββ βββββββ βββββββ β β
β β β SSD β β SSD β β SSD β β β
β β βStoreβ βStoreβ βStoreβ β β
β β βββββββ βββββββ βββββββ β β
β β β β
β β β’ B-tree storage on SSD β β
β β β’ Write-ahead logging β β
β β β’ Synchronous replication across 3 AZs β β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β CONTROL PLANE β β
β β β’ Partition management & splitting β β
β β β’ Auto-scaling decisions β β
β β β’ Health monitoring β β
β β β’ Backup and restore β β
β β β’ Global table coordination β β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Layer Responsibilities
- Request Router Layer
- Storage Node Layer
- Control Plane
Request Router ResponsibilitiesKey Point: The request router is stateless and can scale independently of storage nodes. This enables DynamoDB to handle traffic spikes without affecting storage.
βββββββββββββββββββββββββββββββββββββββββββββββ
β REQUEST ROUTER FUNCTIONS β
βββββββββββββββββββββββββββββββββββββββββββββββ€
β β
β 1. AUTHENTICATION & AUTHORIZATION β
β β’ Validate IAM credentials β
β β’ Check resource permissions β
β β’ Enforce security policies β
β β
β 2. REQUEST VALIDATION β
β β’ Validate API parameters β
β β’ Check payload size limits β
β β’ Validate data types β
β β
β 3. PARTITION KEY HASHING β
β β’ Hash partition key with MD5 β
β β’ Determine partition from hash β
β β’ Locate storage nodes for partition β
β β
β 4. REQUEST ROUTING β
β β’ Route to primary storage node β
β β’ Handle replica routing for reads β
β β’ Retry on failures β
β β
β 5. CAPACITY MANAGEMENT β
β β’ Track consumed capacity β
β β’ Apply throttling (provisioned mode) β
β β’ Return throttling errors (400) β
β β
β 6. METRICS & MONITORING β
β β’ Emit CloudWatch metrics β
β β’ Track latency and errors β
β β’ Log for audit trails β
β β
βββββββββββββββββββββββββββββββββββββββββββββββ
Storage Node ResponsibilitiesKey Point: Each storage node manages multiple partitions and handles replication to ensure durability and availability.
βββββββββββββββββββββββββββββββββββββββββββββββ
β STORAGE NODE FUNCTIONS β
βββββββββββββββββββββββββββββββββββββββββββββββ€
β β
β 1. DATA STORAGE β
β β’ Store items in B-tree index β
β β’ Maintain sort order by sort key β
β β’ Store on SSD for fast access β
β β
β 2. REPLICATION β
β β’ Leader-based replication β
β β’ Synchronous writes to 2 replicas β
β β’ Cross-AZ replication β
β β’ Quorum: W=2, R=1 or R=2 β
β β
β 3. WRITE-AHEAD LOGGING β
β β’ Log writes before applying β
β β’ Enable crash recovery β
β β’ Support point-in-time recovery β
β β
β 4. INDEX MAINTENANCE β
β β’ Update B-tree on writes β
β β’ Maintain secondary index entries β
β β’ Asynchronous GSI updates β
β β
β 5. CONSISTENCY β
β β’ Eventually consistent: Read from any β
β β’ Strongly consistent: Read from leader β
β β’ Last-write-wins conflict resolution β
β β
β 6. STREAMS (if enabled) β
β β’ Capture change events β
β β’ Store in DynamoDB Streams β
β β’ 24-hour retention β
β β
βββββββββββββββββββββββββββββββββββββββββββββββ
Control Plane ResponsibilitiesKey Point: The control plane operates asynchronously and handles all operational tasks, keeping the data plane (request router + storage nodes) focused on serving requests.
βββββββββββββββββββββββββββββββββββββββββββββββ
β CONTROL PLANE FUNCTIONS β
βββββββββββββββββββββββββββββββββββββββββββββββ€
β β
β 1. PARTITION MANAGEMENT β
β β’ Monitor partition metrics β
β β’ Trigger partition splits β
β β’ Balance load across nodes β
β β’ Handle partition migrations β
β β
β 2. AUTO-SCALING β
β β’ Monitor capacity utilization β
β β’ Scale up/down provisioned capacity β
β β’ Adaptive capacity allocation β
β β’ Burst capacity management β
β β
β 3. HEALTH MONITORING β
β β’ Detect node failures β
β β’ Trigger failover β
β β’ Replace unhealthy nodes β
β β’ Ensure replication factor β
β β
β 4. BACKUP & RESTORE β
β β’ Continuous backups β
β β’ Point-in-time recovery β
β β’ On-demand backups β
β β’ Cross-region backup copies β
β β
β 5. GLOBAL TABLES β
β β’ Multi-region replication β
β β’ Conflict resolution β
β β’ Region failover β
β β’ Consistency management β
β β
β 6. TABLE OPERATIONS β
β β’ Create/delete tables β
β β’ Update table settings β
β β’ Add/remove indexes β
β β’ Change capacity modes β
β β
βββββββββββββββββββββββββββββββββββββββββββββββ
Consistent Hashing
The Partitioning Problem
Before diving into consistent hashing, understand the problem it solves:ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β WHY PARTITIONING IS NECESSARY β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β β
β Single-Node Limitations: β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β β β
β β Single DB Server: β β
β β β’ Storage: ~10 TB max (practical limit) β β
β β β’ Memory: ~1 TB max β β
β β β’ Throughput: ~50k requests/sec β β
β β β’ Network: ~10 Gbps β β
β β β β
β β DynamoDB Needs: β β
β β β’ Storage: Unlimited (petabytes+) β β
β β β’ Memory: Terabytes across cluster β β
β β β’ Throughput: Millions of requests/sec β β
β β β’ Network: Aggregate bandwidth across nodes β β
β β β β
β β Solution: PARTITION DATA ACROSS MANY NODES β β
β β β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Naive Hashing Problems
Simple hash-based partitioning has a major flaw:NAIVE APPROACH: hash(key) % N
Example with 3 nodes:
ββββββββββββββββββββββββββββββββββββββββ
β hash("user-123") = 456 β
β 456 % 3 = 0 β Node 0 β
β β
β hash("user-456") = 789 β
β 789 % 3 = 0 β Node 0 β
β β
β hash("user-789") = 234 β
β 234 % 3 = 0 β Node 0 β
ββββββββββββββββββββββββββββββββββββββββ
PROBLEM: Adding or removing a node
Add Node 3 (now N=4):
ββββββββββββββββββββββββββββββββββββββββ
β hash("user-123") = 456 β
β 456 % 4 = 0 β Node 0 β (same) β
β β
β hash("user-456") = 789 β
β 789 % 4 = 1 β Node 1 β (moved!) β
β β
β hash("user-789") = 234 β
β 234 % 4 = 2 β Node 2 β (moved!) β
ββββββββββββββββββββββββββββββββββββββββ
Result: Adding 1 node moved data from ALL nodes!
(K/N keys move when adding a node)
With 1000 nodes, adding 1 node moves ~50% of data!
This causes massive data migration and downtime.
Consistent Hashing Solution
Consistent hashing minimizes data movement when nodes are added or removed:ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β CONSISTENT HASHING RING β
β β
β 0Β° β
β β’ β
β β± β² β
β β± β² β
β 270Β° β’ β’ 90Β° β
β β± β² β
β β β β
β β βββββββββββ β β
β β β Hash β β β
β β β Ring β β β
β β β (0-2Β³Β²) β β β
β β βββββββββββ β β
β β β β
β β² β± β
β β’ β’ β
β 180Β° 360Β°/0Β° β
β β
β Hash Space: 0 to 2Β³Β² - 1 (4,294,967,295) β
β Arranged in a ring (2Β³Β² - 1 wraps to 0) β
β β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
How Consistent Hashing Works
Step 1: Hash Nodes onto Ring
Step 1: Hash Nodes onto Ring
Place nodes on the hash ring
3 Storage Nodes: A, B, C
Node placement:
ββββββββββββββββββββββββββββββββββββββββββ
β hash("Node-A") = 1000 β Position 1000 β
β hash("Node-B") = 2000 β Position 2000 β
β hash("Node-C") = 3000 β Position 3000 β
ββββββββββββββββββββββββββββββββββββββββββ
Ring visualization:
0
β
β
3000 β’ββββββ΄ββββββ’ 1000
Node-C Node-A
β² β±
β² β±
β² β±
β² β±
β’
2000
Node-B
Each node is responsible for the range from
the previous node (clockwise) to itself.
Node-A: (3000, 1000]
Node-B: (1000, 2000]
Node-C: (2000, 3000]
Step 2: Hash Keys onto Ring
Step 2: Hash Keys onto Ring
Determine which node stores each key
Keys to store:
hash("user-123") = 500 β Between 0 and 1000
β Assign to Node-A
hash("user-456") = 1500 β Between 1000 and 2000
β Assign to Node-B
hash("user-789") = 2500 β Between 2000 and 3000
β Assign to Node-C
Ring with keys:
0
β
500 β’ (user-123)
β
3000 β’ββββββ΄ββββββ’ 1000
Node-C Node-A
β² β±
β² β±
2500 β’ β² β± β’ 1500
(user-789) β² β± (user-456)
β² β±
β’
2000
Node-B
Rule: Key goes to the next node clockwise
Step 3: Adding a Node
Step 3: Adding a Node
Minimal data movement when adding nodes
Add Node-D at position 1500:
Before:
Node-A: (3000, 1000] β owns 500 (user-123)
Node-B: (1000, 2000] β owns 1500 (user-456)
Node-C: (2000, 3000] β owns 2500 (user-789)
After adding Node-D at 1500:
Node-A: (3000, 1000] β still owns 500 (user-123) β
Node-D: (1000, 1500] β takes 1500 from Node-B
Node-B: (1500, 2000] β keeps other keys
Node-C: (2000, 3000] β still owns 2500 (user-789) β
Result: ONLY keys in range (1000, 1500] moved!
Only 1/4 of keys moved (instead of 50% with naive hashing)
Ring after addition:
0
β
500 β’ (user-123)
β
3000 β’ββββββ΄ββββββ’ 1000
Node-C Node-A
β² β
β² β
2500 β’ β² 1500 β’ Node-D
(user-789) β² β (user-456)
β² β
β’ββββββββ’
2000
Node-B
Advantage: Only K/(N+1) keys move when adding a node
(Much better than K/N with naive hashing)
Step 4: Removing a Node
Step 4: Removing a Node
Minimal data movement when removing nodes
Remove Node-B at position 2000:
Before:
Node-A: (3000, 1000]
Node-B: (1000, 2000] β owns 1500 (user-456)
Node-C: (2000, 3000]
After removing Node-B:
Node-A: (3000, 1000] β unchanged
Node-C: (1000, 3000] β takes Node-B's keys
Only keys in Node-B's range move to Node-C
Node-A and other nodes unaffected
Ring after removal:
0
β
500 β’ (user-123)
β
3000 β’ββββββ΄ββββββ’ 1000
Node-C Node-A
β² β±
β² β±
2500 β’ β² β± β’ 1500
(user-789) β² β± (user-456)
β² β±
β’
(Node-B removed)
Virtual Nodes (Tokens)
Real implementations use virtual nodes for better load distribution:PROBLEM with Single Token Per Node:
ββββββββββββββββββββββββββββββββββββββββββ
β If nodes are unevenly distributed: β
β β
β 0Β° β’βββββββββ’ 90Β° β
β Node-A Node-B β
β β
β Node-A: 0Β° to 5Β° (tiny range) β
β Node-B: 5Β° to 90Β° (huge range) β
β β
β Result: Uneven data distribution! β
ββββββββββββββββββββββββββββββββββββββββββ
SOLUTION: Virtual Nodes (Tokens)
ββββββββββββββββββββββββββββββββββββββββββ
β Each physical node gets multiple β
β positions (tokens) on the ring β
β β
β Node-A has 128 tokens: β
β hash("Node-A-token-0") = 100 β
β hash("Node-A-token-1") = 500 β
β hash("Node-A-token-2") = 1200 β
β ... β
β hash("Node-A-token-127") = 3900 β
β β
β Node-B has 128 tokens: β
β hash("Node-B-token-0") = 50 β
β hash("Node-B-token-1") = 300 β
β ... β
β β
β Result: Much more even distribution! β
β Each node gets ~1/N of the data β
ββββββββββββββββββββββββββββββββββββββββββ
DynamoDB uses this approach internally to ensure
even data distribution across storage nodes.
Partition Key and Data Distribution
Partition Key Hashing
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β PARTITION KEY DETERMINATION β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Table Schema:
ββββββββββββββββββββββββββββββββββββ
β Table: Users β
β β
β Partition Key: userId β
β Sort Key: (none) β
ββββββββββββββββββββββββββββββββββββ
Write Operation:
ββββββββββββββββββββββββββββββββββββ
β PutItem({ β
β userId: "user-123", β
β name: "Alice", β
β email: "alice@example.com" β
β }) β
ββββββββββββββββββββββββββββββββββββ
Step 1: Extract partition key
β
userId = "user-123"
Step 2: Hash the partition key value
β
hash = MD5("user-123")
= 3e3e9f56c8f7e04b...
= 1042345678 (as 32-bit integer)
Step 3: Map to partition
β
Partition = hash % num_partitions
= 1042345678 % 1000
= Partition 678
Step 4: Route to storage nodes hosting Partition 678
β
Primary: Node-42 (AZ-A)
Replica 1: Node-108 (AZ-B)
Replica 2: Node-251 (AZ-C)
Step 5: Write to all replicas (quorum W=2)
β
Write to Node-42 (AZ-A) β
Write to Node-108 (AZ-B) β
Return success (2/3 acks received)
Partition Size Limits
Partition Constraints:Each partition has hard limits:When a partition exceeds limits, DynamoDB automatically splits it:
ββββββββββββββββββββββββββββββββββββββ
β PARTITION LIMITS (Per Partition) β
ββββββββββββββββββββββββββββββββββββββ€
β β
β Storage: 10 GB maximum β
β RCU: 3,000 read capacity units β
β WCU: 1,000 write capacity units β
β β
β Exceeding these triggers: β
β β Automatic partition split β
β β
ββββββββββββββββββββββββββββββββββββββ
Before Split:
βββββββββββββββββββββββββββββββββββ
β Partition P1 β
β Range: (0, 1000] β
β Size: 10.2 GB (exceeds limit!) β
β Storage Node: Node-A β
βββββββββββββββββββββββββββββββββββ
After Split:
βββββββββββββββββββββββββββββββββββ
β Partition P1-A β
β Range: (0, 500] β
β Size: 5.1 GB β
β Storage Node: Node-A β
βββββββββββββββββββββββββββββββββββ
βββββββββββββββββββββββββββββββββββ
β Partition P1-B β
β Range: (500, 1000] β
β Size: 5.1 GB β
β Storage Node: Node-B β
βββββββββββββββββββββββββββββββββββ
Note: Once split, partitions NEVER merge back!
Choosing Good Partition Keys
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β PARTITION KEY BEST PRACTICES β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β β
β β HIGH CARDINALITY β
β ββββββββββββββββββββββββββββββββββββββββββββββ β
β β Good: userId (millions of unique values) β β
β β Bad: status (only "active" or "inactive") β β
β β β β
β β Reason: Many unique values = even β β
β β distribution across partitions β β
β ββββββββββββββββββββββββββββββββββββββββββββββ β
β β
β β UNIFORM ACCESS PATTERN β
β ββββββββββββββββββββββββββββββββββββββββββββββ β
β β Good: userId (random access) β β
β β Bad: date (everyone queries today's date) β β
β β β β
β β Reason: Avoid hot partitions where all β β
β β traffic goes to one partition β β
β ββββββββββββββββββββββββββββββββββββββββββββββ β
β β
β β KNOWN ACCESS PATTERNS β
β ββββββββββββββββββββββββββββββββββββββββββββββ β
β β Design partition key based on how you β β
β β will query the data β β
β β β β
β β Example: β β
β β β’ Access by userId β PK: userId β β
β β β’ Access by orderId β PK: orderId β β
β β β’ Access by country+date β β β
β β PK: country, SK: date β β
β ββββββββββββββββββββββββββββββββββββββββββββββ β
β β
β β AVOID TIME-BASED PARTITION KEYS β
β ββββββββββββββββββββββββββββββββββββββββββββββ β
β β Bad: timestamp or date as PK β β
β β β β
β β Problem: β β
β β β’ All writes go to "current time" partitionβ β
β β β’ Creates hot partition β β
β β β’ Causes throttling β β
β β β β
β β Solution: β β
β β β’ Use userId as PK, timestamp as SK β β
β β β’ Or use composite key with random suffix β β
β ββββββββββββββββββββββββββββββββββββββββββββββ β
β β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Request Routing
Write Path
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β WRITE REQUEST FLOW β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
1. Client sends PutItem request
ββββββββββββββββ
β Client β PutItem(userId="user-123", name="Alice")
ββββββββ¬ββββββββ
β
β HTTPS/TLS
βββββββββββββββββββββββββββββββββββββββββββ
β Request Router (Endpoint) β
β ββββββββββββββββββββββββββββββββββββββ β
β β 1. Authenticate IAM credentials β β
β β 2. Authorize access to table β β
β β 3. Validate request parameters β β
β β 4. Check capacity (if provisioned) β β
β ββββββββββββββββββββββββββββββββββββββ β
ββββββββββββββββ¬βββββββββββββββββββββββββββ
β
β
βββββββββββββββββββββββββββββββββββββββββββ
β 2. Hash partition key to find partitionβ
β hash("user-123") = 1042345678 β
β β Partition 678 β
ββββββββββββββββ¬βββββββββββββββββββββββββββ
β
β
βββββββββββββββββββββββββββββββββββββββββββ
β 3. Look up storage nodes for P678 β
β Primary: Node-42 (AZ-A) β
β Replica 1: Node-108 (AZ-B) β
β Replica 2: Node-251 (AZ-C) β
ββββββββββββββββ¬βββββββββββββββββββββββββββ
β
β
βββββββββββββββββββββββββββββββββββββββββββ
β 4. Forward write to Primary (Node-42) β
βββββββββββββββββββ¬ββββββββββββββββββββββββ
β
β
βββββββββββββββ
β Node-42 β (Primary - AZ-A)
β Partition β
β 678 β
ββββββββ¬βββββββ
β
ββββββββββββββββΌβββββββββββββββ
β β β
β β β
ββββββββββ ββββββββββ ββββββββββ
βNode-42 β βNode-108β βNode-251β
β (AZ-A) β β (AZ-B) β β (AZ-C) β
βPrimary β βReplica β βReplica β
βββββ¬βββββ βββββ¬βββββ βββββ¬βββββ
β β β
β 5. Write to WAL (Write-Ahead Log)
β β β
β β β
ββββββββββ ββββββββββ ββββββββββ
β WAL β β WAL β β WAL β
βββββ¬βββββ βββββ¬βββββ βββββ¬βββββ
β β β
β 6. Write to B-tree storage
β β β
β β β
ββββββββββ ββββββββββ ββββββββββ
β B-tree β β B-tree β β B-tree β
β SSD β β SSD β β SSD β
βββββ¬βββββ βββββ¬βββββ βββββ¬βββββ
β β β
β 7. Send ACK back
β β β
β β β
ACK 1 ACK 2 ACK 3
β β
βββββββββββββββ΄ββββββββββββββ
β
8. Wait for quorum (W=2) β
Got 2 ACKs β Success! β
β
ββββββββββββββββββββ
β Return success β
β to client β
ββββββββββββββββββββ
Latency Breakdown (typical):
β’ Request routing: ~1ms
β’ Network to storage node: ~1ms
β’ Write to WAL + replicate: ~2-3ms
β’ B-tree update: ~1ms
β’ Total: ~5-7ms (p50)
Read Path (Eventually Consistent)
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β EVENTUALLY CONSISTENT READ FLOW β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
1. Client sends GetItem request
ββββββββββββββββ
β Client β GetItem(userId="user-123")
ββββββββ¬ββββββββ ConsistentRead=false (default)
β
β
βββββββββββββββββββββββββββββββββββββββββββ
β Request Router β
β β’ Authenticate β
β β’ Hash partition key β Partition 678 β
β β’ Lookup storage nodes β
ββββββββββββββββ¬βββββββββββββββββββββββββββ
β
β
βββββββββββββββββββββββββββββββββββββββββββ
β Partition 678 replicas: β
β β’ Node-42 (AZ-A) - Primary β
β β’ Node-108 (AZ-B) - Replica 1 β
β β’ Node-251 (AZ-C) - Replica 2 β
ββββββββββββββββ¬βββββββββββββββββββββββββββ
β
β
Eventually Consistent Read:
Route to ANY replica (load balanced)
β
βββββββββΌββββββββ
β β β
β β β
Node-42 Node-108 Node-251
β
Pick Node-108 (random)
β
β
ββββββββββββββββ
β Node-108 β
β Read from β
β B-tree β
ββββββββ¬ββββββββ
β
β
ββββββββββββββββ
β Return item β
β to client β
ββββββββββββββββ
Latency: ~3-5ms (p50)
Cost: 0.5 RCU per 4KB read
Note: May return stale data if replication lag exists!
But much faster and cheaper than strong reads.
Read Path (Strongly Consistent)
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β STRONGLY CONSISTENT READ FLOW β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
1. Client sends GetItem request
ββββββββββββββββ
β Client β GetItem(userId="user-123")
ββββββββ¬ββββββββ ConsistentRead=true
β
β
βββββββββββββββββββββββββββββββββββββββββββ
β Request Router β
β β’ Authenticate β
β β’ Hash partition key β Partition 678 β
β β’ Lookup storage nodes β
ββββββββββββββββ¬βββββββββββββββββββββββββββ
β
β
Strongly Consistent Read:
MUST route to PRIMARY replica only
β
β
ββββββββββββββββ
β Node-42 β (Primary only)
β Read from β
β B-tree β
ββββββββ¬ββββββββ
β
β
ββββββββββββββββ
β Return item β
β to client β
ββββββββββββββββ
Latency: ~5-10ms (p50) - higher than eventual
Cost: 1 RCU per 4KB read (double the cost)
Guarantees: ALWAYS returns latest written data
Auto-Scaling and Adaptive Capacity
Provisioned Capacity Mode
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β PROVISIONED CAPACITY MODEL β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β β
β Table-Level Capacity: β
β ββββββββββββββββββββββββββββββββββββββββββββββββ β
β β RCU (Read Capacity Units): 1000 β β
β β WCU (Write Capacity Units): 500 β β
β β β β
β β 1 RCU = 1 strongly consistent read/sec β β
β β (or 2 eventually consistent reads) β β
β β for items up to 4 KB β β
β β β β
β β 1 WCU = 1 write/sec for items up to 1 KB β β
β ββββββββββββββββββββββββββββββββββββββββββββββββ β
β β
β Partition-Level Distribution: β
β ββββββββββββββββββββββββββββββββββββββββββββββββ β
β β Table has 10 partitions β β
β β β β
β β Each partition gets: β β
β β β’ RCU: 1000 / 10 = 100 RCU per partition β β
β β β’ WCU: 500 / 10 = 50 WCU per partition β β
β β β β
β β If traffic exceeds partition's capacity: β β
β β β ProvisionedThroughputExceededException β β
β β β Throttling (HTTP 400) β β
β ββββββββββββββββββββββββββββββββββββββββββββββββ β
β β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
The Hot Partition Problem
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β HOT PARTITION PROBLEM β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Table: Products
Total Capacity: 1000 WCU across 10 partitions
Per-Partition: 100 WCU each
Scenario: Product launch (iPhone 15)
ββββββββββββββββββββββββββββββββββββββββββββ
β β
β Partition 1: productId="iphone-15" β
β β 500 writes/sec β
β Capacity: 100 WCU β
β Utilization: 500% β THROTTLED! β
β β
β Partition 2-10: Other products β
β β 10 writes/sec each β
β Capacity: 100 WCU each β
β Utilization: 10% (idle) β
β β
ββββββββββββββββββββββββββββββββββββββββββββ
Result:
β’ Table has 1000 WCU available
β’ Only using 590 WCU total
β’ But Partition 1 is throttled!
This is the "hot partition" problem.
Even with plenty of total capacity, uneven
distribution causes throttling.
Adaptive Capacity
DynamoDBβs solution to hot partitions:ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β ADAPTIVE CAPACITY β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β β
β Automatic Capacity Redistribution: β
β β
β BEFORE Adaptive Capacity: β
β ββββββββββββββββββββββββββββββββββββββββββββββ β
β β Partition 1: 100 WCU (fixed) β β
β β Partition 2: 100 WCU (fixed) β β
β β ... β β
β β Partition 10: 100 WCU (fixed) β β
β β β β
β β Total: 1000 WCU β β
β β Hot partition β Throttled β β
β ββββββββββββββββββββββββββββββββββββββββββββββ β
β β
β AFTER Adaptive Capacity: β
β ββββββββββββββββββββββββββββββββββββββββββββββ β
β β Partition 1: 300 WCU (boosted) β β β
β β Partition 2: 90 WCU β β
β β Partition 3: 90 WCU β β
β β ... β β
β β Partition 10: 90 WCU β β
β β β β
β β Total: 1000 WCU β β
β β Hot partition β Boosted from idle capacity β β
β ββββββββββββββββββββββββββββββββββββββββββββββ β
β β
β How it works: β
β β’ DynamoDB monitors partition utilization β
β β’ Detects hot partitions β
β β’ Borrows capacity from underutilized partitions β
β β’ Temporarily boosts hot partition capacity β
β β’ Happens automatically in minutes β
β β
β Limitations: β
β β’ Can't exceed table's total provisioned capacity β
β β’ Best effort (not guaranteed) β
β β’ Takes ~5-30 minutes to activate β
β β’ Doesn't solve sustained hot partitions β
β β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Burst Capacity
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β BURST CAPACITY β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β β
β DynamoDB reserves unused capacity for bursts: β
β β
β ββββββββββββββββββββββββββββββββββββββββββββββββ β
β β Provisioned: 100 WCU β β
β β β β
β β If using only 50 WCU: β β
β β β’ 50 WCU unused β β
β β β’ DynamoDB saves 50 WCU for up to 5 minutes β β
β β β’ Burst pool: Up to 300 WCU (5 min * 50) β β
β β β β
β β Traffic spike: β β
β β β’ Burst to 150 WCU (50 over provisioned) β β
β β β’ Use burst capacity pool β β
β β β’ No throttling for short spikes β β
β ββββββββββββββββββββββββββββββββββββββββββββββββ β
β β
β Burst Capacity Pool: β
β β’ Accumulates up to 300 seconds of unused capacity β
β β’ Per partition β
β β β’ Helps handle short traffic spikes β
β β’ Not guaranteed (best effort) β
β β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
On-Demand Capacity Mode
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β ON-DEMAND CAPACITY MODE β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β β
β Pay-per-request pricing: β
β ββββββββββββββββββββββββββββββββββββββββββββββββ β
β β β’ No capacity planning required β β
β β β’ Scales automatically to workload β β
β β β’ Pay only for requests you make β β
β β β β
β β Pricing (example): β β
β β β’ Write: $1.25 per million requests β β
β β β’ Read: $0.25 per million requests β β
β β β β
β β (vs Provisioned: ~$0.47/WCU/month) β β
β ββββββββββββββββββββββββββββββββββββββββββββββββ β
β β
β Scaling: β
β ββββββββββββββββββββββββββββββββββββββββββββββββ β
β β β’ Scales to 2x previous peak within 30 min β β
β β β’ Can handle traffic spikes automatically β β
β β β’ No throttling (unless DDoS protection) β β
β β β β
β β Throttling protection: β β
β β β’ Maximum 40,000 RCU per table β β
β β β’ Maximum 40,000 WCU per table β β
β β β’ Per partition: 3,000 RCU, 1,000 WCU β β
β ββββββββββββββββββββββββββββββββββββββββββββββββ β
β β
β When to use On-Demand: β
β β’ Unpredictable workloads β
β β’ New applications (unknown traffic) β
β β’ Spiky traffic patterns β
β β’ Development/testing β
β β
β When to use Provisioned: β
β β’ Predictable traffic β
β β’ Cost optimization (cheaper at consistent load) β
β β’ Reserved capacity discounts β
β β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Key Takeaways
Chapter 2 Summary:
-
Architecture Layers:
- Request Router: Stateless, handles routing and throttling
- Storage Nodes: Store data, handle replication
- Control Plane: Manages partitions and scaling
-
Consistent Hashing:
- Minimizes data movement when nodes are added/removed
- Uses hash ring (0 to 2Β³Β²-1)
- Virtual nodes ensure even distribution
- Only K/(N+1) keys move when adding a node
-
Partitioning:
- Partition key hashed with MD5
- Each partition limited to 10GB, 3000 RCU, 1000 WCU
- Automatic splitting when limits exceeded
- Choose high-cardinality partition keys
-
Request Routing:
- Writes go to primary, replicate to 2 replicas (W=2)
- Eventually consistent reads from any replica
- Strongly consistent reads from primary only
- Typical latency: 5-10ms
-
Capacity Management:
- Provisioned mode: Fixed capacity, cheaper for predictable loads
- On-demand mode: Auto-scaling, pay-per-request
- Adaptive capacity: Redistributes capacity to hot partitions
- Burst capacity: Handles short traffic spikes
-
Hot Partitions:
- Occur when traffic is uneven across partitions
- Mitigated by adaptive capacity
- Best solved with better partition key design
Interview Questions
Question 1: Explain how consistent hashing works in DynamoDB
Question 1: Explain how consistent hashing works in DynamoDB
Difficulty: MediumStrong Answer:Follow-up: What happens if hash distribution is uneven?Answer: Use virtual nodes (tokens). Each physical node gets 128+ positions on the ring, ensuring statistical even distribution.
1. THE PROBLEM:
Traditional hash (hash(key) % N) causes massive
data movement when nodes are added/removed.
With 1000 nodes, adding 1 more moves ~50% of data!
2. CONSISTENT HASHING SOLUTION:
β’ Hash space forms a ring (0 to 2Β³Β²-1)
β’ Nodes are placed on the ring via hash
β’ Keys are placed on the ring via hash
β’ Keys belong to the next node clockwise
3. ADDING A NODE:
β’ New node placed on ring
β’ Only keys between new node and previous node move
β’ Other keys unaffected
β’ Minimal data movement: K/(N+1) keys
4. REMOVING A NODE:
β’ Node removed from ring
β’ Its keys move to next node clockwise
β’ Other nodes unaffected
5. VIRTUAL NODES:
β’ Each physical node has multiple positions (tokens)
β’ Ensures even distribution
β’ DynamoDB uses this internally
6. BENEFITS:
β’ Minimal data movement during scaling
β’ Load balanced across nodes
β’ Fault tolerance (replicas on different nodes)
Question 2: Why does DynamoDB separate the request router from storage nodes?
Question 2: Why does DynamoDB separate the request router from storage nodes?
Difficulty: MediumStrong Answer:
SEPARATION OF CONCERNS:
1. INDEPENDENT SCALING:
Request Router:
β’ CPU-bound (hashing, routing, auth)
β’ Stateless (can scale horizontally easily)
β’ Scale based on request rate
Storage Nodes:
β’ I/O-bound (disk reads/writes)
β’ Stateful (contains data)
β’ Scale based on data size and throughput
Benefit: Scale each layer independently!
2. MULTI-TENANCY:
β’ Router can handle many tables
β’ Storage nodes isolated per tenant
β’ Better resource utilization
β’ Noisy neighbor protection
3. OPERATIONAL SIMPLICITY:
β’ Storage nodes focused on durability
β’ Routers handle complex logic
β’ Easier to debug and monitor
β’ Cleaner failure domains
4. SECURITY:
β’ Router enforces IAM policies
β’ Storage nodes don't need credentials
β’ Attack surface reduced
5. PERFORMANCE:
β’ Router layer can cache metadata
β’ Storage optimized for disk I/O
β’ No coordination overhead
CONTRAST WITH DYNAMO:
Original Dynamo: P2P architecture, every node
does everything. Complex but more control.
DynamoDB: Layered architecture, simpler to
operate and scale, better for managed service.
Question 3: How does DynamoDB handle hot partitions?
Question 3: How does DynamoDB handle hot partitions?
Difficulty: Medium-HardStrong Answer:
HOT PARTITION: One partition receives disproportionate
traffic, causing throttling even when table has capacity.
CAUSES:
1. Poor partition key choice (e.g., status="active")
2. Viral content (celebrity tweet, viral video)
3. Time-based access (everyone queries "today")
DYNAMODB SOLUTIONS:
1. ADAPTIVE CAPACITY (Automatic):
β’ Monitors partition utilization
β’ Borrows unused capacity from cold partitions
β’ Boosts hot partition capacity
β’ Activates in 5-30 minutes
β’ Limitation: Can't exceed table total capacity
2. BURST CAPACITY (Automatic):
β’ Accumulates unused capacity (up to 5 minutes)
β’ Uses burst pool for short spikes
β’ Per-partition burst pool
β’ Helps with brief traffic spikes
3. PARTITION SPLITTING (Automatic):
β’ If partition exceeds 10GB or sustained high traffic
β’ DynamoDB splits partition
β’ Creates two partitions with half the key range
β’ Distributes load across two nodes
APPLICATION-LEVEL SOLUTIONS:
4. WRITE SHARDING:
Add random suffix to partition key
PK: productId + random(1-10)
Distributes writes across 10 partitions
On read: Query all 10 shards, merge results
5. CACHING (DAX):
β’ Use DynamoDB Accelerator (DAX)
β’ In-memory cache for reads
β’ Microsecond latency
β’ Reduces load on hot partitions
6. BETTER PARTITION KEY DESIGN:
β’ Use high-cardinality keys (userId, not status)
β’ Avoid time-based keys as PK
β’ Use composite keys to spread load
EXAMPLE:
Bad: PK=date (everyone queries today)
Good: PK=userId, SK=date
Question 4: Compare provisioned vs on-demand capacity modes
Question 4: Compare provisioned vs on-demand capacity modes
Difficulty: Easy-MediumStrong Answer:
ββββββββββββββββββββββββββββββββββββββββββββββββββ
β PROVISIONED CAPACITY β
ββββββββββββββββββββββββββββββββββββββββββββββββββ€
β Pricing: β
β β’ Pay for provisioned RCU/WCU per hour β
β β’ $0.00065/WCU/hour, $0.00013/RCU/hour (approx)β
β β’ Cheaper for sustained consistent load β
β β
β Scaling: β
β β’ Manual scaling or auto-scaling β
β β’ 4 scale-ups per day, 1 scale-down per day β
β β’ Must provision capacity upfront β
β β
β Throttling: β
β β’ Throttled if exceeding provisioned capacity β
β β’ Need to handle ProvisionedThroughputExceptionβ
β β
β Best for: β
β β’ Predictable traffic β
β β’ Cost optimization β
β β’ Production workloads with known patterns β
ββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββ
β ON-DEMAND CAPACITY β
ββββββββββββββββββββββββββββββββββββββββββββββββββ€
β Pricing: β
β β’ Pay per request β
β β’ $1.25/million WRU, $0.25/million RRU (approx)β
β β’ ~5x more expensive than provisioned β
β β
β Scaling: β
β β’ Automatic instant scaling β
β β’ No capacity planning needed β
β β’ Scales to 2x previous peak in 30 min β
β β
β Throttling: β
β β’ Minimal throttling (DDoS protection only) β
β β’ Max 40K RCU/WCU per table β
β β
β Best for: β
β β’ Unpredictable workloads β
β β’ Spiky traffic β
β β’ New applications β
β β’ Dev/test environments β
ββββββββββββββββββββββββββββββββββββββββββββββββββ
COST COMPARISON EXAMPLE:
Workload: 1M writes/day, 5M reads/day
Provisioned:
β’ WCU: 12 (1M/86400sec)
β’ RCU: 58 (5M/86400sec)
β’ Cost: ~$30/month
On-Demand:
β’ Writes: $1.25
β’ Reads: $1.25
β’ Cost: ~$2.50/month
For LOW traffic: On-demand cheaper
For HIGH sustained traffic: Provisioned cheaper
SWITCHING:
β’ Can switch once per 24 hours
β’ Choose based on traffic patterns
Question 5: Design a partition key strategy for a social media app
Question 5: Design a partition key strategy for a social media app
Difficulty: HardScenario: Design partition key strategy for Twitter-like app with:
- 100M users
- Users post tweets
- Users view their timeline (tweets from followers)
- Popular users have millions of followers
REQUIREMENTS ANALYSIS:
1. Access Patterns:
β’ Get user's tweets: Frequent
β’ Get user timeline: Very frequent
β’ Post tweet: Moderate
β’ Follow/unfollow: Infrequent
2. Scale Considerations:
β’ 100M users β Need high cardinality PK
β’ Viral tweets β Hot partition risk
β’ Celebrity users β Massive fan-out
PARTITION KEY STRATEGIES:
OPTION 1: User-Centric (Single Table Design)
ββββββββββββββββββββββββββββββββββββββββββββ
β PK: userId β
β SK: TYPE#timestamp β
β β
β User Profile: β
β PK: USER#alice β
β SK: METADATA β
β β
β User's Tweets: β
β PK: USER#alice β
β SK: TWEET#2024-01-15T10:00:00#id123 β
β β
β User Timeline (fan-out on write): β
β PK: USER#alice#TIMELINE β
β SK: TWEET#2024-01-15T10:00:00#id456 β
β β
β Pros: β
β + Single query for user's tweets β
β + Single query for timeline β
β + No hot partitions (userId distributed) β
β β
β Cons: β
β - Fan-out write expensive for celebritiesβ
β - Timeline storage duplicates tweet data β
ββββββββββββββββββββββββββββββββββββββββββββ
OPTION 2: Hybrid Approach (Handle Celebrities)
ββββββββββββββββββββββββββββββββββββββββββββ
β Regular users: Fan-out on write β
β β’ Tweet written to followers' timelines β
β β’ Fast timeline reads β
β β
β Celebrities (>10K followers): β
β β’ NO fan-out β
β β’ Timeline: Pull from followed users β
β β’ Use DAX cache for celebrity tweets β
β β
β Timeline Query (alice follows celebrity):β
β 1. Read alice's timeline (fan-out tweets)β
β 2. Query celebrity tweets separately β
β 3. Merge and sort by timestamp β
β 4. Cache in DAX β
ββββββββββββββββββββββββββββββββββββββββββββ
OPTION 3: Sharded Tweets (Avoid Hot Partitions)
ββββββββββββββββββββββββββββββββββββββββββββ
β For viral tweets, shard the tweet: β
β β
β Original tweet: β
β PK: TWEET#viral123 β
β Gets hot (millions of reads) β
β β
β Sharded approach: β
β PK: TWEET#viral123#SHARD#{rand(1-100)} β
β β
β Read path: β
β β’ Pick random shard: SHARD#42 β
β β’ All shards have same content β
β β’ Distributes load across 100 partitions β
β β
β Benefit: 100x capacity for viral content β
ββββββββββββββββββββββββββββββββββββββββββββ
RECOMMENDED STRATEGY:
Tables:
1. Users table:
PK: userId
(User profiles, settings)
2. Tweets table:
PK: userId
SK: tweetId
(User's tweets, easy to query)
3. Timeline table:
PK: userId
SK: timestamp#tweetId
(Fan-out on write for regular users)
4. Follows table:
PK: userId
SK: FOLLOWS#followedUserId
(Track relationships)
Fan-out Strategy:
β’ If follower count < 10K: Fan-out on write
β’ If follower count β₯ 10K: Pull model
β’ Use DynamoDB Streams + Lambda for fan-out
β’ Cache celebrity tweets in DAX
This balances read performance, write cost,
and handles hot partitions gracefully.
Whatβs Next?
In Chapter 3: Data Model and Access Patterns, weβll explore:- Tables, items, and attributes in detail
- Primary keys: partition key vs composite key
- Secondary indexes (GSI and LSI)
- Data modeling patterns and anti-patterns
Continue to Chapter 3
Master DynamoDBβs data model and learn how to design effective schemas