Chapter 7: Fault Tolerance and Reliability

Introduction

DynamoDB is designed for high availability and fault tolerance with built-in mechanisms for data durability, automatic replication, and disaster recovery. This chapter explores how DynamoDB achieves reliability and how to build resilient applications on top of it.

DynamoDB’s Built-in Fault Tolerance

Multi-AZ Replication

DynamoDB automatically replicates data across three Availability Zones within a region.

<svg viewBox="0 0 900 700" xmlns="http://www.w3.org/2000/svg">
  <!-- Title -->
  <text x="450" y="30" font-size="18" font-weight="bold" text-anchor="middle" fill="#333">
    DynamoDB Fault Tolerance Architecture
  </text>

  <!-- AWS Region -->
  <rect x="50" y="60" width="800" height="600" fill="#f5f5f5" stroke="#666" stroke-width="2" stroke-dasharray="5,5" rx="5"/>
  <text x="450" y="90" font-size="16" font-weight="bold" text-anchor="middle" fill="#666">
    AWS Region
  </text>

  <!-- AZ 1 (Primary) -->
  <rect x="80" y="120" width="220" height="500" fill="#e3f2fd" stroke="#1976d2" stroke-width="3" rx="5"/>
  <text x="190" y="150" font-size="14" font-weight="bold" text-anchor="middle" fill="#1976d2">
    Availability Zone 1
  </text>
  <text x="190" y="170" font-size="11" text-anchor="middle" fill="#1976d2">
    (Leader for partition A)
  </text>

  <rect x="100" y="190" width="180" height="100" fill="#fff" stroke="#1976d2" stroke-width="2" rx="3"/>
  <text x="190" y="215" font-size="12" font-weight="bold" text-anchor="middle" fill="#333">
    Storage Node 1A
  </text>
  <text x="190" y="235" font-size="11" text-anchor="middle" fill="#666">
    Partition A (Leader)
  </text>
  <text x="190" y="255" font-size="10" text-anchor="middle" fill="#4caf50">
    Status: Healthy
  </text>
  <text x="190" y="275" font-size="10" text-anchor="middle" fill="#666">
    Data: Current
  </text>

  <rect x="100" y="310" width="180" height="100" fill="#fff" stroke="#1976d2" stroke-width="2" rx="3"/>
  <text x="190" y="335" font-size="12" font-weight="bold" text-anchor="middle" fill="#333">
    Storage Node 1B
  </text>
  <text x="190" y="355" font-size="11" text-anchor="middle" fill="#666">
    Partition B (Follower)
  </text>
  <text x="190" y="375" font-size="10" text-anchor="middle" fill="#4caf50">
    Status: Healthy
  </text>
  <text x="190" y="395" font-size="10" text-anchor="middle" fill="#666">
    Data: Current
  </text>

  <!-- AZ 2 -->
  <rect x="340" y="120" width="220" height="500" fill="#e8f5e9" stroke="#4caf50" stroke-width="2" rx="5"/>
  <text x="450" y="150" font-size="14" font-weight="bold" text-anchor="middle" fill="#4caf50">
    Availability Zone 2
  </text>
  <text x="450" y="170" font-size="11" text-anchor="middle" fill="#4caf50">
    (Follower)
  </text>

  <rect x="360" y="190" width="180" height="100" fill="#fff" stroke="#4caf50" stroke-width="2" rx="3"/>
  <text x="450" y="215" font-size="12" font-weight="bold" text-anchor="middle" fill="#333">
    Storage Node 2A
  </text>
  <text x="450" y="235" font-size="11" text-anchor="middle" fill="#666">
    Partition A (Follower)
  </text>
  <text x="450" y="255" font-size="10" text-anchor="middle" fill="#4caf50">
    Status: Healthy
  </text>
  <text x="450" y="275" font-size="10" text-anchor="middle" fill="#666">
    Data: Current
  </text>

  <rect x="360" y="310" width="180" height="100" fill="#fff" stroke="#4caf50" stroke-width="2" rx="3"/>
  <text x="450" y="335" font-size="12" font-weight="bold" text-anchor="middle" fill="#333">
    Storage Node 2B
  </text>
  <text x="450" y="355" font-size="11" text-anchor="middle" fill="#666">
    Partition B (Leader)
  </text>
  <text x="450" y="375" font-size="10" text-anchor="middle" fill="#4caf50">
    Status: Healthy
  </text>
  <text x="450" y="395" font-size="10" text-anchor="middle" fill="#666">
    Data: Current
  </text>

  <!-- AZ 3 (Failed) -->
  <rect x="600" y="120" width="220" height="500" fill="#ffebee" stroke="#f44336" stroke-width="2" stroke-dasharray="5,5" rx="5"/>
  <text x="710" y="150" font-size="14" font-weight="bold" text-anchor="middle" fill="#f44336">
    Availability Zone 3
  </text>
  <text x="710" y="170" font-size="11" text-anchor="middle" fill="#f44336">
    (FAILED)
  </text>

  <rect x="620" y="190" width="180" height="100" fill="#fff" stroke="#f44336" stroke-width="2" stroke-dasharray="3,3" rx="3"/>
  <text x="710" y="215" font-size="12" font-weight="bold" text-anchor="middle" fill="#999">
    Storage Node 3A
  </text>
  <text x="710" y="235" font-size="11" text-anchor="middle" fill="#999">
    Partition A (Follower)
  </text>
  <text x="710" y="255" font-size="10" text-anchor="middle" fill="#f44336">
    Status: UNAVAILABLE
  </text>
  <text x="710" y="275" font-size="10" text-anchor="middle" fill="#999">
    Data: Stale
  </text>

  <rect x="620" y="310" width="180" height="100" fill="#fff" stroke="#f44336" stroke-width="2" stroke-dasharray="3,3" rx="3"/>
  <text x="710" y="335" font-size="12" font-weight="bold" text-anchor="middle" fill="#999">
    Storage Node 3B
  </text>
  <text x="710" y="355" font-size="11" text-anchor="middle" fill="#999">
    Partition B (Follower)
  </text>
  <text x="710" y="375" font-size="10" text-anchor="middle" fill="#f44336">
    Status: UNAVAILABLE
  </text>
  <text x="710" y="395" font-size="10" text-anchor="middle" fill="#999">
    Data: Stale
  </text>

  <!-- Replication arrows -->
  <path d="M 280 240 L 360 240" stroke="#1976d2" stroke-width="2" fill="none" marker-end="url(#arrowblue)"/>
  <text x="320" y="230" font-size="9" fill="#666">Sync</text>

  <path d="M 280 360 L 360 360" stroke="#4caf50" stroke-width="2" fill="none" marker-end="url(#arrowgreen)"/>
  <text x="320" y="350" font-size="9" fill="#666">Sync</text>

  <path d="M 540 240 L 620 240" stroke="#f44336" stroke-width="2" fill="none" marker-end="url(#arrowred)" stroke-dasharray="3,3"/>
  <text x="580" y="230" font-size="9" fill="#f44336">Failed</text>

  <!-- Status box -->
  <rect x="100" y="460" width="700" height="140" fill="#fff" stroke="#4caf50" stroke-width="3" rx="5"/>
  <text x="450" y="490" font-size="14" font-weight="bold" text-anchor="middle" fill="#4caf50">
    System Status: OPERATIONAL
  </text>

  <text x="120" y="520" font-size="11" fill="#333">✓ Quorum achieved: 2 of 3 replicas available</text>
  <text x="120" y="540" font-size="11" fill="#333">✓ Partition A: Leader in AZ1, Follower in AZ2</text>
  <text x="120" y="560" font-size="11" fill="#333">✓ Partition B: Leader in AZ2, Follower in AZ1</text>
  <text x="120" y="580" font-size="11" fill="#666">⚠ AZ3 failure detected - automatic recovery initiated</text>

  <!-- Arrow markers -->
  <defs>
    <marker id="arrowblue" markerWidth="10" markerHeight="10" refX="9" refY="3" orient="auto">
      <polygon points="0 0, 10 3, 0 6" fill="#1976d2"/>
    </marker>
    <marker id="arrowgreen" markerWidth="10" markerHeight="10" refX="9" refY="3" orient="auto">
      <polygon points="0 0, 10 3, 0 6" fill="#4caf50"/>
    </marker>
    <marker id="arrowred" markerWidth="10" markerHeight="10" refX="9" refY="3" orient="auto">
      <polygon points="0 0, 10 3, 0 6" fill="#f44336"/>
    </marker>
  </defs>
</svg>

Quorum-Based Replication

DynamoDB uses quorum writes and reads to ensure consistency and availability.

Deep Dive: Consensus and the Evolution from Gossip to Paxos

While the original 2007 Dynamo paper relied on Gossip Protocols for membership and decentralized coordination, modern DynamoDB uses a more structured approach for its storage layer.

1. Leader-Based Replication (Paxos)

Inside each partition (replicated 3 times), one node is elected as the Leader.

Election: DynamoDB uses the Paxos consensus algorithm to elect a leader for each partition’s replication group.
Role of the Leader: All writes for a partition MUST go through the leader. The leader coordinates the replication to the followers.
Consistency: The leader ensures that even if one replica is lagging, the quorum (2 of 3) always reflects the most recent acknowledged write.

2. Failure Detection: The Heartbeat

Instead of the “sloppy quorum” and hinted handoff described in the original paper (which prioritized availability over consistency), modern DynamoDB uses strict failure detection.

Lease Mechanism: The leader holds a lease. If the leader fails, the lease expires, and the remaining replicas use Paxos to elect a new leader.
Deterministic vs. Probabilistic: The original gossip protocol was probabilistic (eventual convergence). Paxos-based leadership in modern DynamoDB is deterministic, providing much stronger consistency guarantees for “Strongly Consistent” reads.

3. Comparison: Original Paper vs. Modern DynamoDB

Feature	Original Dynamo (2007)	Modern DynamoDB (AWS)
Consensus	Gossip Protocol (Probabilistic)	Multi-Paxos (Deterministic)
Consistency	Eventual Consistency (Always)	Strongly or Eventually Consistent
Coordination	Peer-to-Peer (No Leader)	Leader-based (per Partition)
Durability	Merkle Trees (Anti-Entropy)	Continuous Log Replication & PITR

4. Cross-Track Analysis: Fault Domains

A. AZ-Awareness vs. Hadoop Rack-Awareness

DynamoDB’s AZ-awareness is the cloud-scale evolution of Hadoop’s Rack-Awareness.

Hadoop Rack-Awareness: Designed for physical data centers where the primary fault domain is a Server Rack (shared power/switch). Hadoop places the 2nd and 3rd replicas on a different rack to survive a switch failure.
DynamoDB AZ-Awareness: Designed for cloud regions where the fault domain is an Availability Zone (entire data center). DynamoDB ensures that the 3 replicas of a partition are distributed across 3 different AZs, surviving a total data center outage.

B. Anti-Entropy and the Merkle Tree Legacy

As discussed in Chapter 1: Introduction, the original Dynamo paper introduced Merkle Trees for efficient data reconciliation.

The Problem: In a distributed system, replicas can drift due to bit rot or missed writes. Comparing entire datasets between nodes is too slow.
The Solution (Merkle Trees): A hash tree where every leaf is a data item hash, and every parent is a hash of its children. If the root hashes of two nodes match, the data is identical. If not, they only swap the specific branches that differ.
Modern Reality: While modern DynamoDB has replaced Merkle-tree gossip with Paxos-based logs for synchronous replication, the principle of Hash-based Verification remains at the core of DynamoDB’s background “Scrubbing” process, which continuously verifies data integrity on disk.

Feature	Original Dynamo (2007)	Modern DynamoDB
Membership	Gossip Protocol	Managed Membership Service
Leader	Peer-to-Peer (No leader)	Strong Leader per Partition
Consensus	Version Clocks / Conflict Resolution	Paxos-based Leadership
Fault Tolerance	Hinted Handoff (AP)	Quorum Writes (strictly CP/AP)

// Write quorum: 2 of 3 replicas must acknowledge
// Read quorum (strong consistency): Reads from majority

// Write operation flow:
const writeWithQuorum = async (item) => {
  // 1. Client sends write request
  // 2. DynamoDB routes to leader node for partition
  // 3. Leader writes locally
  // 4. Leader replicates to 2 other AZs in parallel
  // 5. Leader waits for acknowledgment from at least 1 replica (quorum of 2/3)
  // 6. Write acknowledged to client
  // 7. Background: Third replica eventually catches up

  await dynamodb.put({
    TableName: 'Users',
    Item: item
  }).promise();

  // Write is durable even if one AZ fails
};

// Strong consistent read (reads from quorum)
const strongRead = async (key) => {
  // 1. Client requests strong consistency
  // 2. DynamoDB reads from majority of replicas
  // 3. Returns most recent value

  return await dynamodb.get({
    TableName: 'Users',
    Key: key,
    ConsistentRead: true  // Reads from quorum
  }).promise();
};

// Eventually consistent read (reads from any replica)
const eventualRead = async (key) => {
  // 1. Client requests eventual consistency
  // 2. DynamoDB reads from any available replica (fastest)
  // 3. May return slightly stale data

  return await dynamodb.get({
    TableName: 'Users',
    Key: key,
    ConsistentRead: false  // Reads from single replica
  }).promise();
};

Automatic Failure Detection and Recovery

// DynamoDB handles failures automatically:

// Scenario 1: Single replica failure
// - Quorum still achieved (2 of 3)
// - No impact on availability
// - Failed replica automatically repaired

// Scenario 2: AZ failure (all replicas in one AZ)
// - Remaining 2 AZs maintain quorum
// - Traffic automatically routed to healthy AZs
// - No manual intervention required

// Scenario 3: Region failure
// - If using Global Tables: Failover to another region
// - If not: Data lost (use backups for recovery)

// Example: Monitoring replication health
const AWS = require('aws-sdk');
const cloudwatch = new AWS.CloudWatch();

const monitorReplicationLag = async (tableName) => {
  const metrics = await cloudwatch.getMetricStatistics({
    Namespace: 'AWS/DynamoDB',
    MetricName: 'ReplicationLatency',
    Dimensions: [
      {
        Name: 'TableName',
        Value: tableName
      },
      {
        Name: 'ReceivingRegion',
        Value: 'eu-west-1'
      }
    ],
    StartTime: new Date(Date.now() - 3600000),
    EndTime: new Date(),
    Period: 300,
    Statistics: ['Average', 'Maximum']
  }).promise();

  return metrics.Datapoints;
};

Deep Dive: Evolution from the 2007 Dynamo Paper

While modern DynamoDB has replaced many of the original paper’s mechanisms (like Gossip-based membership) with more deterministic AWS-managed services, the core principles of Anti-Entropy remain central to its design.

1. Merkle Trees and Anti-Entropy

In the original 2007 paper (referenced in Chapter 1), Dynamo used Merkle Trees (hash trees) for anti-entropy. This allowed nodes to compare their datasets by only exchanging the roots of their hash trees, drastically reducing the bandwidth needed to detect inconsistencies. In modern DynamoDB:

Active Anti-Entropy: This is now handled by the Log-Structured Merge-Tree (LSM) storage engine and the Paxos log.
Repair: If a replica falls behind, it doesn’t just “gossip” for the data. Instead, the Paxos leader identifies the missing log sequence numbers and pushes the missing entries to the lagging replica.

2. Quorum Systems (R + W > N)

The original paper’s “Sloppy Quorum” has evolved into a strict, Paxos-based quorum.

Then: Any $N$ healthy nodes could respond to a write.
Now: A majority of the defined replica group (Paxos quorum) must acknowledge the write to the log before it is considered successful. This provides the “Strong Consistency” option that was absent in the original decentralized design.

Deep Dive: PITR Mechanics and Performance Impact

Point-in-Time Recovery (PITR) provides continuous backups for the last 35 days. Unlike traditional snapshot-based backups, PITR is “Zero-Impact” on the live database performance.

1. Log-Structured Archiving

DynamoDB doesn’t perform “scans” to back up data. Instead, it uses a Log-Structured approach:

Stream Archiving: Every write that is committed to the Paxos log is asynchronously copied to a highly durable S3-backed storage system.
Metadata Versioning: DynamoDB maintains a global timeline of these log entries. When you request a restore to T=10:30:05, DynamoDB identifies the base snapshot and replays all log entries up to that exact millisecond.

2. Performance Isolation

Background Process: The archival process happens on the storage nodes’ background threads, completely separate from the request-processing threads.
No Locking: Because HFiles (storage blocks) are immutable, the background archiver can read data without acquiring any locks, ensuring that live read/write latency is unaffected by PITR.

Backup and Restore Strategies

On-Demand Backups

// Create on-demand backup
const createBackup = async (tableName) => {
  const backup = await dynamodb.createBackup({
    TableName: tableName,
    BackupName: `${tableName}-backup-${Date.now()}`
  }).promise();

  console.log('Backup created:', backup.BackupDetails.BackupArn);
  return backup;
};

// List backups
const listBackups = async (tableName) => {
  const backups = await dynamodb.listBackups({
    TableName: tableName,
    TimeRangeUpperBound: new Date(),
    TimeRangeLowerBound: new Date(Date.now() - 30 * 24 * 60 * 60 * 1000) // Last 30 days
  }).promise();

  return backups.BackupSummaries;
};

// Restore from backup
const restoreFromBackup = async (backupArn, targetTableName) => {
  const restore = await dynamodb.restoreTableFromBackup({
    TargetTableName: targetTableName,
    BackupArn: backupArn
  }).promise();

  console.log('Restore initiated:', restore.TableDescription.TableArn);
  return restore;
};

// Delete backup
const deleteBackup = async (backupArn) => {
  await dynamodb.deleteBackup({
    BackupArn: backupArn
  }).promise();

  console.log('Backup deleted:', backupArn);
};

// Automated backup schedule using EventBridge
const scheduleBackups = async (tableName, schedule = 'daily') => {
  const lambda = new AWS.Lambda();
  const events = new AWS.EventBridge();

  // Create Lambda function for backups
  const functionName = `backup-${tableName}`;

  await lambda.createFunction({
    FunctionName: functionName,
    Runtime: 'nodejs18.x',
    Role: 'arn:aws:iam::account:role/LambdaBackupRole',
    Handler: 'index.handler',
    Code: {
      ZipFile: Buffer.from(`
        const AWS = require('aws-sdk');
        const dynamodb = new AWS.DynamoDB();

        exports.handler = async () => {
          await dynamodb.createBackup({
            TableName: '${tableName}',
            BackupName: '${tableName}-auto-' + Date.now()
          }).promise();
        };
      `)
    }
  }).promise();

  // Create EventBridge rule
  const ruleName = `${tableName}-backup-schedule`;

  await events.putRule({
    Name: ruleName,
    ScheduleExpression: schedule === 'daily' ? 'cron(0 2 * * ? *)' : 'cron(0 2 ? * 1 *)',
    State: 'ENABLED'
  }).promise();

  // Add Lambda as target
  await events.putTargets({
    Rule: ruleName,
    Targets: [{
      Id: '1',
      Arn: `arn:aws:lambda:region:account:function:${functionName}`
    }]
  }).promise();

  console.log('Automated backups scheduled');
};

Point-in-Time Recovery (PITR)

// Enable PITR
const enablePITR = async (tableName) => {
  await dynamodb.updateContinuousBackups({
    TableName: tableName,
    PointInTimeRecoverySpecification: {
      PointInTimeRecoveryEnabled: true
    }
  }).promise();

  console.log('PITR enabled - 35 day recovery window');
};

// Check PITR status
const checkPITRStatus = async (tableName) => {
  const backups = await dynamodb.describeContinuousBackups({
    TableName: tableName
  }).promise();

  const pitr = backups.ContinuousBackupsDescription.PointInTimeRecoveryDescription;

  console.log('PITR Status:', pitr.PointInTimeRecoveryStatus);
  console.log('Earliest Restore Time:', pitr.EarliestRestorableDateTime);
  console.log('Latest Restore Time:', pitr.LatestRestorableDateTime);

  return pitr;
};

// Restore to specific point in time
const restoreToPIT = async (sourceTable, targetTable, restoreTime) => {
  const restore = await dynamodb.restoreTableToPointInTime({
    SourceTableName: sourceTable,
    TargetTableName: targetTable,
    RestoreDateTime: restoreTime
  }).promise();

  console.log('PITR restore initiated');
  return restore;
};

// Restore to latest restorable time
const restoreToLatest = async (sourceTable, targetTable) => {
  const restore = await dynamodb.restoreTableToPointInTime({
    SourceTableName: sourceTable,
    TargetTableName: targetTable,
    UseLatestRestorableTime: true
  }).promise();

  console.log('Restored to latest restorable time');
  return restore;
};

Deep Dive: PITR Mechanics and Performance Impact

Point-in-Time Recovery (PITR) is one of DynamoDB’s most impressive engineering feats, allowing restoration to any second in the last 35 days with zero impact on application performance.

1. The Log-Structured Approach

Unlike traditional databases that might rely on periodic snapshots and WAL (Write-Ahead Log) replay, DynamoDB’s PITR is built on a log-structured storage engine.

Continuous Archiving: Every write to a DynamoDB table is automatically archived to a highly durable storage layer (internal S3-like system) in the background.
Zero Performance Hit: Because the archiving happens asynchronously from the main request path (leader replication), enabling PITR does not increase latency for your PutItem or UpdateItem calls.

2. The “Restore-as-New” Pattern

It’s important to understand that a PITR restore never overwrites your existing table.

New Table Creation: DynamoDB creates a new table and populates it from the archived logs at the specified timestamp.
Data Integrity: This “side-by-side” restore allows you to verify the data before pointing your application to the new table.

3. Consistency and Recovery Granularity

Second-Level Precision: You can restore to any second within the 35-day window.
Metadata Restore: PITR restores the base table and its data, but you must manually reconfigure:
- GSIs (Global Secondary Indexes)
- IAM policies
- TTL settings
- Auto-scaling policies

4. Performance Metrics: RTO vs. Data Size

The Recovery Time Objective (RTO) for PITR is proportional to the size of the table, not the length of the recovery window. A 10GB table will restore much faster than a 10TB table, regardless of whether you are restoring to 1 hour ago or 30 days ago.

Backup to S3

// Export DynamoDB table to S3 (for long-term archival)
const exportToS3 = async (tableName, s3Bucket) => {
  const dataPipeline = new AWS.DataPipeline();

  // Option 1: Use AWS Data Pipeline (deprecated)
  // Option 2: Use DynamoDB Export to S3 (native feature)

  const exportArn = await dynamodb.exportTableToPointInTime({
    TableArn: `arn:aws:dynamodb:region:account:table/${tableName}`,
    S3Bucket: s3Bucket,
    S3Prefix: `dynamodb-exports/${tableName}/`,
    ExportFormat: 'DYNAMODB_JSON',  // or ION
    ExportTime: new Date()
  }).promise();

  console.log('Export started:', exportArn.ExportDescription.ExportArn);
  return exportArn;
};

// Monitor export status
const checkExportStatus = async (exportArn) => {
  const status = await dynamodb.describeExport({
    ExportArn: exportArn
  }).promise();

  console.log('Export Status:', status.ExportDescription.ExportStatus);
  console.log('Exported Item Count:', status.ExportDescription.ItemCount);

  return status.ExportDescription;
};

// Import from S3 backup
const importFromS3 = async (s3Bucket, s3Key, tableName) => {
  const importArn = await dynamodb.importTable({
    S3BucketSource: {
      S3Bucket: s3Bucket,
      S3KeyPrefix: s3Key
    },
    InputFormat: 'DYNAMODB_JSON',
    TableCreationParameters: {
      TableName: tableName,
      KeySchema: [
        { AttributeName: 'userId', KeyType: 'HASH' }
      ],
      AttributeDefinitions: [
        { AttributeName: 'userId', AttributeType: 'S' }
      ],
      BillingMode: 'PAY_PER_REQUEST'
    }
  }).promise();

  console.log('Import started:', importArn.ImportTableDescription.ImportArn);
  return importArn;
};

Disaster Recovery Strategies

RTO and RPO

// Recovery Time Objective (RTO): How quickly can you recover?
// Recovery Point Objective (RPO): How much data can you afford to lose?

// Different strategies for different RTO/RPO requirements:

// Strategy 1: Global Tables (RTO: Seconds, RPO: Sub-second)
const setupGlobalTableDR = async () => {
  // Continuous cross-region replication
  // Automatic failover
  // RTO: < 1 minute (DNS change)
  // RPO: < 1 second (continuous replication)

  await dynamodb.updateTable({
    TableName: 'ProductionTable',
    ReplicaUpdates: [
      { Create: { RegionName: 'us-west-2' } },  // DR region
      { Create: { RegionName: 'eu-west-1' } }   // Additional DR
    ]
  }).promise();

  console.log('Global Tables DR configured');
  console.log('RTO: < 1 minute | RPO: < 1 second');
};

// Strategy 2: PITR (RTO: Minutes, RPO: Seconds)
const setupPITRDR = async () => {
  // Point-in-time recovery
  // RTO: 5-15 minutes (restore time)
  // RPO: Up to 5 minutes (PITR granularity)

  await dynamodb.updateContinuousBackups({
    TableName: 'ProductionTable',
    PointInTimeRecoverySpecification: {
      PointInTimeRecoveryEnabled: true
    }
  }).promise();

  console.log('PITR DR configured');
  console.log('RTO: 5-15 minutes | RPO: 5 minutes');
};

// Strategy 3: Daily Backups (RTO: Hours, RPO: 24 hours)
const setupBackupDR = async () => {
  // Daily on-demand backups
  // RTO: 1-4 hours (restore + verification)
  // RPO: Up to 24 hours (daily backup)

  // Schedule daily backup (see scheduleBackups function above)
  console.log('Daily Backup DR configured');
  console.log('RTO: 1-4 hours | RPO: 24 hours');
};

// Strategy 4: S3 Export (RTO: Days, RPO: Weekly)
const setupS3ExportDR = async () => {
  // Weekly exports to S3
  // RTO: 1-2 days (import from S3)
  // RPO: Up to 7 days (weekly export)

  console.log('S3 Export DR configured');
  console.log('RTO: 1-2 days | RPO: 7 days');
};

Automated Failover

// Implement automated failover with Global Tables
class FailoverManager {
  constructor() {
    this.primaryRegion = 'us-east-1';
    this.failoverRegion = 'us-west-2';
    this.healthCheckInterval = 30000; // 30 seconds
  }

  async startHealthCheck() {
    setInterval(async () => {
      const healthy = await this.checkPrimaryHealth();

      if (!healthy) {
        await this.failover();
      }
    }, this.healthCheckInterval);
  }

  async checkPrimaryHealth() {
    try {
      const dynamodb = new AWS.DynamoDB.DocumentClient({ region: this.primaryRegion });

      // Perform health check query
      const start = Date.now();
      await dynamodb.get({
        TableName: 'HealthCheck',
        Key: { id: 'health' }
      }).promise();

      const latency = Date.now() - start;

      // Check latency threshold
      if (latency > 1000) {
        console.warn('High latency detected:', latency, 'ms');
        return false;
      }

      return true;
    } catch (error) {
      console.error('Primary region unhealthy:', error);
      return false;
    }
  }

  async failover() {
    console.log('Initiating failover to', this.failoverRegion);

    // Update Route53 to point to failover region
    const route53 = new AWS.Route53();

    await route53.changeResourceRecordSets({
      HostedZoneId: 'ZONE_ID',
      ChangeBatch: {
        Changes: [{
          Action: 'UPSERT',
          ResourceRecordSet: {
            Name: 'api.example.com',
            Type: 'CNAME',
            TTL: 60,
            ResourceRecords: [{
              Value: `api-${this.failoverRegion}.example.com`
            }]
          }
        }]
      }
    }).promise();

    // Update application configuration
    await this.updateAppConfig(this.failoverRegion);

    // Send alerts
    await this.sendFailoverAlert();

    console.log('Failover complete to', this.failoverRegion);
  }

  async updateAppConfig(newRegion) {
    // Update application configuration to use new region
    process.env.AWS_REGION = newRegion;

    // Notify all application instances
    const sns = new AWS.SNS();
    await sns.publish({
      TopicArn: 'arn:aws:sns:region:account:app-config-updates',
      Message: JSON.stringify({
        action: 'REGION_CHANGE',
        newRegion: newRegion,
        timestamp: new Date().toISOString()
      })
    }).promise();
  }

  async sendFailoverAlert() {
    const sns = new AWS.SNS();

    await sns.publish({
      TopicArn: 'arn:aws:sns:region:account:ops-alerts',
      Subject: 'DynamoDB Failover Initiated',
      Message: `Automatic failover from ${this.primaryRegion} to ${this.failoverRegion} completed at ${new Date().toISOString()}`
    }).promise();
  }
}

// Usage
const failoverManager = new FailoverManager();
failoverManager.startHealthCheck();

Error Handling and Retry Logic

Exponential Backoff

// Implement exponential backoff with jitter
class DynamoDBClient {
  constructor() {
    this.maxRetries = 5;
    this.baseDelay = 100;
    this.maxDelay = 10000;
  }

  async executeWithRetry(operation) {
    for (let attempt = 0; attempt <= this.maxRetries; attempt++) {
      try {
        return await operation();
      } catch (error) {
        if (this.isRetryable(error) && attempt < this.maxRetries) {
          const delay = this.calculateBackoff(attempt);
          console.log(`Retry ${attempt + 1}/${this.maxRetries} after ${delay}ms`);
          await this.sleep(delay);
        } else {
          throw error;
        }
      }
    }
  }

  isRetryable(error) {
    const retryableCodes = [
      'ProvisionedThroughputExceededException',
      'ThrottlingException',
      'RequestLimitExceeded',
      'InternalServerError',
      'ServiceUnavailable'
    ];

    return retryableCodes.includes(error.code) || error.statusCode >= 500;
  }

  calculateBackoff(attempt) {
    // Exponential backoff with jitter
    const exponentialDelay = Math.min(
      this.maxDelay,
      this.baseDelay * Math.pow(2, attempt)
    );

    // Add jitter (random value between 0 and delay)
    const jitter = Math.random() * exponentialDelay;

    return Math.floor(exponentialDelay + jitter);
  }

  sleep(ms) {
    return new Promise(resolve => setTimeout(resolve, ms));
  }
}

// Usage
const client = new DynamoDBClient();

const user = await client.executeWithRetry(async () => {
  return await dynamodb.get({
    TableName: 'Users',
    Key: { userId: '12345' }
  }).promise();
});

Circuit Breaker Pattern

// Implement circuit breaker to prevent cascading failures
class CircuitBreaker {
  constructor(options = {}) {
    this.failureThreshold = options.failureThreshold || 5;
    this.resetTimeout = options.resetTimeout || 60000; // 1 minute
    this.monitorInterval = options.monitorInterval || 10000; // 10 seconds

    this.state = 'CLOSED'; // CLOSED, OPEN, HALF_OPEN
    this.failureCount = 0;
    this.lastFailureTime = null;
    this.successCount = 0;
  }

  async execute(operation) {
    if (this.state === 'OPEN') {
      if (Date.now() - this.lastFailureTime > this.resetTimeout) {
        this.state = 'HALF_OPEN';
        console.log('Circuit breaker: OPEN -> HALF_OPEN');
      } else {
        throw new Error('Circuit breaker is OPEN');
      }
    }

    try {
      const result = await operation();

      this.onSuccess();
      return result;
    } catch (error) {
      this.onFailure();
      throw error;
    }
  }

  onSuccess() {
    this.failureCount = 0;

    if (this.state === 'HALF_OPEN') {
      this.successCount++;
      if (this.successCount >= 3) {
        this.state = 'CLOSED';
        this.successCount = 0;
        console.log('Circuit breaker: HALF_OPEN -> CLOSED');
      }
    }
  }

  onFailure() {
    this.failureCount++;
    this.lastFailureTime = Date.now();

    if (this.failureCount >= this.failureThreshold) {
      this.state = 'OPEN';
      console.error('Circuit breaker: CLOSED -> OPEN');

      // Send alert
      this.sendAlert();
    }
  }

  async sendAlert() {
    const sns = new AWS.SNS();
    await sns.publish({
      TopicArn: 'arn:aws:sns:region:account:alerts',
      Subject: 'Circuit Breaker Triggered',
      Message: `DynamoDB circuit breaker opened due to ${this.failureCount} consecutive failures`
    }).promise();
  }

  getState() {
    return {
      state: this.state,
      failureCount: this.failureCount,
      lastFailureTime: this.lastFailureTime
    };
  }
}

// Usage
const breaker = new CircuitBreaker({
  failureThreshold: 5,
  resetTimeout: 60000
});

try {
  const user = await breaker.execute(async () => {
    return await dynamodb.get({
      TableName: 'Users',
      Key: { userId: '12345' }
    }).promise();
  });
} catch (error) {
  if (error.message === 'Circuit breaker is OPEN') {
    // Fallback logic
    console.log('Using cached data or degraded service');
  }
}

Graceful Degradation

// Implement graceful degradation for resilience
class ResilientDynamoDBService {
  constructor() {
    this.cache = new Map();
    this.cacheTTL = 300000; // 5 minutes
  }

  async getUser(userId) {
    try {
      // Try primary operation
      const result = await dynamodb.get({
        TableName: 'Users',
        Key: { userId: userId },
        ConsistentRead: true
      }).promise();

      // Update cache
      this.cache.set(userId, {
        data: result.Item,
        timestamp: Date.now()
      });

      return result.Item;
    } catch (error) {
      console.warn('Primary fetch failed, attempting fallbacks:', error);

      // Fallback 1: Try eventually consistent read
      try {
        const result = await dynamodb.get({
          TableName: 'Users',
          Key: { userId: userId },
          ConsistentRead: false
        }).promise();

        return result.Item;
      } catch (error2) {
        console.warn('Eventually consistent read failed:', error2);

        // Fallback 2: Return cached data if available
        const cached = this.cache.get(userId);
        if (cached && Date.now() - cached.timestamp < this.cacheTTL) {
          console.log('Returning stale cached data');
          return { ...cached.data, _fromCache: true, _stale: true };
        }

        // Fallback 3: Return partial/default data
        console.warn('All fallbacks failed, returning default data');
        return {
          userId: userId,
          _unavailable: true,
          _message: 'Service temporarily unavailable'
        };
      }
    }
  }

  async updateUser(userId, updates) {
    try {
      const result = await dynamodb.update({
        TableName: 'Users',
        Key: { userId: userId },
        UpdateExpression: 'SET #name = :name',
        ExpressionAttributeNames: { '#name': 'name' },
        ExpressionAttributeValues: { ':name': updates.name },
        ReturnValues: 'ALL_NEW'
      }).promise();

      return { success: true, data: result.Attributes };
    } catch (error) {
      // Queue write for later retry
      await this.queueFailedWrite(userId, updates);

      return {
        success: false,
        queued: true,
        message: 'Update queued for retry'
      };
    }
  }

  async queueFailedWrite(userId, updates) {
    const sqs = new AWS.SQS();

    await sqs.sendMessage({
      QueueUrl: 'https://sqs.region.amazonaws.com/account/failed-writes',
      MessageBody: JSON.stringify({
        userId: userId,
        updates: updates,
        timestamp: new Date().toISOString()
      })
    }).promise();

    console.log('Write queued for retry');
  }
}

Monitoring and Alerting

CloudWatch Alarms

// Set up comprehensive monitoring and alerting
const setupMonitoring = async (tableName) => {
  const cloudwatch = new AWS.CloudWatch();

  // 1. High read throttling alarm
  await cloudwatch.putMetricAlarm({
    AlarmName: `${tableName}-high-read-throttling`,
    MetricName: 'ReadThrottleEvents',
    Namespace: 'AWS/DynamoDB',
    Statistic: 'Sum',
    Period: 300,
    EvaluationPeriods: 2,
    Threshold: 10,
    ComparisonOperator: 'GreaterThanThreshold',
    Dimensions: [
      { Name: 'TableName', Value: tableName }
    ],
    AlarmActions: ['arn:aws:sns:region:account:ops-alerts'],
    TreatMissingData: 'notBreaching'
  }).promise();

  // 2. High write throttling alarm
  await cloudwatch.putMetricAlarm({
    AlarmName: `${tableName}-high-write-throttling`,
    MetricName: 'WriteThrottleEvents',
    Namespace: 'AWS/DynamoDB',
    Statistic: 'Sum',
    Period: 300,
    EvaluationPeriods: 2,
    Threshold: 10,
    ComparisonOperator: 'GreaterThanThreshold',
    Dimensions: [
      { Name: 'TableName', Value: tableName }
    ],
    AlarmActions: ['arn:aws:sns:region:account:ops-alerts']
  }).promise();

  // 3. High system errors alarm
  await cloudwatch.putMetricAlarm({
    AlarmName: `${tableName}-system-errors`,
    MetricName: 'SystemErrors',
    Namespace: 'AWS/DynamoDB',
    Statistic: 'Sum',
    Period: 300,
    EvaluationPeriods: 1,
    Threshold: 5,
    ComparisonOperator: 'GreaterThanThreshold',
    Dimensions: [
      { Name: 'TableName', Value: tableName }
    ],
    AlarmActions: ['arn:aws:sns:region:account:critical-alerts']
  }).promise();

  // 4. High latency alarm
  await cloudwatch.putMetricAlarm({
    AlarmName: `${tableName}-high-latency`,
    MetricName: 'SuccessfulRequestLatency',
    Namespace: 'AWS/DynamoDB',
    Statistic: 'Average',
    Period: 300,
    EvaluationPeriods: 2,
    Threshold: 100, // 100ms
    ComparisonOperator: 'GreaterThanThreshold',
    Dimensions: [
      { Name: 'TableName', Value: tableName },
      { Name: 'Operation', Value: 'GetItem' }
    ],
    AlarmActions: ['arn:aws:sns:region:account:performance-alerts']
  }).promise();

  // 5. Capacity utilization alarm
  await cloudwatch.putMetricAlarm({
    AlarmName: `${tableName}-high-capacity-utilization`,
    MetricName: 'ConsumedReadCapacityUnits',
    Namespace: 'AWS/DynamoDB',
    Statistic: 'Average',
    Period: 300,
    EvaluationPeriods: 2,
    Threshold: 80, // 80% of provisioned capacity
    ComparisonOperator: 'GreaterThanThreshold',
    Dimensions: [
      { Name: 'TableName', Value: tableName }
    ],
    AlarmActions: ['arn:aws:sns:region:account:capacity-alerts']
  }).promise();

  console.log('Monitoring alarms configured');
};

Custom Health Checks

// Implement custom health checks
class HealthChecker {
  constructor(tableName) {
    this.tableName = tableName;
    this.checkInterval = 60000; // 1 minute
  }

  async start() {
    setInterval(() => this.performHealthCheck(), this.checkInterval);
  }

  async performHealthCheck() {
    const checks = await Promise.allSettled([
      this.checkReadLatency(),
      this.checkWriteLatency(),
      this.checkReplicationLag(),
      this.checkErrorRate()
    ]);

    const results = checks.map((check, index) => ({
      check: ['read', 'write', 'replication', 'errors'][index],
      status: check.status,
      value: check.value
    }));

    const healthy = results.every(r => r.status === 'fulfilled' && r.value.healthy);

    await this.publishHealthMetrics(results, healthy);

    return { healthy, results };
  }

  async checkReadLatency() {
    const start = Date.now();

    try {
      await dynamodb.get({
        TableName: this.tableName,
        Key: { id: 'health-check' }
      }).promise();

      const latency = Date.now() - start;
      const healthy = latency < 50; // 50ms threshold

      return { healthy, latency };
    } catch (error) {
      return { healthy: false, error: error.message };
    }
  }

  async checkWriteLatency() {
    const start = Date.now();

    try {
      await dynamodb.put({
        TableName: this.tableName,
        Item: {
          id: `health-check-${Date.now()}`,
          timestamp: new Date().toISOString()
        }
      }).promise();

      const latency = Date.now() - start;
      const healthy = latency < 100; // 100ms threshold

      return { healthy, latency };
    } catch (error) {
      return { healthy: false, error: error.message };
    }
  }

  async checkReplicationLag() {
    // For Global Tables: check replication lag
    if (!this.isGlobalTable) {
      return { healthy: true, lag: 0 };
    }

    const cloudwatch = new AWS.CloudWatch();

    const metrics = await cloudwatch.getMetricStatistics({
      Namespace: 'AWS/DynamoDB',
      MetricName: 'ReplicationLatency',
      Dimensions: [
        { Name: 'TableName', Value: this.tableName }
      ],
      StartTime: new Date(Date.now() - 300000),
      EndTime: new Date(),
      Period: 300,
      Statistics: ['Average']
    }).promise();

    const avgLag = metrics.Datapoints[0]?.Average || 0;
    const healthy = avgLag < 1000; // 1 second threshold

    return { healthy, lag: avgLag };
  }

  async checkErrorRate() {
    const cloudwatch = new AWS.CloudWatch();

    const errors = await cloudwatch.getMetricStatistics({
      Namespace: 'AWS/DynamoDB',
      MetricName: 'UserErrors',
      Dimensions: [
        { Name: 'TableName', Value: this.tableName }
      ],
      StartTime: new Date(Date.now() - 300000),
      EndTime: new Date(),
      Period: 300,
      Statistics: ['Sum']
    }).promise();

    const errorCount = errors.Datapoints[0]?.Sum || 0;
    const healthy = errorCount < 10;

    return { healthy, errorCount };
  }

  async publishHealthMetrics(results, healthy) {
    const cloudwatch = new AWS.CloudWatch();

    const metricData = [
      {
        MetricName: 'OverallHealth',
        Value: healthy ? 1 : 0,
        Unit: 'Count'
      },
      ...results.map(r => ({
        MetricName: `Health-${r.check}`,
        Value: r.value?.healthy ? 1 : 0,
        Unit: 'Count'
      }))
    ];

    await cloudwatch.putMetricData({
      Namespace: 'CustomHealth/DynamoDB',
      MetricData: metricData
    }).promise();
  }
}

Interview Questions and Answers

Question 1: How does DynamoDB achieve fault tolerance?

Answer: DynamoDB achieves fault tolerance through multiple mechanisms: 1. Multi-AZ Replication:

Every write is automatically replicated to 3 AZs
Uses quorum writes (2 of 3 must acknowledge)
Survives single AZ failure without data loss

2. Automatic Failure Detection:

Continuous health monitoring of storage nodes
Failed replicas automatically repaired
Traffic automatically routed to healthy nodes

3. Durable Storage:

SSD-based storage with redundancy
Write-ahead logging
Data checksums to detect corruption

4. No Single Point of Failure:

Distributed architecture
No master node dependency
Each partition independently replicated

Example scenario:

// Write operation during AZ failure
// AZ1: Available (Leader)
// AZ2: Available (Follower)
// AZ3: FAILED (Follower)

await dynamodb.put({ TableName: 'Users', Item: user }).promise();
// Write succeeds (quorum of 2 achieved)
// AZ3 automatically catches up when recovered

Question 2: What is the difference between on-demand backups and PITR?

Answer: On-Demand Backups:

Manual snapshots
Retained until explicitly deleted
Full table backup
No performance impact during backup
Restore creates new table
Use case: Before major changes, compliance archival

Point-in-Time Recovery (PITR):

Continuous backups
35-day retention (automatic)
Second-level granularity
Protects against accidental writes/deletes
Restore to any point in window
Use case: Operational recovery, accidental data loss

Comparison:

// On-Demand Backup
const backup = await dynamodb.createBackup({
  TableName: 'Users',
  BackupName: 'pre-migration-backup'
}).promise();
// Restore months later if needed

// PITR
const restore = await dynamodb.restoreTableToPointInTime({
  SourceTableName: 'Users',
  TargetTableName: 'Users-Recovered',
  RestoreDateTime: new Date('2024-01-15T10:30:00Z')
}).promise();
// Restore to specific moment in last 35 days

Recommendation: Use both - PITR for operational recovery, on-demand for archival.

Question 3: Explain RTO and RPO in the context of DynamoDB disaster recovery.

Answer: RTO (Recovery Time Objective): How quickly can you recover service after a disaster? RPO (Recovery Point Objective): How much data can you afford to lose? DynamoDB DR Strategies:

Strategy	RTO	RPO	Cost	Use Case
Global Tables	Seconds	Sub-second	High	Mission-critical
PITR	Minutes	5 minutes	Medium	Production apps
Daily Backups	Hours	24 hours	Low	Non-critical data
Weekly S3 Export	Days	7 days	Very Low	Archival

Example:

// Global Tables (Best RTO/RPO)
// Primary region fails at 10:00:00
// DNS failover at 10:00:30 (RTO: 30 seconds)
// Data loss: 0 (continuous replication)

// PITR (Medium RTO/RPO)
// Accidental delete at 10:00:00
// Restore initiated at 10:05:00
// Restore complete at 10:12:00 (RTO: 12 minutes)
// Data restored to 09:59:55 (RPO: 5 seconds)

Question 4: How do you handle throttling in a resilient way?

Answer: Multi-layered approach: 1. Exponential Backoff with Jitter:

async function retryWithBackoff(operation, maxRetries = 5) {
  for (let i = 0; i < maxRetries; i++) {
    try {
      return await operation();
    } catch (error) {
      if (error.code === 'ProvisionedThroughputExceededException' && i < maxRetries - 1) {
        const delay = Math.min(1000 * Math.pow(2, i), 10000) + Math.random() * 100;
        await new Promise(resolve => setTimeout(resolve, delay));
      } else {
        throw error;
      }
    }
  }
}

2. Circuit Breaker:

// Prevent cascade failures
// After 5 consecutive failures, circuit opens
// Reject requests immediately for 60 seconds
// Then try again (half-open state)

3. Request Queuing:

// Queue writes during throttling
await sqs.sendMessage({
  QueueUrl: queueUrl,
  MessageBody: JSON.stringify(writeRequest)
}).promise();

// Process queue with rate limiting

4. Capacity Adjustment:

// Auto-scaling or switch to on-demand mode
await dynamodb.updateTable({
  TableName: 'Users',
  BillingMode: 'PAY_PER_REQUEST'
}).promise();

5. Graceful Degradation:

// Return cached data if throttled
// Disable non-critical features
// Show "service degraded" message

Question 5: Design a disaster recovery plan for a multi-region application using DynamoDB.

Answer: Architecture:

// Primary Region: us-east-1
// DR Regions: us-west-2, eu-west-1

const drPlan = {
  // 1. Data Replication
  dataReplication: async () => {
    // Use Global Tables for continuous replication
    await dynamodb.updateTable({
      TableName: 'ProductionTable',
      ReplicaUpdates: [
        { Create: { RegionName: 'us-west-2' } },
        { Create: { RegionName: 'eu-west-1' } }
      ]
    }).promise();
    // RPO: < 1 second
  },

  // 2. Health Monitoring
  monitoring: async () => {
    // CloudWatch alarms in each region
    // Route53 health checks
    // Custom application health checks

    setInterval(async () => {
      const healthy = await checkRegionHealth('us-east-1');
      if (!healthy) {
        await initiateFailover();
      }
    }, 30000);
  },

  // 3. Automated Failover
  initiateFailover: async () => {
    console.log('Primary region failure detected');

    // Update Route53 DNS
    await route53.changeResourceRecordSets({
      HostedZoneId: zoneId,
      ChangeBatch: {
        Changes: [{
          Action: 'UPSERT',
          ResourceRecordSet: {
            Name: 'api.example.com',
            Type: 'CNAME',
            TTL: 60,
            ResourceRecords: [{ Value: 'api-us-west-2.example.com' }]
          }
        }]
      }
    }).promise();

    // Update application config
    process.env.PRIMARY_REGION = 'us-west-2';

    // Send alerts
    await sendAlert('Failover to us-west-2 completed');

    // RTO: < 2 minutes
  },

  // 4. Failback Procedure
  failback: async () => {
    // Verify primary region recovered
    const healthy = await verifyRegionHealth('us-east-1', 300000); // 5 min

    if (healthy) {
      // Gradual traffic shift (10% increments)
      for (let weight = 10; weight <= 100; weight += 10) {
        await updateTrafficWeights('us-east-1', weight);
        await new Promise(resolve => setTimeout(resolve, 60000)); // Wait 1 min
      }
    }
  },

  // 5. Regular Testing
  testing: async () => {
    // Quarterly DR drill
    // Switch to DR region
    // Verify all services work
    // Document issues
    // Failback to primary
  }
};

// Documentation
const drDocumentation = {
  rto: '< 2 minutes',
  rpo: '< 1 second',
  failoverTriggers: [
    'Region outage',
    'Network partition',
    'High error rate (> 5%)',
    'Latency > 1 second'
  ],
  runbooks: {
    failover: 'Step-by-step failover procedure',
    failback: 'Step-by-step failback procedure',
    validation: 'Post-failover validation checklist'
  }
};

Key Components:

Global Tables for data replication
Route53 for DNS failover
CloudWatch for monitoring
Automated scripts for failover
Regular DR testing

Validation:

Quarterly DR drills
Automated testing in staging
Runbook documentation
Team training

Summary

DynamoDB Fault Tolerance Features:

Built-in Replication:
- 3-AZ automatic replication
- Quorum-based writes
- No single point of failure
Backup Options:
- On-demand backups (long-term retention)
- PITR (35-day window, second-level granularity)
- S3 export (archival)
Disaster Recovery:
- Global Tables (RTO: seconds, RPO: sub-second)
- Multi-region deployment
- Automated failover
Error Handling:
- Exponential backoff
- Circuit breakers
- Graceful degradation
Monitoring:
- CloudWatch metrics and alarms
- Custom health checks
- Proactive alerting

Effective fault tolerance requires combining DynamoDB’s built-in features with application-level resilience patterns and comprehensive monitoring.

Overview

Testing & Code Quality

Crash Courses

AI Engineering

Math for ML - Understanding Linear Algebra

Probability & Statistics for ML

Math for ML - Understanding Calculus

ML Mastery

Deep Learning Mastery

NestJS Mastery

Microservices Mastery

Low Level Design

OOP Concepts

SOLID Principles

Design Patterns

LLD Case Studies

System Design (HLD)

Senior Level (L5+/Staff)

HLD Case Studies

Engineering Fundamentals

DevOps & Operations

Azure Cloud Engineering

AWS Cloud

AWS Monitoring & Observability

AWS Security Services

AWS Serverless

AWS Operations

AWS Advanced

AWS Case Studies

GCP Cloud Engineering

DevOps Tools

Database Engineering

HIPAA Compliance Mastery

Operating Systems

Linux Internals

Distributed Systems

Networking Mastery

Build Your Own X

Go Lang Mastery

C Programming

Classic Research Papers

Distributed System Tools

​Chapter 7: Fault Tolerance and Reliability

​Introduction

​DynamoDB’s Built-in Fault Tolerance

​Multi-AZ Replication

​Quorum-Based Replication

​Deep Dive: Consensus and the Evolution from Gossip to Paxos

​1. Leader-Based Replication (Paxos)

​2. Failure Detection: The Heartbeat

​3. Comparison: Original Paper vs. Modern DynamoDB

​4. Cross-Track Analysis: Fault Domains

​A. AZ-Awareness vs. Hadoop Rack-Awareness

​B. Anti-Entropy and the Merkle Tree Legacy

​Automatic Failure Detection and Recovery

​Deep Dive: Evolution from the 2007 Dynamo Paper

​1. Merkle Trees and Anti-Entropy

​2. Quorum Systems (R + W > N)

​Deep Dive: PITR Mechanics and Performance Impact

​1. Log-Structured Archiving

​2. Performance Isolation

​Backup and Restore Strategies

​On-Demand Backups

​Point-in-Time Recovery (PITR)

​Deep Dive: PITR Mechanics and Performance Impact

​1. The Log-Structured Approach

​2. The “Restore-as-New” Pattern

​3. Consistency and Recovery Granularity

​4. Performance Metrics: RTO vs. Data Size

​Backup to S3

​Disaster Recovery Strategies

​RTO and RPO

​Automated Failover

​Error Handling and Retry Logic

​Exponential Backoff

​Circuit Breaker Pattern

​Graceful Degradation

​Monitoring and Alerting

​CloudWatch Alarms

​Custom Health Checks

Chapter 7: Fault Tolerance and Reliability

Introduction

DynamoDB’s Built-in Fault Tolerance

Multi-AZ Replication

Quorum-Based Replication

Deep Dive: Consensus and the Evolution from Gossip to Paxos

1. Leader-Based Replication (Paxos)

2. Failure Detection: The Heartbeat

3. Comparison: Original Paper vs. Modern DynamoDB

4. Cross-Track Analysis: Fault Domains

A. AZ-Awareness vs. Hadoop Rack-Awareness

B. Anti-Entropy and the Merkle Tree Legacy

Automatic Failure Detection and Recovery

Deep Dive: Evolution from the 2007 Dynamo Paper

1. Merkle Trees and Anti-Entropy

2. Quorum Systems (R + W > N)

Deep Dive: PITR Mechanics and Performance Impact

1. Log-Structured Archiving

2. Performance Isolation

Backup and Restore Strategies

On-Demand Backups

Point-in-Time Recovery (PITR)

Deep Dive: PITR Mechanics and Performance Impact

1. The Log-Structured Approach

2. The “Restore-as-New” Pattern

3. Consistency and Recovery Granularity

4. Performance Metrics: RTO vs. Data Size

Backup to S3

Disaster Recovery Strategies

RTO and RPO

Automated Failover

Error Handling and Retry Logic

Exponential Backoff

Circuit Breaker Pattern

Graceful Degradation

Monitoring and Alerting

CloudWatch Alarms

Custom Health Checks

Interview Questions and Answers

Question 1: How does DynamoDB achieve fault tolerance?

Question 2: What is the difference between on-demand backups and PITR?

Question 3: Explain RTO and RPO in the context of DynamoDB disaster recovery.

Question 4: How do you handle throttling in a resilient way?

Question 5: Design a disaster recovery plan for a multi-region application using DynamoDB.

Summary