> ## Documentation Index
> Fetch the complete documentation index at: https://resources.devweekends.com/llms.txt
> Use this file to discover all available pages before exploring further.

# Well-Architected Framework

> Master the 6 pillars for building reliable, secure, and efficient AWS systems

<Frame>
  <img src="https://mintcdn.com/devweeekends/sTu6A4whRFPJo0_g/images/aws/well-architected-pillars.svg?fit=max&auto=format&n=sTu6A4whRFPJo0_g&q=85&s=24325814b33cdd47e1aacfbf3577c337" alt="AWS Well-Architected 6 Pillars" width="1080" height="1080" data-path="images/aws/well-architected-pillars.svg" />
</Frame>

## Module Overview

<Info>
  **Estimated Time**: 3-4 hours | **Difficulty**: Intermediate | **Prerequisites**: Previous AWS modules
</Info>

The Well-Architected Framework is critical for AWS certifications and real-world architecture. Think of it as the building code for cloud systems -- just as a physical building needs to meet structural, fire safety, and electrical standards, your cloud architecture needs to satisfy reliability, security, performance, cost, and sustainability standards. It provides best practices and design principles for building cloud systems. The framework does not prescribe one "right" architecture -- it gives you a structured way to evaluate trade-offs. A startup might intentionally accept lower reliability for faster time-to-market; a bank might pay more for higher security. The framework helps you make those decisions explicitly rather than accidentally.

**What You'll Learn:**

* The 6 pillars and their design principles
* Common architectural patterns
* Trade-offs between pillars
* How to perform architecture reviews
* Practical implementation guidance

## Overview

The AWS Well-Architected Framework provides a consistent approach for evaluating architectures and implementing designs that scale.

```
┌────────────────────────────────────────────────────────────────┐
│               Well-Architected Framework Pillars                │
├────────────────────────────────────────────────────────────────┤
│                                                                 │
│   ┌──────────────┐  ┌──────────────┐  ┌──────────────┐        │
│   │ Operational  │  │   Security   │  │  Reliability │        │
│   │  Excellence  │  │              │  │              │        │
│   │     🔧       │  │     🔒       │  │     💪       │        │
│   └──────────────┘  └──────────────┘  └──────────────┘        │
│                                                                 │
│   ┌──────────────┐  ┌──────────────┐  ┌──────────────┐        │
│   │ Performance  │  │    Cost      │  │Sustainability│        │
│   │  Efficiency  │  │ Optimization │  │              │        │
│   │     ⚡       │  │     💰       │  │     🌱       │        │
│   └──────────────┘  └──────────────┘  └──────────────┘        │
│                                                                 │
└────────────────────────────────────────────────────────────────┘
```

***

## 1. Operational Excellence

*Run and monitor systems to deliver business value*

### Design Principles

* **Perform operations as code** - Infrastructure as Code (IaC)
* **Make frequent, small, reversible changes**
* **Refine operations procedures frequently**
* **Anticipate failure** - Pre-mortems
* **Learn from operational failures**

### Key Practices

```
┌────────────────────────────────────────────────────────────────┐
│               Operational Excellence Practices                  │
├────────────────────────────────────────────────────────────────┤
│                                                                 │
│   Infrastructure as Code                                        │
│   ┌───────────┐    ┌───────────┐    ┌───────────┐             │
│   │CloudForm- │    │ Terraform │    │    CDK    │             │
│   │  ation    │    │           │    │           │             │
│   └───────────┘    └───────────┘    └───────────┘             │
│                                                                 │
│   Observability                                                 │
│   ┌───────────┐    ┌───────────┐    ┌───────────┐             │
│   │CloudWatch │    │  X-Ray    │    │  Logs     │             │
│   │  Metrics  │    │ Tracing   │    │ Insights  │             │
│   └───────────┘    └───────────┘    └───────────┘             │
│                                                                 │
│   Automation                                                    │
│   ┌───────────┐    ┌───────────┐    ┌───────────┐             │
│   │  Lambda   │    │   SSM     │    │EventBridge│             │
│   │ Functions │    │ Runbooks  │    │   Rules   │             │
│   └───────────┘    └───────────┘    └───────────┘             │
│                                                                 │
└────────────────────────────────────────────────────────────────┘
```

***

## 2. Security

*Protect information, systems, and assets*

### Design Principles

* **Implement a strong identity foundation** - Least privilege
* **Enable traceability** - Monitor, alert, audit
* **Apply security at all layers**
* **Automate security best practices**
* **Protect data in transit and at rest**
* **Keep people away from data** - Reduce manual access

### Security Architecture

```
┌────────────────────────────────────────────────────────────────┐
│                    Defense in Depth                             │
├────────────────────────────────────────────────────────────────┤
│                                                                 │
│   Edge                                                          │
│   ┌────────────────────────────────────────────────────────┐   │
│   │  CloudFront + WAF + Shield                              │   │
│   └────────────────────────────────────────────────────────┘   │
│                             │                                   │
│   Network                   ▼                                   │
│   ┌────────────────────────────────────────────────────────┐   │
│   │  VPC + Security Groups + NACLs + Flow Logs             │   │
│   └────────────────────────────────────────────────────────┘   │
│                             │                                   │
│   Compute                   ▼                                   │
│   ┌────────────────────────────────────────────────────────┐   │
│   │  IAM Roles + Instance Metadata + Patching              │   │
│   └────────────────────────────────────────────────────────┘   │
│                             │                                   │
│   Application               ▼                                   │
│   ┌────────────────────────────────────────────────────────┐   │
│   │  Secrets Manager + Parameter Store + Code Scanning     │   │
│   └────────────────────────────────────────────────────────┘   │
│                             │                                   │
│   Data                      ▼                                   │
│   ┌────────────────────────────────────────────────────────┐   │
│   │  KMS Encryption + Backup + Versioning + Access Logging │   │
│   └────────────────────────────────────────────────────────┘   │
│                                                                 │
└────────────────────────────────────────────────────────────────┘
```

***

## 3. Reliability

*Recover from failures and meet demand*

### Design Principles

* **Automatically recover from failure**
* **Test recovery procedures**
* **Scale horizontally**
* **Stop guessing capacity**
* **Manage change through automation**

### High Availability Pattern

```
┌────────────────────────────────────────────────────────────────┐
│                Multi-AZ Architecture                            │
├────────────────────────────────────────────────────────────────┤
│                                                                 │
│                    ┌───────────────┐                           │
│                    │  Route 53     │                           │
│                    │  (DNS + HC)   │                           │
│                    └───────┬───────┘                           │
│                            │                                    │
│                    ┌───────▼───────┐                           │
│                    │     ALB       │                           │
│                    │  (Multi-AZ)   │                           │
│                    └───────┬───────┘                           │
│              ┌─────────────┼─────────────┐                     │
│              │             │             │                     │
│              ▼             ▼             ▼                     │
│        ┌──────────┐  ┌──────────┐  ┌──────────┐               │
│        │   EC2    │  │   EC2    │  │   EC2    │               │
│        │  (AZ-1)  │  │  (AZ-2)  │  │  (AZ-3)  │               │
│        └────┬─────┘  └────┬─────┘  └────┬─────┘               │
│             │             │             │                      │
│             └─────────────┼─────────────┘                      │
│                           │                                    │
│                    ┌──────▼──────┐                             │
│                    │    RDS      │                             │
│                    │  Multi-AZ   │                             │
│                    │(Primary+    │                             │
│                    │ Standby)    │                             │
│                    └─────────────┘                             │
│                                                                 │
│   RTO: Minutes  |  RPO: Near-zero  |  Availability: 99.99%    │
│                                                                 │
└────────────────────────────────────────────────────────────────┘
```

### Reliability Metrics

| Metric   | Definition                 | Target                        |
| -------- | -------------------------- | ----------------------------- |
| **RTO**  | Recovery Time Objective    | How fast to recover           |
| **RPO**  | Recovery Point Objective   | How much data loss acceptable |
| **MTTR** | Mean Time To Recovery      | Average recovery time         |
| **MTBF** | Mean Time Between Failures | Average uptime                |

***

## 4. Performance Efficiency

*Use resources efficiently as demand changes*

### Design Principles

* **Democratize advanced technologies** - Use managed services instead of building from scratch
* **Go global in minutes** - Deploy to multiple regions with a few clicks
* **Use serverless architectures** - Eliminate idle capacity costs
* **Experiment more often** - Lower the cost of experimentation
* **Consider mechanical sympathy** - Match architecture to workload (e.g., use Graviton ARM instances for compute-bound workloads -- 20% cheaper with better price-performance)

### Performance Patterns

```
┌────────────────────────────────────────────────────────────────┐
│              Performance Optimization Stack                     │
├────────────────────────────────────────────────────────────────┤
│                                                                 │
│   Caching Layers                                                │
│   ┌──────────────────────────────────────────────────────────┐ │
│   │                                                           │ │
│   │   Browser    →   CDN      →   API Cache   →   DB Cache   │ │
│   │   (headers)     (CloudFront)  (API GW)      (ElastiCache) │ │
│   │                                                           │ │
│   └──────────────────────────────────────────────────────────┘ │
│                                                                 │
│   Compute Selection                                             │
│   ┌─────────────┐ ┌─────────────┐ ┌─────────────┐             │
│   │ Right-size  │ │ Graviton    │ │ Spot for    │             │
│   │ instances   │ │ (ARM)       │ │ batch       │             │
│   └─────────────┘ └─────────────┘ └─────────────┘             │
│                                                                 │
│   Database Optimization                                         │
│   ┌─────────────┐ ┌─────────────┐ ┌─────────────┐             │
│   │ Read        │ │ Connection  │ │ Query       │             │
│   │ Replicas    │ │ Pooling     │ │ Optimization│             │
│   └─────────────┘ └─────────────┘ └─────────────┘             │
│                                                                 │
└────────────────────────────────────────────────────────────────┘
```

***

## 5. Cost Optimization

*Avoid unnecessary costs*

### Design Principles

* **Implement cloud financial management** - Assign cost ownership to teams
* **Adopt a consumption model** - Pay only for what you use (serverless, auto-scaling)
* **Measure overall efficiency** - Cost per transaction, not just total spend
* **Stop spending on undifferentiated heavy lifting** - Use managed services (RDS over self-managed MySQL saves 20-40 hours/month in DBA time)
* **Analyze and attribute expenditure** - Tag everything; untagged resources are invisible to cost allocation

Cost tip: The three biggest AWS bill surprises for most teams are (1) NAT Gateway data processing ($0.045/GB adds up fast), (2) CloudWatch Logs ingestion ($0.50/GB -- a verbose microservice can generate \$500/month in logs alone), and (3) idle/forgotten resources (dev EC2 instances running over weekends, unused EBS volumes). Set up AWS Budgets alerts at 50%, 80%, and 100% of expected spend.

### Cost Optimization Strategies

```
┌────────────────────────────────────────────────────────────────┐
│                 Cost Optimization Levers                        │
├────────────────────────────────────────────────────────────────┤
│                                                                 │
│   Right-Sizing                                                  │
│   ┌────────────────────────────────────────────────────────┐   │
│   │  • Use Cost Explorer recommendations                    │   │
│   │  • Monitor with CloudWatch                              │   │
│   │  • Downsize underutilized instances                     │   │
│   └────────────────────────────────────────────────────────┘   │
│                                                                 │
│   Pricing Models (Up to 72% savings)                           │
│   ┌──────────────┐  ┌──────────────┐  ┌──────────────┐        │
│   │   Reserved   │  │    Spot      │  │   Savings    │        │
│   │  Instances   │  │  Instances   │  │    Plans     │        │
│   │  (1-3 year)  │  │ (up to 90%)  │  │  (flexible)  │        │
│   └──────────────┘  └──────────────┘  └──────────────┘        │
│                                                                 │
│   Architecture Optimization                                     │
│   ┌────────────────────────────────────────────────────────┐   │
│   │  • Serverless (Lambda, Fargate)                         │   │
│   │  • S3 Lifecycle policies                                │   │
│   │  • Auto-scaling                                         │   │
│   │  • Delete unused resources                              │   │
│   └────────────────────────────────────────────────────────┘   │
│                                                                 │
└────────────────────────────────────────────────────────────────┘
```

***

## 6. Sustainability

*Minimize environmental impact*

### Design Principles

* **Understand your impact**
* **Establish sustainability goals**
* **Maximize utilization**
* **Adopt efficient hardware/software**
* **Use managed services**
* **Reduce downstream impact**

### Sustainability Practices

| Area        | Practice                          |
| ----------- | --------------------------------- |
| **Compute** | Right-size, Graviton (ARM), Spot  |
| **Storage** | Lifecycle policies, compression   |
| **Data**    | Efficient formats, cold storage   |
| **Code**    | Optimize algorithms, reduce calls |

***

## Architecture Review Checklist

```python theme={null}
well_architected_review = {
    "operational_excellence": [
        "□ Infrastructure as Code? (CloudFormation, Terraform, or CDK -- not console clicks)",
        "□ Monitoring and alerting? (CloudWatch alarms for CPU, errors, latency)",
        "□ Runbooks for incidents? (documented steps, not tribal knowledge)",
        "□ CI/CD pipeline? (automated testing and deployment, not manual deploys)",
        "□ Rollback strategy? (canary deployments, blue-green, or feature flags)",
    ],
    "security": [
        "□ Least privilege IAM? (no * permissions, scoped to specific resources)",
        "□ Encryption at rest/transit? (KMS for data at rest, TLS for transit)",
        "□ Network segmentation? (private subnets for databases, SG references not IPs)",
        "□ Audit logging? (CloudTrail in all regions, log file validation enabled)",
        "□ Secrets management? (Secrets Manager or Parameter Store, never env vars)",
    ],
    "reliability": [
        "□ Multi-AZ deployment? (minimum 2 AZs for production)",
        "□ Health checks? (ALB health checks + Route 53 for DNS failover)",
        "□ Backup strategy? (automated snapshots, cross-region for DR)",
        "□ Disaster recovery tested? (run a DR drill at least quarterly)",
        "□ RTO/RPO defined? (documented and agreed upon with stakeholders)",
    ],
    "performance": [
        "□ Right-sized resources? (use Cost Explorer recommendations)",
        "□ Caching strategy? (ElastiCache for DB, CloudFront for static content)",
        "□ CDN for static content? (reduces latency by 50-90% for global users)",
        "□ Database optimized? (read replicas, connection pooling, query analysis)",
    ],
    "cost": [
        "□ Reserved/Savings Plans for baseline? (40-60% savings on steady-state)",
        "□ Spot instances for fault-tolerant workloads? (up to 90% savings)",
        "□ Unused resources cleaned? (weekly audit of idle EBS, old snapshots)",
        "□ Cost allocation tags on all resources? (Team, Environment, Project)",
        "□ Budget alerts configured? (50%, 80%, 100% thresholds)",
    ],
    "sustainability": [
        "□ Resource utilization >60%? (right-size or consolidate underused instances)",
        "□ Graviton (ARM) instances where possible? (20% better price-performance)",
        "□ Data lifecycle policies? (S3 lifecycle rules, RDS snapshot retention)",
    ]
}
```

<Tip>
  **Pro Tip**: Use the AWS Well-Architected Tool in the console to perform self-assessments and get improvement recommendations based on your workload.
</Tip>

***

## 🎯 Interview Questions

<AccordionGroup>
  <Accordion title="Q1: Explain the 6 pillars of Well-Architected Framework">
    1. **Operational Excellence**: Run and monitor systems (IaC, automation)
    2. **Security**: Protect information and assets (IAM, encryption)
    3. **Reliability**: Recover from failures, meet demand (Multi-AZ, backup)
    4. **Performance Efficiency**: Use resources efficiently (right-sizing, caching)
    5. **Cost Optimization**: Avoid unnecessary costs (Reserved, Spot, Savings Plans)
    6. **Sustainability**: Minimize environmental impact (efficiency, managed services)

    Each pillar has design principles and best practices.
  </Accordion>

  <Accordion title="Q2: How would you design for high availability?">
    **Strategy:**

    1. **Multi-AZ deployment** across at least 2 AZs
    2. **Load balancer** for traffic distribution
    3. **Auto Scaling** for capacity
    4. **RDS Multi-AZ** for database HA
    5. **Route 53 health checks** for DNS failover

    **Metrics:**

    * 99.9% = 8.7 hours downtime/year
    * 99.99% = 52.5 minutes/year
    * 99.999% = 5.25 minutes/year

    **Trade-off**: Higher availability = higher cost
  </Accordion>

  <Accordion title="Q3: How do you balance cost vs performance?">
    **Strategies:**

    1. **Right-size first** - don't over-provision
    2. **Reserved/Savings Plans** for baseline (60-70%)
    3. **Spot instances** for fault-tolerant workloads
    4. **Serverless** for variable workloads
    5. **Caching** to reduce compute/database load

    **Trade-offs:**

    * Reserved = commitment vs discount
    * Spot = savings vs interruption risk
    * Caching = complexity vs performance
  </Accordion>

  <Accordion title="Q4: How do you implement defense in depth?">
    **Layers:**

    1. **Edge**: CloudFront, WAF, Shield
    2. **Network**: VPC, Security Groups, NACLs
    3. **Compute**: IAM roles, patching, hardening
    4. **Application**: Input validation, secrets management
    5. **Data**: Encryption at rest/transit, backup

    **Principle**: If one layer fails, others still protect
  </Accordion>

  <Accordion title="Q5: What's the difference between RTO and RPO?">
    **RTO (Recovery Time Objective):**

    * How long to recover from failure
    * "How much downtime is acceptable?"
    * Example: 4 hours RTO = must be back in 4 hours

    **RPO (Recovery Point Objective):**

    * How much data loss is acceptable
    * "How far back in time to recover?"
    * Example: 1 hour RPO = max 1 hour of data loss

    **Trade-offs:**

    * Lower RTO/RPO = more expensive (Multi-AZ, continuous backup)
    * Higher RTO/RPO = cheaper (single-AZ, daily backup)
  </Accordion>
</AccordionGroup>

***

## 🧪 Hands-On Lab: Well-Architected Review

**Objective**: Perform a Well-Architected review on an existing workload

<Steps>
  <Step title="Access Well-Architected Tool">
    Go to AWS Console → Well-Architected Tool → Define Workload
  </Step>

  <Step title="Answer Pillar Questions">
    Go through each pillar's questions, marking current state
  </Step>

  <Step title="Review High-Risk Issues">
    Identify HRIs (High Risk Issues) flagged by the tool
  </Step>

  <Step title="Create Improvement Plan">
    Prioritize and create action items for improvements
  </Step>

  <Step title="Implement Changes">
    Address top 3 issues and re-run review
  </Step>
</Steps>

***

## Next: Case Studies

<Card title="Serverless URL Shortener" icon="link" href="/aws/case-study-serverless">
  Build a complete serverless application with Lambda, API Gateway, and DynamoDB
</Card>