Skip to main content

Documentation Index

Fetch the complete documentation index at: https://resources.devweekends.com/llms.txt

Use this file to discover all available pages before exploring further.

AWS Well-Architected 6 Pillars

Module Overview

Estimated Time: 3-4 hours | Difficulty: Intermediate | Prerequisites: Previous AWS modules
The Well-Architected Framework is critical for AWS certifications and real-world architecture. Think of it as the building code for cloud systems — just as a physical building needs to meet structural, fire safety, and electrical standards, your cloud architecture needs to satisfy reliability, security, performance, cost, and sustainability standards. It provides best practices and design principles for building cloud systems. The framework does not prescribe one “right” architecture — it gives you a structured way to evaluate trade-offs. A startup might intentionally accept lower reliability for faster time-to-market; a bank might pay more for higher security. The framework helps you make those decisions explicitly rather than accidentally. What You’ll Learn:
  • The 6 pillars and their design principles
  • Common architectural patterns
  • Trade-offs between pillars
  • How to perform architecture reviews
  • Practical implementation guidance

Overview

The AWS Well-Architected Framework provides a consistent approach for evaluating architectures and implementing designs that scale.
┌────────────────────────────────────────────────────────────────┐
│               Well-Architected Framework Pillars                │
├────────────────────────────────────────────────────────────────┤
│                                                                 │
│   ┌──────────────┐  ┌──────────────┐  ┌──────────────┐        │
│   │ Operational  │  │   Security   │  │  Reliability │        │
│   │  Excellence  │  │              │  │              │        │
│   │     🔧       │  │     🔒       │  │     💪       │        │
│   └──────────────┘  └──────────────┘  └──────────────┘        │
│                                                                 │
│   ┌──────────────┐  ┌──────────────┐  ┌──────────────┐        │
│   │ Performance  │  │    Cost      │  │Sustainability│        │
│   │  Efficiency  │  │ Optimization │  │              │        │
│   │     ⚡       │  │     💰       │  │     🌱       │        │
│   └──────────────┘  └──────────────┘  └──────────────┘        │
│                                                                 │
└────────────────────────────────────────────────────────────────┘

1. Operational Excellence

Run and monitor systems to deliver business value

Design Principles

  • Perform operations as code - Infrastructure as Code (IaC)
  • Make frequent, small, reversible changes
  • Refine operations procedures frequently
  • Anticipate failure - Pre-mortems
  • Learn from operational failures

Key Practices

┌────────────────────────────────────────────────────────────────┐
│               Operational Excellence Practices                  │
├────────────────────────────────────────────────────────────────┤
│                                                                 │
│   Infrastructure as Code                                        │
│   ┌───────────┐    ┌───────────┐    ┌───────────┐             │
│   │CloudForm- │    │ Terraform │    │    CDK    │             │
│   │  ation    │    │           │    │           │             │
│   └───────────┘    └───────────┘    └───────────┘             │
│                                                                 │
│   Observability                                                 │
│   ┌───────────┐    ┌───────────┐    ┌───────────┐             │
│   │CloudWatch │    │  X-Ray    │    │  Logs     │             │
│   │  Metrics  │    │ Tracing   │    │ Insights  │             │
│   └───────────┘    └───────────┘    └───────────┘             │
│                                                                 │
│   Automation                                                    │
│   ┌───────────┐    ┌───────────┐    ┌───────────┐             │
│   │  Lambda   │    │   SSM     │    │EventBridge│             │
│   │ Functions │    │ Runbooks  │    │   Rules   │             │
│   └───────────┘    └───────────┘    └───────────┘             │
│                                                                 │
└────────────────────────────────────────────────────────────────┘

2. Security

Protect information, systems, and assets

Design Principles

  • Implement a strong identity foundation - Least privilege
  • Enable traceability - Monitor, alert, audit
  • Apply security at all layers
  • Automate security best practices
  • Protect data in transit and at rest
  • Keep people away from data - Reduce manual access

Security Architecture

┌────────────────────────────────────────────────────────────────┐
│                    Defense in Depth                             │
├────────────────────────────────────────────────────────────────┤
│                                                                 │
│   Edge                                                          │
│   ┌────────────────────────────────────────────────────────┐   │
│   │  CloudFront + WAF + Shield                              │   │
│   └────────────────────────────────────────────────────────┘   │
│                             │                                   │
│   Network                   ▼                                   │
│   ┌────────────────────────────────────────────────────────┐   │
│   │  VPC + Security Groups + NACLs + Flow Logs             │   │
│   └────────────────────────────────────────────────────────┘   │
│                             │                                   │
│   Compute                   ▼                                   │
│   ┌────────────────────────────────────────────────────────┐   │
│   │  IAM Roles + Instance Metadata + Patching              │   │
│   └────────────────────────────────────────────────────────┘   │
│                             │                                   │
│   Application               ▼                                   │
│   ┌────────────────────────────────────────────────────────┐   │
│   │  Secrets Manager + Parameter Store + Code Scanning     │   │
│   └────────────────────────────────────────────────────────┘   │
│                             │                                   │
│   Data                      ▼                                   │
│   ┌────────────────────────────────────────────────────────┐   │
│   │  KMS Encryption + Backup + Versioning + Access Logging │   │
│   └────────────────────────────────────────────────────────┘   │
│                                                                 │
└────────────────────────────────────────────────────────────────┘

3. Reliability

Recover from failures and meet demand

Design Principles

  • Automatically recover from failure
  • Test recovery procedures
  • Scale horizontally
  • Stop guessing capacity
  • Manage change through automation

High Availability Pattern

┌────────────────────────────────────────────────────────────────┐
│                Multi-AZ Architecture                            │
├────────────────────────────────────────────────────────────────┤
│                                                                 │
│                    ┌───────────────┐                           │
│                    │  Route 53     │                           │
│                    │  (DNS + HC)   │                           │
│                    └───────┬───────┘                           │
│                            │                                    │
│                    ┌───────▼───────┐                           │
│                    │     ALB       │                           │
│                    │  (Multi-AZ)   │                           │
│                    └───────┬───────┘                           │
│              ┌─────────────┼─────────────┐                     │
│              │             │             │                     │
│              ▼             ▼             ▼                     │
│        ┌──────────┐  ┌──────────┐  ┌──────────┐               │
│        │   EC2    │  │   EC2    │  │   EC2    │               │
│        │  (AZ-1)  │  │  (AZ-2)  │  │  (AZ-3)  │               │
│        └────┬─────┘  └────┬─────┘  └────┬─────┘               │
│             │             │             │                      │
│             └─────────────┼─────────────┘                      │
│                           │                                    │
│                    ┌──────▼──────┐                             │
│                    │    RDS      │                             │
│                    │  Multi-AZ   │                             │
│                    │(Primary+    │                             │
│                    │ Standby)    │                             │
│                    └─────────────┘                             │
│                                                                 │
│   RTO: Minutes  |  RPO: Near-zero  |  Availability: 99.99%    │
│                                                                 │
└────────────────────────────────────────────────────────────────┘

Reliability Metrics

MetricDefinitionTarget
RTORecovery Time ObjectiveHow fast to recover
RPORecovery Point ObjectiveHow much data loss acceptable
MTTRMean Time To RecoveryAverage recovery time
MTBFMean Time Between FailuresAverage uptime

4. Performance Efficiency

Use resources efficiently as demand changes

Design Principles

  • Democratize advanced technologies - Use managed services instead of building from scratch
  • Go global in minutes - Deploy to multiple regions with a few clicks
  • Use serverless architectures - Eliminate idle capacity costs
  • Experiment more often - Lower the cost of experimentation
  • Consider mechanical sympathy - Match architecture to workload (e.g., use Graviton ARM instances for compute-bound workloads — 20% cheaper with better price-performance)

Performance Patterns

┌────────────────────────────────────────────────────────────────┐
│              Performance Optimization Stack                     │
├────────────────────────────────────────────────────────────────┤
│                                                                 │
│   Caching Layers                                                │
│   ┌──────────────────────────────────────────────────────────┐ │
│   │                                                           │ │
│   │   Browser    →   CDN      →   API Cache   →   DB Cache   │ │
│   │   (headers)     (CloudFront)  (API GW)      (ElastiCache) │ │
│   │                                                           │ │
│   └──────────────────────────────────────────────────────────┘ │
│                                                                 │
│   Compute Selection                                             │
│   ┌─────────────┐ ┌─────────────┐ ┌─────────────┐             │
│   │ Right-size  │ │ Graviton    │ │ Spot for    │             │
│   │ instances   │ │ (ARM)       │ │ batch       │             │
│   └─────────────┘ └─────────────┘ └─────────────┘             │
│                                                                 │
│   Database Optimization                                         │
│   ┌─────────────┐ ┌─────────────┐ ┌─────────────┐             │
│   │ Read        │ │ Connection  │ │ Query       │             │
│   │ Replicas    │ │ Pooling     │ │ Optimization│             │
│   └─────────────┘ └─────────────┘ └─────────────┘             │
│                                                                 │
└────────────────────────────────────────────────────────────────┘

5. Cost Optimization

Avoid unnecessary costs

Design Principles

  • Implement cloud financial management - Assign cost ownership to teams
  • Adopt a consumption model - Pay only for what you use (serverless, auto-scaling)
  • Measure overall efficiency - Cost per transaction, not just total spend
  • Stop spending on undifferentiated heavy lifting - Use managed services (RDS over self-managed MySQL saves 20-40 hours/month in DBA time)
  • Analyze and attribute expenditure - Tag everything; untagged resources are invisible to cost allocation
Cost tip: The three biggest AWS bill surprises for most teams are (1) NAT Gateway data processing (0.045/GBaddsupfast),(2)CloudWatchLogsingestion(0.045/GB adds up fast), (2) CloudWatch Logs ingestion (0.50/GB — a verbose microservice can generate $500/month in logs alone), and (3) idle/forgotten resources (dev EC2 instances running over weekends, unused EBS volumes). Set up AWS Budgets alerts at 50%, 80%, and 100% of expected spend.

Cost Optimization Strategies

┌────────────────────────────────────────────────────────────────┐
│                 Cost Optimization Levers                        │
├────────────────────────────────────────────────────────────────┤
│                                                                 │
│   Right-Sizing                                                  │
│   ┌────────────────────────────────────────────────────────┐   │
│   │  • Use Cost Explorer recommendations                    │   │
│   │  • Monitor with CloudWatch                              │   │
│   │  • Downsize underutilized instances                     │   │
│   └────────────────────────────────────────────────────────┘   │
│                                                                 │
│   Pricing Models (Up to 72% savings)                           │
│   ┌──────────────┐  ┌──────────────┐  ┌──────────────┐        │
│   │   Reserved   │  │    Spot      │  │   Savings    │        │
│   │  Instances   │  │  Instances   │  │    Plans     │        │
│   │  (1-3 year)  │  │ (up to 90%)  │  │  (flexible)  │        │
│   └──────────────┘  └──────────────┘  └──────────────┘        │
│                                                                 │
│   Architecture Optimization                                     │
│   ┌────────────────────────────────────────────────────────┐   │
│   │  • Serverless (Lambda, Fargate)                         │   │
│   │  • S3 Lifecycle policies                                │   │
│   │  • Auto-scaling                                         │   │
│   │  • Delete unused resources                              │   │
│   └────────────────────────────────────────────────────────┘   │
│                                                                 │
└────────────────────────────────────────────────────────────────┘

6. Sustainability

Minimize environmental impact

Design Principles

  • Understand your impact
  • Establish sustainability goals
  • Maximize utilization
  • Adopt efficient hardware/software
  • Use managed services
  • Reduce downstream impact

Sustainability Practices

AreaPractice
ComputeRight-size, Graviton (ARM), Spot
StorageLifecycle policies, compression
DataEfficient formats, cold storage
CodeOptimize algorithms, reduce calls

Architecture Review Checklist

well_architected_review = {
    "operational_excellence": [
        "□ Infrastructure as Code? (CloudFormation, Terraform, or CDK -- not console clicks)",
        "□ Monitoring and alerting? (CloudWatch alarms for CPU, errors, latency)",
        "□ Runbooks for incidents? (documented steps, not tribal knowledge)",
        "□ CI/CD pipeline? (automated testing and deployment, not manual deploys)",
        "□ Rollback strategy? (canary deployments, blue-green, or feature flags)",
    ],
    "security": [
        "□ Least privilege IAM? (no * permissions, scoped to specific resources)",
        "□ Encryption at rest/transit? (KMS for data at rest, TLS for transit)",
        "□ Network segmentation? (private subnets for databases, SG references not IPs)",
        "□ Audit logging? (CloudTrail in all regions, log file validation enabled)",
        "□ Secrets management? (Secrets Manager or Parameter Store, never env vars)",
    ],
    "reliability": [
        "□ Multi-AZ deployment? (minimum 2 AZs for production)",
        "□ Health checks? (ALB health checks + Route 53 for DNS failover)",
        "□ Backup strategy? (automated snapshots, cross-region for DR)",
        "□ Disaster recovery tested? (run a DR drill at least quarterly)",
        "□ RTO/RPO defined? (documented and agreed upon with stakeholders)",
    ],
    "performance": [
        "□ Right-sized resources? (use Cost Explorer recommendations)",
        "□ Caching strategy? (ElastiCache for DB, CloudFront for static content)",
        "□ CDN for static content? (reduces latency by 50-90% for global users)",
        "□ Database optimized? (read replicas, connection pooling, query analysis)",
    ],
    "cost": [
        "□ Reserved/Savings Plans for baseline? (40-60% savings on steady-state)",
        "□ Spot instances for fault-tolerant workloads? (up to 90% savings)",
        "□ Unused resources cleaned? (weekly audit of idle EBS, old snapshots)",
        "□ Cost allocation tags on all resources? (Team, Environment, Project)",
        "□ Budget alerts configured? (50%, 80%, 100% thresholds)",
    ],
    "sustainability": [
        "□ Resource utilization >60%? (right-size or consolidate underused instances)",
        "□ Graviton (ARM) instances where possible? (20% better price-performance)",
        "□ Data lifecycle policies? (S3 lifecycle rules, RDS snapshot retention)",
    ]
}
Pro Tip: Use the AWS Well-Architected Tool in the console to perform self-assessments and get improvement recommendations based on your workload.

🎯 Interview Questions

  1. Operational Excellence: Run and monitor systems (IaC, automation)
  2. Security: Protect information and assets (IAM, encryption)
  3. Reliability: Recover from failures, meet demand (Multi-AZ, backup)
  4. Performance Efficiency: Use resources efficiently (right-sizing, caching)
  5. Cost Optimization: Avoid unnecessary costs (Reserved, Spot, Savings Plans)
  6. Sustainability: Minimize environmental impact (efficiency, managed services)
Each pillar has design principles and best practices.
Strategy:
  1. Multi-AZ deployment across at least 2 AZs
  2. Load balancer for traffic distribution
  3. Auto Scaling for capacity
  4. RDS Multi-AZ for database HA
  5. Route 53 health checks for DNS failover
Metrics:
  • 99.9% = 8.7 hours downtime/year
  • 99.99% = 52.5 minutes/year
  • 99.999% = 5.25 minutes/year
Trade-off: Higher availability = higher cost
Strategies:
  1. Right-size first - don’t over-provision
  2. Reserved/Savings Plans for baseline (60-70%)
  3. Spot instances for fault-tolerant workloads
  4. Serverless for variable workloads
  5. Caching to reduce compute/database load
Trade-offs:
  • Reserved = commitment vs discount
  • Spot = savings vs interruption risk
  • Caching = complexity vs performance
Layers:
  1. Edge: CloudFront, WAF, Shield
  2. Network: VPC, Security Groups, NACLs
  3. Compute: IAM roles, patching, hardening
  4. Application: Input validation, secrets management
  5. Data: Encryption at rest/transit, backup
Principle: If one layer fails, others still protect
RTO (Recovery Time Objective):
  • How long to recover from failure
  • “How much downtime is acceptable?”
  • Example: 4 hours RTO = must be back in 4 hours
RPO (Recovery Point Objective):
  • How much data loss is acceptable
  • “How far back in time to recover?”
  • Example: 1 hour RPO = max 1 hour of data loss
Trade-offs:
  • Lower RTO/RPO = more expensive (Multi-AZ, continuous backup)
  • Higher RTO/RPO = cheaper (single-AZ, daily backup)

🧪 Hands-On Lab: Well-Architected Review

Objective: Perform a Well-Architected review on an existing workload
1

Access Well-Architected Tool

Go to AWS Console → Well-Architected Tool → Define Workload
2

Answer Pillar Questions

Go through each pillar’s questions, marking current state
3

Review High-Risk Issues

Identify HRIs (High Risk Issues) flagged by the tool
4

Create Improvement Plan

Prioritize and create action items for improvements
5

Implement Changes

Address top 3 issues and re-run review

Next: Case Studies

Serverless URL Shortener

Build a complete serverless application with Lambda, API Gateway, and DynamoDB