> ## Documentation Index
> Fetch the complete documentation index at: https://resources.devweekends.com/llms.txt
> Use this file to discover all available pages before exploring further.

# Observability & Monitoring

> Master CloudWatch, X-Ray, CloudTrail, and AWS monitoring best practices

<Frame>
  <img src="https://mintcdn.com/devweeekends/sTu6A4whRFPJo0_g/images/aws/cloudwatch-observability.svg?fit=max&auto=format&n=sTu6A4whRFPJo0_g&q=85&s=bc3947b4886c5d68a774569e47fe9f6b" alt="AWS Observability Stack" width="1080" height="1080" data-path="images/aws/cloudwatch-observability.svg" />
</Frame>

## Module Overview

<Info>
  **Estimated Time**: 3-4 hours | **Difficulty**: Intermediate | **Prerequisites**: Core Concepts, Compute
</Info>

Observability is critical for running production workloads. Without it, debugging production issues is like diagnosing a patient over the phone with no test results -- you are guessing. Metrics tell you WHAT is happening (fever), logs tell you WHY (infection details), and traces tell you WHERE (which organ). This module covers the complete AWS monitoring stack for logs, metrics, traces, and alerts.

**What You'll Learn:**

* CloudWatch metrics, logs, and alarms
* X-Ray distributed tracing
* CloudTrail for audit logging
* EventBridge for event-driven automation
* Building observability dashboards
* Alerting and incident response

***

## Observability Pillars

```
┌────────────────────────────────────────────────────────────────────────┐
│                    Three Pillars of Observability                       │
├────────────────────────────────────────────────────────────────────────┤
│                                                                         │
│   ┌─────────────────┐  ┌─────────────────┐  ┌─────────────────┐        │
│   │     METRICS     │  │      LOGS       │  │     TRACES      │        │
│   │   (CloudWatch)  │  │  (CloudWatch    │  │    (X-Ray)      │        │
│   │                 │  │     Logs)       │  │                 │        │
│   └────────┬────────┘  └────────┬────────┘  └────────┬────────┘        │
│            │                    │                    │                  │
│   What happened?       Why did it         Where did it                 │
│   (CPU 85%)            happen?            happen?                      │
│                        (error logs)       (service A→B→C)              │
│                                                                         │
│   ┌─────────────────────────────────────────────────────────────────┐  │
│   │                    AWS Observability Stack                       │  │
│   │                                                                  │  │
│   │   CloudWatch     CloudWatch     X-Ray        CloudTrail         │  │
│   │   Metrics        Logs           Traces       Audit Logs         │  │
│   │      │              │              │              │              │  │
│   │      └──────────────┴──────────────┴──────────────┘              │  │
│   │                            │                                     │  │
│   │                    CloudWatch Dashboards                         │  │
│   │                    CloudWatch Alarms                             │  │
│   │                    EventBridge Automation                        │  │
│   │                                                                  │  │
│   └─────────────────────────────────────────────────────────────────┘  │
│                                                                         │
└────────────────────────────────────────────────────────────────────────┘
```

***

## CloudWatch Metrics

Collect and track metrics from AWS services and custom applications.

### Built-in vs Custom Metrics

```
┌────────────────────────────────────────────────────────────────────────┐
│                    CloudWatch Metrics Types                             │
├────────────────────────────────────────────────────────────────────────┤
│                                                                         │
│   BUILT-IN METRICS (Free - Basic Monitoring)                           │
│   ─────────────────────────────────────────                            │
│   EC2:        CPUUtilization, NetworkIn/Out, DiskRead/Write            │
│   RDS:        CPUUtilization, FreeableMemory, ReadIOPS, WriteIOPS      │
│   Lambda:     Invocations, Duration, Errors, Throttles, ConcurrentExec │
│   ALB:        RequestCount, TargetResponseTime, HTTPCode_Target_2XX    │
│   DynamoDB:   ConsumedRCU, ConsumedWCU, ThrottledRequests              │
│   S3:         BucketSizeBytes, NumberOfObjects                         │
│                                                                         │
│   Resolution: 5 minutes (basic), 1 minute (detailed - extra cost)      │
│                                                                         │
│   CUSTOM METRICS (You publish)                                         │
│   ────────────────────────────                                         │
│   • Application-specific metrics                                       │
│   • Business KPIs (orders/min, signups/hour)                          │
│   • Memory utilization (not built-in for EC2!)                        │
│   • Queue depth, cache hit ratio                                       │
│                                                                         │
│   Resolution: 1 second to 1 minute (high-resolution)                   │
│   Cost: $0.30 per metric per month                                     │
│                                                                         │
└────────────────────────────────────────────────────────────────────────┘
```

### Publishing Custom Metrics

```python theme={null}
import boto3
from datetime import datetime

cloudwatch = boto3.client('cloudwatch')

def publish_custom_metric(namespace: str, metric_name: str, 
                          value: float, unit: str = 'Count',
                          dimensions: list = None):
    """
    Publish custom metric to CloudWatch.
    
    Cost tip: Each unique combination of namespace + metric name + dimensions
    counts as one custom metric ($0.30/month). If you add a "RequestId"
    dimension, every single request creates a new metric -- that is thousands
    of dollars/month. Use low-cardinality dimensions like Environment, Region,
    or Service. Never use user IDs, request IDs, or timestamps as dimensions.
    """
    
    metric_data = {
        'MetricName': metric_name,
        'Value': value,
        'Unit': unit,
        'Timestamp': datetime.utcnow(),
    }
    
    if dimensions:
        metric_data['Dimensions'] = dimensions
    
    cloudwatch.put_metric_data(
        Namespace=namespace,
        MetricData=[metric_data]
    )

# Example: Track orders per minute
publish_custom_metric(
    namespace='MyApp/Ecommerce',
    metric_name='OrdersPlaced',
    value=42,
    unit='Count',
    dimensions=[
        {'Name': 'Environment', 'Value': 'Production'},
        {'Name': 'Region', 'Value': 'us-east-1'}
    ]
)

# Example: Track memory utilization (not built-in!)
import psutil

publish_custom_metric(
    namespace='MyApp/System',
    metric_name='MemoryUtilization',
    value=psutil.virtual_memory().percent,
    unit='Percent',
    dimensions=[
        {'Name': 'InstanceId', 'Value': 'i-1234567890abcdef0'}
    ]
)

# High-resolution metrics (1-second granularity)
# Cost warning: High-resolution metrics cost the same per metric ($0.30/month)
# but generate 60x more data points, which increases storage and query costs.
# Only use StorageResolution=1 for metrics where second-level precision matters
# (e.g., real-time trading systems). For most applications, 60-second standard
# resolution is sufficient and much cheaper to query with Logs Insights.
cloudwatch.put_metric_data(
    Namespace='MyApp/HighFrequency',
    MetricData=[{
        'MetricName': 'TransactionsPerSecond',
        'Value': 1500,
        'Unit': 'Count/Second',
        'StorageResolution': 1  # 1 = high-res, 60 = standard
    }]
)
```

### CloudWatch Embedded Metric Format (EMF)

```python theme={null}
import json

def emit_emf_metric(metric_name: str, value: float, dimensions: dict):
    """
    Embedded Metric Format - publish metrics via logs.
    Automatically extracted by CloudWatch.
    
    Why EMF over PutMetricData? EMF is the preferred approach for Lambda
    because: (1) no additional API call latency added to your function,
    (2) the metric data is embedded in your log output so you get metrics
    AND logs in one write, and (3) you avoid PutMetricData throttling
    limits (150 TPS). The downside is you pay for log ingestion ($0.50/GB),
    but for Lambda you are already paying that regardless.
    """
    emf_log = {
        "_aws": {
            "Timestamp": int(datetime.utcnow().timestamp() * 1000),
            "CloudWatchMetrics": [{
                "Namespace": "MyApp",
                "Dimensions": [list(dimensions.keys())],
                "Metrics": [{
                    "Name": metric_name,
                    "Unit": "Count"
                }]
            }]
        },
        metric_name: value,
        **dimensions
    }
    
    # Print to stdout - CloudWatch Logs extracts the metric
    print(json.dumps(emf_log))

# Usage in Lambda
emit_emf_metric(
    metric_name="OrderValue",
    value=99.99,
    dimensions={"Service": "Checkout", "Environment": "prod"}
)
```

***

## CloudWatch Logs

Centralized log management for all AWS services and applications. CloudWatch Logs is often the single biggest line item on an AWS bill that teams do not expect -- a service generating 100 GB of logs per day costs roughly \$1,500/month in ingestion alone, before storage and queries.

### Log Architecture

```
┌────────────────────────────────────────────────────────────────────────┐
│                    CloudWatch Logs Structure                            │
├────────────────────────────────────────────────────────────────────────┤
│                                                                         │
│   LOG GROUP: /aws/lambda/my-function                                   │
│   ───────────────────────────────────                                  │
│       │                                                                 │
│       ├── LOG STREAM: 2024/01/15/[$LATEST]abc123                       │
│       │       │                                                         │
│       │       ├── Log Event: {"level": "INFO", "msg": "Started"}       │
│       │       ├── Log Event: {"level": "ERROR", "msg": "Failed"}       │
│       │       └── Log Event: {"level": "INFO", "msg": "Completed"}     │
│       │                                                                 │
│       ├── LOG STREAM: 2024/01/15/[$LATEST]def456                       │
│       │       └── ...                                                   │
│       │                                                                 │
│       └── LOG STREAM: 2024/01/16/[$LATEST]ghi789                       │
│               └── ...                                                   │
│                                                                         │
│   RETENTION SETTINGS:                                                  │
│   • 1 day to 10 years (or never expire)                                │
│   • Export to S3 for long-term storage                                 │
│   • Stream to Kinesis/Lambda for real-time processing                  │
│                                                                         │
│   PRICING:                                                              │
│   • Ingestion: $0.50 per GB                                            │
│   • Storage: $0.03 per GB/month                                        │
│   • Queries (Logs Insights): $0.005 per GB scanned                     │
│                                                                         │
└────────────────────────────────────────────────────────────────────────┘
```

### Structured Logging Best Practices

```python theme={null}
import json
import logging
from datetime import datetime

class StructuredLogger:
    """JSON structured logger for CloudWatch Logs."""
    
    def __init__(self, service_name: str):
        self.service_name = service_name
        self.logger = logging.getLogger()
        self.logger.setLevel(logging.INFO)
    
    def _format_log(self, level: str, message: str, **kwargs) -> str:
        log_entry = {
            "timestamp": datetime.utcnow().isoformat(),
            "level": level,
            "service": self.service_name,
            "message": message,
            **kwargs
        }
        return json.dumps(log_entry)
    
    def info(self, message: str, **kwargs):
        print(self._format_log("INFO", message, **kwargs))
    
    def error(self, message: str, **kwargs):
        print(self._format_log("ERROR", message, **kwargs))
    
    def warn(self, message: str, **kwargs):
        print(self._format_log("WARN", message, **kwargs))

# Usage
logger = StructuredLogger("order-service")

def process_order(order_id: str, user_id: str):
    logger.info(
        "Processing order",
        order_id=order_id,
        user_id=user_id,
        action="process_order"
    )
    
    try:
        # Process order...
        logger.info(
            "Order completed",
            order_id=order_id,
            duration_ms=150,
            action="order_complete"
        )
    except Exception as e:
        logger.error(
            "Order failed",
            order_id=order_id,
            error=str(e),
            action="order_failed"
        )
        raise
```

### CloudWatch Logs Insights

```sql theme={null}
-- Find all errors in the last hour
fields @timestamp, @message
| filter @message like /ERROR/
| sort @timestamp desc
| limit 100

-- Parse JSON logs and aggregate
fields @timestamp, @message
| parse @message '{"level":"*","service":"*","message":"*"' as level, service, msg
| filter level = "ERROR"
| stats count(*) as error_count by service
| sort error_count desc

-- Calculate p99 latency from Lambda
fields @timestamp, @duration
| filter @type = "REPORT"
| stats avg(@duration) as avg_duration,
        pct(@duration, 50) as p50,
        pct(@duration, 95) as p95,
        pct(@duration, 99) as p99
  by bin(1h)

-- Find slow API requests
fields @timestamp, @message
| parse @message '"duration_ms":*,' as duration
| filter duration > 1000
| sort duration desc
| limit 50

-- Error rate over time
fields @timestamp, @message
| filter @message like /level/
| parse @message '"level":"*"' as level
| stats count(*) as total,
        sum(level = "ERROR") as errors
  by bin(5m)
| display errors / total * 100 as error_rate
```

***

## X-Ray Distributed Tracing

Trace requests across microservices to identify bottlenecks and errors. While metrics tell you "p99 latency is 3 seconds" and logs tell you "this function threw an error," traces tell you "the request spent 2.5 of those 3 seconds waiting on a DynamoDB call that was throttled." Traces are the only tool that gives you a request-level view across service boundaries.

### X-Ray Architecture

```
┌────────────────────────────────────────────────────────────────────────┐
│                    X-Ray Distributed Tracing                            │
├────────────────────────────────────────────────────────────────────────┤
│                                                                         │
│   Request Flow with Trace:                                             │
│                                                                         │
│   Client                                                                │
│     │                                                                   │
│     │  Trace ID: 1-abc123-def456789                                    │
│     ▼                                                                   │
│   ┌─────────────────────────────────────────────────────────────────┐  │
│   │  API Gateway                                                     │  │
│   │  Segment: 50ms                                                   │  │
│   └───────────────────────────┬─────────────────────────────────────┘  │
│                               │                                         │
│                               ▼                                         │
│   ┌─────────────────────────────────────────────────────────────────┐  │
│   │  Lambda: Order Service                                           │  │
│   │  Segment: 200ms                                                  │  │
│   │  ├── Subsegment: DynamoDB Query (30ms)                          │  │
│   │  ├── Subsegment: External API Call (100ms)                      │  │
│   │  └── Subsegment: SNS Publish (20ms)                             │  │
│   └───────────────────────────┬─────────────────────────────────────┘  │
│                               │                                         │
│              ┌────────────────┴────────────────┐                       │
│              ▼                                 ▼                       │
│   ┌──────────────────────┐        ┌──────────────────────┐            │
│   │  Lambda: Inventory   │        │  Lambda: Payment      │            │
│   │  Segment: 80ms       │        │  Segment: 150ms       │            │
│   │  └── DynamoDB (50ms) │        │  └── Stripe API (120ms)│           │
│   └──────────────────────┘        └──────────────────────┘            │
│                                                                         │
│   Service Map (auto-generated):                                        │
│   ┌──────────┐    ┌──────────┐    ┌──────────┐    ┌──────────┐        │
│   │   API    │───►│  Order   │───►│ Inventory│───►│ DynamoDB │        │
│   │ Gateway  │    │ Service  │    │ Service  │    │          │        │
│   └──────────┘    └────┬─────┘    └──────────┘    └──────────┘        │
│                        │                                                │
│                        └──────────►┌──────────┐                        │
│                                    │ Payment  │                        │
│                                    │ Service  │                        │
│                                    └──────────┘                        │
│                                                                         │
└────────────────────────────────────────────────────────────────────────┘
```

### Instrumenting Lambda with X-Ray

```python theme={null}
import boto3
from aws_xray_sdk.core import xray_recorder
from aws_xray_sdk.core import patch_all

# Patch all supported libraries (boto3, requests, etc.)
patch_all()

dynamodb = boto3.resource('dynamodb')
table = dynamodb.Table('orders')

@xray_recorder.capture('process_order')
def process_order(order_id: str):
    """Process order with X-Ray tracing."""
    
    # Add annotation (indexed, searchable)
    xray_recorder.put_annotation('order_id', order_id)
    
    # Add metadata (not indexed, for debugging)
    xray_recorder.put_metadata('order_details', {
        'order_id': order_id,
        'timestamp': '2024-01-15T10:00:00Z'
    })
    
    # Subsegment for custom operation
    with xray_recorder.in_subsegment('validate_order') as subsegment:
        subsegment.put_annotation('validation_type', 'full')
        validate_order(order_id)
    
    # DynamoDB call is automatically traced
    result = table.get_item(Key={'order_id': order_id})
    
    # External API call with custom subsegment
    with xray_recorder.in_subsegment('external_api') as subsegment:
        subsegment.put_metadata('api', 'payment_gateway')
        response = call_payment_api(result['Item'])
    
    return response

def lambda_handler(event, context):
    order_id = event.get('order_id')
    return process_order(order_id)
```

***

## CloudWatch Alarms

Automated alerts and actions based on metric thresholds.

### Alarm Architecture

```
┌────────────────────────────────────────────────────────────────────────┐
│                    CloudWatch Alarm States                              │
├────────────────────────────────────────────────────────────────────────┤
│                                                                         │
│   Alarm States:                                                         │
│   ┌─────────────────────────────────────────────────────────────────┐  │
│   │                                                                  │  │
│   │    ┌──────────┐        ┌──────────┐        ┌──────────┐        │  │
│   │    │    OK    │◄──────►│ ALARM    │◄──────►│INSUFFICIENT│       │  │
│   │    │  (green) │        │  (red)   │        │   DATA     │       │  │
│   │    └──────────┘        └──────────┘        └──────────┘        │  │
│   │                                                                  │  │
│   └─────────────────────────────────────────────────────────────────┘  │
│                                                                         │
│   Alarm Components:                                                     │
│   ┌─────────────────────────────────────────────────────────────────┐  │
│   │                                                                  │  │
│   │   METRIC           STATISTIC       PERIOD      THRESHOLD        │  │
│   │   CPUUtilization   Average         5 minutes   > 80%            │  │
│   │                                                                  │  │
│   │   EVALUATION PERIODS: 3                                         │  │
│   │   (3 consecutive 5-min periods above 80% = ALARM)               │  │
│   │                                                                  │  │
│   └─────────────────────────────────────────────────────────────────┘  │
│                                                                         │
│   Alarm Actions:                                                        │
│   ┌─────────────────────────────────────────────────────────────────┐  │
│   │                                                                  │  │
│   │   • SNS Topic (email, SMS, Lambda)                              │  │
│   │   • Auto Scaling (scale up/down)                                │  │
│   │   • EC2 Actions (stop, terminate, reboot)                       │  │
│   │   • Systems Manager (run automation)                            │  │
│   │                                                                  │  │
│   └─────────────────────────────────────────────────────────────────┘  │
│                                                                         │
└────────────────────────────────────────────────────────────────────────┘
```

### Essential Alarms (Terraform)

```hcl theme={null}
# High CPU Alarm
resource "aws_cloudwatch_metric_alarm" "high_cpu" {
  alarm_name          = "high-cpu-utilization"
  comparison_operator = "GreaterThanThreshold"
  evaluation_periods  = 3
  metric_name         = "CPUUtilization"
  namespace           = "AWS/EC2"
  period              = 300  # 5 minutes
  statistic           = "Average"
  threshold           = 80
  alarm_description   = "CPU utilization exceeds 80%"
  
  dimensions = {
    AutoScalingGroupName = aws_autoscaling_group.web.name
  }
  
  # Alarm actions fire when state transitions to ALARM.
  # Triggering both notification AND auto-scaling is a production best practice:
  # auto-scaling handles the immediate capacity need while the team investigates.
  alarm_actions = [
    aws_sns_topic.alerts.arn,
    aws_autoscaling_policy.scale_up.arn
  ]
  
  # OK actions fire when alarm returns to normal -- useful for "all clear"
  # notifications so on-call engineers know the issue resolved automatically.
  ok_actions = [aws_sns_topic.alerts.arn]
}

# Lambda Error Rate Alarm
resource "aws_cloudwatch_metric_alarm" "lambda_errors" {
  alarm_name          = "lambda-high-error-rate"
  comparison_operator = "GreaterThanThreshold"
  evaluation_periods  = 2
  threshold           = 5  # 5% error rate
  alarm_description   = "Lambda error rate exceeds 5%"
  
  metric_query {
    id          = "error_rate"
    expression  = "(errors / invocations) * 100"
    label       = "Error Rate"
    return_data = true
  }
  
  metric_query {
    id = "errors"
    metric {
      metric_name = "Errors"
      namespace   = "AWS/Lambda"
      period      = 300
      stat        = "Sum"
      dimensions  = { FunctionName = "my-function" }
    }
  }
  
  metric_query {
    id = "invocations"
    metric {
      metric_name = "Invocations"
      namespace   = "AWS/Lambda"
      period      = 300
      stat        = "Sum"
      dimensions  = { FunctionName = "my-function" }
    }
  }
  
  alarm_actions = [aws_sns_topic.alerts.arn]
}

# DynamoDB Throttling Alarm
# Why threshold 0? Because ANY throttling means you are losing requests.
# DynamoDB throttling is silent -- your application gets a 400 error but
# CloudWatch won't page you unless you set this up. Even one throttled
# request can cascade into retries that cause more throttling.
resource "aws_cloudwatch_metric_alarm" "dynamodb_throttle" {
  alarm_name          = "dynamodb-throttling"
  comparison_operator = "GreaterThanThreshold"
  evaluation_periods  = 1
  metric_name         = "ThrottledRequests"
  namespace           = "AWS/DynamoDB"
  period              = 60
  statistic           = "Sum"
  threshold           = 0
  
  dimensions = {
    TableName = "my-table"
  }
  
  alarm_actions = [aws_sns_topic.alerts.arn]
}

# Composite Alarm (multiple conditions)
resource "aws_cloudwatch_composite_alarm" "critical" {
  alarm_name = "critical-system-alarm"
  
  alarm_rule = "ALARM(${aws_cloudwatch_metric_alarm.high_cpu.alarm_name}) AND ALARM(${aws_cloudwatch_metric_alarm.lambda_errors.alarm_name})"
  
  alarm_actions = [aws_sns_topic.pagerduty.arn]
}
```

***

## CloudTrail (Audit Logging)

Track all API calls for security and compliance.

```
┌────────────────────────────────────────────────────────────────────────┐
│                    CloudTrail Architecture                              │
├────────────────────────────────────────────────────────────────────────┤
│                                                                         │
│   Every AWS API Call:                                                   │
│   ┌─────────────────────────────────────────────────────────────────┐  │
│   │                                                                  │  │
│   │   User/Role ───► AWS API ───► CloudTrail ───► S3 Bucket         │  │
│   │                                    │                             │  │
│   │                                    └──► CloudWatch Logs          │  │
│   │                                    └──► EventBridge (real-time)  │  │
│   │                                                                  │  │
│   └─────────────────────────────────────────────────────────────────┘  │
│                                                                         │
│   Event Types:                                                          │
│   ─────────────                                                         │
│   • Management Events: Control plane (CreateBucket, RunInstances)      │
│   • Data Events: Data plane (S3 GetObject, Lambda Invoke)              │
│   • Insights Events: Unusual API activity detection                    │
│                                                                         │
│   Sample Event:                                                         │
│   {                                                                     │
│     "eventTime": "2024-01-15T10:30:00Z",                               │
│     "eventSource": "s3.amazonaws.com",                                 │
│     "eventName": "DeleteBucket",                                       │
│     "userIdentity": {                                                   │
│       "type": "IAMUser",                                               │
│       "userName": "admin",                                             │
│       "arn": "arn:aws:iam::123456789012:user/admin"                    │
│     },                                                                  │
│     "sourceIPAddress": "203.0.113.50",                                 │
│     "requestParameters": { "bucketName": "my-bucket" }                 │
│   }                                                                     │
│                                                                         │
└────────────────────────────────────────────────────────────────────────┘
```

### CloudTrail Best Practices

```python theme={null}
# CloudTrail configuration checklist
cloudtrail_config = {
    "multi_region": True,           # Trail in all regions
    "log_file_validation": True,    # Detect tampering
    "s3_encryption": "SSE-KMS",     # Encrypt logs
    "cloudwatch_logs": True,        # Real-time analysis
    "data_events": {
        "s3": ["arn:aws:s3:::sensitive-bucket/*"],
        "lambda": True
    },
    "insights": True,               # Anomaly detection
    "organization_trail": True      # Multi-account
}

# Real-time alerting with EventBridge
# Detect root user login
eventbridge_rule = {
    "source": ["aws.signin"],
    "detail-type": ["AWS Console Sign In via CloudTrail"],
    "detail": {
        "userIdentity": {
            "type": ["Root"]
        }
    }
}
```

***

## 🎯 Interview Questions

<AccordionGroup>
  <Accordion title="Q1: How would you debug a slow API response?">
    **Systematic approach:**

    1. **X-Ray Trace**: Find the specific slow request
       * Identify which service/subsegment is slow
       * Check annotations for context

    2. **CloudWatch Metrics**: Check historical patterns
       * Is this a spike or gradual increase?
       * Correlate with CPU, memory, connections

    3. **CloudWatch Logs**: Find related errors
       ```sql theme={null}
       fields @timestamp, @message
       | filter trace_id = "1-abc123..."
       | sort @timestamp
       ```

    4. **Service-specific checks**:
       * Lambda: Cold starts? Memory sufficient?
       * DynamoDB: Throttling? Hot partition?
       * RDS: Connection pool exhausted?
  </Accordion>

  <Accordion title="Q2: What metrics should you monitor for a web application?">
    **Essential metrics by layer:**

    **Load Balancer:**

    * RequestCount, TargetResponseTime
    * HTTPCode\_Target\_5XX, HTTPCode\_ELB\_5XX
    * HealthyHostCount, UnhealthyHostCount

    **Compute (EC2/Lambda):**

    * CPUUtilization, MemoryUtilization (custom)
    * Lambda: Duration, Errors, Throttles

    **Database:**

    * CPUUtilization, FreeableMemory
    * DatabaseConnections, ReadIOPS, WriteIOPS
    * DynamoDB: ThrottledRequests

    **Application:**

    * Error rate, Latency (p50, p95, p99)
    * Requests per second
    * Business metrics (orders, signups)
  </Accordion>

  <Accordion title="Q3: How do you reduce CloudWatch Logs costs?">
    **Cost optimization strategies:**

    1. **Reduce ingestion**:
       * Filter logs at source (log level INFO not DEBUG)
       * Use sampling for high-volume logs

    2. **Optimize retention**:
       * Set appropriate retention (7-30 days for most)
       * Export to S3 for long-term (cheaper)

    3. **Use Logs Insights efficiently**:
       * Narrow time ranges
       * Use specific log groups
       * Cache common queries

    4. **Consider alternatives**:
       * Kinesis Firehose → S3 for high volume
       * OpenSearch for complex analysis
  </Accordion>

  <Accordion title="Q4: How do you set up alerting for a production system?">
    **Alert hierarchy:**

    1. **Critical (PagerDuty/immediate)**:
       * Service down (health check failures)
       * Error rate > 5%
       * Latency p99 > 5s
       * Security events (root login)

    2. **Warning (Slack/email)**:
       * Error rate > 1%
       * CPU > 80% sustained
       * Disk > 85%
       * Approaching quotas

    3. **Informational (dashboard)**:
       * Deployment events
       * Scaling events
       * Cost anomalies

    **Best practices:**

    * Avoid alert fatigue (tune thresholds)
    * Use composite alarms
    * Include runbook links in alerts
  </Accordion>

  <Accordion title="Q5: CloudWatch vs third-party observability tools?">
    **CloudWatch advantages:**

    * Native integration, no agents for AWS services
    * Lower cost for basic use cases
    * No data egress charges

    **Third-party advantages (Datadog, New Relic):**

    * Better visualization and correlation
    * APM with code-level insights
    * Multi-cloud support
    * More powerful querying

    **Hybrid approach:**

    * Use CloudWatch for AWS metrics/logs
    * Stream to third-party for analysis
    * Keep costs balanced
  </Accordion>
</AccordionGroup>

***

## 🧪 Hands-On Lab: Build Observability Dashboard

<Steps>
  <Step title="Enable X-Ray on Lambda">
    Add X-Ray SDK and enable active tracing
  </Step>

  <Step title="Create CloudWatch Dashboard">
    Add widgets for key metrics (CPU, errors, latency)
  </Step>

  <Step title="Set Up Structured Logging">
    Implement JSON logging with correlation IDs
  </Step>

  <Step title="Create Alarms">
    Set up CPU, error rate, and latency alarms
  </Step>

  <Step title="Configure CloudTrail">
    Enable multi-region trail with CloudWatch Logs
  </Step>
</Steps>

***

## Next Module

<Card title="CDN & Edge Services" icon="globe" href="/aws/cdn-edge">
  Master CloudFront, Global Accelerator, and edge computing
</Card>