> ## Documentation Index
> Fetch the complete documentation index at: https://resources.devweekends.com/llms.txt
> Use this file to discover all available pages before exploring further.

# Amazon CloudWatch

> Master CloudWatch metrics, logs, alarms, dashboards, Log Insights, and operational monitoring

<Frame>
  <img src="https://mintcdn.com/devweeekends/sTu6A4whRFPJo0_g/images/aws/cloudwatch-deep-dive.svg?fit=max&auto=format&n=sTu6A4whRFPJo0_g&q=85&s=c474595b010a4d9ff8feaef1b456af02" alt="CloudWatch Deep Dive" width="1080" height="1080" data-path="images/aws/cloudwatch-deep-dive.svg" />
</Frame>

## Module Overview

<Info>
  **Estimated Time**: 4-5 hours | **Difficulty**: Intermediate | **Prerequisites**: Core Concepts
</Info>

Amazon CloudWatch is the unified monitoring and observability service for AWS. Think of it as the nervous system of your cloud infrastructure -- it collects signals (metrics), records conversations (logs), and triggers reflexes (alarms and auto-scaling actions) when something goes wrong. If you only learn one AWS observability service, make it CloudWatch -- every other monitoring tool either feeds into it or reads from it. This module provides a comprehensive deep-dive into CloudWatch capabilities for production monitoring.

**What You'll Learn:**

* CloudWatch Metrics (built-in and custom)
* CloudWatch Logs and Log Insights
* CloudWatch Alarms and composite alarms
* Dashboards and visualization
* CloudWatch Synthetics (canaries)
* CloudWatch Contributor Insights
* CloudWatch Anomaly Detection
* Cross-account and cross-region monitoring

***

## CloudWatch Architecture

```
┌────────────────────────────────────────────────────────────────────────┐
│                    CloudWatch Architecture                              │
├────────────────────────────────────────────────────────────────────────┤
│                                                                         │
│   Data Sources                        CloudWatch                        │
│   ────────────                        ──────────                       │
│   ┌──────────────┐                   ┌───────────────────────────────┐ │
│   │ EC2, RDS,    │───────────────────│         METRICS               │ │
│   │ Lambda, etc. │   Auto-collected  │  • Standard (5-min)           │ │
│   └──────────────┘                   │  • Detailed (1-min)           │ │
│                                      │  • High-res (1-sec)           │ │
│   ┌──────────────┐                   └───────────────────────────────┘ │
│   │ Applications │   PutMetricData   ┌───────────────────────────────┐ │
│   │ (Custom)     │───────────────────│      CUSTOM METRICS           │ │
│   └──────────────┘                   │  • Business KPIs              │ │
│                                      │  • App-specific metrics       │ │
│   ┌──────────────┐                   └───────────────────────────────┘ │
│   │ Lambda, ECS, │   Auto-collected  ┌───────────────────────────────┐ │
│   │ EC2, VPC     │───────────────────│          LOGS                 │ │
│   └──────────────┘                   │  • Log Groups                 │ │
│                                      │  • Log Streams                │ │
│   ┌──────────────┐                   │  • Log Insights               │ │
│   │ Applications │   Agent/SDK       └───────────────────────────────┘ │
│   └──────────────┘                                                     │
│                                      ┌───────────────────────────────┐ │
│   Processing & Actions               │         ALARMS                │ │
│   ─────────────────────              │  • Metric Alarms              │ │
│   • Alarms → SNS, Auto Scaling       │  • Composite Alarms           │ │
│   • EventBridge → Lambda, etc.       │  • Anomaly Detection          │ │
│   • Dashboards → Visualization       └───────────────────────────────┘ │
│   • Contributor Insights                                               │
│                                      ┌───────────────────────────────┐ │
│                                      │       DASHBOARDS              │ │
│                                      │  • Widgets                    │ │
│                                      │  • Cross-account view         │ │
│                                      │  • Sharing                    │ │
│                                      └───────────────────────────────┘ │
│                                                                         │
└────────────────────────────────────────────────────────────────────────┘
```

***

## CloudWatch Metrics

### Built-in Metrics

```
┌────────────────────────────────────────────────────────────────────────┐
│                    Built-in Metrics by Service                          │
├────────────────────────────────────────────────────────────────────────┤
│                                                                         │
│   EC2 (Basic - 5 min | Detailed - 1 min):                              │
│   ─────────────────────────────────────                                │
│   • CPUUtilization       • DiskReadOps         • DiskWriteOps          │
│   • NetworkIn            • NetworkOut          • StatusCheckFailed     │
│   ⚠️ Memory NOT included (use CloudWatch Agent)                        │
│                                                                         │
│   Lambda:                                                               │
│   ───────                                                               │
│   • Invocations          • Duration            • Errors                │
│   • Throttles            • ConcurrentExec      • UnreservedConcurrent  │
│   • IteratorAge (streams)                                              │
│                                                                         │
│   RDS:                                                                  │
│   ────                                                                  │
│   • CPUUtilization       • DatabaseConnections • FreeableMemory        │
│   • ReadIOPS             • WriteIOPS           • ReadLatency           │
│   • WriteLatency         • FreeStorageSpace                            │
│                                                                         │
│   DynamoDB:                                                             │
│   ─────────                                                             │
│   • ConsumedRCU          • ConsumedWCU         • ProvisionedRCU        │
│   • ProvisionedWCU       • ThrottledRequests   • SystemErrors          │
│   • ReturnedItemCount    • SuccessfulRequestLatency                    │
│                                                                         │
│   Application Load Balancer:                                            │
│   ──────────────────────────                                           │
│   • RequestCount         • TargetResponseTime  • HTTPCode_Target_2XX   │
│   • HTTPCode_Target_4XX  • HTTPCode_Target_5XX • HealthyHostCount      │
│   • UnHealthyHostCount   • ActiveConnectionCount                       │
│                                                                         │
└────────────────────────────────────────────────────────────────────────┘
```

### Metric Dimensions

```
┌────────────────────────────────────────────────────────────────────────┐
│                    Metric Dimensions                                    │
├────────────────────────────────────────────────────────────────────────┤
│                                                                         │
│   Metrics are uniquely identified by:                                   │
│   Namespace + Metric Name + Dimensions                                  │
│                                                                         │
│   Example:                                                              │
│   ─────────                                                             │
│   Namespace: AWS/EC2                                                    │
│   Metric: CPUUtilization                                                │
│   Dimensions: InstanceId=i-1234567890abcdef0                           │
│                                                                         │
│   Each unique combination creates a separate time series:               │
│   ┌─────────────────────────────────────────────────────────────────┐  │
│   │  AWS/EC2 | CPUUtilization | InstanceId=i-abc123  → Time Series 1│  │
│   │  AWS/EC2 | CPUUtilization | InstanceId=i-def456  → Time Series 2│  │
│   │  AWS/EC2 | CPUUtilization | AutoScalingGroup=web → Time Series 3│  │
│   └─────────────────────────────────────────────────────────────────┘  │
│                                                                         │
│   Aggregation Example:                                                  │
│   ────────────────────                                                 │
│   Query by ASG dimension to get aggregate across all instances:        │
│   aws cloudwatch get-metric-statistics \                               │
│     --namespace AWS/EC2 \                                              │
│     --metric-name CPUUtilization \                                     │
│     --dimensions Name=AutoScalingGroupName,Value=my-asg               │
│                                                                         │
└────────────────────────────────────────────────────────────────────────┘
```

### Custom Metrics

```python theme={null}
import boto3
from datetime import datetime
from decimal import Decimal

cloudwatch = boto3.client('cloudwatch')

def publish_business_metric(metric_name: str, value: float, 
                            dimensions: list = None):
    """
    Publish custom business metric to CloudWatch.
    
    Cost: $0.30 per metric per month (first 10,000).
    
    IMPORTANT: Each unique combination of metric name + dimensions creates
    a separate billable metric. If you have 3 dimensions with 10 values each,
    that is 10 x 10 x 10 = 1,000 metrics = $300/month. Plan dimensions carefully.
    
    Common mistake: using high-cardinality values (like user_id or request_id)
    as dimensions -- this creates millions of metrics and a massive bill.
    Use dimensions for categories (region, environment, product_type), not identifiers.
    """
    metric_data = {
        'MetricName': metric_name,
        'Value': value,
        'Unit': 'Count',
        'Timestamp': datetime.utcnow(),
        'StorageResolution': 60  # Standard resolution (60 = 1 min)
    }
    
    if dimensions:
        metric_data['Dimensions'] = dimensions
    
    cloudwatch.put_metric_data(
        Namespace='MyApplication',
        MetricData=[metric_data]
    )

# Example: Track orders placed
publish_business_metric(
    metric_name='OrdersPlaced',
    value=1,
    dimensions=[
        {'Name': 'Environment', 'Value': 'production'},
        {'Name': 'Region', 'Value': 'us-east-1'},
        {'Name': 'ProductCategory', 'Value': 'electronics'}
    ]
)

# High-resolution metric (1-second granularity)
def publish_high_res_metric(metric_name: str, value: float):
    """High-resolution metric for real-time monitoring.
    
    Standard resolution (60s) is sufficient for 95% of use cases.
    Use 1-second resolution only for latency-critical paths where you need
    to detect spikes within seconds (e.g., payment processing, trading systems).
    Cost: same per-metric price, but generates 60x more data points in alarms.
    """
    cloudwatch.put_metric_data(
        Namespace='MyApplication/HighFrequency',
        MetricData=[{
            'MetricName': metric_name,
            'Value': value,
            'Unit': 'Milliseconds',
            'Timestamp': datetime.utcnow(),
            'StorageResolution': 1  # 1-second resolution
        }]
    )

# Batch publish (up to 1000 per request, 1MB max)
def publish_batch_metrics(metrics: list):
    """Efficiently publish multiple metrics."""
    cloudwatch.put_metric_data(
        Namespace='MyApplication',
        MetricData=metrics[:1000]  # Max 1000 per request
    )
```

### CloudWatch Agent

```
┌────────────────────────────────────────────────────────────────────────┐
│                    CloudWatch Agent                                     │
├────────────────────────────────────────────────────────────────────────┤
│                                                                         │
│   The CloudWatch Agent collects metrics and logs from:                 │
│   • EC2 instances (Linux & Windows)                                    │
│   • On-premises servers                                                │
│   • Containers                                                         │
│                                                                         │
│   Metrics NOT available without agent:                                 │
│   ────────────────────────────────────                                 │
│   • Memory utilization                                                 │
│   • Disk space utilization                                             │
│   • Disk I/O                                                           │
│   • Network connections                                                │
│   • Process information                                                │
│                                                                         │
│   Configuration (amazon-cloudwatch-agent.json):                        │
│   ┌─────────────────────────────────────────────────────────────────┐  │
│   │  {                                                               │  │
│   │    "metrics": {                                                  │  │
│   │      "namespace": "MyApp/EC2",                                   │  │
│   │      "metrics_collected": {                                      │  │
│   │        "mem": {                                                  │  │
│   │          "measurement": ["mem_used_percent"]                     │  │
│   │        },                                                        │  │
│   │        "disk": {                                                 │  │
│   │          "measurement": ["disk_used_percent"],                   │  │
│   │          "resources": ["/", "/data"]                             │  │
│   │        }                                                         │  │
│   │      }                                                           │  │
│   │    },                                                            │  │
│   │    "logs": {                                                     │  │
│   │      "logs_collected": {                                         │  │
│   │        "files": {                                                │  │
│   │          "collect_list": [{                                      │  │
│   │            "file_path": "/var/log/myapp/*.log",                  │  │
│   │            "log_group_name": "myapp-logs",                       │  │
│   │            "log_stream_name": "{instance_id}"                    │  │
│   │          }]                                                      │  │
│   │        }                                                         │  │
│   │      }                                                           │  │
│   │    }                                                             │  │
│   │  }                                                               │  │
│   └─────────────────────────────────────────────────────────────────┘  │
│                                                                         │
└────────────────────────────────────────────────────────────────────────┘
```

***

## CloudWatch Logs

### Log Architecture

```
┌────────────────────────────────────────────────────────────────────────┐
│                    CloudWatch Logs Structure                            │
├────────────────────────────────────────────────────────────────────────┤
│                                                                         │
│   LOG GROUP: /aws/lambda/order-service                                 │
│   ────────────────────────────────────                                 │
│   • Retention: 7 days to never expire                                  │
│   • Encryption: Optional KMS encryption                                │
│   • Metric Filters: Extract metrics from logs                          │
│   • Subscription Filters: Stream to Kinesis/Lambda                     │
│                                                                         │
│   └── LOG STREAM: 2024/01/15/[$LATEST]abc123                          │
│       └── LOG EVENT: {"timestamp": 1705312800000,                      │
│                       "message": "Order processed: ORD-123"}           │
│       └── LOG EVENT: {"timestamp": 1705312801000,                      │
│                       "message": "Payment confirmed: PAY-456"}         │
│                                                                         │
│   └── LOG STREAM: 2024/01/15/[$LATEST]def456                          │
│       └── ...                                                          │
│                                                                         │
│   Pricing (us-east-1):                                                 │
│   ────────────────────                                                 │
│   • Ingestion: $0.50 per GB                                            │
│   • Storage: $0.03 per GB/month                                        │
│   • Log Insights queries: $0.005 per GB scanned                        │
│   • Export to S3: Free (but S3 storage costs apply)                    │
│                                                                         │
│   Retention Settings:                                                  │
│   ───────────────────                                                  │
│   1 day, 3 days, 5 days, 1 week, 2 weeks, 1 month, 2 months,          │
│   3 months, 6 months, 1 year, 13 months, 18 months, 2 years,          │
│   3 years, 5 years, 6 years, 7 years, 8 years, 9 years, 10 years,     │
│   Never Expire                                                         │
│                                                                         │
└────────────────────────────────────────────────────────────────────────┘
```

### Structured Logging Best Practices

```python theme={null}
import json
import logging
from datetime import datetime
from typing import Any, Dict

class StructuredLogger:
    """JSON structured logger for CloudWatch Logs."""
    
    def __init__(self, service: str, environment: str):
        self.service = service
        self.environment = environment
        self.logger = logging.getLogger()
        self.logger.setLevel(logging.INFO)
    
    def _log(self, level: str, message: str, **extra):
        log_entry = {
            "timestamp": datetime.utcnow().isoformat() + "Z",
            "level": level,
            "service": self.service,
            "environment": self.environment,
            "message": message,
            **extra
        }
        print(json.dumps(log_entry))  # Lambda automatically captures stdout
    
    def info(self, message: str, **extra):
        self._log("INFO", message, **extra)
    
    def error(self, message: str, error: Exception = None, **extra):
        if error:
            extra["error_type"] = type(error).__name__
            extra["error_message"] = str(error)
        self._log("ERROR", message, **extra)
    
    def metric(self, name: str, value: float, unit: str = "Count", **extra):
        """Log a metric that can be extracted with metric filters."""
        self._log("METRIC", f"{name}={value}", 
                  metric_name=name, metric_value=value, metric_unit=unit, **extra)

# Usage
logger = StructuredLogger(service="order-service", environment="prod")

def process_order(order_id: str, user_id: str):
    logger.info("Processing order", order_id=order_id, user_id=user_id)
    
    try:
        # Business logic...
        logger.info("Order completed", 
                   order_id=order_id, 
                   duration_ms=150,
                   items_count=3)
        logger.metric("OrdersProcessed", 1, "Count", order_id=order_id)
        
    except Exception as e:
        logger.error("Order processing failed", 
                    error=e, 
                    order_id=order_id,
                    user_id=user_id)
        raise
```

### Log Insights Queries

```sql theme={null}
-- Find all errors in the last hour
fields @timestamp, @message, @logStream
| filter @message like /ERROR/
| sort @timestamp desc
| limit 100

-- Parse JSON logs and aggregate by service
fields @timestamp, @message
| parse @message '{"service":"*","level":"*","message":"*"}' as service, level, msg
| filter level = "ERROR"
| stats count(*) as error_count by service
| sort error_count desc

-- Calculate Lambda p99 latency
fields @timestamp, @duration, @requestId
| filter @type = "REPORT"
| stats avg(@duration) as avg_ms,
        pct(@duration, 50) as p50_ms,
        pct(@duration, 95) as p95_ms,
        pct(@duration, 99) as p99_ms
  by bin(1h)

-- Find slow API requests
fields @timestamp, @message
| parse @message '"duration_ms":*,' as duration
| filter duration > 1000
| sort duration desc
| limit 50

-- Error rate calculation
fields @timestamp, @message
| filter @message like /level/
| parse @message '"level":"*"' as level
| stats count(*) as total,
        sum(level = "ERROR") as errors
  by bin(5m)
| sort @timestamp desc

-- Top contributors to log volume
fields @logStream, @message
| stats count(*) as log_count by @logStream
| sort log_count desc
| limit 20

-- Correlation by request ID
fields @timestamp, @message, @logStream
| parse @message '"request_id":"*"' as request_id
| filter request_id = "abc-123-def-456"
| sort @timestamp asc
```

### Metric Filters

```
┌────────────────────────────────────────────────────────────────────────┐
│                    Metric Filters                                       │
├────────────────────────────────────────────────────────────────────────┤
│                                                                         │
│   Extract metrics from logs automatically:                              │
│                                                                         │
│   Filter Pattern               │ Metric Created                        │
│   ─────────────────────────────┼───────────────────────────────────── │
│   ERROR                        │ Count of ERROR occurrences            │
│   [timestamp, level=ERROR, ...] │ Count of structured errors           │
│   "status": 500                │ Count of 500 errors                   │
│   { $.latency > 1000 }         │ Count of slow requests (JSON)        │
│                                                                         │
│   Example: Track application errors                                    │
│   ┌─────────────────────────────────────────────────────────────────┐  │
│   │  Filter Pattern: { $.level = "ERROR" }                          │  │
│   │  Metric Name: ApplicationErrors                                  │  │
│   │  Metric Namespace: MyApp                                        │  │
│   │  Metric Value: 1                                                 │  │
│   │  Default Value: 0                                                │  │
│   └─────────────────────────────────────────────────────────────────┘  │
│                                                                         │
│   Example: Extract latency from JSON logs                              │
│   ┌─────────────────────────────────────────────────────────────────┐  │
│   │  Filter Pattern: { $.type = "REQUEST" }                         │  │
│   │  Metric Name: RequestLatency                                     │  │
│   │  Metric Namespace: MyApp                                        │  │
│   │  Metric Value: $.latency                                         │  │
│   └─────────────────────────────────────────────────────────────────┘  │
│                                                                         │
└────────────────────────────────────────────────────────────────────────┘
```

***

## CloudWatch Alarms

### Alarm Types and States

```
┌────────────────────────────────────────────────────────────────────────┐
│                    CloudWatch Alarms                                    │
├────────────────────────────────────────────────────────────────────────┤
│                                                                         │
│   Alarm States:                                                         │
│   ┌───────────┐     ┌───────────┐     ┌─────────────────┐              │
│   │    OK     │────►│   ALARM   │────►│ INSUFFICIENT    │              │
│   │  (green)  │◄────│   (red)   │◄────│     DATA        │              │
│   └───────────┘     └───────────┘     └─────────────────┘              │
│                                                                         │
│   Alarm Components:                                                     │
│   ─────────────────                                                    │
│   • Metric: What to monitor (CPUUtilization, custom metric)           │
│   • Statistic: How to aggregate (Average, Sum, Max, p99)              │
│   • Period: Evaluation period (60s, 300s, etc.)                        │
│   • Evaluation Periods: How many periods before alarm                  │
│   • Threshold: Value that triggers alarm                               │
│   • Comparison: GreaterThan, LessThan, etc.                           │
│                                                                         │
│   Example Configuration:                                                │
│   ─────────────────────                                                │
│   Metric: CPUUtilization                                               │
│   Statistic: Average                                                   │
│   Period: 300 seconds (5 min)                                          │
│   Evaluation Periods: 3                                                │
│   Threshold: 80%                                                       │
│   → Alarm triggers when CPU > 80% for 3 consecutive 5-min periods     │
│                                                                         │
│   Alarm Actions:                                                        │
│   ───────────────                                                      │
│   • SNS Topic → Email, SMS, Lambda                                     │
│   • Auto Scaling → Scale up/down                                       │
│   • EC2 Actions → Stop, terminate, reboot, recover                     │
│   • Systems Manager → Run automation documents                         │
│                                                                         │
└────────────────────────────────────────────────────────────────────────┘
```

### Essential Alarms (Terraform)

```hcl theme={null}
# High CPU Alarm with Auto Scaling
resource "aws_cloudwatch_metric_alarm" "high_cpu" {
  alarm_name          = "${var.app_name}-high-cpu"
  comparison_operator = "GreaterThanThreshold"
  evaluation_periods  = 3
  metric_name         = "CPUUtilization"
  namespace           = "AWS/EC2"
  period              = 300
  statistic           = "Average"
  threshold           = 80
  alarm_description   = "CPU exceeds 80% for 15 minutes"
  
  dimensions = {
    AutoScalingGroupName = aws_autoscaling_group.web.name
  }
  
  alarm_actions = [
    aws_sns_topic.alerts.arn,
    aws_autoscaling_policy.scale_up.arn
  ]
  
  ok_actions = [aws_sns_topic.alerts.arn]
}

# Lambda Error Rate (Metric Math)
resource "aws_cloudwatch_metric_alarm" "lambda_error_rate" {
  alarm_name          = "${var.app_name}-lambda-errors"
  comparison_operator = "GreaterThanThreshold"
  evaluation_periods  = 2
  threshold           = 5  # 5% error rate
  
  metric_query {
    id          = "error_rate"
    expression  = "errors / invocations * 100"
    label       = "Error Rate %"
    return_data = true
  }
  
  metric_query {
    id = "errors"
    metric {
      metric_name = "Errors"
      namespace   = "AWS/Lambda"
      period      = 300
      stat        = "Sum"
      dimensions = {
        FunctionName = aws_lambda_function.api.function_name
      }
    }
  }
  
  metric_query {
    id = "invocations"
    metric {
      metric_name = "Invocations"
      namespace   = "AWS/Lambda"
      period      = 300
      stat        = "Sum"
      dimensions = {
        FunctionName = aws_lambda_function.api.function_name
      }
    }
  }
  
  alarm_actions = [aws_sns_topic.alerts.arn]
}

# Anomaly Detection Alarm
resource "aws_cloudwatch_metric_alarm" "traffic_anomaly" {
  alarm_name          = "${var.app_name}-traffic-anomaly"
  comparison_operator = "LessThanLowerOrGreaterThanUpperThreshold"
  evaluation_periods  = 2
  
  threshold_metric_id = "ad1"
  
  metric_query {
    id          = "m1"
    return_data = true
    metric {
      metric_name = "RequestCount"
      namespace   = "AWS/ApplicationELB"
      period      = 300
      stat        = "Sum"
      dimensions = {
        LoadBalancer = aws_lb.main.arn_suffix
      }
    }
  }
  
  metric_query {
    id          = "ad1"
    expression  = "ANOMALY_DETECTION_BAND(m1, 2)"
    label       = "Traffic Anomaly Band"
    return_data = true
  }
  
  alarm_actions = [aws_sns_topic.alerts.arn]
}

# Composite Alarm (multiple conditions)
resource "aws_cloudwatch_composite_alarm" "critical" {
  alarm_name = "${var.app_name}-critical"
  
  alarm_rule = join(" OR ", [
    "ALARM(${aws_cloudwatch_metric_alarm.high_cpu.alarm_name})",
    "ALARM(${aws_cloudwatch_metric_alarm.lambda_error_rate.alarm_name})",
  ])
  
  alarm_actions = [aws_sns_topic.pagerduty.arn]
}
```

***

## CloudWatch Dashboards

```
┌────────────────────────────────────────────────────────────────────────┐
│                    Dashboard Design                                     │
├────────────────────────────────────────────────────────────────────────┤
│                                                                         │
│   ┌───────────────────────────────────────────────────────────────┐    │
│   │  MY APPLICATION DASHBOARD                           📊 🔗 ⚙️   │    │
│   ├───────────────────────────────────────────────────────────────┤    │
│   │                                                               │    │
│   │  ┌─────────────────────┐  ┌─────────────────────────────────┐│    │
│   │  │ 🟢 Healthy Hosts: 4 │  │ 📈 Request Rate                 ││    │
│   │  │ 🔴 Errors: 0.1%     │  │ [Line graph of requests/min]   ││    │
│   │  │ ⏱️ P99 Latency: 45ms│  │                                 ││    │
│   │  └─────────────────────┘  └─────────────────────────────────┘│    │
│   │                                                               │    │
│   │  ┌────────────────────────────────────────────────────────┐  │    │
│   │  │ 📊 API Response Times by Endpoint                      │  │    │
│   │  │ ┌────────────────────────────────────────────────────┐ │  │    │
│   │  │ │ /orders ████████████████ 45ms                      │ │  │    │
│   │  │ │ /users  ████████ 25ms                              │ │  │    │
│   │  │ │ /products ██████████████████ 55ms                  │ │  │    │
│   │  │ └────────────────────────────────────────────────────┘ │  │    │
│   │  └────────────────────────────────────────────────────────┘  │    │
│   │                                                               │    │
│   │  ┌─────────────────────┐  ┌─────────────────────────────────┐│    │
│   │  │ Lambda Invocations  │  │ DynamoDB Consumed Capacity      ││    │
│   │  │ [Stacked area chart]│  │ [RCU/WCU over time]             ││    │
│   │  └─────────────────────┘  └─────────────────────────────────┘│    │
│   │                                                               │    │
│   │  ┌────────────────────────────────────────────────────────┐  │    │
│   │  │ 📝 Recent Errors (Log Widget)                          │  │    │
│   │  │ 10:23:45 ERROR Payment failed: insufficient funds      │  │    │
│   │  │ 10:22:30 ERROR Timeout connecting to external API      │  │    │
│   │  └────────────────────────────────────────────────────────┘  │    │
│   │                                                               │    │
│   └───────────────────────────────────────────────────────────────┘    │
│                                                                         │
│   Widget Types:                                                         │
│   • Line, Stacked area, Number, Gauge, Bar, Pie                        │
│   • Text (Markdown), Log (Log Insights), Alarm status                  │
│                                                                         │
└────────────────────────────────────────────────────────────────────────┘
```

### Dashboard as Code

```python theme={null}
import json

# CloudWatch Dashboard JSON definition
dashboard_body = {
    "widgets": [
        # Key Metrics Row
        {
            "type": "metric",
            "x": 0, "y": 0, "width": 6, "height": 6,
            "properties": {
                "title": "Request Count",
                "metrics": [
                    ["AWS/ApplicationELB", "RequestCount", "LoadBalancer", "app/my-alb/1234"]
                ],
                "period": 60,
                "stat": "Sum",
                "region": "us-east-1"
            }
        },
        {
            "type": "metric",
            "x": 6, "y": 0, "width": 6, "height": 6,
            "properties": {
                "title": "Response Time (p99)",
                "metrics": [
                    ["AWS/ApplicationELB", "TargetResponseTime", "LoadBalancer", "app/my-alb/1234"]
                ],
                "period": 60,
                "stat": "p99",
                "region": "us-east-1"
            }
        },
        # Lambda Metrics
        {
            "type": "metric",
            "x": 0, "y": 6, "width": 12, "height": 6,
            "properties": {
                "title": "Lambda Performance",
                "metrics": [
                    ["AWS/Lambda", "Duration", "FunctionName", "my-function", {"stat": "Average"}],
                    [".", ".", ".", ".", {"stat": "p95"}],
                    [".", ".", ".", ".", {"stat": "p99"}]
                ],
                "period": 60,
                "region": "us-east-1"
            }
        },
        # Alarm Status Widget
        {
            "type": "alarm",
            "x": 12, "y": 0, "width": 6, "height": 6,
            "properties": {
                "title": "Alarm Status",
                "alarms": [
                    "arn:aws:cloudwatch:us-east-1:123456789012:alarm:HighCPU",
                    "arn:aws:cloudwatch:us-east-1:123456789012:alarm:ErrorRate"
                ]
            }
        },
        # Log Widget
        {
            "type": "log",
            "x": 0, "y": 12, "width": 24, "height": 6,
            "properties": {
                "title": "Recent Errors",
                "query": "SOURCE '/aws/lambda/my-function' | filter @message like /ERROR/ | sort @timestamp desc | limit 20",
                "region": "us-east-1"
            }
        }
    ]
}

# Create/update dashboard
import boto3
cloudwatch = boto3.client('cloudwatch')
cloudwatch.put_dashboard(
    DashboardName='MyApplicationDashboard',
    DashboardBody=json.dumps(dashboard_body)
)
```

***

## CloudWatch Synthetics

Canary tests that monitor your endpoints 24/7.

```
┌────────────────────────────────────────────────────────────────────────┐
│                    CloudWatch Synthetics                                │
├────────────────────────────────────────────────────────────────────────┤
│                                                                         │
│   Canary Types:                                                         │
│   ─────────────                                                        │
│   • Heartbeat: Simple availability check                               │
│   • API: REST API endpoint testing                                     │
│   • Broken Link: Check for broken links                               │
│   • Visual: Screenshot comparison                                      │
│   • GUI Workflow: Multi-step browser tests                             │
│                                                                         │
│   Features:                                                             │
│   ─────────                                                            │
│   • Runs on a schedule (1 min to 1 hour)                               │
│   • Captures screenshots and HAR files                                 │
│   • Integrates with CloudWatch Alarms                                  │
│   • Uses Puppeteer or Selenium                                         │
│                                                                         │
│   Example: API Canary Script (Node.js)                                 │
│   ┌─────────────────────────────────────────────────────────────────┐  │
│   │  const synthetics = require('Synthetics');                      │  │
│   │  const log = require('SyntheticsLogger');                       │  │
│   │                                                                  │  │
│   │  const apiCanary = async function() {                           │  │
│   │    const page = await synthetics.getPage();                     │  │
│   │                                                                  │  │
│   │    // Test API endpoint                                          │  │
│   │    const response = await page.goto('https://api.example.com'); │  │
│   │                                                                  │  │
│   │    if (response.status() !== 200) {                             │  │
│   │      throw new Error(`Expected 200, got ${response.status()}`); │  │
│   │    }                                                             │  │
│   │                                                                  │  │
│   │    log.info('API check passed');                                 │  │
│   │  };                                                              │  │
│   │                                                                  │  │
│   │  exports.handler = async () => {                                │  │
│   │    return await apiCanary();                                     │  │
│   │  };                                                              │  │
│   └─────────────────────────────────────────────────────────────────┘  │
│                                                                         │
│   Pricing: $0.0012 per canary run                                      │
│   100 runs/day × 30 days = $3.60/month per canary                      │
│                                                                         │
└────────────────────────────────────────────────────────────────────────┘
```

***

## CloudWatch Contributor Insights

Identify top contributors to high cardinality data.

```
┌────────────────────────────────────────────────────────────────────────┐
│                    Contributor Insights                                 │
├────────────────────────────────────────────────────────────────────────┤
│                                                                         │
│   Use Cases:                                                            │
│   ───────────                                                          │
│   • Find top IPs making requests                                       │
│   • Identify users causing errors                                      │
│   • Detect DDoS patterns                                               │
│   • Find hottest DynamoDB partition keys                               │
│                                                                         │
│   Example: Top Error-Producing API Endpoints                           │
│   ┌─────────────────────────────────────────────────────────────────┐  │
│   │  Rank │ Endpoint          │ Error Count │ % of Total            │  │
│   │  ─────┼───────────────────┼─────────────┼───────────────────── │  │
│   │  1    │ /api/v1/checkout  │ 1,234       │ 45%                   │  │
│   │  2    │ /api/v1/payment   │ 567         │ 21%                   │  │
│   │  3    │ /api/v1/search    │ 234         │ 9%                    │  │
│   │  4    │ /api/v1/users     │ 123         │ 5%                    │  │
│   └─────────────────────────────────────────────────────────────────┘  │
│                                                                         │
│   Rule Definition:                                                      │
│   {                                                                     │
│     "Schema": {                                                         │
│       "Name": "CloudWatchLogRule",                                     │
│       "Version": 1                                                      │
│     },                                                                  │
│     "LogGroupNames": ["/aws/lambda/my-api"],                           │
│     "LogFormat": "JSON",                                               │
│     "Fields": {                                                         │
│       "endpoint": "$.path",                                            │
│       "status": "$.status_code"                                        │
│     },                                                                  │
│     "Contribution": {                                                   │
│       "Keys": ["endpoint"],                                            │
│       "Filters": [{"Match": "$.status_code", "GreaterThan": 499}]     │
│     }                                                                   │
│   }                                                                     │
│                                                                         │
└────────────────────────────────────────────────────────────────────────┘
```

***

## Best Practices

<CardGroup cols={2}>
  <Card title="Set Retention Policies" icon="clock">
    Configure log retention to balance cost and compliance needs
  </Card>

  <Card title="Use Structured Logging" icon="code">
    JSON format enables powerful Log Insights queries
  </Card>

  <Card title="Alert on Symptoms" icon="bell">
    Focus on user-facing metrics, not just infrastructure
  </Card>

  <Card title="Tune Alarm Thresholds" icon="sliders">
    Avoid alert fatigue with well-calibrated thresholds
  </Card>

  <Card title="Dashboard Hierarchy" icon="sitemap">
    Executive → Service → Debug dashboards
  </Card>

  <Card title="Cost Awareness" icon="dollar-sign">
    Monitor CloudWatch costs—they can surprise you
  </Card>
</CardGroup>

### Cost Optimization

```python theme={null}
cost_tips = {
    "logs": [
        "Set appropriate retention (default: never expire)",
        "Export to S3 for long-term storage (cheaper)",
        "Use metric filters instead of querying raw logs",
        "Filter at source (log level, sampling)",
    ],
    "metrics": [
        "Minimize custom metric dimensions",
        "Use EMF for metrics-from-logs (free publishing)",
        "Consider high-res metrics only where needed",
        "Delete unused custom metrics",
    ],
    "alarms": [
        "Use composite alarms to reduce noise",
        "Consolidate similar alarms",
        "Avoid very short periods on high-cardinality metrics",
    ],
    "dashboards": [
        "Share dashboards instead of duplicating",
        "Use automatic refresh wisely",
    ]
}

# Cost estimation
monthly_costs = {
    "custom_metrics": "10,000 metrics × $0.30 = $3,000",
    "log_ingestion": "100 GB × $0.50 = $50",
    "log_storage": "1 TB × $0.03 = $30",
    "log_insights": "500 GB scanned × $0.005 = $2.50",
    "alarms": "100 alarms × $0.10 = $10",
    "dashboards": "First 3 free, then $3/month each",
}
```

***

## 🎯 Interview Questions

<AccordionGroup>
  <Accordion title="Q1: CloudWatch Logs vs X-Ray vs CloudTrail?">
    **CloudWatch Logs:**

    * Application and system logs
    * What your code outputs
    * Debugging, troubleshooting

    **X-Ray:**

    * Distributed tracing
    * Request flow across services
    * Performance analysis

    **CloudTrail:**

    * AWS API call history
    * Who did what, when
    * Security auditing
  </Accordion>

  <Accordion title="Q2: How to reduce CloudWatch Logs costs?">
    1. **Set retention policies** (don't keep forever)
    2. **Export to S3** for long-term (use lifecycle rules)
    3. **Filter at source** (log levels, sampling)
    4. **Use metric filters** instead of Log Insights for common queries
    5. **Compress logs** before ingestion
    6. **Use Contributor Insights rules** for ongoing analysis
  </Accordion>

  <Accordion title="Q3: What's the difference between metric alarms and composite alarms?">
    **Metric Alarms:**

    * Monitor single metric
    * Simple threshold or anomaly detection
    * One condition

    **Composite Alarms:**

    * Combine multiple alarms with AND/OR/NOT
    * Reduce alert noise
    * Complex conditions like: "High CPU AND High Memory"
    * Better for on-call (fewer, more actionable alerts)
  </Accordion>

  <Accordion title="Q4: EC2 memory not showing in CloudWatch?">
    **Why:** EC2 basic monitoring doesn't include memory—it's inside the OS, not visible to hypervisor.

    **Solution:** Install CloudWatch Agent

    ```bash theme={null}
    # Install agent
    sudo yum install amazon-cloudwatch-agent

    # Configure
    sudo /opt/aws/amazon-cloudwatch-agent/bin/amazon-cloudwatch-agent-config-wizard

    # Start
    sudo systemctl start amazon-cloudwatch-agent
    ```
  </Accordion>

  <Accordion title="Q5: How to monitor Lambda cold starts?">
    **Built-in:** Lambda doesn't have a direct cold start metric.

    **Solutions:**

    1. **Init Duration** in REPORT logs

    ```sql theme={null}
    fields @timestamp, @initDuration
    | filter @type = "REPORT" and ispresent(@initDuration)
    | stats avg(@initDuration), max(@initDuration), count(*)
    ```

    2. **Custom metric** from code (measure init time)

    3. **X-Ray** shows initialization segment

    4. **Lambda Insights** (enhanced monitoring)
  </Accordion>
</AccordionGroup>

***

## 🧪 Hands-On Lab

<Steps>
  <Step title="Set Up Structured Logging">
    Implement JSON logging in a Lambda function
  </Step>

  <Step title="Create Metric Filter">
    Extract error count metric from logs
  </Step>

  <Step title="Build Dashboard">
    Create operational dashboard with key metrics
  </Step>

  <Step title="Configure Alarms">
    Set up metric and composite alarms with SNS notification
  </Step>

  <Step title="Create Canary">
    Set up synthetic monitoring for your API endpoint
  </Step>
</Steps>

***

## Next Module

<Card title="AWS X-Ray" icon="microscope" href="/aws/xray">
  Master distributed tracing with AWS X-Ray
</Card>
