Skip to main content
CloudWatch Deep Dive

Module Overview

Estimated Time: 4-5 hours | Difficulty: Intermediate | Prerequisites: Core Concepts
Amazon CloudWatch is the unified monitoring and observability service for AWS. This module provides a comprehensive deep-dive into CloudWatch capabilities for production monitoring. What You’ll Learn:
  • CloudWatch Metrics (built-in and custom)
  • CloudWatch Logs and Log Insights
  • CloudWatch Alarms and composite alarms
  • Dashboards and visualization
  • CloudWatch Synthetics (canaries)
  • CloudWatch Contributor Insights
  • CloudWatch Anomaly Detection
  • Cross-account and cross-region monitoring

CloudWatch Architecture

┌────────────────────────────────────────────────────────────────────────┐
│                    CloudWatch Architecture                              │
├────────────────────────────────────────────────────────────────────────┤
│                                                                         │
│   Data Sources                        CloudWatch                        │
│   ────────────                        ──────────                       │
│   ┌──────────────┐                   ┌───────────────────────────────┐ │
│   │ EC2, RDS,    │───────────────────│         METRICS               │ │
│   │ Lambda, etc. │   Auto-collected  │  • Standard (5-min)           │ │
│   └──────────────┘                   │  • Detailed (1-min)           │ │
│                                      │  • High-res (1-sec)           │ │
│   ┌──────────────┐                   └───────────────────────────────┘ │
│   │ Applications │   PutMetricData   ┌───────────────────────────────┐ │
│   │ (Custom)     │───────────────────│      CUSTOM METRICS           │ │
│   └──────────────┘                   │  • Business KPIs              │ │
│                                      │  • App-specific metrics       │ │
│   ┌──────────────┐                   └───────────────────────────────┘ │
│   │ Lambda, ECS, │   Auto-collected  ┌───────────────────────────────┐ │
│   │ EC2, VPC     │───────────────────│          LOGS                 │ │
│   └──────────────┘                   │  • Log Groups                 │ │
│                                      │  • Log Streams                │ │
│   ┌──────────────┐                   │  • Log Insights               │ │
│   │ Applications │   Agent/SDK       └───────────────────────────────┘ │
│   └──────────────┘                                                     │
│                                      ┌───────────────────────────────┐ │
│   Processing & Actions               │         ALARMS                │ │
│   ─────────────────────              │  • Metric Alarms              │ │
│   • Alarms → SNS, Auto Scaling       │  • Composite Alarms           │ │
│   • EventBridge → Lambda, etc.       │  • Anomaly Detection          │ │
│   • Dashboards → Visualization       └───────────────────────────────┘ │
│   • Contributor Insights                                               │
│                                      ┌───────────────────────────────┐ │
│                                      │       DASHBOARDS              │ │
│                                      │  • Widgets                    │ │
│                                      │  • Cross-account view         │ │
│                                      │  • Sharing                    │ │
│                                      └───────────────────────────────┘ │
│                                                                         │
└────────────────────────────────────────────────────────────────────────┘

CloudWatch Metrics

Built-in Metrics

┌────────────────────────────────────────────────────────────────────────┐
│                    Built-in Metrics by Service                          │
├────────────────────────────────────────────────────────────────────────┤
│                                                                         │
│   EC2 (Basic - 5 min | Detailed - 1 min):                              │
│   ─────────────────────────────────────                                │
│   • CPUUtilization       • DiskReadOps         • DiskWriteOps          │
│   • NetworkIn            • NetworkOut          • StatusCheckFailed     │
│   ⚠️ Memory NOT included (use CloudWatch Agent)                        │
│                                                                         │
│   Lambda:                                                               │
│   ───────                                                               │
│   • Invocations          • Duration            • Errors                │
│   • Throttles            • ConcurrentExec      • UnreservedConcurrent  │
│   • IteratorAge (streams)                                              │
│                                                                         │
│   RDS:                                                                  │
│   ────                                                                  │
│   • CPUUtilization       • DatabaseConnections • FreeableMemory        │
│   • ReadIOPS             • WriteIOPS           • ReadLatency           │
│   • WriteLatency         • FreeStorageSpace                            │
│                                                                         │
│   DynamoDB:                                                             │
│   ─────────                                                             │
│   • ConsumedRCU          • ConsumedWCU         • ProvisionedRCU        │
│   • ProvisionedWCU       • ThrottledRequests   • SystemErrors          │
│   • ReturnedItemCount    • SuccessfulRequestLatency                    │
│                                                                         │
│   Application Load Balancer:                                            │
│   ──────────────────────────                                           │
│   • RequestCount         • TargetResponseTime  • HTTPCode_Target_2XX   │
│   • HTTPCode_Target_4XX  • HTTPCode_Target_5XX • HealthyHostCount      │
│   • UnHealthyHostCount   • ActiveConnectionCount                       │
│                                                                         │
└────────────────────────────────────────────────────────────────────────┘

Metric Dimensions

┌────────────────────────────────────────────────────────────────────────┐
│                    Metric Dimensions                                    │
├────────────────────────────────────────────────────────────────────────┤
│                                                                         │
│   Metrics are uniquely identified by:                                   │
│   Namespace + Metric Name + Dimensions                                  │
│                                                                         │
│   Example:                                                              │
│   ─────────                                                             │
│   Namespace: AWS/EC2                                                    │
│   Metric: CPUUtilization                                                │
│   Dimensions: InstanceId=i-1234567890abcdef0                           │
│                                                                         │
│   Each unique combination creates a separate time series:               │
│   ┌─────────────────────────────────────────────────────────────────┐  │
│   │  AWS/EC2 | CPUUtilization | InstanceId=i-abc123  → Time Series 1│  │
│   │  AWS/EC2 | CPUUtilization | InstanceId=i-def456  → Time Series 2│  │
│   │  AWS/EC2 | CPUUtilization | AutoScalingGroup=web → Time Series 3│  │
│   └─────────────────────────────────────────────────────────────────┘  │
│                                                                         │
│   Aggregation Example:                                                  │
│   ────────────────────                                                 │
│   Query by ASG dimension to get aggregate across all instances:        │
│   aws cloudwatch get-metric-statistics \                               │
│     --namespace AWS/EC2 \                                              │
│     --metric-name CPUUtilization \                                     │
│     --dimensions Name=AutoScalingGroupName,Value=my-asg               │
│                                                                         │
└────────────────────────────────────────────────────────────────────────┘

Custom Metrics

import boto3
from datetime import datetime
from decimal import Decimal

cloudwatch = boto3.client('cloudwatch')

def publish_business_metric(metric_name: str, value: float, 
                            dimensions: list = None):
    """
    Publish custom business metric to CloudWatch.
    
    Cost: $0.30 per metric per month (first 10,000)
    """
    metric_data = {
        'MetricName': metric_name,
        'Value': value,
        'Unit': 'Count',
        'Timestamp': datetime.utcnow(),
        'StorageResolution': 60  # Standard resolution (60 = 1 min)
    }
    
    if dimensions:
        metric_data['Dimensions'] = dimensions
    
    cloudwatch.put_metric_data(
        Namespace='MyApplication',
        MetricData=[metric_data]
    )

# Example: Track orders placed
publish_business_metric(
    metric_name='OrdersPlaced',
    value=1,
    dimensions=[
        {'Name': 'Environment', 'Value': 'production'},
        {'Name': 'Region', 'Value': 'us-east-1'},
        {'Name': 'ProductCategory', 'Value': 'electronics'}
    ]
)

# High-resolution metric (1-second granularity)
def publish_high_res_metric(metric_name: str, value: float):
    """High-resolution metric for real-time monitoring."""
    cloudwatch.put_metric_data(
        Namespace='MyApplication/HighFrequency',
        MetricData=[{
            'MetricName': metric_name,
            'Value': value,
            'Unit': 'Milliseconds',
            'Timestamp': datetime.utcnow(),
            'StorageResolution': 1  # 1-second resolution
        }]
    )

# Batch publish (up to 1000 per request, 1MB max)
def publish_batch_metrics(metrics: list):
    """Efficiently publish multiple metrics."""
    cloudwatch.put_metric_data(
        Namespace='MyApplication',
        MetricData=metrics[:1000]  # Max 1000 per request
    )

CloudWatch Agent

┌────────────────────────────────────────────────────────────────────────┐
│                    CloudWatch Agent                                     │
├────────────────────────────────────────────────────────────────────────┤
│                                                                         │
│   The CloudWatch Agent collects metrics and logs from:                 │
│   • EC2 instances (Linux & Windows)                                    │
│   • On-premises servers                                                │
│   • Containers                                                         │
│                                                                         │
│   Metrics NOT available without agent:                                 │
│   ────────────────────────────────────                                 │
│   • Memory utilization                                                 │
│   • Disk space utilization                                             │
│   • Disk I/O                                                           │
│   • Network connections                                                │
│   • Process information                                                │
│                                                                         │
│   Configuration (amazon-cloudwatch-agent.json):                        │
│   ┌─────────────────────────────────────────────────────────────────┐  │
│   │  {                                                               │  │
│   │    "metrics": {                                                  │  │
│   │      "namespace": "MyApp/EC2",                                   │  │
│   │      "metrics_collected": {                                      │  │
│   │        "mem": {                                                  │  │
│   │          "measurement": ["mem_used_percent"]                     │  │
│   │        },                                                        │  │
│   │        "disk": {                                                 │  │
│   │          "measurement": ["disk_used_percent"],                   │  │
│   │          "resources": ["/", "/data"]                             │  │
│   │        }                                                         │  │
│   │      }                                                           │  │
│   │    },                                                            │  │
│   │    "logs": {                                                     │  │
│   │      "logs_collected": {                                         │  │
│   │        "files": {                                                │  │
│   │          "collect_list": [{                                      │  │
│   │            "file_path": "/var/log/myapp/*.log",                  │  │
│   │            "log_group_name": "myapp-logs",                       │  │
│   │            "log_stream_name": "{instance_id}"                    │  │
│   │          }]                                                      │  │
│   │        }                                                         │  │
│   │      }                                                           │  │
│   │    }                                                             │  │
│   │  }                                                               │  │
│   └─────────────────────────────────────────────────────────────────┘  │
│                                                                         │
└────────────────────────────────────────────────────────────────────────┘

CloudWatch Logs

Log Architecture

┌────────────────────────────────────────────────────────────────────────┐
│                    CloudWatch Logs Structure                            │
├────────────────────────────────────────────────────────────────────────┤
│                                                                         │
│   LOG GROUP: /aws/lambda/order-service                                 │
│   ────────────────────────────────────                                 │
│   • Retention: 7 days to never expire                                  │
│   • Encryption: Optional KMS encryption                                │
│   • Metric Filters: Extract metrics from logs                          │
│   • Subscription Filters: Stream to Kinesis/Lambda                     │
│                                                                         │
│   └── LOG STREAM: 2024/01/15/[$LATEST]abc123                          │
│       └── LOG EVENT: {"timestamp": 1705312800000,                      │
│                       "message": "Order processed: ORD-123"}           │
│       └── LOG EVENT: {"timestamp": 1705312801000,                      │
│                       "message": "Payment confirmed: PAY-456"}         │
│                                                                         │
│   └── LOG STREAM: 2024/01/15/[$LATEST]def456                          │
│       └── ...                                                          │
│                                                                         │
│   Pricing (us-east-1):                                                 │
│   ────────────────────                                                 │
│   • Ingestion: $0.50 per GB                                            │
│   • Storage: $0.03 per GB/month                                        │
│   • Log Insights queries: $0.005 per GB scanned                        │
│   • Export to S3: Free (but S3 storage costs apply)                    │
│                                                                         │
│   Retention Settings:                                                  │
│   ───────────────────                                                  │
│   1 day, 3 days, 5 days, 1 week, 2 weeks, 1 month, 2 months,          │
│   3 months, 6 months, 1 year, 13 months, 18 months, 2 years,          │
│   3 years, 5 years, 6 years, 7 years, 8 years, 9 years, 10 years,     │
│   Never Expire                                                         │
│                                                                         │
└────────────────────────────────────────────────────────────────────────┘

Structured Logging Best Practices

import json
import logging
from datetime import datetime
from typing import Any, Dict

class StructuredLogger:
    """JSON structured logger for CloudWatch Logs."""
    
    def __init__(self, service: str, environment: str):
        self.service = service
        self.environment = environment
        self.logger = logging.getLogger()
        self.logger.setLevel(logging.INFO)
    
    def _log(self, level: str, message: str, **extra):
        log_entry = {
            "timestamp": datetime.utcnow().isoformat() + "Z",
            "level": level,
            "service": self.service,
            "environment": self.environment,
            "message": message,
            **extra
        }
        print(json.dumps(log_entry))  # Lambda automatically captures stdout
    
    def info(self, message: str, **extra):
        self._log("INFO", message, **extra)
    
    def error(self, message: str, error: Exception = None, **extra):
        if error:
            extra["error_type"] = type(error).__name__
            extra["error_message"] = str(error)
        self._log("ERROR", message, **extra)
    
    def metric(self, name: str, value: float, unit: str = "Count", **extra):
        """Log a metric that can be extracted with metric filters."""
        self._log("METRIC", f"{name}={value}", 
                  metric_name=name, metric_value=value, metric_unit=unit, **extra)

# Usage
logger = StructuredLogger(service="order-service", environment="prod")

def process_order(order_id: str, user_id: str):
    logger.info("Processing order", order_id=order_id, user_id=user_id)
    
    try:
        # Business logic...
        logger.info("Order completed", 
                   order_id=order_id, 
                   duration_ms=150,
                   items_count=3)
        logger.metric("OrdersProcessed", 1, "Count", order_id=order_id)
        
    except Exception as e:
        logger.error("Order processing failed", 
                    error=e, 
                    order_id=order_id,
                    user_id=user_id)
        raise

Log Insights Queries

-- Find all errors in the last hour
fields @timestamp, @message, @logStream
| filter @message like /ERROR/
| sort @timestamp desc
| limit 100

-- Parse JSON logs and aggregate by service
fields @timestamp, @message
| parse @message '{"service":"*","level":"*","message":"*"}' as service, level, msg
| filter level = "ERROR"
| stats count(*) as error_count by service
| sort error_count desc

-- Calculate Lambda p99 latency
fields @timestamp, @duration, @requestId
| filter @type = "REPORT"
| stats avg(@duration) as avg_ms,
        pct(@duration, 50) as p50_ms,
        pct(@duration, 95) as p95_ms,
        pct(@duration, 99) as p99_ms
  by bin(1h)

-- Find slow API requests
fields @timestamp, @message
| parse @message '"duration_ms":*,' as duration
| filter duration > 1000
| sort duration desc
| limit 50

-- Error rate calculation
fields @timestamp, @message
| filter @message like /level/
| parse @message '"level":"*"' as level
| stats count(*) as total,
        sum(level = "ERROR") as errors
  by bin(5m)
| sort @timestamp desc

-- Top contributors to log volume
fields @logStream, @message
| stats count(*) as log_count by @logStream
| sort log_count desc
| limit 20

-- Correlation by request ID
fields @timestamp, @message, @logStream
| parse @message '"request_id":"*"' as request_id
| filter request_id = "abc-123-def-456"
| sort @timestamp asc

Metric Filters

┌────────────────────────────────────────────────────────────────────────┐
│                    Metric Filters                                       │
├────────────────────────────────────────────────────────────────────────┤
│                                                                         │
│   Extract metrics from logs automatically:                              │
│                                                                         │
│   Filter Pattern               │ Metric Created                        │
│   ─────────────────────────────┼───────────────────────────────────── │
│   ERROR                        │ Count of ERROR occurrences            │
│   [timestamp, level=ERROR, ...] │ Count of structured errors           │
│   "status": 500                │ Count of 500 errors                   │
│   { $.latency > 1000 }         │ Count of slow requests (JSON)        │
│                                                                         │
│   Example: Track application errors                                    │
│   ┌─────────────────────────────────────────────────────────────────┐  │
│   │  Filter Pattern: { $.level = "ERROR" }                          │  │
│   │  Metric Name: ApplicationErrors                                  │  │
│   │  Metric Namespace: MyApp                                        │  │
│   │  Metric Value: 1                                                 │  │
│   │  Default Value: 0                                                │  │
│   └─────────────────────────────────────────────────────────────────┘  │
│                                                                         │
│   Example: Extract latency from JSON logs                              │
│   ┌─────────────────────────────────────────────────────────────────┐  │
│   │  Filter Pattern: { $.type = "REQUEST" }                         │  │
│   │  Metric Name: RequestLatency                                     │  │
│   │  Metric Namespace: MyApp                                        │  │
│   │  Metric Value: $.latency                                         │  │
│   └─────────────────────────────────────────────────────────────────┘  │
│                                                                         │
└────────────────────────────────────────────────────────────────────────┘

CloudWatch Alarms

Alarm Types and States

┌────────────────────────────────────────────────────────────────────────┐
│                    CloudWatch Alarms                                    │
├────────────────────────────────────────────────────────────────────────┤
│                                                                         │
│   Alarm States:                                                         │
│   ┌───────────┐     ┌───────────┐     ┌─────────────────┐              │
│   │    OK     │────►│   ALARM   │────►│ INSUFFICIENT    │              │
│   │  (green)  │◄────│   (red)   │◄────│     DATA        │              │
│   └───────────┘     └───────────┘     └─────────────────┘              │
│                                                                         │
│   Alarm Components:                                                     │
│   ─────────────────                                                    │
│   • Metric: What to monitor (CPUUtilization, custom metric)           │
│   • Statistic: How to aggregate (Average, Sum, Max, p99)              │
│   • Period: Evaluation period (60s, 300s, etc.)                        │
│   • Evaluation Periods: How many periods before alarm                  │
│   • Threshold: Value that triggers alarm                               │
│   • Comparison: GreaterThan, LessThan, etc.                           │
│                                                                         │
│   Example Configuration:                                                │
│   ─────────────────────                                                │
│   Metric: CPUUtilization                                               │
│   Statistic: Average                                                   │
│   Period: 300 seconds (5 min)                                          │
│   Evaluation Periods: 3                                                │
│   Threshold: 80%                                                       │
│   → Alarm triggers when CPU > 80% for 3 consecutive 5-min periods     │
│                                                                         │
│   Alarm Actions:                                                        │
│   ───────────────                                                      │
│   • SNS Topic → Email, SMS, Lambda                                     │
│   • Auto Scaling → Scale up/down                                       │
│   • EC2 Actions → Stop, terminate, reboot, recover                     │
│   • Systems Manager → Run automation documents                         │
│                                                                         │
└────────────────────────────────────────────────────────────────────────┘

Essential Alarms (Terraform)

# High CPU Alarm with Auto Scaling
resource "aws_cloudwatch_metric_alarm" "high_cpu" {
  alarm_name          = "${var.app_name}-high-cpu"
  comparison_operator = "GreaterThanThreshold"
  evaluation_periods  = 3
  metric_name         = "CPUUtilization"
  namespace           = "AWS/EC2"
  period              = 300
  statistic           = "Average"
  threshold           = 80
  alarm_description   = "CPU exceeds 80% for 15 minutes"
  
  dimensions = {
    AutoScalingGroupName = aws_autoscaling_group.web.name
  }
  
  alarm_actions = [
    aws_sns_topic.alerts.arn,
    aws_autoscaling_policy.scale_up.arn
  ]
  
  ok_actions = [aws_sns_topic.alerts.arn]
}

# Lambda Error Rate (Metric Math)
resource "aws_cloudwatch_metric_alarm" "lambda_error_rate" {
  alarm_name          = "${var.app_name}-lambda-errors"
  comparison_operator = "GreaterThanThreshold"
  evaluation_periods  = 2
  threshold           = 5  # 5% error rate
  
  metric_query {
    id          = "error_rate"
    expression  = "errors / invocations * 100"
    label       = "Error Rate %"
    return_data = true
  }
  
  metric_query {
    id = "errors"
    metric {
      metric_name = "Errors"
      namespace   = "AWS/Lambda"
      period      = 300
      stat        = "Sum"
      dimensions = {
        FunctionName = aws_lambda_function.api.function_name
      }
    }
  }
  
  metric_query {
    id = "invocations"
    metric {
      metric_name = "Invocations"
      namespace   = "AWS/Lambda"
      period      = 300
      stat        = "Sum"
      dimensions = {
        FunctionName = aws_lambda_function.api.function_name
      }
    }
  }
  
  alarm_actions = [aws_sns_topic.alerts.arn]
}

# Anomaly Detection Alarm
resource "aws_cloudwatch_metric_alarm" "traffic_anomaly" {
  alarm_name          = "${var.app_name}-traffic-anomaly"
  comparison_operator = "LessThanLowerOrGreaterThanUpperThreshold"
  evaluation_periods  = 2
  
  threshold_metric_id = "ad1"
  
  metric_query {
    id          = "m1"
    return_data = true
    metric {
      metric_name = "RequestCount"
      namespace   = "AWS/ApplicationELB"
      period      = 300
      stat        = "Sum"
      dimensions = {
        LoadBalancer = aws_lb.main.arn_suffix
      }
    }
  }
  
  metric_query {
    id          = "ad1"
    expression  = "ANOMALY_DETECTION_BAND(m1, 2)"
    label       = "Traffic Anomaly Band"
    return_data = true
  }
  
  alarm_actions = [aws_sns_topic.alerts.arn]
}

# Composite Alarm (multiple conditions)
resource "aws_cloudwatch_composite_alarm" "critical" {
  alarm_name = "${var.app_name}-critical"
  
  alarm_rule = join(" OR ", [
    "ALARM(${aws_cloudwatch_metric_alarm.high_cpu.alarm_name})",
    "ALARM(${aws_cloudwatch_metric_alarm.lambda_error_rate.alarm_name})",
  ])
  
  alarm_actions = [aws_sns_topic.pagerduty.arn]
}

CloudWatch Dashboards

┌────────────────────────────────────────────────────────────────────────┐
│                    Dashboard Design                                     │
├────────────────────────────────────────────────────────────────────────┤
│                                                                         │
│   ┌───────────────────────────────────────────────────────────────┐    │
│   │  MY APPLICATION DASHBOARD                           📊 🔗 ⚙️   │    │
│   ├───────────────────────────────────────────────────────────────┤    │
│   │                                                               │    │
│   │  ┌─────────────────────┐  ┌─────────────────────────────────┐│    │
│   │  │ 🟢 Healthy Hosts: 4 │  │ 📈 Request Rate                 ││    │
│   │  │ 🔴 Errors: 0.1%     │  │ [Line graph of requests/min]   ││    │
│   │  │ ⏱️ P99 Latency: 45ms│  │                                 ││    │
│   │  └─────────────────────┘  └─────────────────────────────────┘│    │
│   │                                                               │    │
│   │  ┌────────────────────────────────────────────────────────┐  │    │
│   │  │ 📊 API Response Times by Endpoint                      │  │    │
│   │  │ ┌────────────────────────────────────────────────────┐ │  │    │
│   │  │ │ /orders ████████████████ 45ms                      │ │  │    │
│   │  │ │ /users  ████████ 25ms                              │ │  │    │
│   │  │ │ /products ██████████████████ 55ms                  │ │  │    │
│   │  │ └────────────────────────────────────────────────────┘ │  │    │
│   │  └────────────────────────────────────────────────────────┘  │    │
│   │                                                               │    │
│   │  ┌─────────────────────┐  ┌─────────────────────────────────┐│    │
│   │  │ Lambda Invocations  │  │ DynamoDB Consumed Capacity      ││    │
│   │  │ [Stacked area chart]│  │ [RCU/WCU over time]             ││    │
│   │  └─────────────────────┘  └─────────────────────────────────┘│    │
│   │                                                               │    │
│   │  ┌────────────────────────────────────────────────────────┐  │    │
│   │  │ 📝 Recent Errors (Log Widget)                          │  │    │
│   │  │ 10:23:45 ERROR Payment failed: insufficient funds      │  │    │
│   │  │ 10:22:30 ERROR Timeout connecting to external API      │  │    │
│   │  └────────────────────────────────────────────────────────┘  │    │
│   │                                                               │    │
│   └───────────────────────────────────────────────────────────────┘    │
│                                                                         │
│   Widget Types:                                                         │
│   • Line, Stacked area, Number, Gauge, Bar, Pie                        │
│   • Text (Markdown), Log (Log Insights), Alarm status                  │
│                                                                         │
└────────────────────────────────────────────────────────────────────────┘

Dashboard as Code

import json

# CloudWatch Dashboard JSON definition
dashboard_body = {
    "widgets": [
        # Key Metrics Row
        {
            "type": "metric",
            "x": 0, "y": 0, "width": 6, "height": 6,
            "properties": {
                "title": "Request Count",
                "metrics": [
                    ["AWS/ApplicationELB", "RequestCount", "LoadBalancer", "app/my-alb/1234"]
                ],
                "period": 60,
                "stat": "Sum",
                "region": "us-east-1"
            }
        },
        {
            "type": "metric",
            "x": 6, "y": 0, "width": 6, "height": 6,
            "properties": {
                "title": "Response Time (p99)",
                "metrics": [
                    ["AWS/ApplicationELB", "TargetResponseTime", "LoadBalancer", "app/my-alb/1234"]
                ],
                "period": 60,
                "stat": "p99",
                "region": "us-east-1"
            }
        },
        # Lambda Metrics
        {
            "type": "metric",
            "x": 0, "y": 6, "width": 12, "height": 6,
            "properties": {
                "title": "Lambda Performance",
                "metrics": [
                    ["AWS/Lambda", "Duration", "FunctionName", "my-function", {"stat": "Average"}],
                    [".", ".", ".", ".", {"stat": "p95"}],
                    [".", ".", ".", ".", {"stat": "p99"}]
                ],
                "period": 60,
                "region": "us-east-1"
            }
        },
        # Alarm Status Widget
        {
            "type": "alarm",
            "x": 12, "y": 0, "width": 6, "height": 6,
            "properties": {
                "title": "Alarm Status",
                "alarms": [
                    "arn:aws:cloudwatch:us-east-1:123456789012:alarm:HighCPU",
                    "arn:aws:cloudwatch:us-east-1:123456789012:alarm:ErrorRate"
                ]
            }
        },
        # Log Widget
        {
            "type": "log",
            "x": 0, "y": 12, "width": 24, "height": 6,
            "properties": {
                "title": "Recent Errors",
                "query": "SOURCE '/aws/lambda/my-function' | filter @message like /ERROR/ | sort @timestamp desc | limit 20",
                "region": "us-east-1"
            }
        }
    ]
}

# Create/update dashboard
import boto3
cloudwatch = boto3.client('cloudwatch')
cloudwatch.put_dashboard(
    DashboardName='MyApplicationDashboard',
    DashboardBody=json.dumps(dashboard_body)
)

CloudWatch Synthetics

Canary tests that monitor your endpoints 24/7.
┌────────────────────────────────────────────────────────────────────────┐
│                    CloudWatch Synthetics                                │
├────────────────────────────────────────────────────────────────────────┤
│                                                                         │
│   Canary Types:                                                         │
│   ─────────────                                                        │
│   • Heartbeat: Simple availability check                               │
│   • API: REST API endpoint testing                                     │
│   • Broken Link: Check for broken links                               │
│   • Visual: Screenshot comparison                                      │
│   • GUI Workflow: Multi-step browser tests                             │
│                                                                         │
│   Features:                                                             │
│   ─────────                                                            │
│   • Runs on a schedule (1 min to 1 hour)                               │
│   • Captures screenshots and HAR files                                 │
│   • Integrates with CloudWatch Alarms                                  │
│   • Uses Puppeteer or Selenium                                         │
│                                                                         │
│   Example: API Canary Script (Node.js)                                 │
│   ┌─────────────────────────────────────────────────────────────────┐  │
│   │  const synthetics = require('Synthetics');                      │  │
│   │  const log = require('SyntheticsLogger');                       │  │
│   │                                                                  │  │
│   │  const apiCanary = async function() {                           │  │
│   │    const page = await synthetics.getPage();                     │  │
│   │                                                                  │  │
│   │    // Test API endpoint                                          │  │
│   │    const response = await page.goto('https://api.example.com'); │  │
│   │                                                                  │  │
│   │    if (response.status() !== 200) {                             │  │
│   │      throw new Error(`Expected 200, got ${response.status()}`); │  │
│   │    }                                                             │  │
│   │                                                                  │  │
│   │    log.info('API check passed');                                 │  │
│   │  };                                                              │  │
│   │                                                                  │  │
│   │  exports.handler = async () => {                                │  │
│   │    return await apiCanary();                                     │  │
│   │  };                                                              │  │
│   └─────────────────────────────────────────────────────────────────┘  │
│                                                                         │
│   Pricing: $0.0012 per canary run                                      │
│   100 runs/day × 30 days = $3.60/month per canary                      │
│                                                                         │
└────────────────────────────────────────────────────────────────────────┘

CloudWatch Contributor Insights

Identify top contributors to high cardinality data.
┌────────────────────────────────────────────────────────────────────────┐
│                    Contributor Insights                                 │
├────────────────────────────────────────────────────────────────────────┤
│                                                                         │
│   Use Cases:                                                            │
│   ───────────                                                          │
│   • Find top IPs making requests                                       │
│   • Identify users causing errors                                      │
│   • Detect DDoS patterns                                               │
│   • Find hottest DynamoDB partition keys                               │
│                                                                         │
│   Example: Top Error-Producing API Endpoints                           │
│   ┌─────────────────────────────────────────────────────────────────┐  │
│   │  Rank │ Endpoint          │ Error Count │ % of Total            │  │
│   │  ─────┼───────────────────┼─────────────┼───────────────────── │  │
│   │  1    │ /api/v1/checkout  │ 1,234       │ 45%                   │  │
│   │  2    │ /api/v1/payment   │ 567         │ 21%                   │  │
│   │  3    │ /api/v1/search    │ 234         │ 9%                    │  │
│   │  4    │ /api/v1/users     │ 123         │ 5%                    │  │
│   └─────────────────────────────────────────────────────────────────┘  │
│                                                                         │
│   Rule Definition:                                                      │
│   {                                                                     │
│     "Schema": {                                                         │
│       "Name": "CloudWatchLogRule",                                     │
│       "Version": 1                                                      │
│     },                                                                  │
│     "LogGroupNames": ["/aws/lambda/my-api"],                           │
│     "LogFormat": "JSON",                                               │
│     "Fields": {                                                         │
│       "endpoint": "$.path",                                            │
│       "status": "$.status_code"                                        │
│     },                                                                  │
│     "Contribution": {                                                   │
│       "Keys": ["endpoint"],                                            │
│       "Filters": [{"Match": "$.status_code", "GreaterThan": 499}]     │
│     }                                                                   │
│   }                                                                     │
│                                                                         │
└────────────────────────────────────────────────────────────────────────┘

Best Practices

Set Retention Policies

Configure log retention to balance cost and compliance needs

Use Structured Logging

JSON format enables powerful Log Insights queries

Alert on Symptoms

Focus on user-facing metrics, not just infrastructure

Tune Alarm Thresholds

Avoid alert fatigue with well-calibrated thresholds

Dashboard Hierarchy

Executive → Service → Debug dashboards

Cost Awareness

Monitor CloudWatch costs—they can surprise you

Cost Optimization

cost_tips = {
    "logs": [
        "Set appropriate retention (default: never expire)",
        "Export to S3 for long-term storage (cheaper)",
        "Use metric filters instead of querying raw logs",
        "Filter at source (log level, sampling)",
    ],
    "metrics": [
        "Minimize custom metric dimensions",
        "Use EMF for metrics-from-logs (free publishing)",
        "Consider high-res metrics only where needed",
        "Delete unused custom metrics",
    ],
    "alarms": [
        "Use composite alarms to reduce noise",
        "Consolidate similar alarms",
        "Avoid very short periods on high-cardinality metrics",
    ],
    "dashboards": [
        "Share dashboards instead of duplicating",
        "Use automatic refresh wisely",
    ]
}

# Cost estimation
monthly_costs = {
    "custom_metrics": "10,000 metrics × $0.30 = $3,000",
    "log_ingestion": "100 GB × $0.50 = $50",
    "log_storage": "1 TB × $0.03 = $30",
    "log_insights": "500 GB scanned × $0.005 = $2.50",
    "alarms": "100 alarms × $0.10 = $10",
    "dashboards": "First 3 free, then $3/month each",
}

🎯 Interview Questions

CloudWatch Logs:
  • Application and system logs
  • What your code outputs
  • Debugging, troubleshooting
X-Ray:
  • Distributed tracing
  • Request flow across services
  • Performance analysis
CloudTrail:
  • AWS API call history
  • Who did what, when
  • Security auditing
  1. Set retention policies (don’t keep forever)
  2. Export to S3 for long-term (use lifecycle rules)
  3. Filter at source (log levels, sampling)
  4. Use metric filters instead of Log Insights for common queries
  5. Compress logs before ingestion
  6. Use Contributor Insights rules for ongoing analysis
Metric Alarms:
  • Monitor single metric
  • Simple threshold or anomaly detection
  • One condition
Composite Alarms:
  • Combine multiple alarms with AND/OR/NOT
  • Reduce alert noise
  • Complex conditions like: “High CPU AND High Memory”
  • Better for on-call (fewer, more actionable alerts)
Why: EC2 basic monitoring doesn’t include memory—it’s inside the OS, not visible to hypervisor.Solution: Install CloudWatch Agent
# Install agent
sudo yum install amazon-cloudwatch-agent

# Configure
sudo /opt/aws/amazon-cloudwatch-agent/bin/amazon-cloudwatch-agent-config-wizard

# Start
sudo systemctl start amazon-cloudwatch-agent
Built-in: Lambda doesn’t have a direct cold start metric.Solutions:
  1. Init Duration in REPORT logs
fields @timestamp, @initDuration
| filter @type = "REPORT" and ispresent(@initDuration)
| stats avg(@initDuration), max(@initDuration), count(*)
  1. Custom metric from code (measure init time)
  2. X-Ray shows initialization segment
  3. Lambda Insights (enhanced monitoring)

🧪 Hands-On Lab

1

Set Up Structured Logging

Implement JSON logging in a Lambda function
2

Create Metric Filter

Extract error count metric from logs
3

Build Dashboard

Create operational dashboard with key metrics
4

Configure Alarms

Set up metric and composite alarms with SNS notification
5

Create Canary

Set up synthetic monitoring for your API endpoint

Next Module

AWS X-Ray

Master distributed tracing with AWS X-Ray