Amazon CloudWatch - Dev Weekends

Module Overview

Estimated Time: 4-5 hours | Difficulty: Intermediate | Prerequisites: Core Concepts

Amazon CloudWatch is the unified monitoring and observability service for AWS. This module provides a comprehensive deep-dive into CloudWatch capabilities for production monitoring. What You’ll Learn:

CloudWatch Metrics (built-in and custom)
CloudWatch Logs and Log Insights
CloudWatch Alarms and composite alarms
Dashboards and visualization
CloudWatch Synthetics (canaries)
CloudWatch Contributor Insights
CloudWatch Anomaly Detection
Cross-account and cross-region monitoring

CloudWatch Architecture

┌────────────────────────────────────────────────────────────────────────┐
│                    CloudWatch Architecture                              │
├────────────────────────────────────────────────────────────────────────┤
│                                                                         │
│   Data Sources                        CloudWatch                        │
│   ────────────                        ──────────                       │
│   ┌──────────────┐                   ┌───────────────────────────────┐ │
│   │ EC2, RDS,    │───────────────────│         METRICS               │ │
│   │ Lambda, etc. │   Auto-collected  │  • Standard (5-min)           │ │
│   └──────────────┘                   │  • Detailed (1-min)           │ │
│                                      │  • High-res (1-sec)           │ │
│   ┌──────────────┐                   └───────────────────────────────┘ │
│   │ Applications │   PutMetricData   ┌───────────────────────────────┐ │
│   │ (Custom)     │───────────────────│      CUSTOM METRICS           │ │
│   └──────────────┘                   │  • Business KPIs              │ │
│                                      │  • App-specific metrics       │ │
│   ┌──────────────┐                   └───────────────────────────────┘ │
│   │ Lambda, ECS, │   Auto-collected  ┌───────────────────────────────┐ │
│   │ EC2, VPC     │───────────────────│          LOGS                 │ │
│   └──────────────┘                   │  • Log Groups                 │ │
│                                      │  • Log Streams                │ │
│   ┌──────────────┐                   │  • Log Insights               │ │
│   │ Applications │   Agent/SDK       └───────────────────────────────┘ │
│   └──────────────┘                                                     │
│                                      ┌───────────────────────────────┐ │
│   Processing & Actions               │         ALARMS                │ │
│   ─────────────────────              │  • Metric Alarms              │ │
│   • Alarms → SNS, Auto Scaling       │  • Composite Alarms           │ │
│   • EventBridge → Lambda, etc.       │  • Anomaly Detection          │ │
│   • Dashboards → Visualization       └───────────────────────────────┘ │
│   • Contributor Insights                                               │
│                                      ┌───────────────────────────────┐ │
│                                      │       DASHBOARDS              │ │
│                                      │  • Widgets                    │ │
│                                      │  • Cross-account view         │ │
│                                      │  • Sharing                    │ │
│                                      └───────────────────────────────┘ │
│                                                                         │
└────────────────────────────────────────────────────────────────────────┘

CloudWatch Metrics

Built-in Metrics

┌────────────────────────────────────────────────────────────────────────┐
│                    Built-in Metrics by Service                          │
├────────────────────────────────────────────────────────────────────────┤
│                                                                         │
│   EC2 (Basic - 5 min | Detailed - 1 min):                              │
│   ─────────────────────────────────────                                │
│   • CPUUtilization       • DiskReadOps         • DiskWriteOps          │
│   • NetworkIn            • NetworkOut          • StatusCheckFailed     │
│   ⚠️ Memory NOT included (use CloudWatch Agent)                        │
│                                                                         │
│   Lambda:                                                               │
│   ───────                                                               │
│   • Invocations          • Duration            • Errors                │
│   • Throttles            • ConcurrentExec      • UnreservedConcurrent  │
│   • IteratorAge (streams)                                              │
│                                                                         │
│   RDS:                                                                  │
│   ────                                                                  │
│   • CPUUtilization       • DatabaseConnections • FreeableMemory        │
│   • ReadIOPS             • WriteIOPS           • ReadLatency           │
│   • WriteLatency         • FreeStorageSpace                            │
│                                                                         │
│   DynamoDB:                                                             │
│   ─────────                                                             │
│   • ConsumedRCU          • ConsumedWCU         • ProvisionedRCU        │
│   • ProvisionedWCU       • ThrottledRequests   • SystemErrors          │
│   • ReturnedItemCount    • SuccessfulRequestLatency                    │
│                                                                         │
│   Application Load Balancer:                                            │
│   ──────────────────────────                                           │
│   • RequestCount         • TargetResponseTime  • HTTPCode_Target_2XX   │
│   • HTTPCode_Target_4XX  • HTTPCode_Target_5XX • HealthyHostCount      │
│   • UnHealthyHostCount   • ActiveConnectionCount                       │
│                                                                         │
└────────────────────────────────────────────────────────────────────────┘

Metric Dimensions

┌────────────────────────────────────────────────────────────────────────┐
│                    Metric Dimensions                                    │
├────────────────────────────────────────────────────────────────────────┤
│                                                                         │
│   Metrics are uniquely identified by:                                   │
│   Namespace + Metric Name + Dimensions                                  │
│                                                                         │
│   Example:                                                              │
│   ─────────                                                             │
│   Namespace: AWS/EC2                                                    │
│   Metric: CPUUtilization                                                │
│   Dimensions: InstanceId=i-1234567890abcdef0                           │
│                                                                         │
│   Each unique combination creates a separate time series:               │
│   ┌─────────────────────────────────────────────────────────────────┐  │
│   │  AWS/EC2 | CPUUtilization | InstanceId=i-abc123  → Time Series 1│  │
│   │  AWS/EC2 | CPUUtilization | InstanceId=i-def456  → Time Series 2│  │
│   │  AWS/EC2 | CPUUtilization | AutoScalingGroup=web → Time Series 3│  │
│   └─────────────────────────────────────────────────────────────────┘  │
│                                                                         │
│   Aggregation Example:                                                  │
│   ────────────────────                                                 │
│   Query by ASG dimension to get aggregate across all instances:        │
│   aws cloudwatch get-metric-statistics \                               │
│     --namespace AWS/EC2 \                                              │
│     --metric-name CPUUtilization \                                     │
│     --dimensions Name=AutoScalingGroupName,Value=my-asg               │
│                                                                         │
└────────────────────────────────────────────────────────────────────────┘

Custom Metrics

import boto3
from datetime import datetime
from decimal import Decimal

cloudwatch = boto3.client('cloudwatch')

def publish_business_metric(metric_name: str, value: float, 
                            dimensions: list = None):
    """
    Publish custom business metric to CloudWatch.
    
    Cost: $0.30 per metric per month (first 10,000)
    """
    metric_data = {
        'MetricName': metric_name,
        'Value': value,
        'Unit': 'Count',
        'Timestamp': datetime.utcnow(),
        'StorageResolution': 60  # Standard resolution (60 = 1 min)
    }
    
    if dimensions:
        metric_data['Dimensions'] = dimensions
    
    cloudwatch.put_metric_data(
        Namespace='MyApplication',
        MetricData=[metric_data]
    )

# Example: Track orders placed
publish_business_metric(
    metric_name='OrdersPlaced',
    value=1,
    dimensions=[
        {'Name': 'Environment', 'Value': 'production'},
        {'Name': 'Region', 'Value': 'us-east-1'},
        {'Name': 'ProductCategory', 'Value': 'electronics'}
    ]
)

# High-resolution metric (1-second granularity)
def publish_high_res_metric(metric_name: str, value: float):
    """High-resolution metric for real-time monitoring."""
    cloudwatch.put_metric_data(
        Namespace='MyApplication/HighFrequency',
        MetricData=[{
            'MetricName': metric_name,
            'Value': value,
            'Unit': 'Milliseconds',
            'Timestamp': datetime.utcnow(),
            'StorageResolution': 1  # 1-second resolution
        }]
    )

# Batch publish (up to 1000 per request, 1MB max)
def publish_batch_metrics(metrics: list):
    """Efficiently publish multiple metrics."""
    cloudwatch.put_metric_data(
        Namespace='MyApplication',
        MetricData=metrics[:1000]  # Max 1000 per request
    )

CloudWatch Agent

┌────────────────────────────────────────────────────────────────────────┐
│                    CloudWatch Agent                                     │
├────────────────────────────────────────────────────────────────────────┤
│                                                                         │
│   The CloudWatch Agent collects metrics and logs from:                 │
│   • EC2 instances (Linux & Windows)                                    │
│   • On-premises servers                                                │
│   • Containers                                                         │
│                                                                         │
│   Metrics NOT available without agent:                                 │
│   ────────────────────────────────────                                 │
│   • Memory utilization                                                 │
│   • Disk space utilization                                             │
│   • Disk I/O                                                           │
│   • Network connections                                                │
│   • Process information                                                │
│                                                                         │
│   Configuration (amazon-cloudwatch-agent.json):                        │
│   ┌─────────────────────────────────────────────────────────────────┐  │
│   │  {                                                               │  │
│   │    "metrics": {                                                  │  │
│   │      "namespace": "MyApp/EC2",                                   │  │
│   │      "metrics_collected": {                                      │  │
│   │        "mem": {                                                  │  │
│   │          "measurement": ["mem_used_percent"]                     │  │
│   │        },                                                        │  │
│   │        "disk": {                                                 │  │
│   │          "measurement": ["disk_used_percent"],                   │  │
│   │          "resources": ["/", "/data"]                             │  │
│   │        }                                                         │  │
│   │      }                                                           │  │
│   │    },                                                            │  │
│   │    "logs": {                                                     │  │
│   │      "logs_collected": {                                         │  │
│   │        "files": {                                                │  │
│   │          "collect_list": [{                                      │  │
│   │            "file_path": "/var/log/myapp/*.log",                  │  │
│   │            "log_group_name": "myapp-logs",                       │  │
│   │            "log_stream_name": "{instance_id}"                    │  │
│   │          }]                                                      │  │
│   │        }                                                         │  │
│   │      }                                                           │  │
│   │    }                                                             │  │
│   │  }                                                               │  │
│   └─────────────────────────────────────────────────────────────────┘  │
│                                                                         │
└────────────────────────────────────────────────────────────────────────┘

CloudWatch Logs

Log Architecture

┌────────────────────────────────────────────────────────────────────────┐
│                    CloudWatch Logs Structure                            │
├────────────────────────────────────────────────────────────────────────┤
│                                                                         │
│   LOG GROUP: /aws/lambda/order-service                                 │
│   ────────────────────────────────────                                 │
│   • Retention: 7 days to never expire                                  │
│   • Encryption: Optional KMS encryption                                │
│   • Metric Filters: Extract metrics from logs                          │
│   • Subscription Filters: Stream to Kinesis/Lambda                     │
│                                                                         │
│   └── LOG STREAM: 2024/01/15/[$LATEST]abc123                          │
│       └── LOG EVENT: {"timestamp": 1705312800000,                      │
│                       "message": "Order processed: ORD-123"}           │
│       └── LOG EVENT: {"timestamp": 1705312801000,                      │
│                       "message": "Payment confirmed: PAY-456"}         │
│                                                                         │
│   └── LOG STREAM: 2024/01/15/[$LATEST]def456                          │
│       └── ...                                                          │
│                                                                         │
│   Pricing (us-east-1):                                                 │
│   ────────────────────                                                 │
│   • Ingestion: $0.50 per GB                                            │
│   • Storage: $0.03 per GB/month                                        │
│   • Log Insights queries: $0.005 per GB scanned                        │
│   • Export to S3: Free (but S3 storage costs apply)                    │
│                                                                         │
│   Retention Settings:                                                  │
│   ───────────────────                                                  │
│   1 day, 3 days, 5 days, 1 week, 2 weeks, 1 month, 2 months,          │
│   3 months, 6 months, 1 year, 13 months, 18 months, 2 years,          │
│   3 years, 5 years, 6 years, 7 years, 8 years, 9 years, 10 years,     │
│   Never Expire                                                         │
│                                                                         │
└────────────────────────────────────────────────────────────────────────┘

Structured Logging Best Practices

import json
import logging
from datetime import datetime
from typing import Any, Dict

class StructuredLogger:
    """JSON structured logger for CloudWatch Logs."""
    
    def __init__(self, service: str, environment: str):
        self.service = service
        self.environment = environment
        self.logger = logging.getLogger()
        self.logger.setLevel(logging.INFO)
    
    def _log(self, level: str, message: str, **extra):
        log_entry = {
            "timestamp": datetime.utcnow().isoformat() + "Z",
            "level": level,
            "service": self.service,
            "environment": self.environment,
            "message": message,
            **extra
        }
        print(json.dumps(log_entry))  # Lambda automatically captures stdout
    
    def info(self, message: str, **extra):
        self._log("INFO", message, **extra)
    
    def error(self, message: str, error: Exception = None, **extra):
        if error:
            extra["error_type"] = type(error).__name__
            extra["error_message"] = str(error)
        self._log("ERROR", message, **extra)
    
    def metric(self, name: str, value: float, unit: str = "Count", **extra):
        """Log a metric that can be extracted with metric filters."""
        self._log("METRIC", f"{name}={value}", 
                  metric_name=name, metric_value=value, metric_unit=unit, **extra)

# Usage
logger = StructuredLogger(service="order-service", environment="prod")

def process_order(order_id: str, user_id: str):
    logger.info("Processing order", order_id=order_id, user_id=user_id)
    
    try:
        # Business logic...
        logger.info("Order completed", 
                   order_id=order_id, 
                   duration_ms=150,
                   items_count=3)
        logger.metric("OrdersProcessed", 1, "Count", order_id=order_id)
        
    except Exception as e:
        logger.error("Order processing failed", 
                    error=e, 
                    order_id=order_id,
                    user_id=user_id)
        raise

Log Insights Queries

-- Find all errors in the last hour
fields @timestamp, @message, @logStream
| filter @message like /ERROR/
| sort @timestamp desc
| limit 100

-- Parse JSON logs and aggregate by service
fields @timestamp, @message
| parse @message '{"service":"*","level":"*","message":"*"}' as service, level, msg
| filter level = "ERROR"
| stats count(*) as error_count by service
| sort error_count desc

-- Calculate Lambda p99 latency
fields @timestamp, @duration, @requestId
| filter @type = "REPORT"
| stats avg(@duration) as avg_ms,
        pct(@duration, 50) as p50_ms,
        pct(@duration, 95) as p95_ms,
        pct(@duration, 99) as p99_ms
  by bin(1h)

-- Find slow API requests
fields @timestamp, @message
| parse @message '"duration_ms":*,' as duration
| filter duration > 1000
| sort duration desc
| limit 50

-- Error rate calculation
fields @timestamp, @message
| filter @message like /level/
| parse @message '"level":"*"' as level
| stats count(*) as total,
        sum(level = "ERROR") as errors
  by bin(5m)
| sort @timestamp desc

-- Top contributors to log volume
fields @logStream, @message
| stats count(*) as log_count by @logStream
| sort log_count desc
| limit 20

-- Correlation by request ID
fields @timestamp, @message, @logStream
| parse @message '"request_id":"*"' as request_id
| filter request_id = "abc-123-def-456"
| sort @timestamp asc

Metric Filters

┌────────────────────────────────────────────────────────────────────────┐
│                    Metric Filters                                       │
├────────────────────────────────────────────────────────────────────────┤
│                                                                         │
│   Extract metrics from logs automatically:                              │
│                                                                         │
│   Filter Pattern               │ Metric Created                        │
│   ─────────────────────────────┼───────────────────────────────────── │
│   ERROR                        │ Count of ERROR occurrences            │
│   [timestamp, level=ERROR, ...] │ Count of structured errors           │
│   "status": 500                │ Count of 500 errors                   │
│   { $.latency > 1000 }         │ Count of slow requests (JSON)        │
│                                                                         │
│   Example: Track application errors                                    │
│   ┌─────────────────────────────────────────────────────────────────┐  │
│   │  Filter Pattern: { $.level = "ERROR" }                          │  │
│   │  Metric Name: ApplicationErrors                                  │  │
│   │  Metric Namespace: MyApp                                        │  │
│   │  Metric Value: 1                                                 │  │
│   │  Default Value: 0                                                │  │
│   └─────────────────────────────────────────────────────────────────┘  │
│                                                                         │
│   Example: Extract latency from JSON logs                              │
│   ┌─────────────────────────────────────────────────────────────────┐  │
│   │  Filter Pattern: { $.type = "REQUEST" }                         │  │
│   │  Metric Name: RequestLatency                                     │  │
│   │  Metric Namespace: MyApp                                        │  │
│   │  Metric Value: $.latency                                         │  │
│   └─────────────────────────────────────────────────────────────────┘  │
│                                                                         │
└────────────────────────────────────────────────────────────────────────┘

CloudWatch Alarms

Alarm Types and States

┌────────────────────────────────────────────────────────────────────────┐
│                    CloudWatch Alarms                                    │
├────────────────────────────────────────────────────────────────────────┤
│                                                                         │
│   Alarm States:                                                         │
│   ┌───────────┐     ┌───────────┐     ┌─────────────────┐              │
│   │    OK     │────►│   ALARM   │────►│ INSUFFICIENT    │              │
│   │  (green)  │◄────│   (red)   │◄────│     DATA        │              │
│   └───────────┘     └───────────┘     └─────────────────┘              │
│                                                                         │
│   Alarm Components:                                                     │
│   ─────────────────                                                    │
│   • Metric: What to monitor (CPUUtilization, custom metric)           │
│   • Statistic: How to aggregate (Average, Sum, Max, p99)              │
│   • Period: Evaluation period (60s, 300s, etc.)                        │
│   • Evaluation Periods: How many periods before alarm                  │
│   • Threshold: Value that triggers alarm                               │
│   • Comparison: GreaterThan, LessThan, etc.                           │
│                                                                         │
│   Example Configuration:                                                │
│   ─────────────────────                                                │
│   Metric: CPUUtilization                                               │
│   Statistic: Average                                                   │
│   Period: 300 seconds (5 min)                                          │
│   Evaluation Periods: 3                                                │
│   Threshold: 80%                                                       │
│   → Alarm triggers when CPU > 80% for 3 consecutive 5-min periods     │
│                                                                         │
│   Alarm Actions:                                                        │
│   ───────────────                                                      │
│   • SNS Topic → Email, SMS, Lambda                                     │
│   • Auto Scaling → Scale up/down                                       │
│   • EC2 Actions → Stop, terminate, reboot, recover                     │
│   • Systems Manager → Run automation documents                         │
│                                                                         │
└────────────────────────────────────────────────────────────────────────┘

Essential Alarms (Terraform)

# High CPU Alarm with Auto Scaling
resource "aws_cloudwatch_metric_alarm" "high_cpu" {
  alarm_name          = "${var.app_name}-high-cpu"
  comparison_operator = "GreaterThanThreshold"
  evaluation_periods  = 3
  metric_name         = "CPUUtilization"
  namespace           = "AWS/EC2"
  period              = 300
  statistic           = "Average"
  threshold           = 80
  alarm_description   = "CPU exceeds 80% for 15 minutes"
  
  dimensions = {
    AutoScalingGroupName = aws_autoscaling_group.web.name
  }
  
  alarm_actions = [
    aws_sns_topic.alerts.arn,
    aws_autoscaling_policy.scale_up.arn
  ]
  
  ok_actions = [aws_sns_topic.alerts.arn]
}

# Lambda Error Rate (Metric Math)
resource "aws_cloudwatch_metric_alarm" "lambda_error_rate" {
  alarm_name          = "${var.app_name}-lambda-errors"
  comparison_operator = "GreaterThanThreshold"
  evaluation_periods  = 2
  threshold           = 5  # 5% error rate
  
  metric_query {
    id          = "error_rate"
    expression  = "errors / invocations * 100"
    label       = "Error Rate %"
    return_data = true
  }
  
  metric_query {
    id = "errors"
    metric {
      metric_name = "Errors"
      namespace   = "AWS/Lambda"
      period      = 300
      stat        = "Sum"
      dimensions = {
        FunctionName = aws_lambda_function.api.function_name
      }
    }
  }
  
  metric_query {
    id = "invocations"
    metric {
      metric_name = "Invocations"
      namespace   = "AWS/Lambda"
      period      = 300
      stat        = "Sum"
      dimensions = {
        FunctionName = aws_lambda_function.api.function_name
      }
    }
  }
  
  alarm_actions = [aws_sns_topic.alerts.arn]
}

# Anomaly Detection Alarm
resource "aws_cloudwatch_metric_alarm" "traffic_anomaly" {
  alarm_name          = "${var.app_name}-traffic-anomaly"
  comparison_operator = "LessThanLowerOrGreaterThanUpperThreshold"
  evaluation_periods  = 2
  
  threshold_metric_id = "ad1"
  
  metric_query {
    id          = "m1"
    return_data = true
    metric {
      metric_name = "RequestCount"
      namespace   = "AWS/ApplicationELB"
      period      = 300
      stat        = "Sum"
      dimensions = {
        LoadBalancer = aws_lb.main.arn_suffix
      }
    }
  }
  
  metric_query {
    id          = "ad1"
    expression  = "ANOMALY_DETECTION_BAND(m1, 2)"
    label       = "Traffic Anomaly Band"
    return_data = true
  }
  
  alarm_actions = [aws_sns_topic.alerts.arn]
}

# Composite Alarm (multiple conditions)
resource "aws_cloudwatch_composite_alarm" "critical" {
  alarm_name = "${var.app_name}-critical"
  
  alarm_rule = join(" OR ", [
    "ALARM(${aws_cloudwatch_metric_alarm.high_cpu.alarm_name})",
    "ALARM(${aws_cloudwatch_metric_alarm.lambda_error_rate.alarm_name})",
  ])
  
  alarm_actions = [aws_sns_topic.pagerduty.arn]
}

CloudWatch Dashboards

┌────────────────────────────────────────────────────────────────────────┐
│                    Dashboard Design                                     │
├────────────────────────────────────────────────────────────────────────┤
│                                                                         │
│   ┌───────────────────────────────────────────────────────────────┐    │
│   │  MY APPLICATION DASHBOARD                           📊 🔗 ⚙️   │    │
│   ├───────────────────────────────────────────────────────────────┤    │
│   │                                                               │    │
│   │  ┌─────────────────────┐  ┌─────────────────────────────────┐│    │
│   │  │ 🟢 Healthy Hosts: 4 │  │ 📈 Request Rate                 ││    │
│   │  │ 🔴 Errors: 0.1%     │  │ [Line graph of requests/min]   ││    │
│   │  │ ⏱️ P99 Latency: 45ms│  │                                 ││    │
│   │  └─────────────────────┘  └─────────────────────────────────┘│    │
│   │                                                               │    │
│   │  ┌────────────────────────────────────────────────────────┐  │    │
│   │  │ 📊 API Response Times by Endpoint                      │  │    │
│   │  │ ┌────────────────────────────────────────────────────┐ │  │    │
│   │  │ │ /orders ████████████████ 45ms                      │ │  │    │
│   │  │ │ /users  ████████ 25ms                              │ │  │    │
│   │  │ │ /products ██████████████████ 55ms                  │ │  │    │
│   │  │ └────────────────────────────────────────────────────┘ │  │    │
│   │  └────────────────────────────────────────────────────────┘  │    │
│   │                                                               │    │
│   │  ┌─────────────────────┐  ┌─────────────────────────────────┐│    │
│   │  │ Lambda Invocations  │  │ DynamoDB Consumed Capacity      ││    │
│   │  │ [Stacked area chart]│  │ [RCU/WCU over time]             ││    │
│   │  └─────────────────────┘  └─────────────────────────────────┘│    │
│   │                                                               │    │
│   │  ┌────────────────────────────────────────────────────────┐  │    │
│   │  │ 📝 Recent Errors (Log Widget)                          │  │    │
│   │  │ 10:23:45 ERROR Payment failed: insufficient funds      │  │    │
│   │  │ 10:22:30 ERROR Timeout connecting to external API      │  │    │
│   │  └────────────────────────────────────────────────────────┘  │    │
│   │                                                               │    │
│   └───────────────────────────────────────────────────────────────┘    │
│                                                                         │
│   Widget Types:                                                         │
│   • Line, Stacked area, Number, Gauge, Bar, Pie                        │
│   • Text (Markdown), Log (Log Insights), Alarm status                  │
│                                                                         │
└────────────────────────────────────────────────────────────────────────┘

Dashboard as Code

import json

# CloudWatch Dashboard JSON definition
dashboard_body = {
    "widgets": [
        # Key Metrics Row
        {
            "type": "metric",
            "x": 0, "y": 0, "width": 6, "height": 6,
            "properties": {
                "title": "Request Count",
                "metrics": [
                    ["AWS/ApplicationELB", "RequestCount", "LoadBalancer", "app/my-alb/1234"]
                ],
                "period": 60,
                "stat": "Sum",
                "region": "us-east-1"
            }
        },
        {
            "type": "metric",
            "x": 6, "y": 0, "width": 6, "height": 6,
            "properties": {
                "title": "Response Time (p99)",
                "metrics": [
                    ["AWS/ApplicationELB", "TargetResponseTime", "LoadBalancer", "app/my-alb/1234"]
                ],
                "period": 60,
                "stat": "p99",
                "region": "us-east-1"
            }
        },
        # Lambda Metrics
        {
            "type": "metric",
            "x": 0, "y": 6, "width": 12, "height": 6,
            "properties": {
                "title": "Lambda Performance",
                "metrics": [
                    ["AWS/Lambda", "Duration", "FunctionName", "my-function", {"stat": "Average"}],
                    [".", ".", ".", ".", {"stat": "p95"}],
                    [".", ".", ".", ".", {"stat": "p99"}]
                ],
                "period": 60,
                "region": "us-east-1"
            }
        },
        # Alarm Status Widget
        {
            "type": "alarm",
            "x": 12, "y": 0, "width": 6, "height": 6,
            "properties": {
                "title": "Alarm Status",
                "alarms": [
                    "arn:aws:cloudwatch:us-east-1:123456789012:alarm:HighCPU",
                    "arn:aws:cloudwatch:us-east-1:123456789012:alarm:ErrorRate"
                ]
            }
        },
        # Log Widget
        {
            "type": "log",
            "x": 0, "y": 12, "width": 24, "height": 6,
            "properties": {
                "title": "Recent Errors",
                "query": "SOURCE '/aws/lambda/my-function' | filter @message like /ERROR/ | sort @timestamp desc | limit 20",
                "region": "us-east-1"
            }
        }
    ]
}

# Create/update dashboard
import boto3
cloudwatch = boto3.client('cloudwatch')
cloudwatch.put_dashboard(
    DashboardName='MyApplicationDashboard',
    DashboardBody=json.dumps(dashboard_body)
)

CloudWatch Synthetics

Canary tests that monitor your endpoints 24/7.

┌────────────────────────────────────────────────────────────────────────┐
│                    CloudWatch Synthetics                                │
├────────────────────────────────────────────────────────────────────────┤
│                                                                         │
│   Canary Types:                                                         │
│   ─────────────                                                        │
│   • Heartbeat: Simple availability check                               │
│   • API: REST API endpoint testing                                     │
│   • Broken Link: Check for broken links                               │
│   • Visual: Screenshot comparison                                      │
│   • GUI Workflow: Multi-step browser tests                             │
│                                                                         │
│   Features:                                                             │
│   ─────────                                                            │
│   • Runs on a schedule (1 min to 1 hour)                               │
│   • Captures screenshots and HAR files                                 │
│   • Integrates with CloudWatch Alarms                                  │
│   • Uses Puppeteer or Selenium                                         │
│                                                                         │
│   Example: API Canary Script (Node.js)                                 │
│   ┌─────────────────────────────────────────────────────────────────┐  │
│   │  const synthetics = require('Synthetics');                      │  │
│   │  const log = require('SyntheticsLogger');                       │  │
│   │                                                                  │  │
│   │  const apiCanary = async function() {                           │  │
│   │    const page = await synthetics.getPage();                     │  │
│   │                                                                  │  │
│   │    // Test API endpoint                                          │  │
│   │    const response = await page.goto('https://api.example.com'); │  │
│   │                                                                  │  │
│   │    if (response.status() !== 200) {                             │  │
│   │      throw new Error(`Expected 200, got ${response.status()}`); │  │
│   │    }                                                             │  │
│   │                                                                  │  │
│   │    log.info('API check passed');                                 │  │
│   │  };                                                              │  │
│   │                                                                  │  │
│   │  exports.handler = async () => {                                │  │
│   │    return await apiCanary();                                     │  │
│   │  };                                                              │  │
│   └─────────────────────────────────────────────────────────────────┘  │
│                                                                         │
│   Pricing: $0.0012 per canary run                                      │
│   100 runs/day × 30 days = $3.60/month per canary                      │
│                                                                         │
└────────────────────────────────────────────────────────────────────────┘

CloudWatch Contributor Insights

Identify top contributors to high cardinality data.

┌────────────────────────────────────────────────────────────────────────┐
│                    Contributor Insights                                 │
├────────────────────────────────────────────────────────────────────────┤
│                                                                         │
│   Use Cases:                                                            │
│   ───────────                                                          │
│   • Find top IPs making requests                                       │
│   • Identify users causing errors                                      │
│   • Detect DDoS patterns                                               │
│   • Find hottest DynamoDB partition keys                               │
│                                                                         │
│   Example: Top Error-Producing API Endpoints                           │
│   ┌─────────────────────────────────────────────────────────────────┐  │
│   │  Rank │ Endpoint          │ Error Count │ % of Total            │  │
│   │  ─────┼───────────────────┼─────────────┼───────────────────── │  │
│   │  1    │ /api/v1/checkout  │ 1,234       │ 45%                   │  │
│   │  2    │ /api/v1/payment   │ 567         │ 21%                   │  │
│   │  3    │ /api/v1/search    │ 234         │ 9%                    │  │
│   │  4    │ /api/v1/users     │ 123         │ 5%                    │  │
│   └─────────────────────────────────────────────────────────────────┘  │
│                                                                         │
│   Rule Definition:                                                      │
│   {                                                                     │
│     "Schema": {                                                         │
│       "Name": "CloudWatchLogRule",                                     │
│       "Version": 1                                                      │
│     },                                                                  │
│     "LogGroupNames": ["/aws/lambda/my-api"],                           │
│     "LogFormat": "JSON",                                               │
│     "Fields": {                                                         │
│       "endpoint": "$.path",                                            │
│       "status": "$.status_code"                                        │
│     },                                                                  │
│     "Contribution": {                                                   │
│       "Keys": ["endpoint"],                                            │
│       "Filters": [{"Match": "$.status_code", "GreaterThan": 499}]     │
│     }                                                                   │
│   }                                                                     │
│                                                                         │
└────────────────────────────────────────────────────────────────────────┘

Best Practices

Set Retention Policies

Configure log retention to balance cost and compliance needs

Use Structured Logging

JSON format enables powerful Log Insights queries

Alert on Symptoms

Focus on user-facing metrics, not just infrastructure

Tune Alarm Thresholds

Avoid alert fatigue with well-calibrated thresholds

Dashboard Hierarchy

Executive → Service → Debug dashboards

Cost Awareness

Monitor CloudWatch costs—they can surprise you

Cost Optimization

cost_tips = {
    "logs": [
        "Set appropriate retention (default: never expire)",
        "Export to S3 for long-term storage (cheaper)",
        "Use metric filters instead of querying raw logs",
        "Filter at source (log level, sampling)",
    ],
    "metrics": [
        "Minimize custom metric dimensions",
        "Use EMF for metrics-from-logs (free publishing)",
        "Consider high-res metrics only where needed",
        "Delete unused custom metrics",
    ],
    "alarms": [
        "Use composite alarms to reduce noise",
        "Consolidate similar alarms",
        "Avoid very short periods on high-cardinality metrics",
    ],
    "dashboards": [
        "Share dashboards instead of duplicating",
        "Use automatic refresh wisely",
    ]
}

# Cost estimation
monthly_costs = {
    "custom_metrics": "10,000 metrics × $0.30 = $3,000",
    "log_ingestion": "100 GB × $0.50 = $50",
    "log_storage": "1 TB × $0.03 = $30",
    "log_insights": "500 GB scanned × $0.005 = $2.50",
    "alarms": "100 alarms × $0.10 = $10",
    "dashboards": "First 3 free, then $3/month each",
}

🎯 Interview Questions

Q1: CloudWatch Logs vs X-Ray vs CloudTrail?

CloudWatch Logs:

Application and system logs
What your code outputs
Debugging, troubleshooting

X-Ray:

Distributed tracing
Request flow across services
Performance analysis

CloudTrail:

AWS API call history
Who did what, when
Security auditing

Q2: How to reduce CloudWatch Logs costs?

Set retention policies (don’t keep forever)
Export to S3 for long-term (use lifecycle rules)
Filter at source (log levels, sampling)
Use metric filters instead of Log Insights for common queries
Compress logs before ingestion
Use Contributor Insights rules for ongoing analysis

Q3: What's the difference between metric alarms and composite alarms?

Metric Alarms:

Monitor single metric
Simple threshold or anomaly detection
One condition

Composite Alarms:

Combine multiple alarms with AND/OR/NOT
Reduce alert noise
Complex conditions like: “High CPU AND High Memory”
Better for on-call (fewer, more actionable alerts)

Q4: EC2 memory not showing in CloudWatch?

Why: EC2 basic monitoring doesn’t include memory—it’s inside the OS, not visible to hypervisor.Solution: Install CloudWatch Agent

# Install agent
sudo yum install amazon-cloudwatch-agent

# Configure
sudo /opt/aws/amazon-cloudwatch-agent/bin/amazon-cloudwatch-agent-config-wizard

# Start
sudo systemctl start amazon-cloudwatch-agent

Q5: How to monitor Lambda cold starts?

Built-in: Lambda doesn’t have a direct cold start metric.Solutions:

Init Duration in REPORT logs

fields @timestamp, @initDuration
| filter @type = "REPORT" and ispresent(@initDuration)
| stats avg(@initDuration), max(@initDuration), count(*)

Custom metric from code (measure init time)
X-Ray shows initialization segment
Lambda Insights (enhanced monitoring)

🧪 Hands-On Lab

Set Up Structured Logging

Implement JSON logging in a Lambda function

Create Metric Filter

Extract error count metric from logs

Build Dashboard

Create operational dashboard with key metrics

Configure Alarms

Set up metric and composite alarms with SNS notification

Create Canary

Set up synthetic monitoring for your API endpoint

Next Module

AWS X-Ray

Master distributed tracing with AWS X-Ray

Overview

Testing & Code Quality

Crash Courses

AI Engineering

Math for ML - Understanding Linear Algebra

Probability & Statistics for ML

Math for ML - Understanding Calculus

ML Mastery

Deep Learning Mastery

NestJS Mastery

Microservices Mastery

Low Level Design

OOP Concepts

SOLID Principles

Design Patterns

LLD Case Studies

System Design (HLD)

Senior Level (L5+/Staff)

HLD Case Studies

Engineering Fundamentals

DevOps & Operations

Azure Cloud Engineering

AWS Cloud

AWS Monitoring & Observability

AWS Security Services

AWS Serverless

AWS Operations

AWS Advanced

AWS Case Studies

GCP Cloud Engineering

DevOps Tools

Database Engineering

HIPAA Compliance Mastery

Operating Systems

Linux Internals

Distributed Systems

Networking Mastery

Build Your Own X

Go Lang Mastery

C Programming

Classic Research Papers

Distributed System Tools

​Module Overview

​CloudWatch Architecture

​CloudWatch Metrics

​Built-in Metrics

​Metric Dimensions

​Custom Metrics

​CloudWatch Agent

​CloudWatch Logs

​Log Architecture

​Structured Logging Best Practices

​Log Insights Queries

​Metric Filters

​CloudWatch Alarms

​Alarm Types and States

​Essential Alarms (Terraform)

​CloudWatch Dashboards

​Dashboard as Code

​CloudWatch Synthetics

​CloudWatch Contributor Insights

​Best Practices

Set Retention Policies

Use Structured Logging

Alert on Symptoms

Tune Alarm Thresholds

Dashboard Hierarchy

Cost Awareness

​Cost Optimization

​🎯 Interview Questions

​🧪 Hands-On Lab

​Next Module

AWS X-Ray

Module Overview

CloudWatch Architecture

CloudWatch Metrics

Built-in Metrics

Metric Dimensions

Custom Metrics

CloudWatch Agent

CloudWatch Logs

Log Architecture

Structured Logging Best Practices

Log Insights Queries

Metric Filters

CloudWatch Alarms

Alarm Types and States

Essential Alarms (Terraform)

CloudWatch Dashboards

Dashboard as Code

CloudWatch Synthetics

CloudWatch Contributor Insights

Best Practices

Cost Optimization

🎯 Interview Questions

🧪 Hands-On Lab

Next Module