Module Overview
Estimated Time: 4-5 hours | Difficulty: Intermediate | Prerequisites: Core Concepts
- CloudWatch Metrics (built-in and custom)
- CloudWatch Logs and Log Insights
- CloudWatch Alarms and composite alarms
- Dashboards and visualization
- CloudWatch Synthetics (canaries)
- CloudWatch Contributor Insights
- CloudWatch Anomaly Detection
- Cross-account and cross-region monitoring
CloudWatch Architecture
Copy
┌────────────────────────────────────────────────────────────────────────┐
│ CloudWatch Architecture │
├────────────────────────────────────────────────────────────────────────┤
│ │
│ Data Sources CloudWatch │
│ ──────────── ────────── │
│ ┌──────────────┐ ┌───────────────────────────────┐ │
│ │ EC2, RDS, │───────────────────│ METRICS │ │
│ │ Lambda, etc. │ Auto-collected │ • Standard (5-min) │ │
│ └──────────────┘ │ • Detailed (1-min) │ │
│ │ • High-res (1-sec) │ │
│ ┌──────────────┐ └───────────────────────────────┘ │
│ │ Applications │ PutMetricData ┌───────────────────────────────┐ │
│ │ (Custom) │───────────────────│ CUSTOM METRICS │ │
│ └──────────────┘ │ • Business KPIs │ │
│ │ • App-specific metrics │ │
│ ┌──────────────┐ └───────────────────────────────┘ │
│ │ Lambda, ECS, │ Auto-collected ┌───────────────────────────────┐ │
│ │ EC2, VPC │───────────────────│ LOGS │ │
│ └──────────────┘ │ • Log Groups │ │
│ │ • Log Streams │ │
│ ┌──────────────┐ │ • Log Insights │ │
│ │ Applications │ Agent/SDK └───────────────────────────────┘ │
│ └──────────────┘ │
│ ┌───────────────────────────────┐ │
│ Processing & Actions │ ALARMS │ │
│ ───────────────────── │ • Metric Alarms │ │
│ • Alarms → SNS, Auto Scaling │ • Composite Alarms │ │
│ • EventBridge → Lambda, etc. │ • Anomaly Detection │ │
│ • Dashboards → Visualization └───────────────────────────────┘ │
│ • Contributor Insights │
│ ┌───────────────────────────────┐ │
│ │ DASHBOARDS │ │
│ │ • Widgets │ │
│ │ • Cross-account view │ │
│ │ • Sharing │ │
│ └───────────────────────────────┘ │
│ │
└────────────────────────────────────────────────────────────────────────┘
CloudWatch Metrics
Built-in Metrics
Copy
┌────────────────────────────────────────────────────────────────────────┐
│ Built-in Metrics by Service │
├────────────────────────────────────────────────────────────────────────┤
│ │
│ EC2 (Basic - 5 min | Detailed - 1 min): │
│ ───────────────────────────────────── │
│ • CPUUtilization • DiskReadOps • DiskWriteOps │
│ • NetworkIn • NetworkOut • StatusCheckFailed │
│ ⚠️ Memory NOT included (use CloudWatch Agent) │
│ │
│ Lambda: │
│ ─────── │
│ • Invocations • Duration • Errors │
│ • Throttles • ConcurrentExec • UnreservedConcurrent │
│ • IteratorAge (streams) │
│ │
│ RDS: │
│ ──── │
│ • CPUUtilization • DatabaseConnections • FreeableMemory │
│ • ReadIOPS • WriteIOPS • ReadLatency │
│ • WriteLatency • FreeStorageSpace │
│ │
│ DynamoDB: │
│ ───────── │
│ • ConsumedRCU • ConsumedWCU • ProvisionedRCU │
│ • ProvisionedWCU • ThrottledRequests • SystemErrors │
│ • ReturnedItemCount • SuccessfulRequestLatency │
│ │
│ Application Load Balancer: │
│ ────────────────────────── │
│ • RequestCount • TargetResponseTime • HTTPCode_Target_2XX │
│ • HTTPCode_Target_4XX • HTTPCode_Target_5XX • HealthyHostCount │
│ • UnHealthyHostCount • ActiveConnectionCount │
│ │
└────────────────────────────────────────────────────────────────────────┘
Metric Dimensions
Copy
┌────────────────────────────────────────────────────────────────────────┐
│ Metric Dimensions │
├────────────────────────────────────────────────────────────────────────┤
│ │
│ Metrics are uniquely identified by: │
│ Namespace + Metric Name + Dimensions │
│ │
│ Example: │
│ ───────── │
│ Namespace: AWS/EC2 │
│ Metric: CPUUtilization │
│ Dimensions: InstanceId=i-1234567890abcdef0 │
│ │
│ Each unique combination creates a separate time series: │
│ ┌─────────────────────────────────────────────────────────────────┐ │
│ │ AWS/EC2 | CPUUtilization | InstanceId=i-abc123 → Time Series 1│ │
│ │ AWS/EC2 | CPUUtilization | InstanceId=i-def456 → Time Series 2│ │
│ │ AWS/EC2 | CPUUtilization | AutoScalingGroup=web → Time Series 3│ │
│ └─────────────────────────────────────────────────────────────────┘ │
│ │
│ Aggregation Example: │
│ ──────────────────── │
│ Query by ASG dimension to get aggregate across all instances: │
│ aws cloudwatch get-metric-statistics \ │
│ --namespace AWS/EC2 \ │
│ --metric-name CPUUtilization \ │
│ --dimensions Name=AutoScalingGroupName,Value=my-asg │
│ │
└────────────────────────────────────────────────────────────────────────┘
Custom Metrics
Copy
import boto3
from datetime import datetime
from decimal import Decimal
cloudwatch = boto3.client('cloudwatch')
def publish_business_metric(metric_name: str, value: float,
dimensions: list = None):
"""
Publish custom business metric to CloudWatch.
Cost: $0.30 per metric per month (first 10,000)
"""
metric_data = {
'MetricName': metric_name,
'Value': value,
'Unit': 'Count',
'Timestamp': datetime.utcnow(),
'StorageResolution': 60 # Standard resolution (60 = 1 min)
}
if dimensions:
metric_data['Dimensions'] = dimensions
cloudwatch.put_metric_data(
Namespace='MyApplication',
MetricData=[metric_data]
)
# Example: Track orders placed
publish_business_metric(
metric_name='OrdersPlaced',
value=1,
dimensions=[
{'Name': 'Environment', 'Value': 'production'},
{'Name': 'Region', 'Value': 'us-east-1'},
{'Name': 'ProductCategory', 'Value': 'electronics'}
]
)
# High-resolution metric (1-second granularity)
def publish_high_res_metric(metric_name: str, value: float):
"""High-resolution metric for real-time monitoring."""
cloudwatch.put_metric_data(
Namespace='MyApplication/HighFrequency',
MetricData=[{
'MetricName': metric_name,
'Value': value,
'Unit': 'Milliseconds',
'Timestamp': datetime.utcnow(),
'StorageResolution': 1 # 1-second resolution
}]
)
# Batch publish (up to 1000 per request, 1MB max)
def publish_batch_metrics(metrics: list):
"""Efficiently publish multiple metrics."""
cloudwatch.put_metric_data(
Namespace='MyApplication',
MetricData=metrics[:1000] # Max 1000 per request
)
CloudWatch Agent
Copy
┌────────────────────────────────────────────────────────────────────────┐
│ CloudWatch Agent │
├────────────────────────────────────────────────────────────────────────┤
│ │
│ The CloudWatch Agent collects metrics and logs from: │
│ • EC2 instances (Linux & Windows) │
│ • On-premises servers │
│ • Containers │
│ │
│ Metrics NOT available without agent: │
│ ──────────────────────────────────── │
│ • Memory utilization │
│ • Disk space utilization │
│ • Disk I/O │
│ • Network connections │
│ • Process information │
│ │
│ Configuration (amazon-cloudwatch-agent.json): │
│ ┌─────────────────────────────────────────────────────────────────┐ │
│ │ { │ │
│ │ "metrics": { │ │
│ │ "namespace": "MyApp/EC2", │ │
│ │ "metrics_collected": { │ │
│ │ "mem": { │ │
│ │ "measurement": ["mem_used_percent"] │ │
│ │ }, │ │
│ │ "disk": { │ │
│ │ "measurement": ["disk_used_percent"], │ │
│ │ "resources": ["/", "/data"] │ │
│ │ } │ │
│ │ } │ │
│ │ }, │ │
│ │ "logs": { │ │
│ │ "logs_collected": { │ │
│ │ "files": { │ │
│ │ "collect_list": [{ │ │
│ │ "file_path": "/var/log/myapp/*.log", │ │
│ │ "log_group_name": "myapp-logs", │ │
│ │ "log_stream_name": "{instance_id}" │ │
│ │ }] │ │
│ │ } │ │
│ │ } │ │
│ │ } │ │
│ │ } │ │
│ └─────────────────────────────────────────────────────────────────┘ │
│ │
└────────────────────────────────────────────────────────────────────────┘
CloudWatch Logs
Log Architecture
Copy
┌────────────────────────────────────────────────────────────────────────┐
│ CloudWatch Logs Structure │
├────────────────────────────────────────────────────────────────────────┤
│ │
│ LOG GROUP: /aws/lambda/order-service │
│ ──────────────────────────────────── │
│ • Retention: 7 days to never expire │
│ • Encryption: Optional KMS encryption │
│ • Metric Filters: Extract metrics from logs │
│ • Subscription Filters: Stream to Kinesis/Lambda │
│ │
│ └── LOG STREAM: 2024/01/15/[$LATEST]abc123 │
│ └── LOG EVENT: {"timestamp": 1705312800000, │
│ "message": "Order processed: ORD-123"} │
│ └── LOG EVENT: {"timestamp": 1705312801000, │
│ "message": "Payment confirmed: PAY-456"} │
│ │
│ └── LOG STREAM: 2024/01/15/[$LATEST]def456 │
│ └── ... │
│ │
│ Pricing (us-east-1): │
│ ──────────────────── │
│ • Ingestion: $0.50 per GB │
│ • Storage: $0.03 per GB/month │
│ • Log Insights queries: $0.005 per GB scanned │
│ • Export to S3: Free (but S3 storage costs apply) │
│ │
│ Retention Settings: │
│ ─────────────────── │
│ 1 day, 3 days, 5 days, 1 week, 2 weeks, 1 month, 2 months, │
│ 3 months, 6 months, 1 year, 13 months, 18 months, 2 years, │
│ 3 years, 5 years, 6 years, 7 years, 8 years, 9 years, 10 years, │
│ Never Expire │
│ │
└────────────────────────────────────────────────────────────────────────┘
Structured Logging Best Practices
Copy
import json
import logging
from datetime import datetime
from typing import Any, Dict
class StructuredLogger:
"""JSON structured logger for CloudWatch Logs."""
def __init__(self, service: str, environment: str):
self.service = service
self.environment = environment
self.logger = logging.getLogger()
self.logger.setLevel(logging.INFO)
def _log(self, level: str, message: str, **extra):
log_entry = {
"timestamp": datetime.utcnow().isoformat() + "Z",
"level": level,
"service": self.service,
"environment": self.environment,
"message": message,
**extra
}
print(json.dumps(log_entry)) # Lambda automatically captures stdout
def info(self, message: str, **extra):
self._log("INFO", message, **extra)
def error(self, message: str, error: Exception = None, **extra):
if error:
extra["error_type"] = type(error).__name__
extra["error_message"] = str(error)
self._log("ERROR", message, **extra)
def metric(self, name: str, value: float, unit: str = "Count", **extra):
"""Log a metric that can be extracted with metric filters."""
self._log("METRIC", f"{name}={value}",
metric_name=name, metric_value=value, metric_unit=unit, **extra)
# Usage
logger = StructuredLogger(service="order-service", environment="prod")
def process_order(order_id: str, user_id: str):
logger.info("Processing order", order_id=order_id, user_id=user_id)
try:
# Business logic...
logger.info("Order completed",
order_id=order_id,
duration_ms=150,
items_count=3)
logger.metric("OrdersProcessed", 1, "Count", order_id=order_id)
except Exception as e:
logger.error("Order processing failed",
error=e,
order_id=order_id,
user_id=user_id)
raise
Log Insights Queries
Copy
-- Find all errors in the last hour
fields @timestamp, @message, @logStream
| filter @message like /ERROR/
| sort @timestamp desc
| limit 100
-- Parse JSON logs and aggregate by service
fields @timestamp, @message
| parse @message '{"service":"*","level":"*","message":"*"}' as service, level, msg
| filter level = "ERROR"
| stats count(*) as error_count by service
| sort error_count desc
-- Calculate Lambda p99 latency
fields @timestamp, @duration, @requestId
| filter @type = "REPORT"
| stats avg(@duration) as avg_ms,
pct(@duration, 50) as p50_ms,
pct(@duration, 95) as p95_ms,
pct(@duration, 99) as p99_ms
by bin(1h)
-- Find slow API requests
fields @timestamp, @message
| parse @message '"duration_ms":*,' as duration
| filter duration > 1000
| sort duration desc
| limit 50
-- Error rate calculation
fields @timestamp, @message
| filter @message like /level/
| parse @message '"level":"*"' as level
| stats count(*) as total,
sum(level = "ERROR") as errors
by bin(5m)
| sort @timestamp desc
-- Top contributors to log volume
fields @logStream, @message
| stats count(*) as log_count by @logStream
| sort log_count desc
| limit 20
-- Correlation by request ID
fields @timestamp, @message, @logStream
| parse @message '"request_id":"*"' as request_id
| filter request_id = "abc-123-def-456"
| sort @timestamp asc
Metric Filters
Copy
┌────────────────────────────────────────────────────────────────────────┐
│ Metric Filters │
├────────────────────────────────────────────────────────────────────────┤
│ │
│ Extract metrics from logs automatically: │
│ │
│ Filter Pattern │ Metric Created │
│ ─────────────────────────────┼───────────────────────────────────── │
│ ERROR │ Count of ERROR occurrences │
│ [timestamp, level=ERROR, ...] │ Count of structured errors │
│ "status": 500 │ Count of 500 errors │
│ { $.latency > 1000 } │ Count of slow requests (JSON) │
│ │
│ Example: Track application errors │
│ ┌─────────────────────────────────────────────────────────────────┐ │
│ │ Filter Pattern: { $.level = "ERROR" } │ │
│ │ Metric Name: ApplicationErrors │ │
│ │ Metric Namespace: MyApp │ │
│ │ Metric Value: 1 │ │
│ │ Default Value: 0 │ │
│ └─────────────────────────────────────────────────────────────────┘ │
│ │
│ Example: Extract latency from JSON logs │
│ ┌─────────────────────────────────────────────────────────────────┐ │
│ │ Filter Pattern: { $.type = "REQUEST" } │ │
│ │ Metric Name: RequestLatency │ │
│ │ Metric Namespace: MyApp │ │
│ │ Metric Value: $.latency │ │
│ └─────────────────────────────────────────────────────────────────┘ │
│ │
└────────────────────────────────────────────────────────────────────────┘
CloudWatch Alarms
Alarm Types and States
Copy
┌────────────────────────────────────────────────────────────────────────┐
│ CloudWatch Alarms │
├────────────────────────────────────────────────────────────────────────┤
│ │
│ Alarm States: │
│ ┌───────────┐ ┌───────────┐ ┌─────────────────┐ │
│ │ OK │────►│ ALARM │────►│ INSUFFICIENT │ │
│ │ (green) │◄────│ (red) │◄────│ DATA │ │
│ └───────────┘ └───────────┘ └─────────────────┘ │
│ │
│ Alarm Components: │
│ ───────────────── │
│ • Metric: What to monitor (CPUUtilization, custom metric) │
│ • Statistic: How to aggregate (Average, Sum, Max, p99) │
│ • Period: Evaluation period (60s, 300s, etc.) │
│ • Evaluation Periods: How many periods before alarm │
│ • Threshold: Value that triggers alarm │
│ • Comparison: GreaterThan, LessThan, etc. │
│ │
│ Example Configuration: │
│ ───────────────────── │
│ Metric: CPUUtilization │
│ Statistic: Average │
│ Period: 300 seconds (5 min) │
│ Evaluation Periods: 3 │
│ Threshold: 80% │
│ → Alarm triggers when CPU > 80% for 3 consecutive 5-min periods │
│ │
│ Alarm Actions: │
│ ─────────────── │
│ • SNS Topic → Email, SMS, Lambda │
│ • Auto Scaling → Scale up/down │
│ • EC2 Actions → Stop, terminate, reboot, recover │
│ • Systems Manager → Run automation documents │
│ │
└────────────────────────────────────────────────────────────────────────┘
Essential Alarms (Terraform)
Copy
# High CPU Alarm with Auto Scaling
resource "aws_cloudwatch_metric_alarm" "high_cpu" {
alarm_name = "${var.app_name}-high-cpu"
comparison_operator = "GreaterThanThreshold"
evaluation_periods = 3
metric_name = "CPUUtilization"
namespace = "AWS/EC2"
period = 300
statistic = "Average"
threshold = 80
alarm_description = "CPU exceeds 80% for 15 minutes"
dimensions = {
AutoScalingGroupName = aws_autoscaling_group.web.name
}
alarm_actions = [
aws_sns_topic.alerts.arn,
aws_autoscaling_policy.scale_up.arn
]
ok_actions = [aws_sns_topic.alerts.arn]
}
# Lambda Error Rate (Metric Math)
resource "aws_cloudwatch_metric_alarm" "lambda_error_rate" {
alarm_name = "${var.app_name}-lambda-errors"
comparison_operator = "GreaterThanThreshold"
evaluation_periods = 2
threshold = 5 # 5% error rate
metric_query {
id = "error_rate"
expression = "errors / invocations * 100"
label = "Error Rate %"
return_data = true
}
metric_query {
id = "errors"
metric {
metric_name = "Errors"
namespace = "AWS/Lambda"
period = 300
stat = "Sum"
dimensions = {
FunctionName = aws_lambda_function.api.function_name
}
}
}
metric_query {
id = "invocations"
metric {
metric_name = "Invocations"
namespace = "AWS/Lambda"
period = 300
stat = "Sum"
dimensions = {
FunctionName = aws_lambda_function.api.function_name
}
}
}
alarm_actions = [aws_sns_topic.alerts.arn]
}
# Anomaly Detection Alarm
resource "aws_cloudwatch_metric_alarm" "traffic_anomaly" {
alarm_name = "${var.app_name}-traffic-anomaly"
comparison_operator = "LessThanLowerOrGreaterThanUpperThreshold"
evaluation_periods = 2
threshold_metric_id = "ad1"
metric_query {
id = "m1"
return_data = true
metric {
metric_name = "RequestCount"
namespace = "AWS/ApplicationELB"
period = 300
stat = "Sum"
dimensions = {
LoadBalancer = aws_lb.main.arn_suffix
}
}
}
metric_query {
id = "ad1"
expression = "ANOMALY_DETECTION_BAND(m1, 2)"
label = "Traffic Anomaly Band"
return_data = true
}
alarm_actions = [aws_sns_topic.alerts.arn]
}
# Composite Alarm (multiple conditions)
resource "aws_cloudwatch_composite_alarm" "critical" {
alarm_name = "${var.app_name}-critical"
alarm_rule = join(" OR ", [
"ALARM(${aws_cloudwatch_metric_alarm.high_cpu.alarm_name})",
"ALARM(${aws_cloudwatch_metric_alarm.lambda_error_rate.alarm_name})",
])
alarm_actions = [aws_sns_topic.pagerduty.arn]
}
CloudWatch Dashboards
Copy
┌────────────────────────────────────────────────────────────────────────┐
│ Dashboard Design │
├────────────────────────────────────────────────────────────────────────┤
│ │
│ ┌───────────────────────────────────────────────────────────────┐ │
│ │ MY APPLICATION DASHBOARD 📊 🔗 ⚙️ │ │
│ ├───────────────────────────────────────────────────────────────┤ │
│ │ │ │
│ │ ┌─────────────────────┐ ┌─────────────────────────────────┐│ │
│ │ │ 🟢 Healthy Hosts: 4 │ │ 📈 Request Rate ││ │
│ │ │ 🔴 Errors: 0.1% │ │ [Line graph of requests/min] ││ │
│ │ │ ⏱️ P99 Latency: 45ms│ │ ││ │
│ │ └─────────────────────┘ └─────────────────────────────────┘│ │
│ │ │ │
│ │ ┌────────────────────────────────────────────────────────┐ │ │
│ │ │ 📊 API Response Times by Endpoint │ │ │
│ │ │ ┌────────────────────────────────────────────────────┐ │ │ │
│ │ │ │ /orders ████████████████ 45ms │ │ │ │
│ │ │ │ /users ████████ 25ms │ │ │ │
│ │ │ │ /products ██████████████████ 55ms │ │ │ │
│ │ │ └────────────────────────────────────────────────────┘ │ │ │
│ │ └────────────────────────────────────────────────────────┘ │ │
│ │ │ │
│ │ ┌─────────────────────┐ ┌─────────────────────────────────┐│ │
│ │ │ Lambda Invocations │ │ DynamoDB Consumed Capacity ││ │
│ │ │ [Stacked area chart]│ │ [RCU/WCU over time] ││ │
│ │ └─────────────────────┘ └─────────────────────────────────┘│ │
│ │ │ │
│ │ ┌────────────────────────────────────────────────────────┐ │ │
│ │ │ 📝 Recent Errors (Log Widget) │ │ │
│ │ │ 10:23:45 ERROR Payment failed: insufficient funds │ │ │
│ │ │ 10:22:30 ERROR Timeout connecting to external API │ │ │
│ │ └────────────────────────────────────────────────────────┘ │ │
│ │ │ │
│ └───────────────────────────────────────────────────────────────┘ │
│ │
│ Widget Types: │
│ • Line, Stacked area, Number, Gauge, Bar, Pie │
│ • Text (Markdown), Log (Log Insights), Alarm status │
│ │
└────────────────────────────────────────────────────────────────────────┘
Dashboard as Code
Copy
import json
# CloudWatch Dashboard JSON definition
dashboard_body = {
"widgets": [
# Key Metrics Row
{
"type": "metric",
"x": 0, "y": 0, "width": 6, "height": 6,
"properties": {
"title": "Request Count",
"metrics": [
["AWS/ApplicationELB", "RequestCount", "LoadBalancer", "app/my-alb/1234"]
],
"period": 60,
"stat": "Sum",
"region": "us-east-1"
}
},
{
"type": "metric",
"x": 6, "y": 0, "width": 6, "height": 6,
"properties": {
"title": "Response Time (p99)",
"metrics": [
["AWS/ApplicationELB", "TargetResponseTime", "LoadBalancer", "app/my-alb/1234"]
],
"period": 60,
"stat": "p99",
"region": "us-east-1"
}
},
# Lambda Metrics
{
"type": "metric",
"x": 0, "y": 6, "width": 12, "height": 6,
"properties": {
"title": "Lambda Performance",
"metrics": [
["AWS/Lambda", "Duration", "FunctionName", "my-function", {"stat": "Average"}],
[".", ".", ".", ".", {"stat": "p95"}],
[".", ".", ".", ".", {"stat": "p99"}]
],
"period": 60,
"region": "us-east-1"
}
},
# Alarm Status Widget
{
"type": "alarm",
"x": 12, "y": 0, "width": 6, "height": 6,
"properties": {
"title": "Alarm Status",
"alarms": [
"arn:aws:cloudwatch:us-east-1:123456789012:alarm:HighCPU",
"arn:aws:cloudwatch:us-east-1:123456789012:alarm:ErrorRate"
]
}
},
# Log Widget
{
"type": "log",
"x": 0, "y": 12, "width": 24, "height": 6,
"properties": {
"title": "Recent Errors",
"query": "SOURCE '/aws/lambda/my-function' | filter @message like /ERROR/ | sort @timestamp desc | limit 20",
"region": "us-east-1"
}
}
]
}
# Create/update dashboard
import boto3
cloudwatch = boto3.client('cloudwatch')
cloudwatch.put_dashboard(
DashboardName='MyApplicationDashboard',
DashboardBody=json.dumps(dashboard_body)
)
CloudWatch Synthetics
Canary tests that monitor your endpoints 24/7.Copy
┌────────────────────────────────────────────────────────────────────────┐
│ CloudWatch Synthetics │
├────────────────────────────────────────────────────────────────────────┤
│ │
│ Canary Types: │
│ ───────────── │
│ • Heartbeat: Simple availability check │
│ • API: REST API endpoint testing │
│ • Broken Link: Check for broken links │
│ • Visual: Screenshot comparison │
│ • GUI Workflow: Multi-step browser tests │
│ │
│ Features: │
│ ───────── │
│ • Runs on a schedule (1 min to 1 hour) │
│ • Captures screenshots and HAR files │
│ • Integrates with CloudWatch Alarms │
│ • Uses Puppeteer or Selenium │
│ │
│ Example: API Canary Script (Node.js) │
│ ┌─────────────────────────────────────────────────────────────────┐ │
│ │ const synthetics = require('Synthetics'); │ │
│ │ const log = require('SyntheticsLogger'); │ │
│ │ │ │
│ │ const apiCanary = async function() { │ │
│ │ const page = await synthetics.getPage(); │ │
│ │ │ │
│ │ // Test API endpoint │ │
│ │ const response = await page.goto('https://api.example.com'); │ │
│ │ │ │
│ │ if (response.status() !== 200) { │ │
│ │ throw new Error(`Expected 200, got ${response.status()}`); │ │
│ │ } │ │
│ │ │ │
│ │ log.info('API check passed'); │ │
│ │ }; │ │
│ │ │ │
│ │ exports.handler = async () => { │ │
│ │ return await apiCanary(); │ │
│ │ }; │ │
│ └─────────────────────────────────────────────────────────────────┘ │
│ │
│ Pricing: $0.0012 per canary run │
│ 100 runs/day × 30 days = $3.60/month per canary │
│ │
└────────────────────────────────────────────────────────────────────────┘
CloudWatch Contributor Insights
Identify top contributors to high cardinality data.Copy
┌────────────────────────────────────────────────────────────────────────┐
│ Contributor Insights │
├────────────────────────────────────────────────────────────────────────┤
│ │
│ Use Cases: │
│ ─────────── │
│ • Find top IPs making requests │
│ • Identify users causing errors │
│ • Detect DDoS patterns │
│ • Find hottest DynamoDB partition keys │
│ │
│ Example: Top Error-Producing API Endpoints │
│ ┌─────────────────────────────────────────────────────────────────┐ │
│ │ Rank │ Endpoint │ Error Count │ % of Total │ │
│ │ ─────┼───────────────────┼─────────────┼───────────────────── │ │
│ │ 1 │ /api/v1/checkout │ 1,234 │ 45% │ │
│ │ 2 │ /api/v1/payment │ 567 │ 21% │ │
│ │ 3 │ /api/v1/search │ 234 │ 9% │ │
│ │ 4 │ /api/v1/users │ 123 │ 5% │ │
│ └─────────────────────────────────────────────────────────────────┘ │
│ │
│ Rule Definition: │
│ { │
│ "Schema": { │
│ "Name": "CloudWatchLogRule", │
│ "Version": 1 │
│ }, │
│ "LogGroupNames": ["/aws/lambda/my-api"], │
│ "LogFormat": "JSON", │
│ "Fields": { │
│ "endpoint": "$.path", │
│ "status": "$.status_code" │
│ }, │
│ "Contribution": { │
│ "Keys": ["endpoint"], │
│ "Filters": [{"Match": "$.status_code", "GreaterThan": 499}] │
│ } │
│ } │
│ │
└────────────────────────────────────────────────────────────────────────┘
Best Practices
Set Retention Policies
Configure log retention to balance cost and compliance needs
Use Structured Logging
JSON format enables powerful Log Insights queries
Alert on Symptoms
Focus on user-facing metrics, not just infrastructure
Tune Alarm Thresholds
Avoid alert fatigue with well-calibrated thresholds
Dashboard Hierarchy
Executive → Service → Debug dashboards
Cost Awareness
Monitor CloudWatch costs—they can surprise you
Cost Optimization
Copy
cost_tips = {
"logs": [
"Set appropriate retention (default: never expire)",
"Export to S3 for long-term storage (cheaper)",
"Use metric filters instead of querying raw logs",
"Filter at source (log level, sampling)",
],
"metrics": [
"Minimize custom metric dimensions",
"Use EMF for metrics-from-logs (free publishing)",
"Consider high-res metrics only where needed",
"Delete unused custom metrics",
],
"alarms": [
"Use composite alarms to reduce noise",
"Consolidate similar alarms",
"Avoid very short periods on high-cardinality metrics",
],
"dashboards": [
"Share dashboards instead of duplicating",
"Use automatic refresh wisely",
]
}
# Cost estimation
monthly_costs = {
"custom_metrics": "10,000 metrics × $0.30 = $3,000",
"log_ingestion": "100 GB × $0.50 = $50",
"log_storage": "1 TB × $0.03 = $30",
"log_insights": "500 GB scanned × $0.005 = $2.50",
"alarms": "100 alarms × $0.10 = $10",
"dashboards": "First 3 free, then $3/month each",
}
🎯 Interview Questions
Q1: CloudWatch Logs vs X-Ray vs CloudTrail?
Q1: CloudWatch Logs vs X-Ray vs CloudTrail?
CloudWatch Logs:
- Application and system logs
- What your code outputs
- Debugging, troubleshooting
- Distributed tracing
- Request flow across services
- Performance analysis
- AWS API call history
- Who did what, when
- Security auditing
Q2: How to reduce CloudWatch Logs costs?
Q2: How to reduce CloudWatch Logs costs?
- Set retention policies (don’t keep forever)
- Export to S3 for long-term (use lifecycle rules)
- Filter at source (log levels, sampling)
- Use metric filters instead of Log Insights for common queries
- Compress logs before ingestion
- Use Contributor Insights rules for ongoing analysis
Q3: What's the difference between metric alarms and composite alarms?
Q3: What's the difference between metric alarms and composite alarms?
Metric Alarms:
- Monitor single metric
- Simple threshold or anomaly detection
- One condition
- Combine multiple alarms with AND/OR/NOT
- Reduce alert noise
- Complex conditions like: “High CPU AND High Memory”
- Better for on-call (fewer, more actionable alerts)
Q4: EC2 memory not showing in CloudWatch?
Q4: EC2 memory not showing in CloudWatch?
Why: EC2 basic monitoring doesn’t include memory—it’s inside the OS, not visible to hypervisor.Solution: Install CloudWatch Agent
Copy
# Install agent
sudo yum install amazon-cloudwatch-agent
# Configure
sudo /opt/aws/amazon-cloudwatch-agent/bin/amazon-cloudwatch-agent-config-wizard
# Start
sudo systemctl start amazon-cloudwatch-agent
Q5: How to monitor Lambda cold starts?
Q5: How to monitor Lambda cold starts?
Built-in: Lambda doesn’t have a direct cold start metric.Solutions:
- Init Duration in REPORT logs
Copy
fields @timestamp, @initDuration
| filter @type = "REPORT" and ispresent(@initDuration)
| stats avg(@initDuration), max(@initDuration), count(*)
- Custom metric from code (measure init time)
- X-Ray shows initialization segment
- Lambda Insights (enhanced monitoring)
🧪 Hands-On Lab
1
Set Up Structured Logging
Implement JSON logging in a Lambda function
2
Create Metric Filter
Extract error count metric from logs
3
Build Dashboard
Create operational dashboard with key metrics
4
Configure Alarms
Set up metric and composite alarms with SNS notification
5
Create Canary
Set up synthetic monitoring for your API endpoint
Next Module
AWS X-Ray
Master distributed tracing with AWS X-Ray