Module Overview
Estimated Time: 4-5 hours | Difficulty: Intermediate | Prerequisites: Core Concepts
- CloudWatch Metrics (built-in and custom)
- CloudWatch Logs and Log Insights
- CloudWatch Alarms and composite alarms
- Dashboards and visualization
- CloudWatch Synthetics (canaries)
- CloudWatch Contributor Insights
- CloudWatch Anomaly Detection
- Cross-account and cross-region monitoring
CloudWatch Architecture
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β CloudWatch Architecture β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β β
β Data Sources CloudWatch β
β ββββββββββββ ββββββββββ β
β ββββββββββββββββ βββββββββββββββββββββββββββββββββ β
β β EC2, RDS, βββββββββββββββββββββ METRICS β β
β β Lambda, etc. β Auto-collected β β’ Standard (5-min) β β
β ββββββββββββββββ β β’ Detailed (1-min) β β
β β β’ High-res (1-sec) β β
β ββββββββββββββββ βββββββββββββββββββββββββββββββββ β
β β Applications β PutMetricData βββββββββββββββββββββββββββββββββ β
β β (Custom) βββββββββββββββββββββ CUSTOM METRICS β β
β ββββββββββββββββ β β’ Business KPIs β β
β β β’ App-specific metrics β β
β ββββββββββββββββ βββββββββββββββββββββββββββββββββ β
β β Lambda, ECS, β Auto-collected βββββββββββββββββββββββββββββββββ β
β β EC2, VPC βββββββββββββββββββββ LOGS β β
β ββββββββββββββββ β β’ Log Groups β β
β β β’ Log Streams β β
β ββββββββββββββββ β β’ Log Insights β β
β β Applications β Agent/SDK βββββββββββββββββββββββββββββββββ β
β ββββββββββββββββ β
β βββββββββββββββββββββββββββββββββ β
β Processing & Actions β ALARMS β β
β βββββββββββββββββββββ β β’ Metric Alarms β β
β β’ Alarms β SNS, Auto Scaling β β’ Composite Alarms β β
β β’ EventBridge β Lambda, etc. β β’ Anomaly Detection β β
β β’ Dashboards β Visualization βββββββββββββββββββββββββββββββββ β
β β’ Contributor Insights β
β βββββββββββββββββββββββββββββββββ β
β β DASHBOARDS β β
β β β’ Widgets β β
β β β’ Cross-account view β β
β β β’ Sharing β β
β βββββββββββββββββββββββββββββββββ β
β β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
CloudWatch Metrics
Built-in Metrics
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Built-in Metrics by Service β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β β
β EC2 (Basic - 5 min | Detailed - 1 min): β
β βββββββββββββββββββββββββββββββββββββ β
β β’ CPUUtilization β’ DiskReadOps β’ DiskWriteOps β
β β’ NetworkIn β’ NetworkOut β’ StatusCheckFailed β
β β οΈ Memory NOT included (use CloudWatch Agent) β
β β
β Lambda: β
β βββββββ β
β β’ Invocations β’ Duration β’ Errors β
β β’ Throttles β’ ConcurrentExec β’ UnreservedConcurrent β
β β’ IteratorAge (streams) β
β β
β RDS: β
β ββββ β
β β’ CPUUtilization β’ DatabaseConnections β’ FreeableMemory β
β β’ ReadIOPS β’ WriteIOPS β’ ReadLatency β
β β’ WriteLatency β’ FreeStorageSpace β
β β
β DynamoDB: β
β βββββββββ β
β β’ ConsumedRCU β’ ConsumedWCU β’ ProvisionedRCU β
β β’ ProvisionedWCU β’ ThrottledRequests β’ SystemErrors β
β β’ ReturnedItemCount β’ SuccessfulRequestLatency β
β β
β Application Load Balancer: β
β ββββββββββββββββββββββββββ β
β β’ RequestCount β’ TargetResponseTime β’ HTTPCode_Target_2XX β
β β’ HTTPCode_Target_4XX β’ HTTPCode_Target_5XX β’ HealthyHostCount β
β β’ UnHealthyHostCount β’ ActiveConnectionCount β
β β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Metric Dimensions
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Metric Dimensions β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β β
β Metrics are uniquely identified by: β
β Namespace + Metric Name + Dimensions β
β β
β Example: β
β βββββββββ β
β Namespace: AWS/EC2 β
β Metric: CPUUtilization β
β Dimensions: InstanceId=i-1234567890abcdef0 β
β β
β Each unique combination creates a separate time series: β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β AWS/EC2 | CPUUtilization | InstanceId=i-abc123 β Time Series 1β β
β β AWS/EC2 | CPUUtilization | InstanceId=i-def456 β Time Series 2β β
β β AWS/EC2 | CPUUtilization | AutoScalingGroup=web β Time Series 3β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β
β Aggregation Example: β
β ββββββββββββββββββββ β
β Query by ASG dimension to get aggregate across all instances: β
β aws cloudwatch get-metric-statistics \ β
β --namespace AWS/EC2 \ β
β --metric-name CPUUtilization \ β
β --dimensions Name=AutoScalingGroupName,Value=my-asg β
β β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Custom Metrics
import boto3
from datetime import datetime
from decimal import Decimal
cloudwatch = boto3.client('cloudwatch')
def publish_business_metric(metric_name: str, value: float,
dimensions: list = None):
"""
Publish custom business metric to CloudWatch.
Cost: $0.30 per metric per month (first 10,000)
"""
metric_data = {
'MetricName': metric_name,
'Value': value,
'Unit': 'Count',
'Timestamp': datetime.utcnow(),
'StorageResolution': 60 # Standard resolution (60 = 1 min)
}
if dimensions:
metric_data['Dimensions'] = dimensions
cloudwatch.put_metric_data(
Namespace='MyApplication',
MetricData=[metric_data]
)
# Example: Track orders placed
publish_business_metric(
metric_name='OrdersPlaced',
value=1,
dimensions=[
{'Name': 'Environment', 'Value': 'production'},
{'Name': 'Region', 'Value': 'us-east-1'},
{'Name': 'ProductCategory', 'Value': 'electronics'}
]
)
# High-resolution metric (1-second granularity)
def publish_high_res_metric(metric_name: str, value: float):
"""High-resolution metric for real-time monitoring."""
cloudwatch.put_metric_data(
Namespace='MyApplication/HighFrequency',
MetricData=[{
'MetricName': metric_name,
'Value': value,
'Unit': 'Milliseconds',
'Timestamp': datetime.utcnow(),
'StorageResolution': 1 # 1-second resolution
}]
)
# Batch publish (up to 1000 per request, 1MB max)
def publish_batch_metrics(metrics: list):
"""Efficiently publish multiple metrics."""
cloudwatch.put_metric_data(
Namespace='MyApplication',
MetricData=metrics[:1000] # Max 1000 per request
)
CloudWatch Agent
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β CloudWatch Agent β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β β
β The CloudWatch Agent collects metrics and logs from: β
β β’ EC2 instances (Linux & Windows) β
β β’ On-premises servers β
β β’ Containers β
β β
β Metrics NOT available without agent: β
β ββββββββββββββββββββββββββββββββββββ β
β β’ Memory utilization β
β β’ Disk space utilization β
β β’ Disk I/O β
β β’ Network connections β
β β’ Process information β
β β
β Configuration (amazon-cloudwatch-agent.json): β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β { β β
β β "metrics": { β β
β β "namespace": "MyApp/EC2", β β
β β "metrics_collected": { β β
β β "mem": { β β
β β "measurement": ["mem_used_percent"] β β
β β }, β β
β β "disk": { β β
β β "measurement": ["disk_used_percent"], β β
β β "resources": ["/", "/data"] β β
β β } β β
β β } β β
β β }, β β
β β "logs": { β β
β β "logs_collected": { β β
β β "files": { β β
β β "collect_list": [{ β β
β β "file_path": "/var/log/myapp/*.log", β β
β β "log_group_name": "myapp-logs", β β
β β "log_stream_name": "{instance_id}" β β
β β }] β β
β β } β β
β β } β β
β β } β β
β β } β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
CloudWatch Logs
Log Architecture
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β CloudWatch Logs Structure β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β β
β LOG GROUP: /aws/lambda/order-service β
β ββββββββββββββββββββββββββββββββββββ β
β β’ Retention: 7 days to never expire β
β β’ Encryption: Optional KMS encryption β
β β’ Metric Filters: Extract metrics from logs β
β β’ Subscription Filters: Stream to Kinesis/Lambda β
β β
β βββ LOG STREAM: 2024/01/15/[$LATEST]abc123 β
β βββ LOG EVENT: {"timestamp": 1705312800000, β
β "message": "Order processed: ORD-123"} β
β βββ LOG EVENT: {"timestamp": 1705312801000, β
β "message": "Payment confirmed: PAY-456"} β
β β
β βββ LOG STREAM: 2024/01/15/[$LATEST]def456 β
β βββ ... β
β β
β Pricing (us-east-1): β
β ββββββββββββββββββββ β
β β’ Ingestion: $0.50 per GB β
β β’ Storage: $0.03 per GB/month β
β β’ Log Insights queries: $0.005 per GB scanned β
β β’ Export to S3: Free (but S3 storage costs apply) β
β β
β Retention Settings: β
β βββββββββββββββββββ β
β 1 day, 3 days, 5 days, 1 week, 2 weeks, 1 month, 2 months, β
β 3 months, 6 months, 1 year, 13 months, 18 months, 2 years, β
β 3 years, 5 years, 6 years, 7 years, 8 years, 9 years, 10 years, β
β Never Expire β
β β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Structured Logging Best Practices
import json
import logging
from datetime import datetime
from typing import Any, Dict
class StructuredLogger:
"""JSON structured logger for CloudWatch Logs."""
def __init__(self, service: str, environment: str):
self.service = service
self.environment = environment
self.logger = logging.getLogger()
self.logger.setLevel(logging.INFO)
def _log(self, level: str, message: str, **extra):
log_entry = {
"timestamp": datetime.utcnow().isoformat() + "Z",
"level": level,
"service": self.service,
"environment": self.environment,
"message": message,
**extra
}
print(json.dumps(log_entry)) # Lambda automatically captures stdout
def info(self, message: str, **extra):
self._log("INFO", message, **extra)
def error(self, message: str, error: Exception = None, **extra):
if error:
extra["error_type"] = type(error).__name__
extra["error_message"] = str(error)
self._log("ERROR", message, **extra)
def metric(self, name: str, value: float, unit: str = "Count", **extra):
"""Log a metric that can be extracted with metric filters."""
self._log("METRIC", f"{name}={value}",
metric_name=name, metric_value=value, metric_unit=unit, **extra)
# Usage
logger = StructuredLogger(service="order-service", environment="prod")
def process_order(order_id: str, user_id: str):
logger.info("Processing order", order_id=order_id, user_id=user_id)
try:
# Business logic...
logger.info("Order completed",
order_id=order_id,
duration_ms=150,
items_count=3)
logger.metric("OrdersProcessed", 1, "Count", order_id=order_id)
except Exception as e:
logger.error("Order processing failed",
error=e,
order_id=order_id,
user_id=user_id)
raise
Log Insights Queries
-- Find all errors in the last hour
fields @timestamp, @message, @logStream
| filter @message like /ERROR/
| sort @timestamp desc
| limit 100
-- Parse JSON logs and aggregate by service
fields @timestamp, @message
| parse @message '{"service":"*","level":"*","message":"*"}' as service, level, msg
| filter level = "ERROR"
| stats count(*) as error_count by service
| sort error_count desc
-- Calculate Lambda p99 latency
fields @timestamp, @duration, @requestId
| filter @type = "REPORT"
| stats avg(@duration) as avg_ms,
pct(@duration, 50) as p50_ms,
pct(@duration, 95) as p95_ms,
pct(@duration, 99) as p99_ms
by bin(1h)
-- Find slow API requests
fields @timestamp, @message
| parse @message '"duration_ms":*,' as duration
| filter duration > 1000
| sort duration desc
| limit 50
-- Error rate calculation
fields @timestamp, @message
| filter @message like /level/
| parse @message '"level":"*"' as level
| stats count(*) as total,
sum(level = "ERROR") as errors
by bin(5m)
| sort @timestamp desc
-- Top contributors to log volume
fields @logStream, @message
| stats count(*) as log_count by @logStream
| sort log_count desc
| limit 20
-- Correlation by request ID
fields @timestamp, @message, @logStream
| parse @message '"request_id":"*"' as request_id
| filter request_id = "abc-123-def-456"
| sort @timestamp asc
Metric Filters
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Metric Filters β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β β
β Extract metrics from logs automatically: β
β β
β Filter Pattern β Metric Created β
β ββββββββββββββββββββββββββββββΌβββββββββββββββββββββββββββββββββββββ β
β ERROR β Count of ERROR occurrences β
β [timestamp, level=ERROR, ...] β Count of structured errors β
β "status": 500 β Count of 500 errors β
β { $.latency > 1000 } β Count of slow requests (JSON) β
β β
β Example: Track application errors β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β Filter Pattern: { $.level = "ERROR" } β β
β β Metric Name: ApplicationErrors β β
β β Metric Namespace: MyApp β β
β β Metric Value: 1 β β
β β Default Value: 0 β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β
β Example: Extract latency from JSON logs β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β Filter Pattern: { $.type = "REQUEST" } β β
β β Metric Name: RequestLatency β β
β β Metric Namespace: MyApp β β
β β Metric Value: $.latency β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
CloudWatch Alarms
Alarm Types and States
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β CloudWatch Alarms β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β β
β Alarm States: β
β βββββββββββββ βββββββββββββ βββββββββββββββββββ β
β β OK ββββββΊβ ALARM ββββββΊβ INSUFFICIENT β β
β β (green) βββββββ (red) βββββββ DATA β β
β βββββββββββββ βββββββββββββ βββββββββββββββββββ β
β β
β Alarm Components: β
β βββββββββββββββββ β
β β’ Metric: What to monitor (CPUUtilization, custom metric) β
β β’ Statistic: How to aggregate (Average, Sum, Max, p99) β
β β’ Period: Evaluation period (60s, 300s, etc.) β
β β’ Evaluation Periods: How many periods before alarm β
β β’ Threshold: Value that triggers alarm β
β β’ Comparison: GreaterThan, LessThan, etc. β
β β
β Example Configuration: β
β βββββββββββββββββββββ β
β Metric: CPUUtilization β
β Statistic: Average β
β Period: 300 seconds (5 min) β
β Evaluation Periods: 3 β
β Threshold: 80% β
β β Alarm triggers when CPU > 80% for 3 consecutive 5-min periods β
β β
β Alarm Actions: β
β βββββββββββββββ β
β β’ SNS Topic β Email, SMS, Lambda β
β β’ Auto Scaling β Scale up/down β
β β’ EC2 Actions β Stop, terminate, reboot, recover β
β β’ Systems Manager β Run automation documents β
β β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Essential Alarms (Terraform)
# High CPU Alarm with Auto Scaling
resource "aws_cloudwatch_metric_alarm" "high_cpu" {
alarm_name = "${var.app_name}-high-cpu"
comparison_operator = "GreaterThanThreshold"
evaluation_periods = 3
metric_name = "CPUUtilization"
namespace = "AWS/EC2"
period = 300
statistic = "Average"
threshold = 80
alarm_description = "CPU exceeds 80% for 15 minutes"
dimensions = {
AutoScalingGroupName = aws_autoscaling_group.web.name
}
alarm_actions = [
aws_sns_topic.alerts.arn,
aws_autoscaling_policy.scale_up.arn
]
ok_actions = [aws_sns_topic.alerts.arn]
}
# Lambda Error Rate (Metric Math)
resource "aws_cloudwatch_metric_alarm" "lambda_error_rate" {
alarm_name = "${var.app_name}-lambda-errors"
comparison_operator = "GreaterThanThreshold"
evaluation_periods = 2
threshold = 5 # 5% error rate
metric_query {
id = "error_rate"
expression = "errors / invocations * 100"
label = "Error Rate %"
return_data = true
}
metric_query {
id = "errors"
metric {
metric_name = "Errors"
namespace = "AWS/Lambda"
period = 300
stat = "Sum"
dimensions = {
FunctionName = aws_lambda_function.api.function_name
}
}
}
metric_query {
id = "invocations"
metric {
metric_name = "Invocations"
namespace = "AWS/Lambda"
period = 300
stat = "Sum"
dimensions = {
FunctionName = aws_lambda_function.api.function_name
}
}
}
alarm_actions = [aws_sns_topic.alerts.arn]
}
# Anomaly Detection Alarm
resource "aws_cloudwatch_metric_alarm" "traffic_anomaly" {
alarm_name = "${var.app_name}-traffic-anomaly"
comparison_operator = "LessThanLowerOrGreaterThanUpperThreshold"
evaluation_periods = 2
threshold_metric_id = "ad1"
metric_query {
id = "m1"
return_data = true
metric {
metric_name = "RequestCount"
namespace = "AWS/ApplicationELB"
period = 300
stat = "Sum"
dimensions = {
LoadBalancer = aws_lb.main.arn_suffix
}
}
}
metric_query {
id = "ad1"
expression = "ANOMALY_DETECTION_BAND(m1, 2)"
label = "Traffic Anomaly Band"
return_data = true
}
alarm_actions = [aws_sns_topic.alerts.arn]
}
# Composite Alarm (multiple conditions)
resource "aws_cloudwatch_composite_alarm" "critical" {
alarm_name = "${var.app_name}-critical"
alarm_rule = join(" OR ", [
"ALARM(${aws_cloudwatch_metric_alarm.high_cpu.alarm_name})",
"ALARM(${aws_cloudwatch_metric_alarm.lambda_error_rate.alarm_name})",
])
alarm_actions = [aws_sns_topic.pagerduty.arn]
}
CloudWatch Dashboards
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Dashboard Design β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β MY APPLICATION DASHBOARD π π βοΈ β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€ β
β β β β
β β βββββββββββββββββββββββ ββββββββββββββββββββββββββββββββββββ β
β β β π’ Healthy Hosts: 4 β β π Request Rate ββ β
β β β π΄ Errors: 0.1% β β [Line graph of requests/min] ββ β
β β β β±οΈ P99 Latency: 45msβ β ββ β
β β βββββββββββββββββββββββ ββββββββββββββββββββββββββββββββββββ β
β β β β
β β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β β
β β β π API Response Times by Endpoint β β β
β β β ββββββββββββββββββββββββββββββββββββββββββββββββββββββ β β β
β β β β /orders ββββββββββββββββ 45ms β β β β
β β β β /users ββββββββ 25ms β β β β
β β β β /products ββββββββββββββββββ 55ms β β β β
β β β ββββββββββββββββββββββββββββββββββββββββββββββββββββββ β β β
β β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β β
β β β β
β β βββββββββββββββββββββββ ββββββββββββββββββββββββββββββββββββ β
β β β Lambda Invocations β β DynamoDB Consumed Capacity ββ β
β β β [Stacked area chart]β β [RCU/WCU over time] ββ β
β β βββββββββββββββββββββββ ββββββββββββββββββββββββββββββββββββ β
β β β β
β β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β β
β β β π Recent Errors (Log Widget) β β β
β β β 10:23:45 ERROR Payment failed: insufficient funds β β β
β β β 10:22:30 ERROR Timeout connecting to external API β β β
β β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β β
β β β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β
β Widget Types: β
β β’ Line, Stacked area, Number, Gauge, Bar, Pie β
β β’ Text (Markdown), Log (Log Insights), Alarm status β
β β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Dashboard as Code
import json
# CloudWatch Dashboard JSON definition
dashboard_body = {
"widgets": [
# Key Metrics Row
{
"type": "metric",
"x": 0, "y": 0, "width": 6, "height": 6,
"properties": {
"title": "Request Count",
"metrics": [
["AWS/ApplicationELB", "RequestCount", "LoadBalancer", "app/my-alb/1234"]
],
"period": 60,
"stat": "Sum",
"region": "us-east-1"
}
},
{
"type": "metric",
"x": 6, "y": 0, "width": 6, "height": 6,
"properties": {
"title": "Response Time (p99)",
"metrics": [
["AWS/ApplicationELB", "TargetResponseTime", "LoadBalancer", "app/my-alb/1234"]
],
"period": 60,
"stat": "p99",
"region": "us-east-1"
}
},
# Lambda Metrics
{
"type": "metric",
"x": 0, "y": 6, "width": 12, "height": 6,
"properties": {
"title": "Lambda Performance",
"metrics": [
["AWS/Lambda", "Duration", "FunctionName", "my-function", {"stat": "Average"}],
[".", ".", ".", ".", {"stat": "p95"}],
[".", ".", ".", ".", {"stat": "p99"}]
],
"period": 60,
"region": "us-east-1"
}
},
# Alarm Status Widget
{
"type": "alarm",
"x": 12, "y": 0, "width": 6, "height": 6,
"properties": {
"title": "Alarm Status",
"alarms": [
"arn:aws:cloudwatch:us-east-1:123456789012:alarm:HighCPU",
"arn:aws:cloudwatch:us-east-1:123456789012:alarm:ErrorRate"
]
}
},
# Log Widget
{
"type": "log",
"x": 0, "y": 12, "width": 24, "height": 6,
"properties": {
"title": "Recent Errors",
"query": "SOURCE '/aws/lambda/my-function' | filter @message like /ERROR/ | sort @timestamp desc | limit 20",
"region": "us-east-1"
}
}
]
}
# Create/update dashboard
import boto3
cloudwatch = boto3.client('cloudwatch')
cloudwatch.put_dashboard(
DashboardName='MyApplicationDashboard',
DashboardBody=json.dumps(dashboard_body)
)
CloudWatch Synthetics
Canary tests that monitor your endpoints 24/7.ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β CloudWatch Synthetics β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β β
β Canary Types: β
β βββββββββββββ β
β β’ Heartbeat: Simple availability check β
β β’ API: REST API endpoint testing β
β β’ Broken Link: Check for broken links β
β β’ Visual: Screenshot comparison β
β β’ GUI Workflow: Multi-step browser tests β
β β
β Features: β
β βββββββββ β
β β’ Runs on a schedule (1 min to 1 hour) β
β β’ Captures screenshots and HAR files β
β β’ Integrates with CloudWatch Alarms β
β β’ Uses Puppeteer or Selenium β
β β
β Example: API Canary Script (Node.js) β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β const synthetics = require('Synthetics'); β β
β β const log = require('SyntheticsLogger'); β β
β β β β
β β const apiCanary = async function() { β β
β β const page = await synthetics.getPage(); β β
β β β β
β β // Test API endpoint β β
β β const response = await page.goto('https://api.example.com'); β β
β β β β
β β if (response.status() !== 200) { β β
β β throw new Error(`Expected 200, got ${response.status()}`); β β
β β } β β
β β β β
β β log.info('API check passed'); β β
β β }; β β
β β β β
β β exports.handler = async () => { β β
β β return await apiCanary(); β β
β β }; β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β
β Pricing: $0.0012 per canary run β
β 100 runs/day Γ 30 days = $3.60/month per canary β
β β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
CloudWatch Contributor Insights
Identify top contributors to high cardinality data.ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Contributor Insights β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β β
β Use Cases: β
β βββββββββββ β
β β’ Find top IPs making requests β
β β’ Identify users causing errors β
β β’ Detect DDoS patterns β
β β’ Find hottest DynamoDB partition keys β
β β
β Example: Top Error-Producing API Endpoints β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β Rank β Endpoint β Error Count β % of Total β β
β β ββββββΌββββββββββββββββββββΌββββββββββββββΌβββββββββββββββββββββ β β
β β 1 β /api/v1/checkout β 1,234 β 45% β β
β β 2 β /api/v1/payment β 567 β 21% β β
β β 3 β /api/v1/search β 234 β 9% β β
β β 4 β /api/v1/users β 123 β 5% β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β
β Rule Definition: β
β { β
β "Schema": { β
β "Name": "CloudWatchLogRule", β
β "Version": 1 β
β }, β
β "LogGroupNames": ["/aws/lambda/my-api"], β
β "LogFormat": "JSON", β
β "Fields": { β
β "endpoint": "$.path", β
β "status": "$.status_code" β
β }, β
β "Contribution": { β
β "Keys": ["endpoint"], β
β "Filters": [{"Match": "$.status_code", "GreaterThan": 499}] β
β } β
β } β
β β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Best Practices
Set Retention Policies
Configure log retention to balance cost and compliance needs
Use Structured Logging
JSON format enables powerful Log Insights queries
Alert on Symptoms
Focus on user-facing metrics, not just infrastructure
Tune Alarm Thresholds
Avoid alert fatigue with well-calibrated thresholds
Dashboard Hierarchy
Executive β Service β Debug dashboards
Cost Awareness
Monitor CloudWatch costsβthey can surprise you
Cost Optimization
cost_tips = {
"logs": [
"Set appropriate retention (default: never expire)",
"Export to S3 for long-term storage (cheaper)",
"Use metric filters instead of querying raw logs",
"Filter at source (log level, sampling)",
],
"metrics": [
"Minimize custom metric dimensions",
"Use EMF for metrics-from-logs (free publishing)",
"Consider high-res metrics only where needed",
"Delete unused custom metrics",
],
"alarms": [
"Use composite alarms to reduce noise",
"Consolidate similar alarms",
"Avoid very short periods on high-cardinality metrics",
],
"dashboards": [
"Share dashboards instead of duplicating",
"Use automatic refresh wisely",
]
}
# Cost estimation
monthly_costs = {
"custom_metrics": "10,000 metrics Γ $0.30 = $3,000",
"log_ingestion": "100 GB Γ $0.50 = $50",
"log_storage": "1 TB Γ $0.03 = $30",
"log_insights": "500 GB scanned Γ $0.005 = $2.50",
"alarms": "100 alarms Γ $0.10 = $10",
"dashboards": "First 3 free, then $3/month each",
}
π― Interview Questions
Q1: CloudWatch Logs vs X-Ray vs CloudTrail?
Q1: CloudWatch Logs vs X-Ray vs CloudTrail?
CloudWatch Logs:
- Application and system logs
- What your code outputs
- Debugging, troubleshooting
- Distributed tracing
- Request flow across services
- Performance analysis
- AWS API call history
- Who did what, when
- Security auditing
Q2: How to reduce CloudWatch Logs costs?
Q2: How to reduce CloudWatch Logs costs?
- Set retention policies (donβt keep forever)
- Export to S3 for long-term (use lifecycle rules)
- Filter at source (log levels, sampling)
- Use metric filters instead of Log Insights for common queries
- Compress logs before ingestion
- Use Contributor Insights rules for ongoing analysis
Q3: What's the difference between metric alarms and composite alarms?
Q3: What's the difference between metric alarms and composite alarms?
Metric Alarms:
- Monitor single metric
- Simple threshold or anomaly detection
- One condition
- Combine multiple alarms with AND/OR/NOT
- Reduce alert noise
- Complex conditions like: βHigh CPU AND High Memoryβ
- Better for on-call (fewer, more actionable alerts)
Q4: EC2 memory not showing in CloudWatch?
Q4: EC2 memory not showing in CloudWatch?
Why: EC2 basic monitoring doesnβt include memoryβitβs inside the OS, not visible to hypervisor.Solution: Install CloudWatch Agent
# Install agent
sudo yum install amazon-cloudwatch-agent
# Configure
sudo /opt/aws/amazon-cloudwatch-agent/bin/amazon-cloudwatch-agent-config-wizard
# Start
sudo systemctl start amazon-cloudwatch-agent
Q5: How to monitor Lambda cold starts?
Q5: How to monitor Lambda cold starts?
Built-in: Lambda doesnβt have a direct cold start metric.Solutions:
- Init Duration in REPORT logs
fields @timestamp, @initDuration
| filter @type = "REPORT" and ispresent(@initDuration)
| stats avg(@initDuration), max(@initDuration), count(*)
- Custom metric from code (measure init time)
- X-Ray shows initialization segment
- Lambda Insights (enhanced monitoring)
π§ͺ Hands-On Lab
Next Module
AWS X-Ray
Master distributed tracing with AWS X-Ray