Module Overview
Estimated Time : 3-4 hours | Difficulty : Intermediate | Prerequisites : Core Concepts, Compute
Observability is critical for running production workloads. This module covers the complete AWS monitoring stack for logs, metrics, traces, and alerts.
What You’ll Learn:
CloudWatch metrics, logs, and alarms
X-Ray distributed tracing
CloudTrail for audit logging
EventBridge for event-driven automation
Building observability dashboards
Alerting and incident response
Observability Pillars
┌────────────────────────────────────────────────────────────────────────┐
│ Three Pillars of Observability │
├────────────────────────────────────────────────────────────────────────┤
│ │
│ ┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐ │
│ │ METRICS │ │ LOGS │ │ TRACES │ │
│ │ (CloudWatch) │ │ (CloudWatch │ │ (X-Ray) │ │
│ │ │ │ Logs) │ │ │ │
│ └────────┬────────┘ └────────┬────────┘ └────────┬────────┘ │
│ │ │ │ │
│ What happened? Why did it Where did it │
│ (CPU 85%) happen? happen? │
│ (error logs) (service A→B→C) │
│ │
│ ┌─────────────────────────────────────────────────────────────────┐ │
│ │ AWS Observability Stack │ │
│ │ │ │
│ │ CloudWatch CloudWatch X-Ray CloudTrail │ │
│ │ Metrics Logs Traces Audit Logs │ │
│ │ │ │ │ │ │ │
│ │ └──────────────┴──────────────┴──────────────┘ │ │
│ │ │ │ │
│ │ CloudWatch Dashboards │ │
│ │ CloudWatch Alarms │ │
│ │ EventBridge Automation │ │
│ │ │ │
│ └─────────────────────────────────────────────────────────────────┘ │
│ │
└────────────────────────────────────────────────────────────────────────┘
CloudWatch Metrics
Collect and track metrics from AWS services and custom applications.
Built-in vs Custom Metrics
┌────────────────────────────────────────────────────────────────────────┐
│ CloudWatch Metrics Types │
├────────────────────────────────────────────────────────────────────────┤
│ │
│ BUILT-IN METRICS (Free - Basic Monitoring) │
│ ───────────────────────────────────────── │
│ EC2: CPUUtilization, NetworkIn/Out, DiskRead/Write │
│ RDS: CPUUtilization, FreeableMemory, ReadIOPS, WriteIOPS │
│ Lambda: Invocations, Duration, Errors, Throttles, ConcurrentExec │
│ ALB: RequestCount, TargetResponseTime, HTTPCode_Target_2XX │
│ DynamoDB: ConsumedRCU, ConsumedWCU, ThrottledRequests │
│ S3: BucketSizeBytes, NumberOfObjects │
│ │
│ Resolution: 5 minutes (basic), 1 minute (detailed - extra cost) │
│ │
│ CUSTOM METRICS (You publish) │
│ ──────────────────────────── │
│ • Application-specific metrics │
│ • Business KPIs (orders/min, signups/hour) │
│ • Memory utilization (not built-in for EC2!) │
│ • Queue depth, cache hit ratio │
│ │
│ Resolution: 1 second to 1 minute (high-resolution) │
│ Cost: $0.30 per metric per month │
│ │
└────────────────────────────────────────────────────────────────────────┘
Publishing Custom Metrics
import boto3
from datetime import datetime
cloudwatch = boto3.client( 'cloudwatch' )
def publish_custom_metric ( namespace : str , metric_name : str ,
value : float , unit : str = 'Count' ,
dimensions : list = None ):
"""Publish custom metric to CloudWatch."""
metric_data = {
'MetricName' : metric_name,
'Value' : value,
'Unit' : unit,
'Timestamp' : datetime.utcnow(),
}
if dimensions:
metric_data[ 'Dimensions' ] = dimensions
cloudwatch.put_metric_data(
Namespace = namespace,
MetricData = [metric_data]
)
# Example: Track orders per minute
publish_custom_metric(
namespace = 'MyApp/Ecommerce' ,
metric_name = 'OrdersPlaced' ,
value = 42 ,
unit = 'Count' ,
dimensions = [
{ 'Name' : 'Environment' , 'Value' : 'Production' },
{ 'Name' : 'Region' , 'Value' : 'us-east-1' }
]
)
# Example: Track memory utilization (not built-in!)
import psutil
publish_custom_metric(
namespace = 'MyApp/System' ,
metric_name = 'MemoryUtilization' ,
value = psutil.virtual_memory().percent,
unit = 'Percent' ,
dimensions = [
{ 'Name' : 'InstanceId' , 'Value' : 'i-1234567890abcdef0' }
]
)
# High-resolution metrics (1-second granularity)
cloudwatch.put_metric_data(
Namespace = 'MyApp/HighFrequency' ,
MetricData = [{
'MetricName' : 'TransactionsPerSecond' ,
'Value' : 1500 ,
'Unit' : 'Count/Second' ,
'StorageResolution' : 1 # 1 = high-res, 60 = standard
}]
)
import json
def emit_emf_metric ( metric_name : str , value : float , dimensions : dict ):
"""
Embedded Metric Format - publish metrics via logs.
Automatically extracted by CloudWatch.
"""
emf_log = {
"_aws" : {
"Timestamp" : int (datetime.utcnow().timestamp() * 1000 ),
"CloudWatchMetrics" : [{
"Namespace" : "MyApp" ,
"Dimensions" : [ list (dimensions.keys())],
"Metrics" : [{
"Name" : metric_name,
"Unit" : "Count"
}]
}]
},
metric_name: value,
** dimensions
}
# Print to stdout - CloudWatch Logs extracts the metric
print (json.dumps(emf_log))
# Usage in Lambda
emit_emf_metric(
metric_name = "OrderValue" ,
value = 99.99 ,
dimensions = { "Service" : "Checkout" , "Environment" : "prod" }
)
CloudWatch Logs
Centralized log management for all AWS services and applications.
Log Architecture
┌────────────────────────────────────────────────────────────────────────┐
│ CloudWatch Logs Structure │
├────────────────────────────────────────────────────────────────────────┤
│ │
│ LOG GROUP: /aws/lambda/my-function │
│ ─────────────────────────────────── │
│ │ │
│ ├── LOG STREAM: 2024/01/15/[$LATEST]abc123 │
│ │ │ │
│ │ ├── Log Event: {"level": "INFO", "msg": "Started"} │
│ │ ├── Log Event: {"level": "ERROR", "msg": "Failed"} │
│ │ └── Log Event: {"level": "INFO", "msg": "Completed"} │
│ │ │
│ ├── LOG STREAM: 2024/01/15/[$LATEST]def456 │
│ │ └── ... │
│ │ │
│ └── LOG STREAM: 2024/01/16/[$LATEST]ghi789 │
│ └── ... │
│ │
│ RETENTION SETTINGS: │
│ • 1 day to 10 years (or never expire) │
│ • Export to S3 for long-term storage │
│ • Stream to Kinesis/Lambda for real-time processing │
│ │
│ PRICING: │
│ • Ingestion: $0.50 per GB │
│ • Storage: $0.03 per GB/month │
│ • Queries (Logs Insights): $0.005 per GB scanned │
│ │
└────────────────────────────────────────────────────────────────────────┘
Structured Logging Best Practices
import json
import logging
from datetime import datetime
class StructuredLogger :
"""JSON structured logger for CloudWatch Logs."""
def __init__ ( self , service_name : str ):
self .service_name = service_name
self .logger = logging.getLogger()
self .logger.setLevel(logging. INFO )
def _format_log ( self , level : str , message : str , ** kwargs ) -> str :
log_entry = {
"timestamp" : datetime.utcnow().isoformat(),
"level" : level,
"service" : self .service_name,
"message" : message,
** kwargs
}
return json.dumps(log_entry)
def info ( self , message : str , ** kwargs ):
print ( self ._format_log( "INFO" , message, ** kwargs))
def error ( self , message : str , ** kwargs ):
print ( self ._format_log( "ERROR" , message, ** kwargs))
def warn ( self , message : str , ** kwargs ):
print ( self ._format_log( "WARN" , message, ** kwargs))
# Usage
logger = StructuredLogger( "order-service" )
def process_order ( order_id : str , user_id : str ):
logger.info(
"Processing order" ,
order_id = order_id,
user_id = user_id,
action = "process_order"
)
try :
# Process order...
logger.info(
"Order completed" ,
order_id = order_id,
duration_ms = 150 ,
action = "order_complete"
)
except Exception as e:
logger.error(
"Order failed" ,
order_id = order_id,
error = str (e),
action = "order_failed"
)
raise
CloudWatch Logs Insights
-- Find all errors in the last hour
fields @timestamp, @message
| filter @message like / ERROR /
| sort @timestamp desc
| limit 100
-- Parse JSON logs and aggregate
fields @timestamp, @message
| parse @message '{"level":"*","service":"*","message":"*"' as level , service , msg
| filter level = "ERROR"
| stats count ( * ) as error_count by service
| sort error_count desc
-- Calculate p99 latency from Lambda
fields @timestamp, @duration
| filter @type = "REPORT"
| stats avg (@duration) as avg_duration,
pct(@duration, 50 ) as p50,
pct(@duration, 95 ) as p95,
pct(@duration, 99 ) as p99
by bin(1h)
-- Find slow API requests
fields @timestamp, @message
| parse @message '"duration_ms":*,' as duration
| filter duration > 1000
| sort duration desc
| limit 50
-- Error rate over time
fields @timestamp, @message
| filter @message like / level /
| parse @message '"level":"*"' as level
| stats count ( * ) as total,
sum ( level = "ERROR" ) as errors
by bin(5m)
| display errors / total * 100 as error_rate
X-Ray Distributed Tracing
Trace requests across microservices to identify bottlenecks and errors.
X-Ray Architecture
┌────────────────────────────────────────────────────────────────────────┐
│ X-Ray Distributed Tracing │
├────────────────────────────────────────────────────────────────────────┤
│ │
│ Request Flow with Trace: │
│ │
│ Client │
│ │ │
│ │ Trace ID: 1-abc123-def456789 │
│ ▼ │
│ ┌─────────────────────────────────────────────────────────────────┐ │
│ │ API Gateway │ │
│ │ Segment: 50ms │ │
│ └───────────────────────────┬─────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌─────────────────────────────────────────────────────────────────┐ │
│ │ Lambda: Order Service │ │
│ │ Segment: 200ms │ │
│ │ ├── Subsegment: DynamoDB Query (30ms) │ │
│ │ ├── Subsegment: External API Call (100ms) │ │
│ │ └── Subsegment: SNS Publish (20ms) │ │
│ └───────────────────────────┬─────────────────────────────────────┘ │
│ │ │
│ ┌────────────────┴────────────────┐ │
│ ▼ ▼ │
│ ┌──────────────────────┐ ┌──────────────────────┐ │
│ │ Lambda: Inventory │ │ Lambda: Payment │ │
│ │ Segment: 80ms │ │ Segment: 150ms │ │
│ │ └── DynamoDB (50ms) │ │ └── Stripe API (120ms)│ │
│ └──────────────────────┘ └──────────────────────┘ │
│ │
│ Service Map (auto-generated): │
│ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐ │
│ │ API │───►│ Order │───►│ Inventory│───►│ DynamoDB │ │
│ │ Gateway │ │ Service │ │ Service │ │ │ │
│ └──────────┘ └────┬─────┘ └──────────┘ └──────────┘ │
│ │ │
│ └──────────►┌──────────┐ │
│ │ Payment │ │
│ │ Service │ │
│ └──────────┘ │
│ │
└────────────────────────────────────────────────────────────────────────┘
Instrumenting Lambda with X-Ray
import boto3
from aws_xray_sdk.core import xray_recorder
from aws_xray_sdk.core import patch_all
# Patch all supported libraries (boto3, requests, etc.)
patch_all()
dynamodb = boto3.resource( 'dynamodb' )
table = dynamodb.Table( 'orders' )
@xray_recorder.capture ( 'process_order' )
def process_order ( order_id : str ):
"""Process order with X-Ray tracing."""
# Add annotation (indexed, searchable)
xray_recorder.put_annotation( 'order_id' , order_id)
# Add metadata (not indexed, for debugging)
xray_recorder.put_metadata( 'order_details' , {
'order_id' : order_id,
'timestamp' : '2024-01-15T10:00:00Z'
})
# Subsegment for custom operation
with xray_recorder.in_subsegment( 'validate_order' ) as subsegment:
subsegment.put_annotation( 'validation_type' , 'full' )
validate_order(order_id)
# DynamoDB call is automatically traced
result = table.get_item( Key = { 'order_id' : order_id})
# External API call with custom subsegment
with xray_recorder.in_subsegment( 'external_api' ) as subsegment:
subsegment.put_metadata( 'api' , 'payment_gateway' )
response = call_payment_api(result[ 'Item' ])
return response
def lambda_handler ( event , context ):
order_id = event.get( 'order_id' )
return process_order(order_id)
CloudWatch Alarms
Automated alerts and actions based on metric thresholds.
Alarm Architecture
┌────────────────────────────────────────────────────────────────────────┐
│ CloudWatch Alarm States │
├────────────────────────────────────────────────────────────────────────┤
│ │
│ Alarm States: │
│ ┌─────────────────────────────────────────────────────────────────┐ │
│ │ │ │
│ │ ┌──────────┐ ┌──────────┐ ┌──────────┐ │ │
│ │ │ OK │◄──────►│ ALARM │◄──────►│INSUFFICIENT│ │ │
│ │ │ (green) │ │ (red) │ │ DATA │ │ │
│ │ └──────────┘ └──────────┘ └──────────┘ │ │
│ │ │ │
│ └─────────────────────────────────────────────────────────────────┘ │
│ │
│ Alarm Components: │
│ ┌─────────────────────────────────────────────────────────────────┐ │
│ │ │ │
│ │ METRIC STATISTIC PERIOD THRESHOLD │ │
│ │ CPUUtilization Average 5 minutes > 80% │ │
│ │ │ │
│ │ EVALUATION PERIODS: 3 │ │
│ │ (3 consecutive 5-min periods above 80% = ALARM) │ │
│ │ │ │
│ └─────────────────────────────────────────────────────────────────┘ │
│ │
│ Alarm Actions: │
│ ┌─────────────────────────────────────────────────────────────────┐ │
│ │ │ │
│ │ • SNS Topic (email, SMS, Lambda) │ │
│ │ • Auto Scaling (scale up/down) │ │
│ │ • EC2 Actions (stop, terminate, reboot) │ │
│ │ • Systems Manager (run automation) │ │
│ │ │ │
│ └─────────────────────────────────────────────────────────────────┘ │
│ │
└────────────────────────────────────────────────────────────────────────┘
# High CPU Alarm
resource "aws_cloudwatch_metric_alarm" "high_cpu" {
alarm_name = "high-cpu-utilization"
comparison_operator = "GreaterThanThreshold"
evaluation_periods = 3
metric_name = "CPUUtilization"
namespace = "AWS/EC2"
period = 300 # 5 minutes
statistic = "Average"
threshold = 80
alarm_description = "CPU utilization exceeds 80%"
dimensions = {
AutoScalingGroupName = aws_autoscaling_group.web.name
}
alarm_actions = [
aws_sns_topic . alerts . arn ,
aws_autoscaling_policy . scale_up . arn
]
ok_actions = [ aws_sns_topic . alerts . arn ]
}
# Lambda Error Rate Alarm
resource "aws_cloudwatch_metric_alarm" "lambda_errors" {
alarm_name = "lambda-high-error-rate"
comparison_operator = "GreaterThanThreshold"
evaluation_periods = 2
threshold = 5 # 5% error rate
alarm_description = "Lambda error rate exceeds 5%"
metric_query {
id = "error_rate"
expression = "(errors / invocations) * 100"
label = "Error Rate"
return_data = true
}
metric_query {
id = "errors"
metric {
metric_name = "Errors"
namespace = "AWS/Lambda"
period = 300
stat = "Sum"
dimensions = { FunctionName = "my-function" }
}
}
metric_query {
id = "invocations"
metric {
metric_name = "Invocations"
namespace = "AWS/Lambda"
period = 300
stat = "Sum"
dimensions = { FunctionName = "my-function" }
}
}
alarm_actions = [ aws_sns_topic . alerts . arn ]
}
# DynamoDB Throttling Alarm
resource "aws_cloudwatch_metric_alarm" "dynamodb_throttle" {
alarm_name = "dynamodb-throttling"
comparison_operator = "GreaterThanThreshold"
evaluation_periods = 1
metric_name = "ThrottledRequests"
namespace = "AWS/DynamoDB"
period = 60
statistic = "Sum"
threshold = 0
dimensions = {
TableName = "my-table"
}
alarm_actions = [ aws_sns_topic . alerts . arn ]
}
# Composite Alarm (multiple conditions)
resource "aws_cloudwatch_composite_alarm" "critical" {
alarm_name = "critical-system-alarm"
alarm_rule = "ALARM( ${ aws_cloudwatch_metric_alarm . high_cpu . alarm_name } ) AND ALARM( ${ aws_cloudwatch_metric_alarm . lambda_errors . alarm_name } )"
alarm_actions = [ aws_sns_topic . pagerduty . arn ]
}
CloudTrail (Audit Logging)
Track all API calls for security and compliance.
┌────────────────────────────────────────────────────────────────────────┐
│ CloudTrail Architecture │
├────────────────────────────────────────────────────────────────────────┤
│ │
│ Every AWS API Call: │
│ ┌─────────────────────────────────────────────────────────────────┐ │
│ │ │ │
│ │ User/Role ───► AWS API ───► CloudTrail ───► S3 Bucket │ │
│ │ │ │ │
│ │ └──► CloudWatch Logs │ │
│ │ └──► EventBridge (real-time) │ │
│ │ │ │
│ └─────────────────────────────────────────────────────────────────┘ │
│ │
│ Event Types: │
│ ───────────── │
│ • Management Events: Control plane (CreateBucket, RunInstances) │
│ • Data Events: Data plane (S3 GetObject, Lambda Invoke) │
│ • Insights Events: Unusual API activity detection │
│ │
│ Sample Event: │
│ { │
│ "eventTime": "2024-01-15T10:30:00Z", │
│ "eventSource": "s3.amazonaws.com", │
│ "eventName": "DeleteBucket", │
│ "userIdentity": { │
│ "type": "IAMUser", │
│ "userName": "admin", │
│ "arn": "arn:aws:iam::123456789012:user/admin" │
│ }, │
│ "sourceIPAddress": "203.0.113.50", │
│ "requestParameters": { "bucketName": "my-bucket" } │
│ } │
│ │
└────────────────────────────────────────────────────────────────────────┘
CloudTrail Best Practices
# CloudTrail configuration checklist
cloudtrail_config = {
"multi_region" : True , # Trail in all regions
"log_file_validation" : True , # Detect tampering
"s3_encryption" : "SSE-KMS" , # Encrypt logs
"cloudwatch_logs" : True , # Real-time analysis
"data_events" : {
"s3" : [ "arn:aws:s3:::sensitive-bucket/*" ],
"lambda" : True
},
"insights" : True , # Anomaly detection
"organization_trail" : True # Multi-account
}
# Real-time alerting with EventBridge
# Detect root user login
eventbridge_rule = {
"source" : [ "aws.signin" ],
"detail-type" : [ "AWS Console Sign In via CloudTrail" ],
"detail" : {
"userIdentity" : {
"type" : [ "Root" ]
}
}
}
🎯 Interview Questions
Q1: How would you debug a slow API response?
Systematic approach:
X-Ray Trace : Find the specific slow request
Identify which service/subsegment is slow
Check annotations for context
CloudWatch Metrics : Check historical patterns
Is this a spike or gradual increase?
Correlate with CPU, memory, connections
CloudWatch Logs : Find related errors
fields @timestamp, @message
| filter trace_id = "1-abc123..."
| sort @timestamp
Service-specific checks :
Lambda: Cold starts? Memory sufficient?
DynamoDB: Throttling? Hot partition?
RDS: Connection pool exhausted?
Q2: What metrics should you monitor for a web application?
Essential metrics by layer: Load Balancer:
RequestCount, TargetResponseTime
HTTPCode_Target_5XX, HTTPCode_ELB_5XX
HealthyHostCount, UnhealthyHostCount
Compute (EC2/Lambda):
CPUUtilization, MemoryUtilization (custom)
Lambda: Duration, Errors, Throttles
Database:
CPUUtilization, FreeableMemory
DatabaseConnections, ReadIOPS, WriteIOPS
DynamoDB: ThrottledRequests
Application:
Error rate, Latency (p50, p95, p99)
Requests per second
Business metrics (orders, signups)
Q3: How do you reduce CloudWatch Logs costs?
Cost optimization strategies:
Reduce ingestion :
Filter logs at source (log level INFO not DEBUG)
Use sampling for high-volume logs
Optimize retention :
Set appropriate retention (7-30 days for most)
Export to S3 for long-term (cheaper)
Use Logs Insights efficiently :
Narrow time ranges
Use specific log groups
Cache common queries
Consider alternatives :
Kinesis Firehose → S3 for high volume
OpenSearch for complex analysis
Q4: How do you set up alerting for a production system?
Alert hierarchy:
Critical (PagerDuty/immediate) :
Service down (health check failures)
Error rate > 5%
Latency p99 > 5s
Security events (root login)
Warning (Slack/email) :
Error rate > 1%
CPU > 80% sustained
Disk > 85%
Approaching quotas
Informational (dashboard) :
Deployment events
Scaling events
Cost anomalies
Best practices:
Avoid alert fatigue (tune thresholds)
Use composite alarms
Include runbook links in alerts
Q5: CloudWatch vs third-party observability tools?
🧪 Hands-On Lab: Build Observability Dashboard
Enable X-Ray on Lambda
Add X-Ray SDK and enable active tracing
Create CloudWatch Dashboard
Add widgets for key metrics (CPU, errors, latency)
Set Up Structured Logging
Implement JSON logging with correlation IDs
Create Alarms
Set up CPU, error rate, and latency alarms
Configure CloudTrail
Enable multi-region trail with CloudWatch Logs
Next Module
CDN & Edge Services Master CloudFront, Global Accelerator, and edge computing