Module Overview
Estimated Time: 3-4 hours | Difficulty: Intermediate | Prerequisites: Core Concepts, Compute
- CloudWatch metrics, logs, and alarms
- X-Ray distributed tracing
- CloudTrail for audit logging
- EventBridge for event-driven automation
- Building observability dashboards
- Alerting and incident response
Observability Pillars
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ Three Pillars of Observability โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ โ
โ โโโโโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโโโ โ
โ โ METRICS โ โ LOGS โ โ TRACES โ โ
โ โ (CloudWatch) โ โ (CloudWatch โ โ (X-Ray) โ โ
โ โ โ โ Logs) โ โ โ โ
โ โโโโโโโโโโฌโโโโโโโโโ โโโโโโโโโโฌโโโโโโโโโ โโโโโโโโโโฌโโโโโโโโโ โ
โ โ โ โ โ
โ What happened? Why did it Where did it โ
โ (CPU 85%) happen? happen? โ
โ (error logs) (service AโBโC) โ
โ โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โ AWS Observability Stack โ โ
โ โ โ โ
โ โ CloudWatch CloudWatch X-Ray CloudTrail โ โ
โ โ Metrics Logs Traces Audit Logs โ โ
โ โ โ โ โ โ โ โ
โ โ โโโโโโโโโโโโโโโโดโโโโโโโโโโโโโโโดโโโโโโโโโโโโโโโ โ โ
โ โ โ โ โ
โ โ CloudWatch Dashboards โ โ
โ โ CloudWatch Alarms โ โ
โ โ EventBridge Automation โ โ
โ โ โ โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
CloudWatch Metrics
Collect and track metrics from AWS services and custom applications.Built-in vs Custom Metrics
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ CloudWatch Metrics Types โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ โ
โ BUILT-IN METRICS (Free - Basic Monitoring) โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ EC2: CPUUtilization, NetworkIn/Out, DiskRead/Write โ
โ RDS: CPUUtilization, FreeableMemory, ReadIOPS, WriteIOPS โ
โ Lambda: Invocations, Duration, Errors, Throttles, ConcurrentExec โ
โ ALB: RequestCount, TargetResponseTime, HTTPCode_Target_2XX โ
โ DynamoDB: ConsumedRCU, ConsumedWCU, ThrottledRequests โ
โ S3: BucketSizeBytes, NumberOfObjects โ
โ โ
โ Resolution: 5 minutes (basic), 1 minute (detailed - extra cost) โ
โ โ
โ CUSTOM METRICS (You publish) โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โข Application-specific metrics โ
โ โข Business KPIs (orders/min, signups/hour) โ
โ โข Memory utilization (not built-in for EC2!) โ
โ โข Queue depth, cache hit ratio โ
โ โ
โ Resolution: 1 second to 1 minute (high-resolution) โ
โ Cost: $0.30 per metric per month โ
โ โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
Publishing Custom Metrics
import boto3
from datetime import datetime
cloudwatch = boto3.client('cloudwatch')
def publish_custom_metric(namespace: str, metric_name: str,
value: float, unit: str = 'Count',
dimensions: list = None):
"""
Publish custom metric to CloudWatch.
Cost tip: Each unique combination of namespace + metric name + dimensions
counts as one custom metric ($0.30/month). If you add a "RequestId"
dimension, every single request creates a new metric -- that is thousands
of dollars/month. Use low-cardinality dimensions like Environment, Region,
or Service. Never use user IDs, request IDs, or timestamps as dimensions.
"""
metric_data = {
'MetricName': metric_name,
'Value': value,
'Unit': unit,
'Timestamp': datetime.utcnow(),
}
if dimensions:
metric_data['Dimensions'] = dimensions
cloudwatch.put_metric_data(
Namespace=namespace,
MetricData=[metric_data]
)
# Example: Track orders per minute
publish_custom_metric(
namespace='MyApp/Ecommerce',
metric_name='OrdersPlaced',
value=42,
unit='Count',
dimensions=[
{'Name': 'Environment', 'Value': 'Production'},
{'Name': 'Region', 'Value': 'us-east-1'}
]
)
# Example: Track memory utilization (not built-in!)
import psutil
publish_custom_metric(
namespace='MyApp/System',
metric_name='MemoryUtilization',
value=psutil.virtual_memory().percent,
unit='Percent',
dimensions=[
{'Name': 'InstanceId', 'Value': 'i-1234567890abcdef0'}
]
)
# High-resolution metrics (1-second granularity)
# Cost warning: High-resolution metrics cost the same per metric ($0.30/month)
# but generate 60x more data points, which increases storage and query costs.
# Only use StorageResolution=1 for metrics where second-level precision matters
# (e.g., real-time trading systems). For most applications, 60-second standard
# resolution is sufficient and much cheaper to query with Logs Insights.
cloudwatch.put_metric_data(
Namespace='MyApp/HighFrequency',
MetricData=[{
'MetricName': 'TransactionsPerSecond',
'Value': 1500,
'Unit': 'Count/Second',
'StorageResolution': 1 # 1 = high-res, 60 = standard
}]
)
CloudWatch Embedded Metric Format (EMF)
import json
def emit_emf_metric(metric_name: str, value: float, dimensions: dict):
"""
Embedded Metric Format - publish metrics via logs.
Automatically extracted by CloudWatch.
Why EMF over PutMetricData? EMF is the preferred approach for Lambda
because: (1) no additional API call latency added to your function,
(2) the metric data is embedded in your log output so you get metrics
AND logs in one write, and (3) you avoid PutMetricData throttling
limits (150 TPS). The downside is you pay for log ingestion ($0.50/GB),
but for Lambda you are already paying that regardless.
"""
emf_log = {
"_aws": {
"Timestamp": int(datetime.utcnow().timestamp() * 1000),
"CloudWatchMetrics": [{
"Namespace": "MyApp",
"Dimensions": [list(dimensions.keys())],
"Metrics": [{
"Name": metric_name,
"Unit": "Count"
}]
}]
},
metric_name: value,
**dimensions
}
# Print to stdout - CloudWatch Logs extracts the metric
print(json.dumps(emf_log))
# Usage in Lambda
emit_emf_metric(
metric_name="OrderValue",
value=99.99,
dimensions={"Service": "Checkout", "Environment": "prod"}
)
CloudWatch Logs
Centralized log management for all AWS services and applications. CloudWatch Logs is often the single biggest line item on an AWS bill that teams do not expect โ a service generating 100 GB of logs per day costs roughly $1,500/month in ingestion alone, before storage and queries.Log Architecture
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ CloudWatch Logs Structure โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ โ
โ LOG GROUP: /aws/lambda/my-function โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โ โ
โ โโโ LOG STREAM: 2024/01/15/[$LATEST]abc123 โ
โ โ โ โ
โ โ โโโ Log Event: {"level": "INFO", "msg": "Started"} โ
โ โ โโโ Log Event: {"level": "ERROR", "msg": "Failed"} โ
โ โ โโโ Log Event: {"level": "INFO", "msg": "Completed"} โ
โ โ โ
โ โโโ LOG STREAM: 2024/01/15/[$LATEST]def456 โ
โ โ โโโ ... โ
โ โ โ
โ โโโ LOG STREAM: 2024/01/16/[$LATEST]ghi789 โ
โ โโโ ... โ
โ โ
โ RETENTION SETTINGS: โ
โ โข 1 day to 10 years (or never expire) โ
โ โข Export to S3 for long-term storage โ
โ โข Stream to Kinesis/Lambda for real-time processing โ
โ โ
โ PRICING: โ
โ โข Ingestion: $0.50 per GB โ
โ โข Storage: $0.03 per GB/month โ
โ โข Queries (Logs Insights): $0.005 per GB scanned โ
โ โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
Structured Logging Best Practices
import json
import logging
from datetime import datetime
class StructuredLogger:
"""JSON structured logger for CloudWatch Logs."""
def __init__(self, service_name: str):
self.service_name = service_name
self.logger = logging.getLogger()
self.logger.setLevel(logging.INFO)
def _format_log(self, level: str, message: str, **kwargs) -> str:
log_entry = {
"timestamp": datetime.utcnow().isoformat(),
"level": level,
"service": self.service_name,
"message": message,
**kwargs
}
return json.dumps(log_entry)
def info(self, message: str, **kwargs):
print(self._format_log("INFO", message, **kwargs))
def error(self, message: str, **kwargs):
print(self._format_log("ERROR", message, **kwargs))
def warn(self, message: str, **kwargs):
print(self._format_log("WARN", message, **kwargs))
# Usage
logger = StructuredLogger("order-service")
def process_order(order_id: str, user_id: str):
logger.info(
"Processing order",
order_id=order_id,
user_id=user_id,
action="process_order"
)
try:
# Process order...
logger.info(
"Order completed",
order_id=order_id,
duration_ms=150,
action="order_complete"
)
except Exception as e:
logger.error(
"Order failed",
order_id=order_id,
error=str(e),
action="order_failed"
)
raise
CloudWatch Logs Insights
-- Find all errors in the last hour
fields @timestamp, @message
| filter @message like /ERROR/
| sort @timestamp desc
| limit 100
-- Parse JSON logs and aggregate
fields @timestamp, @message
| parse @message '{"level":"*","service":"*","message":"*"' as level, service, msg
| filter level = "ERROR"
| stats count(*) as error_count by service
| sort error_count desc
-- Calculate p99 latency from Lambda
fields @timestamp, @duration
| filter @type = "REPORT"
| stats avg(@duration) as avg_duration,
pct(@duration, 50) as p50,
pct(@duration, 95) as p95,
pct(@duration, 99) as p99
by bin(1h)
-- Find slow API requests
fields @timestamp, @message
| parse @message '"duration_ms":*,' as duration
| filter duration > 1000
| sort duration desc
| limit 50
-- Error rate over time
fields @timestamp, @message
| filter @message like /level/
| parse @message '"level":"*"' as level
| stats count(*) as total,
sum(level = "ERROR") as errors
by bin(5m)
| display errors / total * 100 as error_rate
X-Ray Distributed Tracing
Trace requests across microservices to identify bottlenecks and errors. While metrics tell you โp99 latency is 3 secondsโ and logs tell you โthis function threw an error,โ traces tell you โthe request spent 2.5 of those 3 seconds waiting on a DynamoDB call that was throttled.โ Traces are the only tool that gives you a request-level view across service boundaries.X-Ray Architecture
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ X-Ray Distributed Tracing โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ โ
โ Request Flow with Trace: โ
โ โ
โ Client โ
โ โ โ
โ โ Trace ID: 1-abc123-def456789 โ
โ โผ โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โ API Gateway โ โ
โ โ Segment: 50ms โ โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โ โ
โ โผ โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โ Lambda: Order Service โ โ
โ โ Segment: 200ms โ โ
โ โ โโโ Subsegment: DynamoDB Query (30ms) โ โ
โ โ โโโ Subsegment: External API Call (100ms) โ โ
โ โ โโโ Subsegment: SNS Publish (20ms) โ โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โ โ
โ โโโโโโโโโโโโโโโโโโดโโโโโโโโโโโโโโโโโ โ
โ โผ โผ โ
โ โโโโโโโโโโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โ Lambda: Inventory โ โ Lambda: Payment โ โ
โ โ Segment: 80ms โ โ Segment: 150ms โ โ
โ โ โโโ DynamoDB (50ms) โ โ โโโ Stripe API (120ms)โ โ
โ โโโโโโโโโโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โ
โ Service Map (auto-generated): โ
โ โโโโโโโโโโโโ โโโโโโโโโโโโ โโโโโโโโโโโโ โโโโโโโโโโโโ โ
โ โ API โโโโโบโ Order โโโโโบโ Inventoryโโโโโบโ DynamoDB โ โ
โ โ Gateway โ โ Service โ โ Service โ โ โ โ
โ โโโโโโโโโโโโ โโโโโโฌโโโโโโ โโโโโโโโโโโโ โโโโโโโโโโโโ โ
โ โ โ
โ โโโโโโโโโโโโบโโโโโโโโโโโโ โ
โ โ Payment โ โ
โ โ Service โ โ
โ โโโโโโโโโโโโ โ
โ โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
Instrumenting Lambda with X-Ray
import boto3
from aws_xray_sdk.core import xray_recorder
from aws_xray_sdk.core import patch_all
# Patch all supported libraries (boto3, requests, etc.)
patch_all()
dynamodb = boto3.resource('dynamodb')
table = dynamodb.Table('orders')
@xray_recorder.capture('process_order')
def process_order(order_id: str):
"""Process order with X-Ray tracing."""
# Add annotation (indexed, searchable)
xray_recorder.put_annotation('order_id', order_id)
# Add metadata (not indexed, for debugging)
xray_recorder.put_metadata('order_details', {
'order_id': order_id,
'timestamp': '2024-01-15T10:00:00Z'
})
# Subsegment for custom operation
with xray_recorder.in_subsegment('validate_order') as subsegment:
subsegment.put_annotation('validation_type', 'full')
validate_order(order_id)
# DynamoDB call is automatically traced
result = table.get_item(Key={'order_id': order_id})
# External API call with custom subsegment
with xray_recorder.in_subsegment('external_api') as subsegment:
subsegment.put_metadata('api', 'payment_gateway')
response = call_payment_api(result['Item'])
return response
def lambda_handler(event, context):
order_id = event.get('order_id')
return process_order(order_id)
CloudWatch Alarms
Automated alerts and actions based on metric thresholds.Alarm Architecture
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ CloudWatch Alarm States โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ โ
โ Alarm States: โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โ โ โ
โ โ โโโโโโโโโโโโ โโโโโโโโโโโโ โโโโโโโโโโโโ โ โ
โ โ โ OK โโโโโโโโโบโ ALARM โโโโโโโโโบโINSUFFICIENTโ โ โ
โ โ โ (green) โ โ (red) โ โ DATA โ โ โ
โ โ โโโโโโโโโโโโ โโโโโโโโโโโโ โโโโโโโโโโโโ โ โ
โ โ โ โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โ
โ Alarm Components: โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โ โ โ
โ โ METRIC STATISTIC PERIOD THRESHOLD โ โ
โ โ CPUUtilization Average 5 minutes > 80% โ โ
โ โ โ โ
โ โ EVALUATION PERIODS: 3 โ โ
โ โ (3 consecutive 5-min periods above 80% = ALARM) โ โ
โ โ โ โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โ
โ Alarm Actions: โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โ โ โ
โ โ โข SNS Topic (email, SMS, Lambda) โ โ
โ โ โข Auto Scaling (scale up/down) โ โ
โ โ โข EC2 Actions (stop, terminate, reboot) โ โ
โ โ โข Systems Manager (run automation) โ โ
โ โ โ โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
Essential Alarms (Terraform)
# High CPU Alarm
resource "aws_cloudwatch_metric_alarm" "high_cpu" {
alarm_name = "high-cpu-utilization"
comparison_operator = "GreaterThanThreshold"
evaluation_periods = 3
metric_name = "CPUUtilization"
namespace = "AWS/EC2"
period = 300 # 5 minutes
statistic = "Average"
threshold = 80
alarm_description = "CPU utilization exceeds 80%"
dimensions = {
AutoScalingGroupName = aws_autoscaling_group.web.name
}
# Alarm actions fire when state transitions to ALARM.
# Triggering both notification AND auto-scaling is a production best practice:
# auto-scaling handles the immediate capacity need while the team investigates.
alarm_actions = [
aws_sns_topic.alerts.arn,
aws_autoscaling_policy.scale_up.arn
]
# OK actions fire when alarm returns to normal -- useful for "all clear"
# notifications so on-call engineers know the issue resolved automatically.
ok_actions = [aws_sns_topic.alerts.arn]
}
# Lambda Error Rate Alarm
resource "aws_cloudwatch_metric_alarm" "lambda_errors" {
alarm_name = "lambda-high-error-rate"
comparison_operator = "GreaterThanThreshold"
evaluation_periods = 2
threshold = 5 # 5% error rate
alarm_description = "Lambda error rate exceeds 5%"
metric_query {
id = "error_rate"
expression = "(errors / invocations) * 100"
label = "Error Rate"
return_data = true
}
metric_query {
id = "errors"
metric {
metric_name = "Errors"
namespace = "AWS/Lambda"
period = 300
stat = "Sum"
dimensions = { FunctionName = "my-function" }
}
}
metric_query {
id = "invocations"
metric {
metric_name = "Invocations"
namespace = "AWS/Lambda"
period = 300
stat = "Sum"
dimensions = { FunctionName = "my-function" }
}
}
alarm_actions = [aws_sns_topic.alerts.arn]
}
# DynamoDB Throttling Alarm
# Why threshold 0? Because ANY throttling means you are losing requests.
# DynamoDB throttling is silent -- your application gets a 400 error but
# CloudWatch won't page you unless you set this up. Even one throttled
# request can cascade into retries that cause more throttling.
resource "aws_cloudwatch_metric_alarm" "dynamodb_throttle" {
alarm_name = "dynamodb-throttling"
comparison_operator = "GreaterThanThreshold"
evaluation_periods = 1
metric_name = "ThrottledRequests"
namespace = "AWS/DynamoDB"
period = 60
statistic = "Sum"
threshold = 0
dimensions = {
TableName = "my-table"
}
alarm_actions = [aws_sns_topic.alerts.arn]
}
# Composite Alarm (multiple conditions)
resource "aws_cloudwatch_composite_alarm" "critical" {
alarm_name = "critical-system-alarm"
alarm_rule = "ALARM(${aws_cloudwatch_metric_alarm.high_cpu.alarm_name}) AND ALARM(${aws_cloudwatch_metric_alarm.lambda_errors.alarm_name})"
alarm_actions = [aws_sns_topic.pagerduty.arn]
}
CloudTrail (Audit Logging)
Track all API calls for security and compliance.โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ CloudTrail Architecture โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ โ
โ Every AWS API Call: โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โ โ โ
โ โ User/Role โโโโบ AWS API โโโโบ CloudTrail โโโโบ S3 Bucket โ โ
โ โ โ โ โ
โ โ โโโโบ CloudWatch Logs โ โ
โ โ โโโโบ EventBridge (real-time) โ โ
โ โ โ โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โ
โ Event Types: โ
โ โโโโโโโโโโโโโ โ
โ โข Management Events: Control plane (CreateBucket, RunInstances) โ
โ โข Data Events: Data plane (S3 GetObject, Lambda Invoke) โ
โ โข Insights Events: Unusual API activity detection โ
โ โ
โ Sample Event: โ
โ { โ
โ "eventTime": "2024-01-15T10:30:00Z", โ
โ "eventSource": "s3.amazonaws.com", โ
โ "eventName": "DeleteBucket", โ
โ "userIdentity": { โ
โ "type": "IAMUser", โ
โ "userName": "admin", โ
โ "arn": "arn:aws:iam::123456789012:user/admin" โ
โ }, โ
โ "sourceIPAddress": "203.0.113.50", โ
โ "requestParameters": { "bucketName": "my-bucket" } โ
โ } โ
โ โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
CloudTrail Best Practices
# CloudTrail configuration checklist
cloudtrail_config = {
"multi_region": True, # Trail in all regions
"log_file_validation": True, # Detect tampering
"s3_encryption": "SSE-KMS", # Encrypt logs
"cloudwatch_logs": True, # Real-time analysis
"data_events": {
"s3": ["arn:aws:s3:::sensitive-bucket/*"],
"lambda": True
},
"insights": True, # Anomaly detection
"organization_trail": True # Multi-account
}
# Real-time alerting with EventBridge
# Detect root user login
eventbridge_rule = {
"source": ["aws.signin"],
"detail-type": ["AWS Console Sign In via CloudTrail"],
"detail": {
"userIdentity": {
"type": ["Root"]
}
}
}
๐ฏ Interview Questions
Q1: How would you debug a slow API response?
Q1: How would you debug a slow API response?
Systematic approach:
-
X-Ray Trace: Find the specific slow request
- Identify which service/subsegment is slow
- Check annotations for context
-
CloudWatch Metrics: Check historical patterns
- Is this a spike or gradual increase?
- Correlate with CPU, memory, connections
-
CloudWatch Logs: Find related errors
fields @timestamp, @message | filter trace_id = "1-abc123..." | sort @timestamp -
Service-specific checks:
- Lambda: Cold starts? Memory sufficient?
- DynamoDB: Throttling? Hot partition?
- RDS: Connection pool exhausted?
Q2: What metrics should you monitor for a web application?
Q2: What metrics should you monitor for a web application?
Essential metrics by layer:Load Balancer:
- RequestCount, TargetResponseTime
- HTTPCode_Target_5XX, HTTPCode_ELB_5XX
- HealthyHostCount, UnhealthyHostCount
- CPUUtilization, MemoryUtilization (custom)
- Lambda: Duration, Errors, Throttles
- CPUUtilization, FreeableMemory
- DatabaseConnections, ReadIOPS, WriteIOPS
- DynamoDB: ThrottledRequests
- Error rate, Latency (p50, p95, p99)
- Requests per second
- Business metrics (orders, signups)
Q3: How do you reduce CloudWatch Logs costs?
Q3: How do you reduce CloudWatch Logs costs?
Cost optimization strategies:
-
Reduce ingestion:
- Filter logs at source (log level INFO not DEBUG)
- Use sampling for high-volume logs
-
Optimize retention:
- Set appropriate retention (7-30 days for most)
- Export to S3 for long-term (cheaper)
-
Use Logs Insights efficiently:
- Narrow time ranges
- Use specific log groups
- Cache common queries
-
Consider alternatives:
- Kinesis Firehose โ S3 for high volume
- OpenSearch for complex analysis
Q4: How do you set up alerting for a production system?
Q4: How do you set up alerting for a production system?
Alert hierarchy:
-
Critical (PagerDuty/immediate):
- Service down (health check failures)
- Error rate > 5%
- Latency p99 > 5s
- Security events (root login)
-
Warning (Slack/email):
- Error rate > 1%
- CPU > 80% sustained
- Disk > 85%
- Approaching quotas
-
Informational (dashboard):
- Deployment events
- Scaling events
- Cost anomalies
- Avoid alert fatigue (tune thresholds)
- Use composite alarms
- Include runbook links in alerts
Q5: CloudWatch vs third-party observability tools?
Q5: CloudWatch vs third-party observability tools?
CloudWatch advantages:
- Native integration, no agents for AWS services
- Lower cost for basic use cases
- No data egress charges
- Better visualization and correlation
- APM with code-level insights
- Multi-cloud support
- More powerful querying
- Use CloudWatch for AWS metrics/logs
- Stream to third-party for analysis
- Keep costs balanced
๐งช Hands-On Lab: Build Observability Dashboard
Next Module
CDN & Edge Services
Master CloudFront, Global Accelerator, and edge computing