Skip to main content
AWS Observability Stack

Module Overview

Estimated Time: 3-4 hours | Difficulty: Intermediate | Prerequisites: Core Concepts, Compute
Observability is critical for running production workloads. This module covers the complete AWS monitoring stack for logs, metrics, traces, and alerts. What You’ll Learn:
  • CloudWatch metrics, logs, and alarms
  • X-Ray distributed tracing
  • CloudTrail for audit logging
  • EventBridge for event-driven automation
  • Building observability dashboards
  • Alerting and incident response

Observability Pillars

┌────────────────────────────────────────────────────────────────────────┐
│                    Three Pillars of Observability                       │
├────────────────────────────────────────────────────────────────────────┤
│                                                                         │
│   ┌─────────────────┐  ┌─────────────────┐  ┌─────────────────┐        │
│   │     METRICS     │  │      LOGS       │  │     TRACES      │        │
│   │   (CloudWatch)  │  │  (CloudWatch    │  │    (X-Ray)      │        │
│   │                 │  │     Logs)       │  │                 │        │
│   └────────┬────────┘  └────────┬────────┘  └────────┬────────┘        │
│            │                    │                    │                  │
│   What happened?       Why did it         Where did it                 │
│   (CPU 85%)            happen?            happen?                      │
│                        (error logs)       (service A→B→C)              │
│                                                                         │
│   ┌─────────────────────────────────────────────────────────────────┐  │
│   │                    AWS Observability Stack                       │  │
│   │                                                                  │  │
│   │   CloudWatch     CloudWatch     X-Ray        CloudTrail         │  │
│   │   Metrics        Logs           Traces       Audit Logs         │  │
│   │      │              │              │              │              │  │
│   │      └──────────────┴──────────────┴──────────────┘              │  │
│   │                            │                                     │  │
│   │                    CloudWatch Dashboards                         │  │
│   │                    CloudWatch Alarms                             │  │
│   │                    EventBridge Automation                        │  │
│   │                                                                  │  │
│   └─────────────────────────────────────────────────────────────────┘  │
│                                                                         │
└────────────────────────────────────────────────────────────────────────┘

CloudWatch Metrics

Collect and track metrics from AWS services and custom applications.

Built-in vs Custom Metrics

┌────────────────────────────────────────────────────────────────────────┐
│                    CloudWatch Metrics Types                             │
├────────────────────────────────────────────────────────────────────────┤
│                                                                         │
│   BUILT-IN METRICS (Free - Basic Monitoring)                           │
│   ─────────────────────────────────────────                            │
│   EC2:        CPUUtilization, NetworkIn/Out, DiskRead/Write            │
│   RDS:        CPUUtilization, FreeableMemory, ReadIOPS, WriteIOPS      │
│   Lambda:     Invocations, Duration, Errors, Throttles, ConcurrentExec │
│   ALB:        RequestCount, TargetResponseTime, HTTPCode_Target_2XX    │
│   DynamoDB:   ConsumedRCU, ConsumedWCU, ThrottledRequests              │
│   S3:         BucketSizeBytes, NumberOfObjects                         │
│                                                                         │
│   Resolution: 5 minutes (basic), 1 minute (detailed - extra cost)      │
│                                                                         │
│   CUSTOM METRICS (You publish)                                         │
│   ────────────────────────────                                         │
│   • Application-specific metrics                                       │
│   • Business KPIs (orders/min, signups/hour)                          │
│   • Memory utilization (not built-in for EC2!)                        │
│   • Queue depth, cache hit ratio                                       │
│                                                                         │
│   Resolution: 1 second to 1 minute (high-resolution)                   │
│   Cost: $0.30 per metric per month                                     │
│                                                                         │
└────────────────────────────────────────────────────────────────────────┘

Publishing Custom Metrics

import boto3
from datetime import datetime

cloudwatch = boto3.client('cloudwatch')

def publish_custom_metric(namespace: str, metric_name: str, 
                          value: float, unit: str = 'Count',
                          dimensions: list = None):
    """Publish custom metric to CloudWatch."""
    
    metric_data = {
        'MetricName': metric_name,
        'Value': value,
        'Unit': unit,
        'Timestamp': datetime.utcnow(),
    }
    
    if dimensions:
        metric_data['Dimensions'] = dimensions
    
    cloudwatch.put_metric_data(
        Namespace=namespace,
        MetricData=[metric_data]
    )

# Example: Track orders per minute
publish_custom_metric(
    namespace='MyApp/Ecommerce',
    metric_name='OrdersPlaced',
    value=42,
    unit='Count',
    dimensions=[
        {'Name': 'Environment', 'Value': 'Production'},
        {'Name': 'Region', 'Value': 'us-east-1'}
    ]
)

# Example: Track memory utilization (not built-in!)
import psutil

publish_custom_metric(
    namespace='MyApp/System',
    metric_name='MemoryUtilization',
    value=psutil.virtual_memory().percent,
    unit='Percent',
    dimensions=[
        {'Name': 'InstanceId', 'Value': 'i-1234567890abcdef0'}
    ]
)

# High-resolution metrics (1-second granularity)
cloudwatch.put_metric_data(
    Namespace='MyApp/HighFrequency',
    MetricData=[{
        'MetricName': 'TransactionsPerSecond',
        'Value': 1500,
        'Unit': 'Count/Second',
        'StorageResolution': 1  # 1 = high-res, 60 = standard
    }]
)

CloudWatch Embedded Metric Format (EMF)

import json

def emit_emf_metric(metric_name: str, value: float, dimensions: dict):
    """
    Embedded Metric Format - publish metrics via logs.
    Automatically extracted by CloudWatch.
    """
    emf_log = {
        "_aws": {
            "Timestamp": int(datetime.utcnow().timestamp() * 1000),
            "CloudWatchMetrics": [{
                "Namespace": "MyApp",
                "Dimensions": [list(dimensions.keys())],
                "Metrics": [{
                    "Name": metric_name,
                    "Unit": "Count"
                }]
            }]
        },
        metric_name: value,
        **dimensions
    }
    
    # Print to stdout - CloudWatch Logs extracts the metric
    print(json.dumps(emf_log))

# Usage in Lambda
emit_emf_metric(
    metric_name="OrderValue",
    value=99.99,
    dimensions={"Service": "Checkout", "Environment": "prod"}
)

CloudWatch Logs

Centralized log management for all AWS services and applications.

Log Architecture

┌────────────────────────────────────────────────────────────────────────┐
│                    CloudWatch Logs Structure                            │
├────────────────────────────────────────────────────────────────────────┤
│                                                                         │
│   LOG GROUP: /aws/lambda/my-function                                   │
│   ───────────────────────────────────                                  │
│       │                                                                 │
│       ├── LOG STREAM: 2024/01/15/[$LATEST]abc123                       │
│       │       │                                                         │
│       │       ├── Log Event: {"level": "INFO", "msg": "Started"}       │
│       │       ├── Log Event: {"level": "ERROR", "msg": "Failed"}       │
│       │       └── Log Event: {"level": "INFO", "msg": "Completed"}     │
│       │                                                                 │
│       ├── LOG STREAM: 2024/01/15/[$LATEST]def456                       │
│       │       └── ...                                                   │
│       │                                                                 │
│       └── LOG STREAM: 2024/01/16/[$LATEST]ghi789                       │
│               └── ...                                                   │
│                                                                         │
│   RETENTION SETTINGS:                                                  │
│   • 1 day to 10 years (or never expire)                                │
│   • Export to S3 for long-term storage                                 │
│   • Stream to Kinesis/Lambda for real-time processing                  │
│                                                                         │
│   PRICING:                                                              │
│   • Ingestion: $0.50 per GB                                            │
│   • Storage: $0.03 per GB/month                                        │
│   • Queries (Logs Insights): $0.005 per GB scanned                     │
│                                                                         │
└────────────────────────────────────────────────────────────────────────┘

Structured Logging Best Practices

import json
import logging
from datetime import datetime

class StructuredLogger:
    """JSON structured logger for CloudWatch Logs."""
    
    def __init__(self, service_name: str):
        self.service_name = service_name
        self.logger = logging.getLogger()
        self.logger.setLevel(logging.INFO)
    
    def _format_log(self, level: str, message: str, **kwargs) -> str:
        log_entry = {
            "timestamp": datetime.utcnow().isoformat(),
            "level": level,
            "service": self.service_name,
            "message": message,
            **kwargs
        }
        return json.dumps(log_entry)
    
    def info(self, message: str, **kwargs):
        print(self._format_log("INFO", message, **kwargs))
    
    def error(self, message: str, **kwargs):
        print(self._format_log("ERROR", message, **kwargs))
    
    def warn(self, message: str, **kwargs):
        print(self._format_log("WARN", message, **kwargs))

# Usage
logger = StructuredLogger("order-service")

def process_order(order_id: str, user_id: str):
    logger.info(
        "Processing order",
        order_id=order_id,
        user_id=user_id,
        action="process_order"
    )
    
    try:
        # Process order...
        logger.info(
            "Order completed",
            order_id=order_id,
            duration_ms=150,
            action="order_complete"
        )
    except Exception as e:
        logger.error(
            "Order failed",
            order_id=order_id,
            error=str(e),
            action="order_failed"
        )
        raise

CloudWatch Logs Insights

-- Find all errors in the last hour
fields @timestamp, @message
| filter @message like /ERROR/
| sort @timestamp desc
| limit 100

-- Parse JSON logs and aggregate
fields @timestamp, @message
| parse @message '{"level":"*","service":"*","message":"*"' as level, service, msg
| filter level = "ERROR"
| stats count(*) as error_count by service
| sort error_count desc

-- Calculate p99 latency from Lambda
fields @timestamp, @duration
| filter @type = "REPORT"
| stats avg(@duration) as avg_duration,
        pct(@duration, 50) as p50,
        pct(@duration, 95) as p95,
        pct(@duration, 99) as p99
  by bin(1h)

-- Find slow API requests
fields @timestamp, @message
| parse @message '"duration_ms":*,' as duration
| filter duration > 1000
| sort duration desc
| limit 50

-- Error rate over time
fields @timestamp, @message
| filter @message like /level/
| parse @message '"level":"*"' as level
| stats count(*) as total,
        sum(level = "ERROR") as errors
  by bin(5m)
| display errors / total * 100 as error_rate

X-Ray Distributed Tracing

Trace requests across microservices to identify bottlenecks and errors.

X-Ray Architecture

┌────────────────────────────────────────────────────────────────────────┐
│                    X-Ray Distributed Tracing                            │
├────────────────────────────────────────────────────────────────────────┤
│                                                                         │
│   Request Flow with Trace:                                             │
│                                                                         │
│   Client                                                                │
│     │                                                                   │
│     │  Trace ID: 1-abc123-def456789                                    │
│     ▼                                                                   │
│   ┌─────────────────────────────────────────────────────────────────┐  │
│   │  API Gateway                                                     │  │
│   │  Segment: 50ms                                                   │  │
│   └───────────────────────────┬─────────────────────────────────────┘  │
│                               │                                         │
│                               ▼                                         │
│   ┌─────────────────────────────────────────────────────────────────┐  │
│   │  Lambda: Order Service                                           │  │
│   │  Segment: 200ms                                                  │  │
│   │  ├── Subsegment: DynamoDB Query (30ms)                          │  │
│   │  ├── Subsegment: External API Call (100ms)                      │  │
│   │  └── Subsegment: SNS Publish (20ms)                             │  │
│   └───────────────────────────┬─────────────────────────────────────┘  │
│                               │                                         │
│              ┌────────────────┴────────────────┐                       │
│              ▼                                 ▼                       │
│   ┌──────────────────────┐        ┌──────────────────────┐            │
│   │  Lambda: Inventory   │        │  Lambda: Payment      │            │
│   │  Segment: 80ms       │        │  Segment: 150ms       │            │
│   │  └── DynamoDB (50ms) │        │  └── Stripe API (120ms)│           │
│   └──────────────────────┘        └──────────────────────┘            │
│                                                                         │
│   Service Map (auto-generated):                                        │
│   ┌──────────┐    ┌──────────┐    ┌──────────┐    ┌──────────┐        │
│   │   API    │───►│  Order   │───►│ Inventory│───►│ DynamoDB │        │
│   │ Gateway  │    │ Service  │    │ Service  │    │          │        │
│   └──────────┘    └────┬─────┘    └──────────┘    └──────────┘        │
│                        │                                                │
│                        └──────────►┌──────────┐                        │
│                                    │ Payment  │                        │
│                                    │ Service  │                        │
│                                    └──────────┘                        │
│                                                                         │
└────────────────────────────────────────────────────────────────────────┘

Instrumenting Lambda with X-Ray

import boto3
from aws_xray_sdk.core import xray_recorder
from aws_xray_sdk.core import patch_all

# Patch all supported libraries (boto3, requests, etc.)
patch_all()

dynamodb = boto3.resource('dynamodb')
table = dynamodb.Table('orders')

@xray_recorder.capture('process_order')
def process_order(order_id: str):
    """Process order with X-Ray tracing."""
    
    # Add annotation (indexed, searchable)
    xray_recorder.put_annotation('order_id', order_id)
    
    # Add metadata (not indexed, for debugging)
    xray_recorder.put_metadata('order_details', {
        'order_id': order_id,
        'timestamp': '2024-01-15T10:00:00Z'
    })
    
    # Subsegment for custom operation
    with xray_recorder.in_subsegment('validate_order') as subsegment:
        subsegment.put_annotation('validation_type', 'full')
        validate_order(order_id)
    
    # DynamoDB call is automatically traced
    result = table.get_item(Key={'order_id': order_id})
    
    # External API call with custom subsegment
    with xray_recorder.in_subsegment('external_api') as subsegment:
        subsegment.put_metadata('api', 'payment_gateway')
        response = call_payment_api(result['Item'])
    
    return response

def lambda_handler(event, context):
    order_id = event.get('order_id')
    return process_order(order_id)

CloudWatch Alarms

Automated alerts and actions based on metric thresholds.

Alarm Architecture

┌────────────────────────────────────────────────────────────────────────┐
│                    CloudWatch Alarm States                              │
├────────────────────────────────────────────────────────────────────────┤
│                                                                         │
│   Alarm States:                                                         │
│   ┌─────────────────────────────────────────────────────────────────┐  │
│   │                                                                  │  │
│   │    ┌──────────┐        ┌──────────┐        ┌──────────┐        │  │
│   │    │    OK    │◄──────►│ ALARM    │◄──────►│INSUFFICIENT│       │  │
│   │    │  (green) │        │  (red)   │        │   DATA     │       │  │
│   │    └──────────┘        └──────────┘        └──────────┘        │  │
│   │                                                                  │  │
│   └─────────────────────────────────────────────────────────────────┘  │
│                                                                         │
│   Alarm Components:                                                     │
│   ┌─────────────────────────────────────────────────────────────────┐  │
│   │                                                                  │  │
│   │   METRIC           STATISTIC       PERIOD      THRESHOLD        │  │
│   │   CPUUtilization   Average         5 minutes   > 80%            │  │
│   │                                                                  │  │
│   │   EVALUATION PERIODS: 3                                         │  │
│   │   (3 consecutive 5-min periods above 80% = ALARM)               │  │
│   │                                                                  │  │
│   └─────────────────────────────────────────────────────────────────┘  │
│                                                                         │
│   Alarm Actions:                                                        │
│   ┌─────────────────────────────────────────────────────────────────┐  │
│   │                                                                  │  │
│   │   • SNS Topic (email, SMS, Lambda)                              │  │
│   │   • Auto Scaling (scale up/down)                                │  │
│   │   • EC2 Actions (stop, terminate, reboot)                       │  │
│   │   • Systems Manager (run automation)                            │  │
│   │                                                                  │  │
│   └─────────────────────────────────────────────────────────────────┘  │
│                                                                         │
└────────────────────────────────────────────────────────────────────────┘

Essential Alarms (Terraform)

# High CPU Alarm
resource "aws_cloudwatch_metric_alarm" "high_cpu" {
  alarm_name          = "high-cpu-utilization"
  comparison_operator = "GreaterThanThreshold"
  evaluation_periods  = 3
  metric_name         = "CPUUtilization"
  namespace           = "AWS/EC2"
  period              = 300  # 5 minutes
  statistic           = "Average"
  threshold           = 80
  alarm_description   = "CPU utilization exceeds 80%"
  
  dimensions = {
    AutoScalingGroupName = aws_autoscaling_group.web.name
  }
  
  alarm_actions = [
    aws_sns_topic.alerts.arn,
    aws_autoscaling_policy.scale_up.arn
  ]
  
  ok_actions = [aws_sns_topic.alerts.arn]
}

# Lambda Error Rate Alarm
resource "aws_cloudwatch_metric_alarm" "lambda_errors" {
  alarm_name          = "lambda-high-error-rate"
  comparison_operator = "GreaterThanThreshold"
  evaluation_periods  = 2
  threshold           = 5  # 5% error rate
  alarm_description   = "Lambda error rate exceeds 5%"
  
  metric_query {
    id          = "error_rate"
    expression  = "(errors / invocations) * 100"
    label       = "Error Rate"
    return_data = true
  }
  
  metric_query {
    id = "errors"
    metric {
      metric_name = "Errors"
      namespace   = "AWS/Lambda"
      period      = 300
      stat        = "Sum"
      dimensions  = { FunctionName = "my-function" }
    }
  }
  
  metric_query {
    id = "invocations"
    metric {
      metric_name = "Invocations"
      namespace   = "AWS/Lambda"
      period      = 300
      stat        = "Sum"
      dimensions  = { FunctionName = "my-function" }
    }
  }
  
  alarm_actions = [aws_sns_topic.alerts.arn]
}

# DynamoDB Throttling Alarm
resource "aws_cloudwatch_metric_alarm" "dynamodb_throttle" {
  alarm_name          = "dynamodb-throttling"
  comparison_operator = "GreaterThanThreshold"
  evaluation_periods  = 1
  metric_name         = "ThrottledRequests"
  namespace           = "AWS/DynamoDB"
  period              = 60
  statistic           = "Sum"
  threshold           = 0
  
  dimensions = {
    TableName = "my-table"
  }
  
  alarm_actions = [aws_sns_topic.alerts.arn]
}

# Composite Alarm (multiple conditions)
resource "aws_cloudwatch_composite_alarm" "critical" {
  alarm_name = "critical-system-alarm"
  
  alarm_rule = "ALARM(${aws_cloudwatch_metric_alarm.high_cpu.alarm_name}) AND ALARM(${aws_cloudwatch_metric_alarm.lambda_errors.alarm_name})"
  
  alarm_actions = [aws_sns_topic.pagerduty.arn]
}

CloudTrail (Audit Logging)

Track all API calls for security and compliance.
┌────────────────────────────────────────────────────────────────────────┐
│                    CloudTrail Architecture                              │
├────────────────────────────────────────────────────────────────────────┤
│                                                                         │
│   Every AWS API Call:                                                   │
│   ┌─────────────────────────────────────────────────────────────────┐  │
│   │                                                                  │  │
│   │   User/Role ───► AWS API ───► CloudTrail ───► S3 Bucket         │  │
│   │                                    │                             │  │
│   │                                    └──► CloudWatch Logs          │  │
│   │                                    └──► EventBridge (real-time)  │  │
│   │                                                                  │  │
│   └─────────────────────────────────────────────────────────────────┘  │
│                                                                         │
│   Event Types:                                                          │
│   ─────────────                                                         │
│   • Management Events: Control plane (CreateBucket, RunInstances)      │
│   • Data Events: Data plane (S3 GetObject, Lambda Invoke)              │
│   • Insights Events: Unusual API activity detection                    │
│                                                                         │
│   Sample Event:                                                         │
│   {                                                                     │
│     "eventTime": "2024-01-15T10:30:00Z",                               │
│     "eventSource": "s3.amazonaws.com",                                 │
│     "eventName": "DeleteBucket",                                       │
│     "userIdentity": {                                                   │
│       "type": "IAMUser",                                               │
│       "userName": "admin",                                             │
│       "arn": "arn:aws:iam::123456789012:user/admin"                    │
│     },                                                                  │
│     "sourceIPAddress": "203.0.113.50",                                 │
│     "requestParameters": { "bucketName": "my-bucket" }                 │
│   }                                                                     │
│                                                                         │
└────────────────────────────────────────────────────────────────────────┘

CloudTrail Best Practices

# CloudTrail configuration checklist
cloudtrail_config = {
    "multi_region": True,           # Trail in all regions
    "log_file_validation": True,    # Detect tampering
    "s3_encryption": "SSE-KMS",     # Encrypt logs
    "cloudwatch_logs": True,        # Real-time analysis
    "data_events": {
        "s3": ["arn:aws:s3:::sensitive-bucket/*"],
        "lambda": True
    },
    "insights": True,               # Anomaly detection
    "organization_trail": True      # Multi-account
}

# Real-time alerting with EventBridge
# Detect root user login
eventbridge_rule = {
    "source": ["aws.signin"],
    "detail-type": ["AWS Console Sign In via CloudTrail"],
    "detail": {
        "userIdentity": {
            "type": ["Root"]
        }
    }
}

🎯 Interview Questions

Systematic approach:
  1. X-Ray Trace: Find the specific slow request
    • Identify which service/subsegment is slow
    • Check annotations for context
  2. CloudWatch Metrics: Check historical patterns
    • Is this a spike or gradual increase?
    • Correlate with CPU, memory, connections
  3. CloudWatch Logs: Find related errors
    fields @timestamp, @message
    | filter trace_id = "1-abc123..."
    | sort @timestamp
    
  4. Service-specific checks:
    • Lambda: Cold starts? Memory sufficient?
    • DynamoDB: Throttling? Hot partition?
    • RDS: Connection pool exhausted?
Essential metrics by layer:Load Balancer:
  • RequestCount, TargetResponseTime
  • HTTPCode_Target_5XX, HTTPCode_ELB_5XX
  • HealthyHostCount, UnhealthyHostCount
Compute (EC2/Lambda):
  • CPUUtilization, MemoryUtilization (custom)
  • Lambda: Duration, Errors, Throttles
Database:
  • CPUUtilization, FreeableMemory
  • DatabaseConnections, ReadIOPS, WriteIOPS
  • DynamoDB: ThrottledRequests
Application:
  • Error rate, Latency (p50, p95, p99)
  • Requests per second
  • Business metrics (orders, signups)
Cost optimization strategies:
  1. Reduce ingestion:
    • Filter logs at source (log level INFO not DEBUG)
    • Use sampling for high-volume logs
  2. Optimize retention:
    • Set appropriate retention (7-30 days for most)
    • Export to S3 for long-term (cheaper)
  3. Use Logs Insights efficiently:
    • Narrow time ranges
    • Use specific log groups
    • Cache common queries
  4. Consider alternatives:
    • Kinesis Firehose → S3 for high volume
    • OpenSearch for complex analysis
Alert hierarchy:
  1. Critical (PagerDuty/immediate):
    • Service down (health check failures)
    • Error rate > 5%
    • Latency p99 > 5s
    • Security events (root login)
  2. Warning (Slack/email):
    • Error rate > 1%
    • CPU > 80% sustained
    • Disk > 85%
    • Approaching quotas
  3. Informational (dashboard):
    • Deployment events
    • Scaling events
    • Cost anomalies
Best practices:
  • Avoid alert fatigue (tune thresholds)
  • Use composite alarms
  • Include runbook links in alerts
CloudWatch advantages:
  • Native integration, no agents for AWS services
  • Lower cost for basic use cases
  • No data egress charges
Third-party advantages (Datadog, New Relic):
  • Better visualization and correlation
  • APM with code-level insights
  • Multi-cloud support
  • More powerful querying
Hybrid approach:
  • Use CloudWatch for AWS metrics/logs
  • Stream to third-party for analysis
  • Keep costs balanced

🧪 Hands-On Lab: Build Observability Dashboard

1

Enable X-Ray on Lambda

Add X-Ray SDK and enable active tracing
2

Create CloudWatch Dashboard

Add widgets for key metrics (CPU, errors, latency)
3

Set Up Structured Logging

Implement JSON logging with correlation IDs
4

Create Alarms

Set up CPU, error rate, and latency alarms
5

Configure CloudTrail

Enable multi-region trail with CloudWatch Logs

Next Module

CDN & Edge Services

Master CloudFront, Global Accelerator, and edge computing