Skip to main content

Documentation Index

Fetch the complete documentation index at: https://resources.devweekends.com/llms.txt

Use this file to discover all available pages before exploring further.

AWS Observability Stack

Module Overview

Estimated Time: 3-4 hours | Difficulty: Intermediate | Prerequisites: Core Concepts, Compute
Observability is critical for running production workloads. Without it, debugging production issues is like diagnosing a patient over the phone with no test results — you are guessing. Metrics tell you WHAT is happening (fever), logs tell you WHY (infection details), and traces tell you WHERE (which organ). This module covers the complete AWS monitoring stack for logs, metrics, traces, and alerts. What You’ll Learn:
  • CloudWatch metrics, logs, and alarms
  • X-Ray distributed tracing
  • CloudTrail for audit logging
  • EventBridge for event-driven automation
  • Building observability dashboards
  • Alerting and incident response

Observability Pillars

┌────────────────────────────────────────────────────────────────────────┐
│                    Three Pillars of Observability                       │
├────────────────────────────────────────────────────────────────────────┤
│                                                                         │
│   ┌─────────────────┐  ┌─────────────────┐  ┌─────────────────┐        │
│   │     METRICS     │  │      LOGS       │  │     TRACES      │        │
│   │   (CloudWatch)  │  │  (CloudWatch    │  │    (X-Ray)      │        │
│   │                 │  │     Logs)       │  │                 │        │
│   └────────┬────────┘  └────────┬────────┘  └────────┬────────┘        │
│            │                    │                    │                  │
│   What happened?       Why did it         Where did it                 │
│   (CPU 85%)            happen?            happen?                      │
│                        (error logs)       (service A→B→C)              │
│                                                                         │
│   ┌─────────────────────────────────────────────────────────────────┐  │
│   │                    AWS Observability Stack                       │  │
│   │                                                                  │  │
│   │   CloudWatch     CloudWatch     X-Ray        CloudTrail         │  │
│   │   Metrics        Logs           Traces       Audit Logs         │  │
│   │      │              │              │              │              │  │
│   │      └──────────────┴──────────────┴──────────────┘              │  │
│   │                            │                                     │  │
│   │                    CloudWatch Dashboards                         │  │
│   │                    CloudWatch Alarms                             │  │
│   │                    EventBridge Automation                        │  │
│   │                                                                  │  │
│   └─────────────────────────────────────────────────────────────────┘  │
│                                                                         │
└────────────────────────────────────────────────────────────────────────┘

CloudWatch Metrics

Collect and track metrics from AWS services and custom applications.

Built-in vs Custom Metrics

┌────────────────────────────────────────────────────────────────────────┐
│                    CloudWatch Metrics Types                             │
├────────────────────────────────────────────────────────────────────────┤
│                                                                         │
│   BUILT-IN METRICS (Free - Basic Monitoring)                           │
│   ─────────────────────────────────────────                            │
│   EC2:        CPUUtilization, NetworkIn/Out, DiskRead/Write            │
│   RDS:        CPUUtilization, FreeableMemory, ReadIOPS, WriteIOPS      │
│   Lambda:     Invocations, Duration, Errors, Throttles, ConcurrentExec │
│   ALB:        RequestCount, TargetResponseTime, HTTPCode_Target_2XX    │
│   DynamoDB:   ConsumedRCU, ConsumedWCU, ThrottledRequests              │
│   S3:         BucketSizeBytes, NumberOfObjects                         │
│                                                                         │
│   Resolution: 5 minutes (basic), 1 minute (detailed - extra cost)      │
│                                                                         │
│   CUSTOM METRICS (You publish)                                         │
│   ────────────────────────────                                         │
│   • Application-specific metrics                                       │
│   • Business KPIs (orders/min, signups/hour)                          │
│   • Memory utilization (not built-in for EC2!)                        │
│   • Queue depth, cache hit ratio                                       │
│                                                                         │
│   Resolution: 1 second to 1 minute (high-resolution)                   │
│   Cost: $0.30 per metric per month                                     │
│                                                                         │
└────────────────────────────────────────────────────────────────────────┘

Publishing Custom Metrics

import boto3
from datetime import datetime

cloudwatch = boto3.client('cloudwatch')

def publish_custom_metric(namespace: str, metric_name: str, 
                          value: float, unit: str = 'Count',
                          dimensions: list = None):
    """
    Publish custom metric to CloudWatch.
    
    Cost tip: Each unique combination of namespace + metric name + dimensions
    counts as one custom metric ($0.30/month). If you add a "RequestId"
    dimension, every single request creates a new metric -- that is thousands
    of dollars/month. Use low-cardinality dimensions like Environment, Region,
    or Service. Never use user IDs, request IDs, or timestamps as dimensions.
    """
    
    metric_data = {
        'MetricName': metric_name,
        'Value': value,
        'Unit': unit,
        'Timestamp': datetime.utcnow(),
    }
    
    if dimensions:
        metric_data['Dimensions'] = dimensions
    
    cloudwatch.put_metric_data(
        Namespace=namespace,
        MetricData=[metric_data]
    )

# Example: Track orders per minute
publish_custom_metric(
    namespace='MyApp/Ecommerce',
    metric_name='OrdersPlaced',
    value=42,
    unit='Count',
    dimensions=[
        {'Name': 'Environment', 'Value': 'Production'},
        {'Name': 'Region', 'Value': 'us-east-1'}
    ]
)

# Example: Track memory utilization (not built-in!)
import psutil

publish_custom_metric(
    namespace='MyApp/System',
    metric_name='MemoryUtilization',
    value=psutil.virtual_memory().percent,
    unit='Percent',
    dimensions=[
        {'Name': 'InstanceId', 'Value': 'i-1234567890abcdef0'}
    ]
)

# High-resolution metrics (1-second granularity)
# Cost warning: High-resolution metrics cost the same per metric ($0.30/month)
# but generate 60x more data points, which increases storage and query costs.
# Only use StorageResolution=1 for metrics where second-level precision matters
# (e.g., real-time trading systems). For most applications, 60-second standard
# resolution is sufficient and much cheaper to query with Logs Insights.
cloudwatch.put_metric_data(
    Namespace='MyApp/HighFrequency',
    MetricData=[{
        'MetricName': 'TransactionsPerSecond',
        'Value': 1500,
        'Unit': 'Count/Second',
        'StorageResolution': 1  # 1 = high-res, 60 = standard
    }]
)

CloudWatch Embedded Metric Format (EMF)

import json

def emit_emf_metric(metric_name: str, value: float, dimensions: dict):
    """
    Embedded Metric Format - publish metrics via logs.
    Automatically extracted by CloudWatch.
    
    Why EMF over PutMetricData? EMF is the preferred approach for Lambda
    because: (1) no additional API call latency added to your function,
    (2) the metric data is embedded in your log output so you get metrics
    AND logs in one write, and (3) you avoid PutMetricData throttling
    limits (150 TPS). The downside is you pay for log ingestion ($0.50/GB),
    but for Lambda you are already paying that regardless.
    """
    emf_log = {
        "_aws": {
            "Timestamp": int(datetime.utcnow().timestamp() * 1000),
            "CloudWatchMetrics": [{
                "Namespace": "MyApp",
                "Dimensions": [list(dimensions.keys())],
                "Metrics": [{
                    "Name": metric_name,
                    "Unit": "Count"
                }]
            }]
        },
        metric_name: value,
        **dimensions
    }
    
    # Print to stdout - CloudWatch Logs extracts the metric
    print(json.dumps(emf_log))

# Usage in Lambda
emit_emf_metric(
    metric_name="OrderValue",
    value=99.99,
    dimensions={"Service": "Checkout", "Environment": "prod"}
)

CloudWatch Logs

Centralized log management for all AWS services and applications. CloudWatch Logs is often the single biggest line item on an AWS bill that teams do not expect — a service generating 100 GB of logs per day costs roughly $1,500/month in ingestion alone, before storage and queries.

Log Architecture

┌────────────────────────────────────────────────────────────────────────┐
│                    CloudWatch Logs Structure                            │
├────────────────────────────────────────────────────────────────────────┤
│                                                                         │
│   LOG GROUP: /aws/lambda/my-function                                   │
│   ───────────────────────────────────                                  │
│       │                                                                 │
│       ├── LOG STREAM: 2024/01/15/[$LATEST]abc123                       │
│       │       │                                                         │
│       │       ├── Log Event: {"level": "INFO", "msg": "Started"}       │
│       │       ├── Log Event: {"level": "ERROR", "msg": "Failed"}       │
│       │       └── Log Event: {"level": "INFO", "msg": "Completed"}     │
│       │                                                                 │
│       ├── LOG STREAM: 2024/01/15/[$LATEST]def456                       │
│       │       └── ...                                                   │
│       │                                                                 │
│       └── LOG STREAM: 2024/01/16/[$LATEST]ghi789                       │
│               └── ...                                                   │
│                                                                         │
│   RETENTION SETTINGS:                                                  │
│   • 1 day to 10 years (or never expire)                                │
│   • Export to S3 for long-term storage                                 │
│   • Stream to Kinesis/Lambda for real-time processing                  │
│                                                                         │
│   PRICING:                                                              │
│   • Ingestion: $0.50 per GB                                            │
│   • Storage: $0.03 per GB/month                                        │
│   • Queries (Logs Insights): $0.005 per GB scanned                     │
│                                                                         │
└────────────────────────────────────────────────────────────────────────┘

Structured Logging Best Practices

import json
import logging
from datetime import datetime

class StructuredLogger:
    """JSON structured logger for CloudWatch Logs."""
    
    def __init__(self, service_name: str):
        self.service_name = service_name
        self.logger = logging.getLogger()
        self.logger.setLevel(logging.INFO)
    
    def _format_log(self, level: str, message: str, **kwargs) -> str:
        log_entry = {
            "timestamp": datetime.utcnow().isoformat(),
            "level": level,
            "service": self.service_name,
            "message": message,
            **kwargs
        }
        return json.dumps(log_entry)
    
    def info(self, message: str, **kwargs):
        print(self._format_log("INFO", message, **kwargs))
    
    def error(self, message: str, **kwargs):
        print(self._format_log("ERROR", message, **kwargs))
    
    def warn(self, message: str, **kwargs):
        print(self._format_log("WARN", message, **kwargs))

# Usage
logger = StructuredLogger("order-service")

def process_order(order_id: str, user_id: str):
    logger.info(
        "Processing order",
        order_id=order_id,
        user_id=user_id,
        action="process_order"
    )
    
    try:
        # Process order...
        logger.info(
            "Order completed",
            order_id=order_id,
            duration_ms=150,
            action="order_complete"
        )
    except Exception as e:
        logger.error(
            "Order failed",
            order_id=order_id,
            error=str(e),
            action="order_failed"
        )
        raise

CloudWatch Logs Insights

-- Find all errors in the last hour
fields @timestamp, @message
| filter @message like /ERROR/
| sort @timestamp desc
| limit 100

-- Parse JSON logs and aggregate
fields @timestamp, @message
| parse @message '{"level":"*","service":"*","message":"*"' as level, service, msg
| filter level = "ERROR"
| stats count(*) as error_count by service
| sort error_count desc

-- Calculate p99 latency from Lambda
fields @timestamp, @duration
| filter @type = "REPORT"
| stats avg(@duration) as avg_duration,
        pct(@duration, 50) as p50,
        pct(@duration, 95) as p95,
        pct(@duration, 99) as p99
  by bin(1h)

-- Find slow API requests
fields @timestamp, @message
| parse @message '"duration_ms":*,' as duration
| filter duration > 1000
| sort duration desc
| limit 50

-- Error rate over time
fields @timestamp, @message
| filter @message like /level/
| parse @message '"level":"*"' as level
| stats count(*) as total,
        sum(level = "ERROR") as errors
  by bin(5m)
| display errors / total * 100 as error_rate

X-Ray Distributed Tracing

Trace requests across microservices to identify bottlenecks and errors. While metrics tell you “p99 latency is 3 seconds” and logs tell you “this function threw an error,” traces tell you “the request spent 2.5 of those 3 seconds waiting on a DynamoDB call that was throttled.” Traces are the only tool that gives you a request-level view across service boundaries.

X-Ray Architecture

┌────────────────────────────────────────────────────────────────────────┐
│                    X-Ray Distributed Tracing                            │
├────────────────────────────────────────────────────────────────────────┤
│                                                                         │
│   Request Flow with Trace:                                             │
│                                                                         │
│   Client                                                                │
│     │                                                                   │
│     │  Trace ID: 1-abc123-def456789                                    │
│     ▼                                                                   │
│   ┌─────────────────────────────────────────────────────────────────┐  │
│   │  API Gateway                                                     │  │
│   │  Segment: 50ms                                                   │  │
│   └───────────────────────────┬─────────────────────────────────────┘  │
│                               │                                         │
│                               ▼                                         │
│   ┌─────────────────────────────────────────────────────────────────┐  │
│   │  Lambda: Order Service                                           │  │
│   │  Segment: 200ms                                                  │  │
│   │  ├── Subsegment: DynamoDB Query (30ms)                          │  │
│   │  ├── Subsegment: External API Call (100ms)                      │  │
│   │  └── Subsegment: SNS Publish (20ms)                             │  │
│   └───────────────────────────┬─────────────────────────────────────┘  │
│                               │                                         │
│              ┌────────────────┴────────────────┐                       │
│              ▼                                 ▼                       │
│   ┌──────────────────────┐        ┌──────────────────────┐            │
│   │  Lambda: Inventory   │        │  Lambda: Payment      │            │
│   │  Segment: 80ms       │        │  Segment: 150ms       │            │
│   │  └── DynamoDB (50ms) │        │  └── Stripe API (120ms)│           │
│   └──────────────────────┘        └──────────────────────┘            │
│                                                                         │
│   Service Map (auto-generated):                                        │
│   ┌──────────┐    ┌──────────┐    ┌──────────┐    ┌──────────┐        │
│   │   API    │───►│  Order   │───►│ Inventory│───►│ DynamoDB │        │
│   │ Gateway  │    │ Service  │    │ Service  │    │          │        │
│   └──────────┘    └────┬─────┘    └──────────┘    └──────────┘        │
│                        │                                                │
│                        └──────────►┌──────────┐                        │
│                                    │ Payment  │                        │
│                                    │ Service  │                        │
│                                    └──────────┘                        │
│                                                                         │
└────────────────────────────────────────────────────────────────────────┘

Instrumenting Lambda with X-Ray

import boto3
from aws_xray_sdk.core import xray_recorder
from aws_xray_sdk.core import patch_all

# Patch all supported libraries (boto3, requests, etc.)
patch_all()

dynamodb = boto3.resource('dynamodb')
table = dynamodb.Table('orders')

@xray_recorder.capture('process_order')
def process_order(order_id: str):
    """Process order with X-Ray tracing."""
    
    # Add annotation (indexed, searchable)
    xray_recorder.put_annotation('order_id', order_id)
    
    # Add metadata (not indexed, for debugging)
    xray_recorder.put_metadata('order_details', {
        'order_id': order_id,
        'timestamp': '2024-01-15T10:00:00Z'
    })
    
    # Subsegment for custom operation
    with xray_recorder.in_subsegment('validate_order') as subsegment:
        subsegment.put_annotation('validation_type', 'full')
        validate_order(order_id)
    
    # DynamoDB call is automatically traced
    result = table.get_item(Key={'order_id': order_id})
    
    # External API call with custom subsegment
    with xray_recorder.in_subsegment('external_api') as subsegment:
        subsegment.put_metadata('api', 'payment_gateway')
        response = call_payment_api(result['Item'])
    
    return response

def lambda_handler(event, context):
    order_id = event.get('order_id')
    return process_order(order_id)

CloudWatch Alarms

Automated alerts and actions based on metric thresholds.

Alarm Architecture

┌────────────────────────────────────────────────────────────────────────┐
│                    CloudWatch Alarm States                              │
├────────────────────────────────────────────────────────────────────────┤
│                                                                         │
│   Alarm States:                                                         │
│   ┌─────────────────────────────────────────────────────────────────┐  │
│   │                                                                  │  │
│   │    ┌──────────┐        ┌──────────┐        ┌──────────┐        │  │
│   │    │    OK    │◄──────►│ ALARM    │◄──────►│INSUFFICIENT│       │  │
│   │    │  (green) │        │  (red)   │        │   DATA     │       │  │
│   │    └──────────┘        └──────────┘        └──────────┘        │  │
│   │                                                                  │  │
│   └─────────────────────────────────────────────────────────────────┘  │
│                                                                         │
│   Alarm Components:                                                     │
│   ┌─────────────────────────────────────────────────────────────────┐  │
│   │                                                                  │  │
│   │   METRIC           STATISTIC       PERIOD      THRESHOLD        │  │
│   │   CPUUtilization   Average         5 minutes   > 80%            │  │
│   │                                                                  │  │
│   │   EVALUATION PERIODS: 3                                         │  │
│   │   (3 consecutive 5-min periods above 80% = ALARM)               │  │
│   │                                                                  │  │
│   └─────────────────────────────────────────────────────────────────┘  │
│                                                                         │
│   Alarm Actions:                                                        │
│   ┌─────────────────────────────────────────────────────────────────┐  │
│   │                                                                  │  │
│   │   • SNS Topic (email, SMS, Lambda)                              │  │
│   │   • Auto Scaling (scale up/down)                                │  │
│   │   • EC2 Actions (stop, terminate, reboot)                       │  │
│   │   • Systems Manager (run automation)                            │  │
│   │                                                                  │  │
│   └─────────────────────────────────────────────────────────────────┘  │
│                                                                         │
└────────────────────────────────────────────────────────────────────────┘

Essential Alarms (Terraform)

# High CPU Alarm
resource "aws_cloudwatch_metric_alarm" "high_cpu" {
  alarm_name          = "high-cpu-utilization"
  comparison_operator = "GreaterThanThreshold"
  evaluation_periods  = 3
  metric_name         = "CPUUtilization"
  namespace           = "AWS/EC2"
  period              = 300  # 5 minutes
  statistic           = "Average"
  threshold           = 80
  alarm_description   = "CPU utilization exceeds 80%"
  
  dimensions = {
    AutoScalingGroupName = aws_autoscaling_group.web.name
  }
  
  # Alarm actions fire when state transitions to ALARM.
  # Triggering both notification AND auto-scaling is a production best practice:
  # auto-scaling handles the immediate capacity need while the team investigates.
  alarm_actions = [
    aws_sns_topic.alerts.arn,
    aws_autoscaling_policy.scale_up.arn
  ]
  
  # OK actions fire when alarm returns to normal -- useful for "all clear"
  # notifications so on-call engineers know the issue resolved automatically.
  ok_actions = [aws_sns_topic.alerts.arn]
}

# Lambda Error Rate Alarm
resource "aws_cloudwatch_metric_alarm" "lambda_errors" {
  alarm_name          = "lambda-high-error-rate"
  comparison_operator = "GreaterThanThreshold"
  evaluation_periods  = 2
  threshold           = 5  # 5% error rate
  alarm_description   = "Lambda error rate exceeds 5%"
  
  metric_query {
    id          = "error_rate"
    expression  = "(errors / invocations) * 100"
    label       = "Error Rate"
    return_data = true
  }
  
  metric_query {
    id = "errors"
    metric {
      metric_name = "Errors"
      namespace   = "AWS/Lambda"
      period      = 300
      stat        = "Sum"
      dimensions  = { FunctionName = "my-function" }
    }
  }
  
  metric_query {
    id = "invocations"
    metric {
      metric_name = "Invocations"
      namespace   = "AWS/Lambda"
      period      = 300
      stat        = "Sum"
      dimensions  = { FunctionName = "my-function" }
    }
  }
  
  alarm_actions = [aws_sns_topic.alerts.arn]
}

# DynamoDB Throttling Alarm
# Why threshold 0? Because ANY throttling means you are losing requests.
# DynamoDB throttling is silent -- your application gets a 400 error but
# CloudWatch won't page you unless you set this up. Even one throttled
# request can cascade into retries that cause more throttling.
resource "aws_cloudwatch_metric_alarm" "dynamodb_throttle" {
  alarm_name          = "dynamodb-throttling"
  comparison_operator = "GreaterThanThreshold"
  evaluation_periods  = 1
  metric_name         = "ThrottledRequests"
  namespace           = "AWS/DynamoDB"
  period              = 60
  statistic           = "Sum"
  threshold           = 0
  
  dimensions = {
    TableName = "my-table"
  }
  
  alarm_actions = [aws_sns_topic.alerts.arn]
}

# Composite Alarm (multiple conditions)
resource "aws_cloudwatch_composite_alarm" "critical" {
  alarm_name = "critical-system-alarm"
  
  alarm_rule = "ALARM(${aws_cloudwatch_metric_alarm.high_cpu.alarm_name}) AND ALARM(${aws_cloudwatch_metric_alarm.lambda_errors.alarm_name})"
  
  alarm_actions = [aws_sns_topic.pagerduty.arn]
}

CloudTrail (Audit Logging)

Track all API calls for security and compliance.
┌────────────────────────────────────────────────────────────────────────┐
│                    CloudTrail Architecture                              │
├────────────────────────────────────────────────────────────────────────┤
│                                                                         │
│   Every AWS API Call:                                                   │
│   ┌─────────────────────────────────────────────────────────────────┐  │
│   │                                                                  │  │
│   │   User/Role ───► AWS API ───► CloudTrail ───► S3 Bucket         │  │
│   │                                    │                             │  │
│   │                                    └──► CloudWatch Logs          │  │
│   │                                    └──► EventBridge (real-time)  │  │
│   │                                                                  │  │
│   └─────────────────────────────────────────────────────────────────┘  │
│                                                                         │
│   Event Types:                                                          │
│   ─────────────                                                         │
│   • Management Events: Control plane (CreateBucket, RunInstances)      │
│   • Data Events: Data plane (S3 GetObject, Lambda Invoke)              │
│   • Insights Events: Unusual API activity detection                    │
│                                                                         │
│   Sample Event:                                                         │
│   {                                                                     │
│     "eventTime": "2024-01-15T10:30:00Z",                               │
│     "eventSource": "s3.amazonaws.com",                                 │
│     "eventName": "DeleteBucket",                                       │
│     "userIdentity": {                                                   │
│       "type": "IAMUser",                                               │
│       "userName": "admin",                                             │
│       "arn": "arn:aws:iam::123456789012:user/admin"                    │
│     },                                                                  │
│     "sourceIPAddress": "203.0.113.50",                                 │
│     "requestParameters": { "bucketName": "my-bucket" }                 │
│   }                                                                     │
│                                                                         │
└────────────────────────────────────────────────────────────────────────┘

CloudTrail Best Practices

# CloudTrail configuration checklist
cloudtrail_config = {
    "multi_region": True,           # Trail in all regions
    "log_file_validation": True,    # Detect tampering
    "s3_encryption": "SSE-KMS",     # Encrypt logs
    "cloudwatch_logs": True,        # Real-time analysis
    "data_events": {
        "s3": ["arn:aws:s3:::sensitive-bucket/*"],
        "lambda": True
    },
    "insights": True,               # Anomaly detection
    "organization_trail": True      # Multi-account
}

# Real-time alerting with EventBridge
# Detect root user login
eventbridge_rule = {
    "source": ["aws.signin"],
    "detail-type": ["AWS Console Sign In via CloudTrail"],
    "detail": {
        "userIdentity": {
            "type": ["Root"]
        }
    }
}

🎯 Interview Questions

Systematic approach:
  1. X-Ray Trace: Find the specific slow request
    • Identify which service/subsegment is slow
    • Check annotations for context
  2. CloudWatch Metrics: Check historical patterns
    • Is this a spike or gradual increase?
    • Correlate with CPU, memory, connections
  3. CloudWatch Logs: Find related errors
    fields @timestamp, @message
    | filter trace_id = "1-abc123..."
    | sort @timestamp
    
  4. Service-specific checks:
    • Lambda: Cold starts? Memory sufficient?
    • DynamoDB: Throttling? Hot partition?
    • RDS: Connection pool exhausted?
Essential metrics by layer:Load Balancer:
  • RequestCount, TargetResponseTime
  • HTTPCode_Target_5XX, HTTPCode_ELB_5XX
  • HealthyHostCount, UnhealthyHostCount
Compute (EC2/Lambda):
  • CPUUtilization, MemoryUtilization (custom)
  • Lambda: Duration, Errors, Throttles
Database:
  • CPUUtilization, FreeableMemory
  • DatabaseConnections, ReadIOPS, WriteIOPS
  • DynamoDB: ThrottledRequests
Application:
  • Error rate, Latency (p50, p95, p99)
  • Requests per second
  • Business metrics (orders, signups)
Cost optimization strategies:
  1. Reduce ingestion:
    • Filter logs at source (log level INFO not DEBUG)
    • Use sampling for high-volume logs
  2. Optimize retention:
    • Set appropriate retention (7-30 days for most)
    • Export to S3 for long-term (cheaper)
  3. Use Logs Insights efficiently:
    • Narrow time ranges
    • Use specific log groups
    • Cache common queries
  4. Consider alternatives:
    • Kinesis Firehose → S3 for high volume
    • OpenSearch for complex analysis
Alert hierarchy:
  1. Critical (PagerDuty/immediate):
    • Service down (health check failures)
    • Error rate > 5%
    • Latency p99 > 5s
    • Security events (root login)
  2. Warning (Slack/email):
    • Error rate > 1%
    • CPU > 80% sustained
    • Disk > 85%
    • Approaching quotas
  3. Informational (dashboard):
    • Deployment events
    • Scaling events
    • Cost anomalies
Best practices:
  • Avoid alert fatigue (tune thresholds)
  • Use composite alarms
  • Include runbook links in alerts
CloudWatch advantages:
  • Native integration, no agents for AWS services
  • Lower cost for basic use cases
  • No data egress charges
Third-party advantages (Datadog, New Relic):
  • Better visualization and correlation
  • APM with code-level insights
  • Multi-cloud support
  • More powerful querying
Hybrid approach:
  • Use CloudWatch for AWS metrics/logs
  • Stream to third-party for analysis
  • Keep costs balanced

🧪 Hands-On Lab: Build Observability Dashboard

1

Enable X-Ray on Lambda

Add X-Ray SDK and enable active tracing
2

Create CloudWatch Dashboard

Add widgets for key metrics (CPU, errors, latency)
3

Set Up Structured Logging

Implement JSON logging with correlation IDs
4

Create Alarms

Set up CPU, error rate, and latency alarms
5

Configure CloudTrail

Enable multi-region trail with CloudWatch Logs

Next Module

CDN & Edge Services

Master CloudFront, Global Accelerator, and edge computing