Observability & Monitoring

Module Overview

Estimated Time: 3-4 hours | Difficulty: Intermediate | Prerequisites: Core Concepts, Compute

Observability is critical for running production workloads. This module covers the complete AWS monitoring stack for logs, metrics, traces, and alerts. What You’ll Learn:

CloudWatch metrics, logs, and alarms
X-Ray distributed tracing
CloudTrail for audit logging
EventBridge for event-driven automation
Building observability dashboards
Alerting and incident response

Observability Pillars

┌────────────────────────────────────────────────────────────────────────┐
│                    Three Pillars of Observability                       │
├────────────────────────────────────────────────────────────────────────┤
│                                                                         │
│   ┌─────────────────┐  ┌─────────────────┐  ┌─────────────────┐        │
│   │     METRICS     │  │      LOGS       │  │     TRACES      │        │
│   │   (CloudWatch)  │  │  (CloudWatch    │  │    (X-Ray)      │        │
│   │                 │  │     Logs)       │  │                 │        │
│   └────────┬────────┘  └────────┬────────┘  └────────┬────────┘        │
│            │                    │                    │                  │
│   What happened?       Why did it         Where did it                 │
│   (CPU 85%)            happen?            happen?                      │
│                        (error logs)       (service A→B→C)              │
│                                                                         │
│   ┌─────────────────────────────────────────────────────────────────┐  │
│   │                    AWS Observability Stack                       │  │
│   │                                                                  │  │
│   │   CloudWatch     CloudWatch     X-Ray        CloudTrail         │  │
│   │   Metrics        Logs           Traces       Audit Logs         │  │
│   │      │              │              │              │              │  │
│   │      └──────────────┴──────────────┴──────────────┘              │  │
│   │                            │                                     │  │
│   │                    CloudWatch Dashboards                         │  │
│   │                    CloudWatch Alarms                             │  │
│   │                    EventBridge Automation                        │  │
│   │                                                                  │  │
│   └─────────────────────────────────────────────────────────────────┘  │
│                                                                         │
└────────────────────────────────────────────────────────────────────────┘

CloudWatch Metrics

Collect and track metrics from AWS services and custom applications.

Built-in vs Custom Metrics

┌────────────────────────────────────────────────────────────────────────┐
│                    CloudWatch Metrics Types                             │
├────────────────────────────────────────────────────────────────────────┤
│                                                                         │
│   BUILT-IN METRICS (Free - Basic Monitoring)                           │
│   ─────────────────────────────────────────                            │
│   EC2:        CPUUtilization, NetworkIn/Out, DiskRead/Write            │
│   RDS:        CPUUtilization, FreeableMemory, ReadIOPS, WriteIOPS      │
│   Lambda:     Invocations, Duration, Errors, Throttles, ConcurrentExec │
│   ALB:        RequestCount, TargetResponseTime, HTTPCode_Target_2XX    │
│   DynamoDB:   ConsumedRCU, ConsumedWCU, ThrottledRequests              │
│   S3:         BucketSizeBytes, NumberOfObjects                         │
│                                                                         │
│   Resolution: 5 minutes (basic), 1 minute (detailed - extra cost)      │
│                                                                         │
│   CUSTOM METRICS (You publish)                                         │
│   ────────────────────────────                                         │
│   • Application-specific metrics                                       │
│   • Business KPIs (orders/min, signups/hour)                          │
│   • Memory utilization (not built-in for EC2!)                        │
│   • Queue depth, cache hit ratio                                       │
│                                                                         │
│   Resolution: 1 second to 1 minute (high-resolution)                   │
│   Cost: $0.30 per metric per month                                     │
│                                                                         │
└────────────────────────────────────────────────────────────────────────┘

Publishing Custom Metrics

import boto3
from datetime import datetime

cloudwatch = boto3.client('cloudwatch')

def publish_custom_metric(namespace: str, metric_name: str, 
                          value: float, unit: str = 'Count',
                          dimensions: list = None):
    """Publish custom metric to CloudWatch."""
    
    metric_data = {
        'MetricName': metric_name,
        'Value': value,
        'Unit': unit,
        'Timestamp': datetime.utcnow(),
    }
    
    if dimensions:
        metric_data['Dimensions'] = dimensions
    
    cloudwatch.put_metric_data(
        Namespace=namespace,
        MetricData=[metric_data]
    )

# Example: Track orders per minute
publish_custom_metric(
    namespace='MyApp/Ecommerce',
    metric_name='OrdersPlaced',
    value=42,
    unit='Count',
    dimensions=[
        {'Name': 'Environment', 'Value': 'Production'},
        {'Name': 'Region', 'Value': 'us-east-1'}
    ]
)

# Example: Track memory utilization (not built-in!)
import psutil

publish_custom_metric(
    namespace='MyApp/System',
    metric_name='MemoryUtilization',
    value=psutil.virtual_memory().percent,
    unit='Percent',
    dimensions=[
        {'Name': 'InstanceId', 'Value': 'i-1234567890abcdef0'}
    ]
)

# High-resolution metrics (1-second granularity)
cloudwatch.put_metric_data(
    Namespace='MyApp/HighFrequency',
    MetricData=[{
        'MetricName': 'TransactionsPerSecond',
        'Value': 1500,
        'Unit': 'Count/Second',
        'StorageResolution': 1  # 1 = high-res, 60 = standard
    }]
)

CloudWatch Embedded Metric Format (EMF)

import json

def emit_emf_metric(metric_name: str, value: float, dimensions: dict):
    """
    Embedded Metric Format - publish metrics via logs.
    Automatically extracted by CloudWatch.
    """
    emf_log = {
        "_aws": {
            "Timestamp": int(datetime.utcnow().timestamp() * 1000),
            "CloudWatchMetrics": [{
                "Namespace": "MyApp",
                "Dimensions": [list(dimensions.keys())],
                "Metrics": [{
                    "Name": metric_name,
                    "Unit": "Count"
                }]
            }]
        },
        metric_name: value,
        **dimensions
    }
    
    # Print to stdout - CloudWatch Logs extracts the metric
    print(json.dumps(emf_log))

# Usage in Lambda
emit_emf_metric(
    metric_name="OrderValue",
    value=99.99,
    dimensions={"Service": "Checkout", "Environment": "prod"}
)

CloudWatch Logs

Centralized log management for all AWS services and applications.

Log Architecture

┌────────────────────────────────────────────────────────────────────────┐
│                    CloudWatch Logs Structure                            │
├────────────────────────────────────────────────────────────────────────┤
│                                                                         │
│   LOG GROUP: /aws/lambda/my-function                                   │
│   ───────────────────────────────────                                  │
│       │                                                                 │
│       ├── LOG STREAM: 2024/01/15/[$LATEST]abc123                       │
│       │       │                                                         │
│       │       ├── Log Event: {"level": "INFO", "msg": "Started"}       │
│       │       ├── Log Event: {"level": "ERROR", "msg": "Failed"}       │
│       │       └── Log Event: {"level": "INFO", "msg": "Completed"}     │
│       │                                                                 │
│       ├── LOG STREAM: 2024/01/15/[$LATEST]def456                       │
│       │       └── ...                                                   │
│       │                                                                 │
│       └── LOG STREAM: 2024/01/16/[$LATEST]ghi789                       │
│               └── ...                                                   │
│                                                                         │
│   RETENTION SETTINGS:                                                  │
│   • 1 day to 10 years (or never expire)                                │
│   • Export to S3 for long-term storage                                 │
│   • Stream to Kinesis/Lambda for real-time processing                  │
│                                                                         │
│   PRICING:                                                              │
│   • Ingestion: $0.50 per GB                                            │
│   • Storage: $0.03 per GB/month                                        │
│   • Queries (Logs Insights): $0.005 per GB scanned                     │
│                                                                         │
└────────────────────────────────────────────────────────────────────────┘

Structured Logging Best Practices

import json
import logging
from datetime import datetime

class StructuredLogger:
    """JSON structured logger for CloudWatch Logs."""
    
    def __init__(self, service_name: str):
        self.service_name = service_name
        self.logger = logging.getLogger()
        self.logger.setLevel(logging.INFO)
    
    def _format_log(self, level: str, message: str, **kwargs) -> str:
        log_entry = {
            "timestamp": datetime.utcnow().isoformat(),
            "level": level,
            "service": self.service_name,
            "message": message,
            **kwargs
        }
        return json.dumps(log_entry)
    
    def info(self, message: str, **kwargs):
        print(self._format_log("INFO", message, **kwargs))
    
    def error(self, message: str, **kwargs):
        print(self._format_log("ERROR", message, **kwargs))
    
    def warn(self, message: str, **kwargs):
        print(self._format_log("WARN", message, **kwargs))

# Usage
logger = StructuredLogger("order-service")

def process_order(order_id: str, user_id: str):
    logger.info(
        "Processing order",
        order_id=order_id,
        user_id=user_id,
        action="process_order"
    )
    
    try:
        # Process order...
        logger.info(
            "Order completed",
            order_id=order_id,
            duration_ms=150,
            action="order_complete"
        )
    except Exception as e:
        logger.error(
            "Order failed",
            order_id=order_id,
            error=str(e),
            action="order_failed"
        )
        raise

CloudWatch Logs Insights

-- Find all errors in the last hour
fields @timestamp, @message
| filter @message like /ERROR/
| sort @timestamp desc
| limit 100

-- Parse JSON logs and aggregate
fields @timestamp, @message
| parse @message '{"level":"*","service":"*","message":"*"' as level, service, msg
| filter level = "ERROR"
| stats count(*) as error_count by service
| sort error_count desc

-- Calculate p99 latency from Lambda
fields @timestamp, @duration
| filter @type = "REPORT"
| stats avg(@duration) as avg_duration,
        pct(@duration, 50) as p50,
        pct(@duration, 95) as p95,
        pct(@duration, 99) as p99
  by bin(1h)

-- Find slow API requests
fields @timestamp, @message
| parse @message '"duration_ms":*,' as duration
| filter duration > 1000
| sort duration desc
| limit 50

-- Error rate over time
fields @timestamp, @message
| filter @message like /level/
| parse @message '"level":"*"' as level
| stats count(*) as total,
        sum(level = "ERROR") as errors
  by bin(5m)
| display errors / total * 100 as error_rate

X-Ray Distributed Tracing

Trace requests across microservices to identify bottlenecks and errors.

X-Ray Architecture

┌────────────────────────────────────────────────────────────────────────┐
│                    X-Ray Distributed Tracing                            │
├────────────────────────────────────────────────────────────────────────┤
│                                                                         │
│   Request Flow with Trace:                                             │
│                                                                         │
│   Client                                                                │
│     │                                                                   │
│     │  Trace ID: 1-abc123-def456789                                    │
│     ▼                                                                   │
│   ┌─────────────────────────────────────────────────────────────────┐  │
│   │  API Gateway                                                     │  │
│   │  Segment: 50ms                                                   │  │
│   └───────────────────────────┬─────────────────────────────────────┘  │
│                               │                                         │
│                               ▼                                         │
│   ┌─────────────────────────────────────────────────────────────────┐  │
│   │  Lambda: Order Service                                           │  │
│   │  Segment: 200ms                                                  │  │
│   │  ├── Subsegment: DynamoDB Query (30ms)                          │  │
│   │  ├── Subsegment: External API Call (100ms)                      │  │
│   │  └── Subsegment: SNS Publish (20ms)                             │  │
│   └───────────────────────────┬─────────────────────────────────────┘  │
│                               │                                         │
│              ┌────────────────┴────────────────┐                       │
│              ▼                                 ▼                       │
│   ┌──────────────────────┐        ┌──────────────────────┐            │
│   │  Lambda: Inventory   │        │  Lambda: Payment      │            │
│   │  Segment: 80ms       │        │  Segment: 150ms       │            │
│   │  └── DynamoDB (50ms) │        │  └── Stripe API (120ms)│           │
│   └──────────────────────┘        └──────────────────────┘            │
│                                                                         │
│   Service Map (auto-generated):                                        │
│   ┌──────────┐    ┌──────────┐    ┌──────────┐    ┌──────────┐        │
│   │   API    │───►│  Order   │───►│ Inventory│───►│ DynamoDB │        │
│   │ Gateway  │    │ Service  │    │ Service  │    │          │        │
│   └──────────┘    └────┬─────┘    └──────────┘    └──────────┘        │
│                        │                                                │
│                        └──────────►┌──────────┐                        │
│                                    │ Payment  │                        │
│                                    │ Service  │                        │
│                                    └──────────┘                        │
│                                                                         │
└────────────────────────────────────────────────────────────────────────┘

Instrumenting Lambda with X-Ray

import boto3
from aws_xray_sdk.core import xray_recorder
from aws_xray_sdk.core import patch_all

# Patch all supported libraries (boto3, requests, etc.)
patch_all()

dynamodb = boto3.resource('dynamodb')
table = dynamodb.Table('orders')

@xray_recorder.capture('process_order')
def process_order(order_id: str):
    """Process order with X-Ray tracing."""
    
    # Add annotation (indexed, searchable)
    xray_recorder.put_annotation('order_id', order_id)
    
    # Add metadata (not indexed, for debugging)
    xray_recorder.put_metadata('order_details', {
        'order_id': order_id,
        'timestamp': '2024-01-15T10:00:00Z'
    })
    
    # Subsegment for custom operation
    with xray_recorder.in_subsegment('validate_order') as subsegment:
        subsegment.put_annotation('validation_type', 'full')
        validate_order(order_id)
    
    # DynamoDB call is automatically traced
    result = table.get_item(Key={'order_id': order_id})
    
    # External API call with custom subsegment
    with xray_recorder.in_subsegment('external_api') as subsegment:
        subsegment.put_metadata('api', 'payment_gateway')
        response = call_payment_api(result['Item'])
    
    return response

def lambda_handler(event, context):
    order_id = event.get('order_id')
    return process_order(order_id)

CloudWatch Alarms

Automated alerts and actions based on metric thresholds.

Alarm Architecture

┌────────────────────────────────────────────────────────────────────────┐
│                    CloudWatch Alarm States                              │
├────────────────────────────────────────────────────────────────────────┤
│                                                                         │
│   Alarm States:                                                         │
│   ┌─────────────────────────────────────────────────────────────────┐  │
│   │                                                                  │  │
│   │    ┌──────────┐        ┌──────────┐        ┌──────────┐        │  │
│   │    │    OK    │◄──────►│ ALARM    │◄──────►│INSUFFICIENT│       │  │
│   │    │  (green) │        │  (red)   │        │   DATA     │       │  │
│   │    └──────────┘        └──────────┘        └──────────┘        │  │
│   │                                                                  │  │
│   └─────────────────────────────────────────────────────────────────┘  │
│                                                                         │
│   Alarm Components:                                                     │
│   ┌─────────────────────────────────────────────────────────────────┐  │
│   │                                                                  │  │
│   │   METRIC           STATISTIC       PERIOD      THRESHOLD        │  │
│   │   CPUUtilization   Average         5 minutes   > 80%            │  │
│   │                                                                  │  │
│   │   EVALUATION PERIODS: 3                                         │  │
│   │   (3 consecutive 5-min periods above 80% = ALARM)               │  │
│   │                                                                  │  │
│   └─────────────────────────────────────────────────────────────────┘  │
│                                                                         │
│   Alarm Actions:                                                        │
│   ┌─────────────────────────────────────────────────────────────────┐  │
│   │                                                                  │  │
│   │   • SNS Topic (email, SMS, Lambda)                              │  │
│   │   • Auto Scaling (scale up/down)                                │  │
│   │   • EC2 Actions (stop, terminate, reboot)                       │  │
│   │   • Systems Manager (run automation)                            │  │
│   │                                                                  │  │
│   └─────────────────────────────────────────────────────────────────┘  │
│                                                                         │
└────────────────────────────────────────────────────────────────────────┘

Essential Alarms (Terraform)

# High CPU Alarm
resource "aws_cloudwatch_metric_alarm" "high_cpu" {
  alarm_name          = "high-cpu-utilization"
  comparison_operator = "GreaterThanThreshold"
  evaluation_periods  = 3
  metric_name         = "CPUUtilization"
  namespace           = "AWS/EC2"
  period              = 300  # 5 minutes
  statistic           = "Average"
  threshold           = 80
  alarm_description   = "CPU utilization exceeds 80%"
  
  dimensions = {
    AutoScalingGroupName = aws_autoscaling_group.web.name
  }
  
  alarm_actions = [
    aws_sns_topic.alerts.arn,
    aws_autoscaling_policy.scale_up.arn
  ]
  
  ok_actions = [aws_sns_topic.alerts.arn]
}

# Lambda Error Rate Alarm
resource "aws_cloudwatch_metric_alarm" "lambda_errors" {
  alarm_name          = "lambda-high-error-rate"
  comparison_operator = "GreaterThanThreshold"
  evaluation_periods  = 2
  threshold           = 5  # 5% error rate
  alarm_description   = "Lambda error rate exceeds 5%"
  
  metric_query {
    id          = "error_rate"
    expression  = "(errors / invocations) * 100"
    label       = "Error Rate"
    return_data = true
  }
  
  metric_query {
    id = "errors"
    metric {
      metric_name = "Errors"
      namespace   = "AWS/Lambda"
      period      = 300
      stat        = "Sum"
      dimensions  = { FunctionName = "my-function" }
    }
  }
  
  metric_query {
    id = "invocations"
    metric {
      metric_name = "Invocations"
      namespace   = "AWS/Lambda"
      period      = 300
      stat        = "Sum"
      dimensions  = { FunctionName = "my-function" }
    }
  }
  
  alarm_actions = [aws_sns_topic.alerts.arn]
}

# DynamoDB Throttling Alarm
resource "aws_cloudwatch_metric_alarm" "dynamodb_throttle" {
  alarm_name          = "dynamodb-throttling"
  comparison_operator = "GreaterThanThreshold"
  evaluation_periods  = 1
  metric_name         = "ThrottledRequests"
  namespace           = "AWS/DynamoDB"
  period              = 60
  statistic           = "Sum"
  threshold           = 0
  
  dimensions = {
    TableName = "my-table"
  }
  
  alarm_actions = [aws_sns_topic.alerts.arn]
}

# Composite Alarm (multiple conditions)
resource "aws_cloudwatch_composite_alarm" "critical" {
  alarm_name = "critical-system-alarm"
  
  alarm_rule = "ALARM(${aws_cloudwatch_metric_alarm.high_cpu.alarm_name}) AND ALARM(${aws_cloudwatch_metric_alarm.lambda_errors.alarm_name})"
  
  alarm_actions = [aws_sns_topic.pagerduty.arn]
}

CloudTrail (Audit Logging)

Track all API calls for security and compliance.

┌────────────────────────────────────────────────────────────────────────┐
│                    CloudTrail Architecture                              │
├────────────────────────────────────────────────────────────────────────┤
│                                                                         │
│   Every AWS API Call:                                                   │
│   ┌─────────────────────────────────────────────────────────────────┐  │
│   │                                                                  │  │
│   │   User/Role ───► AWS API ───► CloudTrail ───► S3 Bucket         │  │
│   │                                    │                             │  │
│   │                                    └──► CloudWatch Logs          │  │
│   │                                    └──► EventBridge (real-time)  │  │
│   │                                                                  │  │
│   └─────────────────────────────────────────────────────────────────┘  │
│                                                                         │
│   Event Types:                                                          │
│   ─────────────                                                         │
│   • Management Events: Control plane (CreateBucket, RunInstances)      │
│   • Data Events: Data plane (S3 GetObject, Lambda Invoke)              │
│   • Insights Events: Unusual API activity detection                    │
│                                                                         │
│   Sample Event:                                                         │
│   {                                                                     │
│     "eventTime": "2024-01-15T10:30:00Z",                               │
│     "eventSource": "s3.amazonaws.com",                                 │
│     "eventName": "DeleteBucket",                                       │
│     "userIdentity": {                                                   │
│       "type": "IAMUser",                                               │
│       "userName": "admin",                                             │
│       "arn": "arn:aws:iam::123456789012:user/admin"                    │
│     },                                                                  │
│     "sourceIPAddress": "203.0.113.50",                                 │
│     "requestParameters": { "bucketName": "my-bucket" }                 │
│   }                                                                     │
│                                                                         │
└────────────────────────────────────────────────────────────────────────┘

CloudTrail Best Practices

# CloudTrail configuration checklist
cloudtrail_config = {
    "multi_region": True,           # Trail in all regions
    "log_file_validation": True,    # Detect tampering
    "s3_encryption": "SSE-KMS",     # Encrypt logs
    "cloudwatch_logs": True,        # Real-time analysis
    "data_events": {
        "s3": ["arn:aws:s3:::sensitive-bucket/*"],
        "lambda": True
    },
    "insights": True,               # Anomaly detection
    "organization_trail": True      # Multi-account
}

# Real-time alerting with EventBridge
# Detect root user login
eventbridge_rule = {
    "source": ["aws.signin"],
    "detail-type": ["AWS Console Sign In via CloudTrail"],
    "detail": {
        "userIdentity": {
            "type": ["Root"]
        }
    }
}

🎯 Interview Questions

Q1: How would you debug a slow API response?

Systematic approach:

X-Ray Trace: Find the specific slow request
- Identify which service/subsegment is slow
- Check annotations for context
CloudWatch Metrics: Check historical patterns
- Is this a spike or gradual increase?
- Correlate with CPU, memory, connections

CloudWatch Logs: Find related errors

fields @timestamp, @message
| filter trace_id = "1-abc123..."
| sort @timestamp

Service-specific checks:
- Lambda: Cold starts? Memory sufficient?
- DynamoDB: Throttling? Hot partition?
- RDS: Connection pool exhausted?

Q2: What metrics should you monitor for a web application?

Essential metrics by layer:Load Balancer:

RequestCount, TargetResponseTime
HTTPCode_Target_5XX, HTTPCode_ELB_5XX
HealthyHostCount, UnhealthyHostCount

Compute (EC2/Lambda):

CPUUtilization, MemoryUtilization (custom)
Lambda: Duration, Errors, Throttles

Database:

CPUUtilization, FreeableMemory
DatabaseConnections, ReadIOPS, WriteIOPS
DynamoDB: ThrottledRequests

Application:

Error rate, Latency (p50, p95, p99)
Requests per second
Business metrics (orders, signups)

Q3: How do you reduce CloudWatch Logs costs?

Cost optimization strategies:

Reduce ingestion:
- Filter logs at source (log level INFO not DEBUG)
- Use sampling for high-volume logs
Optimize retention:
- Set appropriate retention (7-30 days for most)
- Export to S3 for long-term (cheaper)
Use Logs Insights efficiently:
- Narrow time ranges
- Use specific log groups
- Cache common queries
Consider alternatives:
- Kinesis Firehose → S3 for high volume
- OpenSearch for complex analysis

Q4: How do you set up alerting for a production system?

Alert hierarchy:

Critical (PagerDuty/immediate):
- Service down (health check failures)
- Error rate > 5%
- Latency p99 > 5s
- Security events (root login)
Warning (Slack/email):
- Error rate > 1%
- CPU > 80% sustained
- Disk > 85%
- Approaching quotas
Informational (dashboard):
- Deployment events
- Scaling events
- Cost anomalies

Best practices:

Avoid alert fatigue (tune thresholds)
Use composite alarms
Include runbook links in alerts

Q5: CloudWatch vs third-party observability tools?

CloudWatch advantages:

Native integration, no agents for AWS services
Lower cost for basic use cases
No data egress charges

Third-party advantages (Datadog, New Relic):

Better visualization and correlation
APM with code-level insights
Multi-cloud support
More powerful querying

Hybrid approach:

Use CloudWatch for AWS metrics/logs
Stream to third-party for analysis
Keep costs balanced

🧪 Hands-On Lab: Build Observability Dashboard

Enable X-Ray on Lambda

Add X-Ray SDK and enable active tracing

Create CloudWatch Dashboard

Add widgets for key metrics (CPU, errors, latency)

Set Up Structured Logging

Implement JSON logging with correlation IDs

Create Alarms

Set up CPU, error rate, and latency alarms

Configure CloudTrail

Enable multi-region trail with CloudWatch Logs

Next Module

CDN & Edge Services

Master CloudFront, Global Accelerator, and edge computing

Overview

Testing & Code Quality

Crash Courses

AI Engineering

Math for ML - Understanding Linear Algebra

Probability & Statistics for ML

Math for ML - Understanding Calculus

ML Mastery

Deep Learning Mastery

NestJS Mastery

Microservices Mastery

Low Level Design

OOP Concepts

SOLID Principles

Design Patterns

LLD Case Studies

System Design (HLD)

Senior Level (L5+/Staff)

HLD Case Studies

Engineering Fundamentals

DevOps & Operations

Azure Cloud Engineering

AWS Cloud

AWS Monitoring & Observability

AWS Security Services

AWS Serverless

AWS Operations

AWS Advanced

AWS Case Studies

GCP Cloud Engineering

DevOps Tools

Database Engineering

HIPAA Compliance Mastery

Operating Systems

Linux Internals

Distributed Systems

Networking Mastery

Build Your Own X

Go Lang Mastery

C Programming

Classic Research Papers

Distributed System Tools

​Module Overview

​Observability Pillars

​CloudWatch Metrics

​Built-in vs Custom Metrics

​Publishing Custom Metrics

​CloudWatch Embedded Metric Format (EMF)

​CloudWatch Logs

​Log Architecture

​Structured Logging Best Practices

​CloudWatch Logs Insights

​X-Ray Distributed Tracing

​X-Ray Architecture

​Instrumenting Lambda with X-Ray

​CloudWatch Alarms

​Alarm Architecture

​Essential Alarms (Terraform)

​CloudTrail (Audit Logging)

​CloudTrail Best Practices

​🎯 Interview Questions

​🧪 Hands-On Lab: Build Observability Dashboard

​Next Module

CDN & Edge Services

Module Overview

Observability Pillars

CloudWatch Metrics

Built-in vs Custom Metrics

Publishing Custom Metrics

CloudWatch Embedded Metric Format (EMF)

CloudWatch Logs

Log Architecture

Structured Logging Best Practices

CloudWatch Logs Insights

X-Ray Distributed Tracing

X-Ray Architecture

Instrumenting Lambda with X-Ray

CloudWatch Alarms

Alarm Architecture

Essential Alarms (Terraform)

CloudTrail (Audit Logging)

CloudTrail Best Practices

🎯 Interview Questions

🧪 Hands-On Lab: Build Observability Dashboard

Next Module