Skip to main content
AWS X-Ray Deep Dive

Module Overview

Estimated Time: 3-4 hours | Difficulty: Intermediate | Prerequisites: Lambda, CloudWatch
AWS X-Ray helps you analyze and debug distributed applications. This module covers tracing concepts, instrumentation, and production debugging patterns. What You’ll Learn:
  • X-Ray concepts: traces, segments, subsegments
  • Instrumenting Lambda, API Gateway, and SDK calls
  • Service maps and analytics
  • Annotations and metadata
  • Sampling rules
  • Integration with CloudWatch and other services

Why X-Ray?

Visualize Request Flow

See how requests traverse your microservices architecture

Find Bottlenecks

Identify which service or dependency is causing latency

Debug Errors

Trace errors back to their source across services

Analyze Performance

Understand latency distributions and trends

X-Ray Concepts

┌────────────────────────────────────────────────────────────────────────┐
│                    X-Ray Tracing Concepts                               │
├────────────────────────────────────────────────────────────────────────┤
│                                                                         │
│   TRACE                                                                 │
│   ─────                                                                 │
│   A unique ID that follows a request through all services              │
│   Trace ID: 1-581cf771-a006649127e371903a2de979                        │
│                                                                         │
│   ┌─────────────────────────────────────────────────────────────────┐  │
│   │  API Gateway    Lambda: Order    DynamoDB    Lambda: Payment   │  │
│   │  ┌─────────┐    ┌───────────┐    ┌────────┐  ┌──────────────┐  │  │
│   │  │ Segment │───►│  Segment  │───►│Subseg. │  │   Segment    │  │  │
│   │  │  50ms   │    │   200ms   │    │  30ms  │  │    150ms     │  │  │
│   │  └─────────┘    └───────────┘    └────────┘  └──────────────┘  │  │
│   │                      │                                          │  │
│   │                      └── Subsegment: SNS Publish (20ms)        │  │
│   └─────────────────────────────────────────────────────────────────┘  │
│                                                                         │
│   SEGMENT                                                               │
│   ───────                                                               │
│   • Work done by a single service                                      │
│   • Contains timing, metadata, annotations                             │
│   • Auto-created for Lambda, API Gateway, etc.                         │
│                                                                         │
│   SUBSEGMENT                                                            │
│   ──────────                                                            │
│   • Work done within a segment                                         │
│   • AWS SDK calls, HTTP calls, custom operations                       │
│   • You create these in your code                                      │
│                                                                         │
│   ANNOTATION                                                            │
│   ──────────                                                            │
│   • Key-value pairs for filtering traces                               │
│   • Indexed and searchable                                             │
│   • Example: user_id = "user-123"                                      │
│                                                                         │
│   METADATA                                                              │
│   ────────                                                              │
│   • Additional data for debugging                                      │
│   • Not indexed, not searchable                                        │
│   • Example: full request/response body                                │
│                                                                         │
└────────────────────────────────────────────────────────────────────────┘

Service Map

┌────────────────────────────────────────────────────────────────────────┐
│                    X-Ray Service Map                                    │
├────────────────────────────────────────────────────────────────────────┤
│                                                                         │
│   Auto-generated visualization of your architecture:                   │
│                                                                         │
│   ┌──────────┐                                                         │
│   │ Clients  │                                                         │
│   │  (web)   │                                                         │
│   └────┬─────┘                                                         │
│        │                                                                │
│        ▼                                                                │
│   ┌──────────┐         ┌───────────────┐                               │
│   │   API    │────────►│  Order        │                               │
│   │ Gateway  │         │  Lambda       │                               │
│   │ 99.8%    │         │  avg: 125ms   │                               │
│   │ 45ms avg │         │  99.5% OK     │                               │
│   └──────────┘         └───────┬───────┘                               │
│                                │                                        │
│                ┌───────────────┼───────────────┐                       │
│                │               │               │                       │
│                ▼               ▼               ▼                       │
│   ┌───────────────┐  ┌───────────────┐  ┌───────────────┐             │
│   │   DynamoDB    │  │     SNS       │  │   Payment     │             │
│   │   Orders      │  │   Notify      │  │   Lambda      │             │
│   │   avg: 8ms    │  │   avg: 15ms   │  │   avg: 200ms  │             │
│   │   100% OK     │  │   100% OK     │  │   98% OK      │             │
│   └───────────────┘  └───────────────┘  └───────────────┘             │
│                                                 │                       │
│                                                 ▼                       │
│                                    ┌───────────────────┐               │
│                                    │   Stripe API      │               │
│                                    │   (external)      │               │
│                                    │   avg: 180ms      │               │
│                                    │   97% OK          │               │
│                                    └───────────────────┘               │
│                                                                         │
│   Color coding:                                                         │
│   🟢 Green: Healthy (>99% success)                                     │
│   🟡 Yellow: Degraded (95-99%)                                         │
│   🔴 Red: Error state (<95%)                                           │
│                                                                         │
└────────────────────────────────────────────────────────────────────────┘

Instrumenting Lambda

Automatic Instrumentation

# Enable X-Ray in Lambda configuration (or SAM/CDK)
# Lambda automatically creates segment for each invocation

# SAM template example
"""
Resources:
  MyFunction:
    Type: AWS::Serverless::Function
    Properties:
      Tracing: Active  # This enables X-Ray
"""

# CDK example
"""
const fn = new lambda.Function(this, 'MyFunction', {
  tracing: lambda.Tracing.ACTIVE,
});
"""

Manual Instrumentation with SDK

from aws_xray_sdk.core import xray_recorder
from aws_xray_sdk.core import patch_all

# Patch all supported libraries (boto3, requests, httplib, etc.)
patch_all()

# Or patch specific libraries
from aws_xray_sdk.core import patch
patch(['boto3', 'requests'])

import boto3
import requests

# boto3 and requests calls are now automatically traced
dynamodb = boto3.resource('dynamodb')
table = dynamodb.Table('Orders')

def lambda_handler(event, context):
    order_id = event.get('order_id')
    
    # Add annotation (indexed, searchable)
    xray_recorder.put_annotation('order_id', order_id)
    xray_recorder.put_annotation('user_type', 'premium')
    
    # Add metadata (not indexed, for debugging)
    xray_recorder.put_metadata('event', event)
    
    # DynamoDB call is automatically traced as subsegment
    result = table.get_item(Key={'order_id': order_id})
    
    # External HTTP call is automatically traced
    response = requests.get('https://api.example.com/validate')
    
    return result.get('Item')

Custom Subsegments

from aws_xray_sdk.core import xray_recorder
import time

@xray_recorder.capture('process_order')
def process_order(order: dict):
    """Function decorator creates subsegment automatically."""
    
    # All code here is captured in 'process_order' subsegment
    validate_order(order)
    calculate_total(order)
    
    return order

def complex_operation(data: dict):
    """Manual subsegment for granular tracing."""
    
    # Create custom subsegment
    with xray_recorder.in_subsegment('data_transformation') as subsegment:
        subsegment.put_annotation('data_size', len(data))
        
        # Nested subsegment
        with xray_recorder.in_subsegment('step_1_parse'):
            parsed = parse_data(data)
        
        with xray_recorder.in_subsegment('step_2_transform'):
            transformed = transform_data(parsed)
        
        with xray_recorder.in_subsegment('step_3_validate'):
            validated = validate_data(transformed)
    
    return validated

def async_operation():
    """Capture async work correctly."""
    
    subsegment = xray_recorder.begin_subsegment('async_task')
    try:
        # Your async work
        result = do_async_work()
        subsegment.put_metadata('result', result)
    except Exception as e:
        subsegment.add_exception(e)
        raise
    finally:
        xray_recorder.end_subsegment()

Tracing Across Services

┌────────────────────────────────────────────────────────────────────────┐
│                    Cross-Service Tracing                                │
├────────────────────────────────────────────────────────────────────────┤
│                                                                         │
│   Trace ID is passed via HTTP headers:                                 │
│   X-Amzn-Trace-Id: Root=1-5759e988-bd862e3fe1be46a994272793;Sampled=1 │
│                                                                         │
│   Service A (Lambda)                                                    │
│   ┌─────────────────────────────────────────────────────────────────┐  │
│   │  # Trace ID automatically passed to downstream                  │  │
│   │  response = requests.post(                                      │  │
│   │      'https://service-b.example.com/api',                       │  │
│   │      json=data                                                   │  │
│   │  )                                                               │  │
│   │  # X-Ray SDK adds trace header automatically (when patched)     │  │
│   └─────────────────────────────────────────────────────────────────┘  │
│                         │                                               │
│                         │ X-Amzn-Trace-Id header                       │
│                         ▼                                               │
│   Service B (ECS/EC2)                                                   │
│   ┌─────────────────────────────────────────────────────────────────┐  │
│   │  from aws_xray_sdk.core import xray_recorder                    │  │
│   │  from aws_xray_sdk.ext.flask.middleware import XRayMiddleware   │  │
│   │                                                                  │  │
│   │  app = Flask(__name__)                                          │  │
│   │  XRayMiddleware(app, xray_recorder)  # Auto-extract trace ID    │  │
│   │                                                                  │  │
│   │  @app.route('/api', methods=['POST'])                           │  │
│   │  def api():                                                      │  │
│   │      # This segment links to same trace                         │  │
│   │      return process_request(request.json)                        │  │
│   └─────────────────────────────────────────────────────────────────┘  │
│                                                                         │
│   SQS Integration:                                                      │
│   ─────────────────                                                    │
│   • SQS passes trace header in message attribute: AWSTraceHeader       │
│   • Lambda extracts it automatically                                   │
│   • For custom consumers, extract and set segment parent               │
│                                                                         │
└────────────────────────────────────────────────────────────────────────┘

Sampling Rules

┌────────────────────────────────────────────────────────────────────────┐
│                    X-Ray Sampling                                       │
├────────────────────────────────────────────────────────────────────────┤
│                                                                         │
│   Why Sample?                                                           │
│   ────────────                                                         │
│   • Reduce costs (X-Ray charges per trace)                             │
│   • Reduce noise in high-volume systems                                │
│   • Still get statistical significance                                  │
│                                                                         │
│   Default Rule:                                                         │
│   ──────────────                                                       │
│   • First request each second: Always traced (reservoir = 1)          │
│   • Additional requests: 5% sampled                                    │
│                                                                         │
│   Custom Sampling Rules:                                                │
│   ┌─────────────────────────────────────────────────────────────────┐  │
│   │  {                                                               │  │
│   │    "version": 2,                                                 │  │
│   │    "rules": [                                                    │  │
│   │      {                                                           │  │
│   │        "description": "High priority for errors",                │  │
│   │        "host": "*",                                              │  │
│   │        "http_method": "*",                                       │  │
│   │        "url_path": "*",                                          │  │
│   │        "fixed_target": 10,     # 10 per second                   │  │
│   │        "rate": 1.0,            # 100% of remaining               │  │
│   │        "service_name": "*",                                      │  │
│   │        "service_type": "*",                                      │  │
│   │        "attributes": {                                           │  │
│   │          "http.status_code": "5*"  # Match 5xx errors            │  │
│   │        },                                                        │  │
│   │        "priority": 1                                             │  │
│   │      },                                                          │  │
│   │      {                                                           │  │
│   │        "description": "Low volume for health checks",            │  │
│   │        "host": "*",                                              │  │
│   │        "http_method": "GET",                                     │  │
│   │        "url_path": "/health",                                    │  │
│   │        "fixed_target": 1,                                        │  │
│   │        "rate": 0.01,           # 1%                              │  │
│   │        "priority": 2                                             │  │
│   │      }                                                           │  │
│   │    ],                                                            │  │
│   │    "default": {                                                  │  │
│   │      "fixed_target": 1,                                          │  │
│   │      "rate": 0.05              # 5% default                      │  │
│   │    }                                                             │  │
│   │  }                                                               │  │
│   └─────────────────────────────────────────────────────────────────┘  │
│                                                                         │
└────────────────────────────────────────────────────────────────────────┘

X-Ray Analytics

Filter Expressions

┌────────────────────────────────────────────────────────────────────────┐
│                    X-Ray Analytics Queries                              │
├────────────────────────────────────────────────────────────────────────┤
│                                                                         │
│   Find traces by:                                                       │
│   ───────────────                                                      │
│                                                                         │
│   # By annotation (custom indexed data)                                 │
│   annotation.user_id = "user-123"                                      │
│   annotation.order_id = "ORD-456"                                      │
│                                                                         │
│   # By response time                                                    │
│   responsetime > 5                    # Slower than 5 seconds          │
│   duration > 2 AND duration < 5       # Between 2-5 seconds            │
│                                                                         │
│   # By status                                                           │
│   http.status = 500                                                    │
│   fault = true                        # Any 5xx error                  │
│   error = true                        # Any 4xx error                  │
│   throttle = true                     # 429 errors                     │
│                                                                         │
│   # By service                                                          │
│   service("order-service")                                             │
│   service("payment-lambda") AND fault = true                           │
│                                                                         │
│   # By URL                                                              │
│   http.url CONTAINS "/api/orders"                                      │
│   http.method = "POST"                                                 │
│                                                                         │
│   # Combined queries                                                    │
│   annotation.user_type = "premium" AND responsetime > 1                │
│   service("checkout") AND http.status >= 500                           │
│                                                                         │
└────────────────────────────────────────────────────────────────────────┘

Trace Analysis

┌────────────────────────────────────────────────────────────────────────┐
│                    Trace Timeline View                                  │
├────────────────────────────────────────────────────────────────────────┤
│                                                                         │
│   Trace ID: 1-5759e988-bd862e3fe1be46a994272793                        │
│   Duration: 342ms | Segments: 5 | Subsegments: 12                      │
│                                                                         │
│   Timeline:                                                             │
│   0ms        100ms       200ms       300ms       400ms                 │
│   │          │           │           │           │                     │
│   ├──────────────────────────────────────────────┤ API Gateway (50ms)  │
│   │                                                                     │
│   │ ├────────────────────────────────────────┤ Order Lambda (280ms)    │
│   │ │                                                                   │
│   │ │ ├────┤ DynamoDB GetItem (25ms)                                   │
│   │ │      │                                                            │
│   │ │      ├──────────────────────┤ Payment Lambda (150ms)             │
│   │ │      │                      │                                     │
│   │ │      │ ├──────────────────┤ Stripe API (120ms)                   │
│   │ │      │                                                            │
│   │ │      ├────┤ DynamoDB UpdateItem (20ms)                           │
│   │ │           │                                                       │
│   │ │           ├──┤ SNS Publish (15ms)                                │
│   │                                                                     │
│   Legend:                                                               │
│   ████ In progress                                                      │
│   ░░░░ Waiting/Idle                                                     │
│   ▓▓▓▓ Error                                                            │
│                                                                         │
└────────────────────────────────────────────────────────────────────────┘

Integration Patterns

Lambda with X-Ray Powertools

from aws_lambda_powertools import Logger, Tracer, Metrics
from aws_lambda_powertools.utilities.typing import LambdaContext

logger = Logger(service="order-service")
tracer = Tracer(service="order-service")
metrics = Metrics(service="order-service")

@tracer.capture_method
def get_order(order_id: str) -> dict:
    """Method automatically traced as subsegment."""
    
    # Add annotation for filtering
    tracer.put_annotation(key="order_id", value=order_id)
    
    # Get from DynamoDB (auto-traced if boto3 patched)
    response = table.get_item(Key={'order_id': order_id})
    
    return response.get('Item')

@tracer.capture_method
def process_payment(order: dict) -> dict:
    """Payment processing with detailed tracing."""
    
    tracer.put_metadata(key="order_amount", value=order['amount'])
    
    # External call traced
    result = payment_client.charge(
        amount=order['amount'],
        customer=order['customer_id']
    )
    
    tracer.put_annotation(key="payment_status", value=result['status'])
    
    return result

@logger.inject_lambda_context
@tracer.capture_lambda_handler
@metrics.log_metrics
def lambda_handler(event: dict, context: LambdaContext) -> dict:
    """Main handler with full observability stack."""
    
    order_id = event['pathParameters']['order_id']
    
    order = get_order(order_id)
    payment = process_payment(order)
    
    return {
        'statusCode': 200,
        'body': json.dumps({'order': order, 'payment': payment})
    }

API Gateway Integration

# SAM template
Resources:
  ApiGateway:
    Type: AWS::Serverless::Api
    Properties:
      StageName: prod
      TracingEnabled: true  # Enable X-Ray on API Gateway
      
  OrderFunction:
    Type: AWS::Serverless::Function
    Properties:
      Handler: app.lambda_handler
      Tracing: Active  # Enable X-Ray on Lambda
      Events:
        GetOrder:
          Type: Api
          Properties:
            RestApiId: !Ref ApiGateway
            Path: /orders/{id}
            Method: get

ECS/Fargate Integration

# Dockerfile
"""
FROM python:3.11-slim

# Install X-Ray daemon (sidecar)
RUN apt-get update && apt-get install -y curl unzip
RUN curl -o daemon.zip https://s3.us-east-2.amazonaws.com/aws-xray-assets.us-east-2/xray-daemon/aws-xray-daemon-linux-3.x.zip
RUN unzip daemon.zip && mv xray /usr/local/bin/

COPY . /app
WORKDIR /app
RUN pip install -r requirements.txt

CMD ["python", "app.py"]
"""

# Flask application with X-Ray
from flask import Flask, request
from aws_xray_sdk.core import xray_recorder, patch_all
from aws_xray_sdk.ext.flask.middleware import XRayMiddleware

patch_all()

app = Flask(__name__)

# Configure X-Ray for non-Lambda environment
xray_recorder.configure(
    service='order-service',
    daemon_address='127.0.0.1:2000',  # Local daemon
    context_missing='LOG_ERROR'
)

XRayMiddleware(app, xray_recorder)

@app.route('/orders/<order_id>')
def get_order(order_id):
    # Request automatically traced
    order = fetch_order(order_id)
    return jsonify(order)

Debugging with X-Ray

Common Patterns

┌────────────────────────────────────────────────────────────────────────┐
│                    Debugging Patterns                                   │
├────────────────────────────────────────────────────────────────────────┤
│                                                                         │
│   1. Finding the Slow Service                                          │
│   ────────────────────────────                                         │
│   Query: responsetime > 2                                              │
│   Look at: Timeline view to see which segment is slowest               │
│                                                                         │
│   2. Tracking Down Errors                                              │
│   ───────────────────────                                              │
│   Query: fault = true AND annotation.order_id = "ORD-123"              │
│   Look at: Exception details in segment                                │
│                                                                         │
│   3. Correlating with CloudWatch Logs                                  │
│   ─────────────────────────────────────                                │
│   • X-Ray trace ID is in Lambda REPORT log                             │
│   • Search logs by trace ID for full context                           │
│   • Use CloudWatch ServiceLens for unified view                        │
│                                                                         │
│   4. Identifying Cold Starts                                           │
│   ──────────────────────────                                           │
│   Query: service("my-lambda") AND annotation.cold_start = true         │
│   (Requires adding annotation in init code)                            │
│                                                                         │
│   5. Upstream Dependency Analysis                                      │
│   ─────────────────────────────                                        │
│   • Look at external service subsegments                               │
│   • Check for timeouts, errors, slow response times                    │
│   • Stripe API taking 3s? That's your bottleneck!                      │
│                                                                         │
└────────────────────────────────────────────────────────────────────────┘

Best Practices

Use Annotations Wisely

Add business context (user_id, order_id) for easy filtering

Configure Sampling

100% tracing is expensive—sample based on importance

Trace Errors at 100%

Never sample errors—always trace failures

Set Meaningful Names

Name subsegments clearly for easy timeline reading

Production Checklist

xray_checklist = {
    "instrumentation": [
        "✓ Enable X-Ray on Lambda (Tracing: Active)",
        "✓ Enable X-Ray on API Gateway",
        "✓ Patch AWS SDK and HTTP clients",
        "✓ Add annotations for business context",
        "✓ Create subsegments for key operations",
    ],
    "sampling": [
        "✓ Configure custom sampling rules",
        "✓ 100% sample errors and slow requests",
        "✓ Lower sample rate for health checks",
    ],
    "analysis": [
        "✓ Set up ServiceLens dashboards",
        "✓ Create CloudWatch alarms on X-Ray metrics",
        "✓ Document common filter expressions",
    ],
    "cost": [
        "✓ Estimate trace volume and costs",
        "✓ Adjust sampling to budget",
        "✓ Monitor X-Ray costs in Cost Explorer",
    ]
}

🎯 Interview Questions

X-Ray:
  • Distributed tracing
  • Request flow across services
  • Performance and latency analysis
CloudWatch Logs:
  • Application logs (what your code outputs)
  • Debugging with log messages
  • Metric extraction from logs
CloudTrail:
  • AWS API call history
  • Security auditing (who did what)
  • Compliance and governance
Mechanism:
  • Trace ID passed via X-Amzn-Trace-Id HTTP header
  • For SQS: AWSTraceHeader message attribute
  • SDK automatically propagates when patched
Format:
Root=1-5759e988-bd862e3fe1be46a994272793;Parent=53995c3f42cd8ad8;Sampled=1
  • Root: Trace ID
  • Parent: Parent segment ID
  • Sampled: Whether to trace (1=yes, 0=no)
Segments:
  • Represent a service/compute unit
  • Created automatically (Lambda, API Gateway)
  • Top-level work unit in a trace
Subsegments:
  • Work done within a segment
  • You create these (SDK calls, custom operations)
  • Nested, can have parent subsegments
  • Example: DynamoDB call within Lambda
Strategies:
  1. Sampling rules: Don’t trace everything
    • 100% for errors
    • 5-10% for normal requests
    • 1% for health checks
  2. Filter trace types:
    • Skip tracing for certain paths
    • Higher rates for production, lower for dev
  3. Optimize subsegments:
    • Don’t create too many nested subsegments
    • Use meaningful grouping

Next Module

AWS Step Functions

Orchestrate serverless workflows with state machines