Skip to main content

Monitoring & Observability

What You’ll Learn

By the end of this chapter, you’ll understand:
  • What monitoring and observability mean - The difference between knowing your app is broken vs. knowing WHY it’s broken
  • The Three Pillars - Metrics, Logs, and Traces and when to use each one
  • Azure Monitor ecosystem - Application Insights, Log Analytics, and how they work together
  • KQL (Kusto Query Language) - How to query logs to find problems in production
  • Distributed tracing - How to track a single user request across 10 different microservices
  • Cost optimization - How to avoid $10,000/month telemetry bills (yes, this happens!)

Introduction: What is Monitoring & Observability?

Start Here if You’re Completely New

Monitoring = Knowing your application is broken Observability = Knowing WHY your application is broken Think of it like a car: Monitoring (Dashboard lights):
  • Check Engine Light ✅ (Something is wrong!)
  • Temperature Gauge: HIGH ⚠️ (Engine is overheating!)
  • Fuel Gauge: EMPTY ⚠️ (You’re out of gas!)
What you know: The car has a problem What you DON’T know: Why the engine is overheating (radiator leak? broken fan? low coolant?) Observability (Diagnostic tools):
  • Metrics: Temperature readings every second (shows temperature spiked at 3:15 PM)
  • Logs: “Coolant level low” warning at 3:10 PM → “Fan belt broke” error at 3:14 PM
  • Traces: Complete timeline from “fan belt snapped” → “fan stopped” → “engine overheated” → “check engine light”
Result: You know EXACTLY what failed, when, and why!

Why Observability Matters (Real-World Example)

The $2 Million Bug (True Story)

Scenario: E-commerce site during Black Friday sale What Happened:
  • 3:00 PM: Sales suddenly drop 90%
  • 3:05 PM: CEO calls: “FIX IT NOW!”
  • 3:30 PM: Still debugging…
  • 4:00 PM: Finally found the bug (payment API timeout)
  • Total downtime: 1 hour
  • Lost revenue: $2,000,000
Without Observability:
Step 1: Check if servers are running (10 min)
Step 2: Check database connections (15 min)
Step 3: Restart application (10 min - didn't help)
Step 4: Check payment gateway logs (20 min - found it!)
Step 5: Fix payment timeout setting (5 min)
Total time: 60 minutes ❌
With Observability:
Step 1: Open Application Insights dashboard (30 sec)
Step 2: See payment API latency spiked from 200ms → 5000ms (1 min)
Step 3: View distributed trace showing payment gateway timeout (2 min)
Step 4: Fix payment timeout setting (5 min)
Total time: 8.5 minutes ✅
Saved: $1.8 million in revenue
Cost of observability tools: ~$500/month ROI: 3,600x return on investment

The Three Pillars of Observability (Explained Simply)

Real-World Analogy: Investigating a Crime

Crime Scene: Your application crashed 1. Metrics = Security Camera (Numbers over time) What you see:
  • 3:14:23 PM: 100 people in store
  • 3:14:25 PM: 95 people in store
  • 3:14:27 PM: 0 people in store ← Everyone left suddenly!
What it tells you: WHEN something happened (3:14:27 PM) What it DOESN’T tell you: WHY everyone left Example metrics:
  • CPU usage: 25% → 75% → 100% ← CPU spiked!
  • Request rate: 1000/sec → 500/sec → 0/sec ← Requests dropped!
  • Error rate: 1% → 5% → 25% ← Errors increased!
2. Logs = Witness Statements (Discrete events) What you see:
3:14:20 PM: "Customer #1234 entered checkout"
3:14:22 PM: "Payment gateway timeout after 5 seconds"
3:14:23 PM: "Error: Cannot process payment - gateway unavailable"
3:14:25 PM: "Customer #1234 abandoned cart"
What it tells you: Exactly WHAT happened (payment gateway timed out) What it DOESN’T tell you: Which service caused the timeout? 3. Traces = Detective’s Timeline (Request journey) What you see:
Request ID: abc123 (Customer #1234's checkout)

Frontend (50ms)

API Gateway (20ms)

Order Service (100ms)

Payment Service (5000ms) ← BOTTLENECK!
  ↓ Timeout (payment gateway never responded)

Total time: 5,170ms (should be ~200ms)
What it tells you: WHERE the problem occurred (Payment Service → payment gateway)

How Azure Monitor Works (Behind the Scenes)

The Complete Picture

Your Application:
Web App (Frontend)
  ↓ sends telemetry
Application Insights
  ↓ stores data in
Log Analytics Workspace
  ↓ you query using
KQL (Kusto Query Language)
  ↓ creates
Dashboards & Alerts
Think of it like a security system:
  1. Sensors (Your application with Application Insights SDK)
    • Motion sensors = Telemetry in your code
    • Automatically detect: HTTP requests, database queries, exceptions
  2. Recording device (Log Analytics Workspace)
    • DVR that stores all footage
    • Stores: Metrics, logs, traces for 30-90 days
  3. Monitoring screen (Azure Portal dashboards)
    • View live feeds
    • See alerts when motion detected
  4. Alert system (Azure Monitor Alerts)
    • Calls police when break-in detected
    • Sends email/SMS when error rate > 5%

Cost of Observability (Real Numbers)

Before You Start: Understand the Costs

Azure Monitor Pricing (as of 2024): 1. Data Ingestion (Getting data IN):
  • First 5 GB/month: FREE ✅
  • After 5 GB: $2.76/GB
2. Data Retention (Keeping data):
  • First 31 days: FREE ✅
  • After 31 days: $0.12/GB/month
Real-World Example: E-commerce App Scenario:
  • 1 million requests/day
  • Each request generates ~5 KB telemetry
  • Total data: 5,000,000 KB/day = ~5 GB/day = 150 GB/month
Cost Calculation:
Monthly ingestion: 150 GB
- First 5 GB free: $0
- Remaining 145 GB × $2.76 = $400.20/month
Data retention (90 days):
- First 31 days: $0
- Days 32-90: 150 GB × $0.12 × 2 months = $36/month
Total: ~$436/month
Optimization (Enable sampling at 10%):
Monthly ingestion: 15 GB (90% reduction!)
- First 5 GB free: $0
- Remaining 10 GB × $2.76 = $27.60/month
Data retention: 15 GB × $0.12 × 2 = $3.60/month
Total: ~$31/month (93% savings!)
Cost for Typical Apps:
  • Small app (10k requests/day): ~$5-20/month
  • Medium app (1M requests/day): ~$30-400/month (with/without sampling)
  • Large app (100M requests/day): ~$1,000-10,000/month
[!WARNING] Gotcha: Debug Logs Cost Money! Developers often enable debug logging in production and forget to turn it off. Example mistake:
logger.LogDebug($"Processing order {orderId} for customer {customerId}");
// This runs 1 million times/day = 5 GB/day = $400/month!
Fix: Only log errors/warnings in production, use debug logs in development only.

Observability vs Monitoring (The Real Difference)

Monitoring = Pre-defined dashboards
  • You must know what to monitor ahead of time
  • “I’ll track CPU, memory, request rate”
  • Works great for known issues
Observability = Ask any question
  • You can investigate unknown problems
  • “Show me all requests from user X that failed between 3-4 PM”
  • Essential for debugging new issues

Real-World Comparison

Problem: “Checkout is slow for some users” Monitoring approach (limited):
Dashboard shows:
- Average response time: 250ms ✅ (Looks fine!)
- CPU usage: 40% ✅ (Looks fine!)
- Error rate: 0.5% ✅ (Looks fine!)

Conclusion: Everything looks normal, but users still complaining!
Observability approach (powerful):
KQL Query:
requests
| where name contains "checkout"
| where duration > 5000  // > 5 seconds
| summarize count() by client_City

Results:
London: 2 slow requests
Tokyo: 1,543 slow requests ← FOUND IT!

Cause: Database replica in Asia is down!
Observability lets you ask questions you didn’t think of when building dashboards.
Azure Monitoring Stack
[!TIP] Jargon Alert: Observability vs Monitoring Monitoring: Tells you the system is dead. (“CPU is 100%”) Observability: Tells you why the system is dead. (“The database query from line 42 is hanging.”)
[!WARNING] Gotcha: Log Retention Costs Log Analytics charges you to ingest data and to keep it. Storing debug logs for 365 days is expensive. Set retention to 30 days for dev/test and use “Data Export” to move old logs to Blob Storage (Archive Tier) for long-term compliance.

1. The Three Pillars of Observability

Metrics

What: Time-series numerical dataExamples:
  • CPU usage: 75%
  • Request rate: 1,000/sec
  • Error rate: 2.5%
  • Response time: 250ms
Use: Real-time monitoring, alerting

Logs

What: Discrete event recordsExamples:
  • “User login failed”
  • “Payment processed: $99.99”
  • “Database connection timeout”
Use: Debugging, troubleshooting

Traces

What: Request flow across servicesExamples:
  • Frontend → API → Database
  • Total time: 450ms
  • DB query took 300ms
Use: Performance analysis, bottleneck identification

2. Azure Monitor Components


3. Application Insights Deep Dive

Enable Application Insights

// Program.cs
var builder = WebApplication.CreateBuilder(args);

// Add Application Insights
builder.Services.AddApplicationInsightsTelemetry();

var app = builder.Build();
Install NuGet:
dotnet add package Microsoft.ApplicationInsights.AspNetCore
Configuration (appsettings.json):
{
  "ApplicationInsights": {
    "ConnectionString": "InstrumentationKey=xxx;IngestionEndpoint=https://xxx"
  }
}

Custom Telemetry

// Track business events
telemetryClient.TrackEvent("OrderPlaced",
    properties: new Dictionary<string, string> {
        { "OrderId", orderId },
        { "CustomerId", customerId }
    },
    metrics: new Dictionary<string, double> {
        { "Amount", amount }
    });

// Query in KQL:
customEvents
| where name == "OrderPlaced"
| summarize totalRevenue=sum(todouble(customMeasurements.Amount)) by bin(timestamp, 1h)
| render timechart
// Track external dependencies (DB, APIs, etc.)
using (var operation = telemetryClient.StartOperation<DependencyTelemetry>("SQL Query"))
{
    operation.Telemetry.Type = "SQL";
    operation.Telemetry.Data = "SELECT * FROM Orders WHERE CustomerId = @id";

    try
    {
        var result = await database.QueryAsync(sql);
        operation.Telemetry.Success = true;
    }
    catch (Exception ex)
    {
        operation.Telemetry.Success = false;
        telemetryClient.TrackException(ex);
        throw;
    }
}

4. KQL (Kusto Query Language) Mastery

Essential Queries for Production

// Find slowest requests
requests
| where timestamp > ago(24h)
| summarize
    count=count(),
    avg_duration=avg(duration),
    p50=percentile(duration, 50),
    p95=percentile(duration, 95),
    p99=percentile(duration, 99)
    by operation_Name
| order by p95 desc

// Find requests slower than SLA
requests
| where duration > 1000  // > 1 second
| project timestamp, name, url, duration, resultCode
| order by duration desc

5. Distributed Tracing

Application Insights automatically correlates using traceparent header.

6. Alerting Strategy

# CPU alert
az monitor metrics alert create \
  --name high-cpu-alert \
  --resource-group rg-prod \
  --scopes /subscriptions/.../virtualMachines/vm-web-01 \
  --condition "avg Percentage CPU > 80" \
  --window-size 5m \
  --evaluation-frequency 1m \
  --action email-action-group

7. Interview Questions

Beginner Level

Answer:Metrics:
  • Numerical time-series data (CPU: 75%, requests: 1000/sec)
  • Cheap to store (aggregated)
  • Real-time monitoring
  • Limited context
Logs:
  • Discrete event records (structured or unstructured)
  • Expensive to store (high volume)
  • Rich context and details
  • Used for debugging
Example: Metrics tell you “error rate is 5%”, logs tell you “User X’s payment failed because of timeout”
Answer:
  1. Metrics: Numerical measurements over time (CPU, memory, request rate)
  2. Logs: Discrete event records (errors, audit trails)
  3. Traces: Request flow across distributed systems
All three are needed for complete observability. Metrics show what is wrong, logs show why, and traces show where in the system.
Answer:Application Insights is Azure’s Application Performance Monitoring (APM) service.Features:
  • Automatic telemetry collection (requests, dependencies, exceptions)
  • Distributed tracing
  • Application Map (visualize dependencies)
  • Live Metrics Stream (real-time monitoring)
  • Smart Detection (anomaly detection)
Use case: Monitor web applications, detect performance issues, track user behavior

Intermediate Level

Answer:Step-by-step approach:
  1. Identify the slow endpoint:
requests
| where timestamp > ago(1h)
| summarize p95=percentile(duration, 95) by operation_Name
| order by p95 desc
  1. Find slow dependencies:
dependencies
| where operation_Name == "/api/orders"
| summarize p95=percentile(duration, 95) by name
| order by p95 desc
  1. View end-to-end transaction: Use Application Map or search by operation_Id to see the entire request flow
Answer:Distributed Tracing tracks a single user request as it flows through multiple services.How it works:
  1. Generate TraceId (unique per request)
  2. Each service creates a SpanId (unique per operation)
  3. Pass TraceId and parent SpanId via HTTP headers (traceparent)
  4. All telemetry includes TraceId for correlation
Query all operations in a trace:
union requests, dependencies
| where operation_Id == "abc123"
| project timestamp, itemType, name, duration
| order by timestamp asc

Advanced Level

Answer:1. Instrumentation:
  • Enable Application Insights on all services
  • Implement distributed tracing
  • Use structured logging (JSON)
2. Metrics:
  • Service-level: Request rate, error rate, latency (p50, p95, p99)
  • Infrastructure: CPU, memory, disk, network
  • Business: Orders/min, revenue/hour, conversion rate
3. Dashboards:
  • Overview: Health of all services (green/yellow/red)
  • Service Detail: Golden signals per service
  • Business: KPIs (revenue, active users, conversion)
4. Alerts:
  • Critical: Service down, high error rate (> 5%)
  • Warning: Degraded performance, resource usage > 80%
5. On-Call Runbooks:
  • Document troubleshooting steps for each alert
  • Include dashboard links, KQL queries
  • Escalation paths
Answer:Optimization Strategies:1. Enable Sampling:
services.AddApplicationInsightsTelemetry(options =>
{
    options.EnableAdaptiveSampling = true;
    options.SamplingPercentage = 10; // Reduce by 90%
});
2. Filter Unnecessary Telemetry:
// Don't send health check requests
public class FilterHealthCheckProcessor : ITelemetryProcessor
{
    public void Process(ITelemetry item)
    {
        if (item is RequestTelemetry request &&
            request.Url.AbsolutePath == "/health")
        {
            return; // Skip
        }
        _next.Process(item);
    }
}
3. Reduce Retention:
# Set retention to 30 days (vs 90 default)
az monitor app-insights component update \
  --app myapp \
  --retention-time 30
Expected Savings: 70-90% reduction

8. Best Practices

Structured Logging

Use structured logs (JSON) for easier querying. Include context (user ID, correlation ID).

Correlation IDs

Track requests across services with operation_Id. Essential for distributed tracing.

Sample High-Volume Data

Enable adaptive sampling for high-traffic apps to control costs while preserving insights.

Monitor SLIs/SLOs

Define Service Level Indicators (latency, error rate) and Objectives (99.9% uptime).

Alert Runbooks

Every alert needs a runbook: What it means, how to troubleshoot, escalation path.

Test Observability

Regularly test your monitoring: Can you detect and diagnose issues quickly?

9. Key Takeaways

Three Pillars

Metrics, Logs, and Traces work together for complete observability.

Distributed Tracing

Application Insights automatically traces requests across services using correlation IDs.

KQL is Essential

Master KQL for querying logs, creating dashboards, and building alerts.

Golden Signals

Monitor Latency, Traffic, Errors, and Saturation for system health.

Smart Alerts

Alert on symptoms (error rate), not causes (CPU). Reduce alert fatigue.

Cost Optimization

Use sampling, filter noise, and reduce retention to control telemetry costs.

Next Steps

Continue to Chapter 10

Master Azure security, compliance, and governance