Monitoring & Observability

What You’ll Learn

By the end of this chapter, you’ll understand:

What monitoring and observability mean - The difference between knowing your app is broken vs. knowing WHY it’s broken
The Three Pillars - Metrics, Logs, and Traces and when to use each one
Azure Monitor ecosystem - Application Insights, Log Analytics, and how they work together
KQL (Kusto Query Language) - How to query logs to find problems in production
Distributed tracing - How to track a single user request across 10 different microservices
Cost optimization - How to avoid $10,000/month telemetry bills (yes, this happens!)

Introduction: What is Monitoring & Observability?

Start Here if You’re Completely New

Monitoring = Knowing your application is broken Observability = Knowing WHY your application is broken Think of it like a car: Monitoring (Dashboard lights):

Check Engine Light ✅ (Something is wrong!)
Temperature Gauge: HIGH ⚠️ (Engine is overheating!)
Fuel Gauge: EMPTY ⚠️ (You’re out of gas!)

What you know: The car has a problem What you DON’T know: Why the engine is overheating (radiator leak? broken fan? low coolant?) Observability (Diagnostic tools):

Metrics: Temperature readings every second (shows temperature spiked at 3:15 PM)
Logs: “Coolant level low” warning at 3:10 PM → “Fan belt broke” error at 3:14 PM
Traces: Complete timeline from “fan belt snapped” → “fan stopped” → “engine overheated” → “check engine light”

Result: You know EXACTLY what failed, when, and why!

Why Observability Matters (Real-World Example)

The $2 Million Bug (True Story)

Scenario: E-commerce site during Black Friday sale What Happened:

3:00 PM: Sales suddenly drop 90%
3:05 PM: CEO calls: “FIX IT NOW!”
3:30 PM: Still debugging…
4:00 PM: Finally found the bug (payment API timeout)
Total downtime: 1 hour
Lost revenue: $2,000,000

Without Observability:

Step 1: Check if servers are running (10 min)
Step 2: Check database connections (15 min)
Step 3: Restart application (10 min - didn't help)
Step 4: Check payment gateway logs (20 min - found it!)
Step 5: Fix payment timeout setting (5 min)
Total time: 60 minutes ❌

With Observability:

Step 1: Open Application Insights dashboard (30 sec)
Step 2: See payment API latency spiked from 200ms → 5000ms (1 min)
Step 3: View distributed trace showing payment gateway timeout (2 min)
Step 4: Fix payment timeout setting (5 min)
Total time: 8.5 minutes ✅
Saved: $1.8 million in revenue

Cost of observability tools: ~$500/month ROI: 3,600x return on investment

The Three Pillars of Observability (Explained Simply)

Real-World Analogy: Investigating a Crime

Crime Scene: Your application crashed 1. Metrics = Security Camera (Numbers over time) What you see:

3:14:23 PM: 100 people in store
3:14:25 PM: 95 people in store
3:14:27 PM: 0 people in store ← Everyone left suddenly!

What it tells you: WHEN something happened (3:14:27 PM) What it DOESN’T tell you: WHY everyone left Example metrics:

CPU usage: 25% → 75% → 100% ← CPU spiked!
Request rate: 1000/sec → 500/sec → 0/sec ← Requests dropped!
Error rate: 1% → 5% → 25% ← Errors increased!

2. Logs = Witness Statements (Discrete events) What you see:

14:20 PM: "Customer #1234 entered checkout"
14:22 PM: "Payment gateway timeout after 5 seconds"
14:23 PM: "Error: Cannot process payment - gateway unavailable"
14:25 PM: "Customer #1234 abandoned cart"

What it tells you: Exactly WHAT happened (payment gateway timed out) What it DOESN’T tell you: Which service caused the timeout? 3. Traces = Detective’s Timeline (Request journey) What you see:

Request ID: abc123 (Customer #1234's checkout)

Frontend (50ms)
  ↓
API Gateway (20ms)
  ↓
Order Service (100ms)
  ↓
Payment Service (5000ms) ← BOTTLENECK!
  ↓ Timeout (payment gateway never responded)

Total time: 5,170ms (should be ~200ms)

What it tells you: WHERE the problem occurred (Payment Service → payment gateway)

How Azure Monitor Works (Behind the Scenes)

The Complete Picture

Your Application:

Web App (Frontend)
  ↓ sends telemetry
Application Insights
  ↓ stores data in
Log Analytics Workspace
  ↓ you query using
KQL (Kusto Query Language)
  ↓ creates
Dashboards & Alerts

Think of it like a security system:

Sensors (Your application with Application Insights SDK)
- Motion sensors = Telemetry in your code
- Automatically detect: HTTP requests, database queries, exceptions
Recording device (Log Analytics Workspace)
- DVR that stores all footage
- Stores: Metrics, logs, traces for 30-90 days
Monitoring screen (Azure Portal dashboards)
- View live feeds
- See alerts when motion detected
Alert system (Azure Monitor Alerts)
- Calls police when break-in detected
- Sends email/SMS when error rate > 5%

Cost of Observability (Real Numbers)

Before You Start: Understand the Costs

Azure Monitor Pricing (as of 2024): 1. Data Ingestion (Getting data IN):

First 5 GB/month: FREE ✅
After 5 GB: $2.76/GB

2. Data Retention (Keeping data):

First 31 days: FREE ✅
After 31 days: $0.12/GB/month

Real-World Example: E-commerce App Scenario:

1 million requests/day
Each request generates ~5 KB telemetry
Total data: 5,000,000 KB/day = ~5 GB/day = 150 GB/month

Cost Calculation:

Monthly ingestion: 150 GB
- First 5 GB free: $0
- Remaining 145 GB × $2.76 = $400.20/month
Data retention (90 days):
- First 31 days: $0
- Days 32-90: 150 GB × $0.12 × 2 months = $36/month
Total: ~$436/month

Optimization (Enable sampling at 10%):

Monthly ingestion: 15 GB (90% reduction!)
- First 5 GB free: $0
- Remaining 10 GB × $2.76 = $27.60/month
Data retention: 15 GB × $0.12 × 2 = $3.60/month
Total: ~$31/month (93% savings!)

Cost for Typical Apps:

Small app (10k requests/day): ~$5-20/month
Medium app (1M requests/day): ~$30-400/month (with/without sampling)
Large app (100M requests/day): ~$1,000-10,000/month

[!WARNING] Gotcha: Debug Logs Cost Money! Developers often enable debug logging in production and forget to turn it off. Example mistake:
logger.LogDebug($"Processing order {orderId} for customer {customerId}");
// This runs 1 million times/day = 5 GB/day = $400/month!
Fix: Only log errors/warnings in production, use debug logs in development only.

Observability vs Monitoring (The Real Difference)

Monitoring = Pre-defined dashboards

You must know what to monitor ahead of time
“I’ll track CPU, memory, request rate”
Works great for known issues

Observability = Ask any question

You can investigate unknown problems
“Show me all requests from user X that failed between 3-4 PM”
Essential for debugging new issues

Real-World Comparison

Problem: “Checkout is slow for some users” Monitoring approach (limited):

Dashboard shows:
- Average response time: 250ms ✅ (Looks fine!)
- CPU usage: 40% ✅ (Looks fine!)
- Error rate: 0.5% ✅ (Looks fine!)

Conclusion: Everything looks normal, but users still complaining!

Observability approach (powerful):

KQL Query:
requests
| where name contains "checkout"
| where duration > 5000  // > 5 seconds
| summarize count() by client_City

Results:
London: 2 slow requests
Tokyo: 1,543 slow requests ← FOUND IT!

Cause: Database replica in Asia is down!

Observability lets you ask questions you didn’t think of when building dashboards.

[!TIP] Jargon Alert: Observability vs Monitoring Monitoring: Tells you the system is dead. (“CPU is 100%”) Observability: Tells you why the system is dead. (“The database query from line 42 is hanging.”)

[!WARNING] Gotcha: Log Retention Costs Log Analytics charges you to ingest data and to keep it. Storing debug logs for 365 days is expensive. Set retention to 30 days for dev/test and use “Data Export” to move old logs to Blob Storage (Archive Tier) for long-term compliance.

1. The Three Pillars of Observability

Metrics

What: Time-series numerical dataExamples:

CPU usage: 75%
Request rate: 1,000/sec
Error rate: 2.5%
Response time: 250ms

Use: Real-time monitoring, alerting

Logs

What: Discrete event recordsExamples:

“User login failed”
“Payment processed: $99.99”
“Database connection timeout”

Use: Debugging, troubleshooting

Traces

What: Request flow across servicesExamples:

Frontend → API → Database
Total time: 450ms
DB query took 300ms

Use: Performance analysis, bottleneck identification

2. Azure Monitor Components

3. Application Insights Deep Dive

Enable Application Insights

ASP.NET Core
Node.js
Python

// Program.cs
var builder = WebApplication.CreateBuilder(args);

// Add Application Insights
builder.Services.AddApplicationInsightsTelemetry();

var app = builder.Build();

Install NuGet:

dotnet add package Microsoft.ApplicationInsights.AspNetCore

Configuration (appsettings.json):

{
  "ApplicationInsights": {
    "ConnectionString": "InstrumentationKey=xxx;IngestionEndpoint=https://xxx"
  }
}

// app.js
const appInsights = require('applicationinsights');
appInsights.setup('YOUR_CONNECTION_STRING')
  .setAutoDependencyCorrelation(true)
  .setAutoCollectRequests(true)
  .setAutoCollectPerformance(true, true)
  .setAutoCollectExceptions(true)
  .start();

// Track custom events
const client = appInsights.defaultClient;
client.trackEvent({ name: 'OrderPlaced', properties: { orderId: '123' } });

# app.py
from applicationinsights import TelemetryClient
from applicationinsights.flask.ext import AppInsights

app = Flask(__name__)
app.config['APPINSIGHTS_INSTRUMENTATIONKEY'] = 'YOUR_KEY'
appinsights = AppInsights(app)

# Track custom events
tc = TelemetryClient('YOUR_KEY')
tc.track_event('OrderPlaced', {'orderId': '123'})
tc.flush()

Custom Telemetry

Track Custom Events

// Track business events
telemetryClient.TrackEvent("OrderPlaced",
    properties: new Dictionary<string, string> {
        { "OrderId", orderId },
        { "CustomerId", customerId }
    },
    metrics: new Dictionary<string, double> {
        { "Amount", amount }
    });

// Query in KQL:
customEvents
| where name == "OrderPlaced"
| summarize totalRevenue=sum(todouble(customMeasurements.Amount)) by bin(timestamp, 1h)
| render timechart

Track Dependencies

// Track external dependencies (DB, APIs, etc.)
using (var operation = telemetryClient.StartOperation<DependencyTelemetry>("SQL Query"))
{
    operation.Telemetry.Type = "SQL";
    operation.Telemetry.Data = "SELECT * FROM Orders WHERE CustomerId = @id";

    try
    {
        var result = await database.QueryAsync(sql);
        operation.Telemetry.Success = true;
    }
    catch (Exception ex)
    {
        operation.Telemetry.Success = false;
        telemetryClient.TrackException(ex);
        throw;
    }
}

4. KQL (Kusto Query Language) Mastery

Essential Queries for Production

Performance Analysis
Error Analysis
Dependency Failures
User Analytics

// Find slowest requests
requests
| where timestamp > ago(24h)
| summarize
    count=count(),
    avg_duration=avg(duration),
    p50=percentile(duration, 50),
    p95=percentile(duration, 95),
    p99=percentile(duration, 99)
    by operation_Name
| order by p95 desc

// Find requests slower than SLA
requests
| where duration > 1000  // > 1 second
| project timestamp, name, url, duration, resultCode
| order by duration desc

// Find top errors
requests
| where timestamp > ago(1h)
| where success == false
| summarize count() by resultCode, operation_Name
| order by count_ desc

// Error rate over time
requests
| where timestamp > ago(24h)
| summarize
    total=count(),
    failed=countif(success == false)
    by bin(timestamp, 5m)
| extend errorRate = (failed * 100.0) / total
| render timechart

// Exceptions with stack traces
exceptions
| where timestamp > ago(1h)
| project timestamp, type, outerMessage, innermostMessage
| order by timestamp desc

// Failed dependencies
dependencies
| where timestamp > ago(1h)
| where success == false
| summarize count() by name, type, resultCode
| order by count_ desc

// Database query performance
dependencies
| where type == "SQL"
| where timestamp > ago(24h)
| summarize
    count(),
    avg(duration),
    p95=percentile(duration, 95)
    by name
| order by p95 desc

// Active users
pageViews
| where timestamp > ago(7d)
| summarize dau=dcount(user_Id) by bin(timestamp, 1d)
| render timechart

// Most popular pages
pageViews
| where timestamp > ago(7d)
| summarize count() by name
| order by count_ desc
| take 10

5. Distributed Tracing

Application Insights automatically correlates using traceparent header.

6. Alerting Strategy

Metric Alerts
Log Query Alerts
Smart Detection

# CPU alert
az monitor metrics alert create \
  --name high-cpu-alert \
  --resource-group rg-prod \
  --scopes /subscriptions/.../virtualMachines/vm-web-01 \
  --condition "avg Percentage CPU > 80" \
  --window-size 5m \
  --evaluation-frequency 1m \
  --action email-action-group

// High error rate alert
requests
| where timestamp > ago(5m)
| summarize
    total=count(),
    failed=countif(success == false)
| extend errorRate = (failed * 100.0) / total
| where errorRate > 5  // Alert if > 5% errors

7. Interview Questions

Beginner Level

Q1: What's the difference between metrics and logs?

Answer:Metrics:

Numerical time-series data (CPU: 75%, requests: 1000/sec)
Cheap to store (aggregated)
Real-time monitoring
Limited context

Logs:

Discrete event records (structured or unstructured)
Expensive to store (high volume)
Rich context and details
Used for debugging

Example: Metrics tell you “error rate is 5%”, logs tell you “User X’s payment failed because of timeout”

Q2: What are the three pillars of observability?

Answer:

Metrics: Numerical measurements over time (CPU, memory, request rate)
Logs: Discrete event records (errors, audit trails)
Traces: Request flow across distributed systems

All three are needed for complete observability. Metrics show what is wrong, logs show why, and traces show where in the system.

Q3: What is Application Insights?

Answer:Application Insights is Azure’s Application Performance Monitoring (APM) service.Features:

Automatic telemetry collection (requests, dependencies, exceptions)
Distributed tracing
Application Map (visualize dependencies)
Live Metrics Stream (real-time monitoring)
Smart Detection (anomaly detection)

Use case: Monitor web applications, detect performance issues, track user behavior

Intermediate Level

Q4: How would you troubleshoot a slow API endpoint?

Answer:Step-by-step approach:

Identify the slow endpoint:

requests
| where timestamp > ago(1h)
| summarize p95=percentile(duration, 95) by operation_Name
| order by p95 desc

Find slow dependencies:

dependencies
| where operation_Name == "/api/orders"
| summarize p95=percentile(duration, 95) by name
| order by p95 desc

View end-to-end transaction: Use Application Map or search by operation_Id to see the entire request flow

Q5: Explain distributed tracing and correlation

Answer:Distributed Tracing tracks a single user request as it flows through multiple services.How it works:

Generate TraceId (unique per request)
Each service creates a SpanId (unique per operation)
Pass TraceId and parent SpanId via HTTP headers (traceparent)
All telemetry includes TraceId for correlation

Query all operations in a trace:

union requests, dependencies
| where operation_Id == "abc123"
| project timestamp, itemType, name, duration
| order by timestamp asc

Advanced Level

Q6: Design a monitoring strategy for microservices

Answer:1. Instrumentation:

Enable Application Insights on all services
Implement distributed tracing
Use structured logging (JSON)

2. Metrics:

Service-level: Request rate, error rate, latency (p50, p95, p99)
Infrastructure: CPU, memory, disk, network
Business: Orders/min, revenue/hour, conversion rate

3. Dashboards:

Overview: Health of all services (green/yellow/red)
Service Detail: Golden signals per service
Business: KPIs (revenue, active users, conversion)

4. Alerts:

Critical: Service down, high error rate (> 5%)
Warning: Degraded performance, resource usage > 80%

5. On-Call Runbooks:

Document troubleshooting steps for each alert
Include dashboard links, KQL queries
Escalation paths

Q7: How do you optimize telemetry costs?

Answer:Optimization Strategies:1. Enable Sampling:

services.AddApplicationInsightsTelemetry(options =>
{
    options.EnableAdaptiveSampling = true;
    options.SamplingPercentage = 10; // Reduce by 90%
});

2. Filter Unnecessary Telemetry:

// Don't send health check requests
public class FilterHealthCheckProcessor : ITelemetryProcessor
{
    public void Process(ITelemetry item)
    {
        if (item is RequestTelemetry request &&
            request.Url.AbsolutePath == "/health")
        {
            return; // Skip
        }
        _next.Process(item);
    }
}

3. Reduce Retention:

# Set retention to 30 days (vs 90 default)
az monitor app-insights component update \
  --app myapp \
  --retention-time 30

Expected Savings: 70-90% reduction

8. Best Practices

Structured Logging

Use structured logs (JSON) for easier querying. Include context (user ID, correlation ID).

Correlation IDs

Track requests across services with operation_Id. Essential for distributed tracing.

Sample High-Volume Data

Enable adaptive sampling for high-traffic apps to control costs while preserving insights.

Monitor SLIs/SLOs

Define Service Level Indicators (latency, error rate) and Objectives (99.9% uptime).

Alert Runbooks

Every alert needs a runbook: What it means, how to troubleshoot, escalation path.

Test Observability

Regularly test your monitoring: Can you detect and diagnose issues quickly?

9. Key Takeaways

Three Pillars

Metrics, Logs, and Traces work together for complete observability.

Distributed Tracing

Application Insights automatically traces requests across services using correlation IDs.

KQL is Essential

Master KQL for querying logs, creating dashboards, and building alerts.

Golden Signals

Monitor Latency, Traffic, Errors, and Saturation for system health.

Smart Alerts

Alert on symptoms (error rate), not causes (CPU). Reduce alert fatigue.

Cost Optimization

Use sampling, filter noise, and reduce retention to control telemetry costs.

Next Steps

Continue to Chapter 10

Master Azure security, compliance, and governance

Overview

Testing & Code Quality

Crash Courses

AI Engineering

Math for ML - Understanding Linear Algebra

Probability & Statistics for ML

Math for ML - Understanding Calculus

ML Mastery

Deep Learning Mastery

NestJS Mastery

Microservices Mastery

Low Level Design

OOP Concepts

SOLID Principles

Design Patterns

LLD Case Studies

System Design (HLD)

Senior Level (L5+/Staff)

HLD Case Studies

Engineering Fundamentals

DevOps & Operations

Azure Cloud Engineering

AWS Cloud

AWS Monitoring & Observability

AWS Security Services

AWS Serverless

AWS Operations

AWS Advanced

AWS Case Studies

GCP Cloud Engineering

DevOps Tools

Database Engineering

HIPAA Compliance Mastery

Operating Systems

Linux Internals

Distributed Systems

Networking Mastery

Build Your Own X

Go Lang Mastery

C Programming

Classic Research Papers

Distributed System Tools

​Monitoring & Observability

​What You’ll Learn

​Introduction: What is Monitoring & Observability?

​Start Here if You’re Completely New

​Why Observability Matters (Real-World Example)

​The $2 Million Bug (True Story)

​The Three Pillars of Observability (Explained Simply)

​Real-World Analogy: Investigating a Crime

​How Azure Monitor Works (Behind the Scenes)

​The Complete Picture

​Cost of Observability (Real Numbers)

​Before You Start: Understand the Costs

​Observability vs Monitoring (The Real Difference)

​Real-World Comparison

​1. The Three Pillars of Observability