Monitoring & Observability
What You’ll Learn
By the end of this chapter, you’ll understand:- What monitoring and observability mean - The difference between knowing your app is broken vs. knowing WHY it’s broken
- The Three Pillars - Metrics, Logs, and Traces and when to use each one
- Azure Monitor ecosystem - Application Insights, Log Analytics, and how they work together
- KQL (Kusto Query Language) - How to query logs to find problems in production
- Distributed tracing - How to track a single user request across 10 different microservices
- Cost optimization - How to avoid $10,000/month telemetry bills (yes, this happens!)
Introduction: What is Monitoring & Observability?
Start Here if You’re Completely New
Monitoring = Knowing your application is broken Observability = Knowing WHY your application is broken Think of it like a car: Monitoring (Dashboard lights):- Check Engine Light ✅ (Something is wrong!)
- Temperature Gauge: HIGH ⚠️ (Engine is overheating!)
- Fuel Gauge: EMPTY ⚠️ (You’re out of gas!)
- Metrics: Temperature readings every second (shows temperature spiked at 3:15 PM)
- Logs: “Coolant level low” warning at 3:10 PM → “Fan belt broke” error at 3:14 PM
- Traces: Complete timeline from “fan belt snapped” → “fan stopped” → “engine overheated” → “check engine light”
Why Observability Matters (Real-World Example)
The $2 Million Bug (True Story)
Scenario: E-commerce site during Black Friday sale What Happened:- 3:00 PM: Sales suddenly drop 90%
- 3:05 PM: CEO calls: “FIX IT NOW!”
- 3:30 PM: Still debugging…
- 4:00 PM: Finally found the bug (payment API timeout)
- Total downtime: 1 hour
- Lost revenue: $2,000,000
The Three Pillars of Observability (Explained Simply)
Real-World Analogy: Investigating a Crime
Crime Scene: Your application crashed 1. Metrics = Security Camera (Numbers over time) What you see:- 3:14:23 PM: 100 people in store
- 3:14:25 PM: 95 people in store
- 3:14:27 PM: 0 people in store ← Everyone left suddenly!
- CPU usage: 25% → 75% → 100% ← CPU spiked!
- Request rate: 1000/sec → 500/sec → 0/sec ← Requests dropped!
- Error rate: 1% → 5% → 25% ← Errors increased!
How Azure Monitor Works (Behind the Scenes)
The Complete Picture
Your Application:-
Sensors (Your application with Application Insights SDK)
- Motion sensors = Telemetry in your code
- Automatically detect: HTTP requests, database queries, exceptions
-
Recording device (Log Analytics Workspace)
- DVR that stores all footage
- Stores: Metrics, logs, traces for 30-90 days
-
Monitoring screen (Azure Portal dashboards)
- View live feeds
- See alerts when motion detected
-
Alert system (Azure Monitor Alerts)
- Calls police when break-in detected
- Sends email/SMS when error rate > 5%
Cost of Observability (Real Numbers)
Before You Start: Understand the Costs
Azure Monitor Pricing (as of 2024): 1. Data Ingestion (Getting data IN):- First 5 GB/month: FREE ✅
- After 5 GB: $2.76/GB
- First 31 days: FREE ✅
- After 31 days: $0.12/GB/month
- 1 million requests/day
- Each request generates ~5 KB telemetry
- Total data: 5,000,000 KB/day = ~5 GB/day = 150 GB/month
- Small app (10k requests/day): ~$5-20/month
- Medium app (1M requests/day): ~$30-400/month (with/without sampling)
- Large app (100M requests/day): ~$1,000-10,000/month
[!WARNING] Gotcha: Debug Logs Cost Money! Developers often enable debug logging in production and forget to turn it off. Example mistake:Fix: Only log errors/warnings in production, use debug logs in development only.
Observability vs Monitoring (The Real Difference)
Monitoring = Pre-defined dashboards- You must know what to monitor ahead of time
- “I’ll track CPU, memory, request rate”
- Works great for known issues
- You can investigate unknown problems
- “Show me all requests from user X that failed between 3-4 PM”
- Essential for debugging new issues
Real-World Comparison
Problem: “Checkout is slow for some users” Monitoring approach (limited):[!TIP] Jargon Alert: Observability vs Monitoring Monitoring: Tells you the system is dead. (“CPU is 100%”) Observability: Tells you why the system is dead. (“The database query from line 42 is hanging.”)
[!WARNING] Gotcha: Log Retention Costs Log Analytics charges you to ingest data and to keep it. Storing debug logs for 365 days is expensive. Set retention to 30 days for dev/test and use “Data Export” to move old logs to Blob Storage (Archive Tier) for long-term compliance.
1. The Three Pillars of Observability
Metrics
What: Time-series numerical dataExamples:
- CPU usage: 75%
- Request rate: 1,000/sec
- Error rate: 2.5%
- Response time: 250ms
Logs
What: Discrete event recordsExamples:
- “User login failed”
- “Payment processed: $99.99”
- “Database connection timeout”
Traces
What: Request flow across servicesExamples:
- Frontend → API → Database
- Total time: 450ms
- DB query took 300ms
2. Azure Monitor Components
3. Application Insights Deep Dive
Enable Application Insights
- ASP.NET Core
- Node.js
- Python
Custom Telemetry
Track Custom Events
Track Custom Events
Track Dependencies
Track Dependencies
4. KQL (Kusto Query Language) Mastery
Essential Queries for Production
- Performance Analysis
- Error Analysis
- Dependency Failures
- User Analytics
5. Distributed Tracing
Application Insights automatically correlates using
traceparent header.
6. Alerting Strategy
- Metric Alerts
- Log Query Alerts
- Smart Detection
7. Interview Questions
Beginner Level
Q1: What's the difference between metrics and logs?
Q1: What's the difference between metrics and logs?
Answer:Metrics:
- Numerical time-series data (CPU: 75%, requests: 1000/sec)
- Cheap to store (aggregated)
- Real-time monitoring
- Limited context
- Discrete event records (structured or unstructured)
- Expensive to store (high volume)
- Rich context and details
- Used for debugging
Q2: What are the three pillars of observability?
Q2: What are the three pillars of observability?
Answer:
- Metrics: Numerical measurements over time (CPU, memory, request rate)
- Logs: Discrete event records (errors, audit trails)
- Traces: Request flow across distributed systems
Q3: What is Application Insights?
Q3: What is Application Insights?
Answer:Application Insights is Azure’s Application Performance Monitoring (APM) service.Features:
- Automatic telemetry collection (requests, dependencies, exceptions)
- Distributed tracing
- Application Map (visualize dependencies)
- Live Metrics Stream (real-time monitoring)
- Smart Detection (anomaly detection)
Intermediate Level
Q4: How would you troubleshoot a slow API endpoint?
Q4: How would you troubleshoot a slow API endpoint?
Answer:Step-by-step approach:
- Identify the slow endpoint:
- Find slow dependencies:
- View end-to-end transaction:
Use Application Map or search by
operation_Idto see the entire request flow
Q5: Explain distributed tracing and correlation
Q5: Explain distributed tracing and correlation
Answer:Distributed Tracing tracks a single user request as it flows through multiple services.How it works:
- Generate
TraceId(unique per request) - Each service creates a
SpanId(unique per operation) - Pass
TraceIdand parentSpanIdvia HTTP headers (traceparent) - All telemetry includes
TraceIdfor correlation
Advanced Level
Q6: Design a monitoring strategy for microservices
Q6: Design a monitoring strategy for microservices
Answer:1. Instrumentation:
- Enable Application Insights on all services
- Implement distributed tracing
- Use structured logging (JSON)
- Service-level: Request rate, error rate, latency (p50, p95, p99)
- Infrastructure: CPU, memory, disk, network
- Business: Orders/min, revenue/hour, conversion rate
- Overview: Health of all services (green/yellow/red)
- Service Detail: Golden signals per service
- Business: KPIs (revenue, active users, conversion)
- Critical: Service down, high error rate (> 5%)
- Warning: Degraded performance, resource usage > 80%
- Document troubleshooting steps for each alert
- Include dashboard links, KQL queries
- Escalation paths
Q7: How do you optimize telemetry costs?
Q7: How do you optimize telemetry costs?
Answer:Optimization Strategies:1. Enable Sampling:2. Filter Unnecessary Telemetry:3. Reduce Retention:Expected Savings: 70-90% reduction
8. Best Practices
Structured Logging
Use structured logs (JSON) for easier querying. Include context (user ID, correlation ID).
Correlation IDs
Track requests across services with operation_Id. Essential for distributed tracing.
Sample High-Volume Data
Enable adaptive sampling for high-traffic apps to control costs while preserving insights.
Monitor SLIs/SLOs
Define Service Level Indicators (latency, error rate) and Objectives (99.9% uptime).
Alert Runbooks
Every alert needs a runbook: What it means, how to troubleshoot, escalation path.
Test Observability
Regularly test your monitoring: Can you detect and diagnose issues quickly?
9. Key Takeaways
Three Pillars
Metrics, Logs, and Traces work together for complete observability.
Distributed Tracing
Application Insights automatically traces requests across services using correlation IDs.
KQL is Essential
Master KQL for querying logs, creating dashboards, and building alerts.
Golden Signals
Monitor Latency, Traffic, Errors, and Saturation for system health.
Smart Alerts
Alert on symptoms (error rate), not causes (CPU). Reduce alert fatigue.
Cost Optimization
Use sampling, filter noise, and reduce retention to control telemetry costs.
Next Steps
Continue to Chapter 10
Master Azure security, compliance, and governance