Monitoring & Observability
What You’ll Learn
By the end of this chapter, you’ll understand:- What monitoring and observability mean - The difference between knowing your app is broken vs. knowing WHY it’s broken
- The Three Pillars - Metrics, Logs, and Traces and when to use each one
- Azure Monitor ecosystem - Application Insights, Log Analytics, and how they work together
- KQL (Kusto Query Language) - How to query logs to find problems in production
- Distributed tracing - How to track a single user request across 10 different microservices
- Cost optimization - How to avoid $10,000/month telemetry bills (yes, this happens!)
Introduction: What is Monitoring & Observability?
Start Here if You’re Completely New
Monitoring = Knowing your application is broken Observability = Knowing WHY your application is broken Think of it like a car: Monitoring (Dashboard lights):- Check Engine Light ✅ (Something is wrong!)
- Temperature Gauge: HIGH ⚠️ (Engine is overheating!)
- Fuel Gauge: EMPTY ⚠️ (You’re out of gas!)
- Metrics: Temperature readings every second (shows temperature spiked at 3:15 PM)
- Logs: “Coolant level low” warning at 3:10 PM → “Fan belt broke” error at 3:14 PM
- Traces: Complete timeline from “fan belt snapped” → “fan stopped” → “engine overheated” → “check engine light”
Why Observability Matters (Real-World Example)
The $2 Million Bug (True Story)
Scenario: E-commerce site during Black Friday sale What Happened:- 3:00 PM: Sales suddenly drop 90%
- 3:05 PM: CEO calls: “FIX IT NOW!”
- 3:30 PM: Still debugging…
- 4:00 PM: Finally found the bug (payment API timeout)
- Total downtime: 1 hour
- Lost revenue: $2,000,000
The Three Pillars of Observability (Explained Simply)
Real-World Analogy: Investigating a Crime
Crime Scene: Your application crashed 1. Metrics = Security Camera (Numbers over time) What you see:- 3:14:23 PM: 100 people in store
- 3:14:25 PM: 95 people in store
- 3:14:27 PM: 0 people in store ← Everyone left suddenly!
- CPU usage: 25% → 75% → 100% ← CPU spiked!
- Request rate: 1000/sec → 500/sec → 0/sec ← Requests dropped!
- Error rate: 1% → 5% → 25% ← Errors increased!
How Azure Monitor Works (Behind the Scenes)
The Complete Picture
Your Application:-
Sensors (Your application with Application Insights SDK)
- Motion sensors = Telemetry in your code
- Automatically detect: HTTP requests, database queries, exceptions
-
Recording device (Log Analytics Workspace)
- DVR that stores all footage
- Stores: Metrics, logs, traces for 30-90 days
-
Monitoring screen (Azure Portal dashboards)
- View live feeds
- See alerts when motion detected
-
Alert system (Azure Monitor Alerts)
- Calls police when break-in detected
- Sends email/SMS when error rate > 5%
Cost of Observability (Real Numbers)
Before You Start: Understand the Costs
Azure Monitor Pricing (as of 2024): 1. Data Ingestion (Getting data IN):- First 5 GB/month: FREE ✅
- After 5 GB: $2.76/GB
- First 31 days: FREE ✅
- After 31 days: $0.12/GB/month
- 1 million requests/day
- Each request generates ~5 KB telemetry
- Total data: 5,000,000 KB/day = ~5 GB/day = 150 GB/month
- Small app (10k requests/day): ~$5-20/month
- Medium app (1M requests/day): ~$30-400/month (with/without sampling)
- Large app (100M requests/day): ~$1,000-10,000/month
[!WARNING] Gotcha: Debug Logs Cost Money! Developers often enable debug logging in production and forget to turn it off. Example mistake:Fix: Only log errors/warnings in production, use debug logs in development only.
Observability vs Monitoring (The Real Difference)
Monitoring = Pre-defined dashboards- You must know what to monitor ahead of time
- “I’ll track CPU, memory, request rate”
- Works great for known issues
- You can investigate unknown problems
- “Show me all requests from user X that failed between 3-4 PM”
- Essential for debugging new issues
Real-World Comparison
Problem: “Checkout is slow for some users” Monitoring approach (limited):[!TIP] Jargon Alert: Observability vs Monitoring Monitoring: Tells you the system is dead. (“CPU is 100%”) Observability: Tells you why the system is dead. (“The database query from line 42 is hanging.”)
[!WARNING] Gotcha: Log Retention Costs Log Analytics charges you to ingest data and to keep it. Storing debug logs for 365 days is expensive. Set retention to 30 days for dev/test and use “Data Export” to move old logs to Blob Storage (Archive Tier) for long-term compliance.
1. The Three Pillars of Observability
Metrics
- CPU usage: 75%
- Request rate: 1,000/sec
- Error rate: 2.5%
- Response time: 250ms
Logs
- “User login failed”
- “Payment processed: $99.99”
- “Database connection timeout”
Traces
- Frontend → API → Database
- Total time: 450ms
- DB query took 300ms
2. Azure Monitor Components
3. Application Insights Deep Dive
Enable Application Insights
- ASP.NET Core
- Node.js
- Python
Custom Telemetry
Track Custom Events
Track Custom Events
Track Dependencies
Track Dependencies
4. KQL (Kusto Query Language) Mastery
Essential Queries for Production
- Performance Analysis
- Error Analysis
- Dependency Failures
- User Analytics
5. Distributed Tracing
traceparent header.
6. Alerting Strategy
- Metric Alerts
- Log Query Alerts
- Smart Detection
7. Interview Questions
Beginner Level
Q1: What's the difference between metrics and logs?
Q1: What's the difference between metrics and logs?
- Numerical time-series data (CPU: 75%, requests: 1000/sec)
- Cheap to store (aggregated)
- Real-time monitoring
- Limited context
- Discrete event records (structured or unstructured)
- Expensive to store (high volume)
- Rich context and details
- Used for debugging
Q2: What are the three pillars of observability?
Q2: What are the three pillars of observability?
- Metrics: Numerical measurements over time (CPU, memory, request rate)
- Logs: Discrete event records (errors, audit trails)
- Traces: Request flow across distributed systems
Q3: What is Application Insights?
Q3: What is Application Insights?
- Automatic telemetry collection (requests, dependencies, exceptions)
- Distributed tracing
- Application Map (visualize dependencies)
- Live Metrics Stream (real-time monitoring)
- Smart Detection (anomaly detection)
Intermediate Level
Q4: How would you troubleshoot a slow API endpoint?
Q4: How would you troubleshoot a slow API endpoint?
- Identify the slow endpoint:
- Find slow dependencies:
- View end-to-end transaction:
Use Application Map or search by
operation_Idto see the entire request flow
Q5: Explain distributed tracing and correlation
Q5: Explain distributed tracing and correlation
- Generate
TraceId(unique per request) - Each service creates a
SpanId(unique per operation) - Pass
TraceIdand parentSpanIdvia HTTP headers (traceparent) - All telemetry includes
TraceIdfor correlation
Advanced Level
Q6: Design a monitoring strategy for microservices
Q6: Design a monitoring strategy for microservices
- Enable Application Insights on all services
- Implement distributed tracing
- Use structured logging (JSON)
- Service-level: Request rate, error rate, latency (p50, p95, p99)
- Infrastructure: CPU, memory, disk, network
- Business: Orders/min, revenue/hour, conversion rate
- Overview: Health of all services (green/yellow/red)
- Service Detail: Golden signals per service
- Business: KPIs (revenue, active users, conversion)
- Critical: Service down, high error rate (> 5%)
- Warning: Degraded performance, resource usage > 80%
- Document troubleshooting steps for each alert
- Include dashboard links, KQL queries
- Escalation paths
Q7: How do you optimize telemetry costs?
Q7: How do you optimize telemetry costs?
8. Best Practices
Structured Logging
Correlation IDs
Sample High-Volume Data
Monitor SLIs/SLOs
Alert Runbooks
Test Observability
9. Key Takeaways
Three Pillars
Distributed Tracing
KQL is Essential
Golden Signals
Smart Alerts
Cost Optimization
Interview Deep-Dive
Checkout conversion drops 40% but no alerts fire. CPU, memory, and error rates look normal. How do you diagnose this?
Checkout conversion drops 40% but no alerts fire. CPU, memory, and error rates look normal. How do you diagnose this?
- Why monitoring missed it: CPU, memory, and 500 error rates are infrastructure metrics. A conversion drop is a business metric — the app is technically working but something is wrong with user experience.
- Diagnosis: Open Application Map for dependency latency. Check Performance blade for /checkout P95 response time — if it jumped from 800ms to 4 seconds, users abandon due to slowness (200 OK response, but slow). Use distributed tracing to find which dependency call is the bottleneck. Check Failures blade for dependency 429s (rate limiting) causing retries.
- The likely culprit: A third-party dependency (payment gateway, fraud detection) responding slowly. Application Insights dependency tracking shows exact call, latency, and success rate.
- What was missing: No alerts on business KPIs. Create custom metrics tracking checkout start vs order confirmation events. Alert when conversion drops 20% vs same hour last week.
Your Application Insights bill jumped from $500 to $8,000/month. How do you reduce it without losing critical observability?
Your Application Insights bill jumped from $500 to $8,000/month. How do you reduce it without losing critical observability?
- What happened: Ingestion went from ~185 GB to ~2,900 GB/month at $2.76/GB. Common causes: verbose DEBUG logging in production, trace telemetry for every request, dependency tracking logging full SQL queries, or snapshot debugging enabled.
- Immediate fix: Enable adaptive sampling (1-in-5 requests, 80% cost reduction while preserving statistical accuracy). Set daily cap to 200 GB to prevent runaway costs.
- Medium-term: Use “Usage and estimated costs” blade to find which telemetry type is largest. Filter out health check endpoints with TelemetryProcessors. Move verbose logs to separate workspace with 7-day retention.
- Never cut: Distributed tracing and exception tracking. A 1-hour outage costs $50K+ for most e-commerce sites — far more than monthly telemetry costs.
A user says checkout is slow but your API returns 200 OK in 300ms. Where is the problem?
A user says checkout is slow but your API returns 200 OK in 300ms. Where is the problem?
- The disconnect: Server measures 300ms. User experiences: DNS (50ms) + TCP (100ms) + TLS (100ms) + TTFB (300ms) + download (200ms) + JS rendering (2s) + third-party scripts (analytics, chat). Total perceived: 2.75 seconds.
- How distributed tracing reveals this: Application Insights JavaScript SDK captures browser timing. End-to-end trace shows: Browser (2.75s) -> CDN (50ms) -> Front Door (20ms) -> API (300ms). 2 seconds are client-side rendering and third-party scripts.
- The fix: Defer non-critical third-party scripts after checkout. Lazy-load below-fold content. Pre-connect to payment API domain. Reduces perceived time from 2.75s to under 1 second without backend changes.