Traffic Management & Network Security
What You’ll Learn
By the end of this chapter, you’ll understand:- How load balancers work (and why Azure has 4 different ones)
- When to use each load balancer (Layer 4 vs Layer 7, Regional vs Global)
- Real costs and performance trade-offs between load balancing options
- How to prevent outages with health probes and connection draining
- Common mistakes that cause production failures
Introduction: What is Load Balancing?
Start Here if You’re Completely New
The Problem: You have a website running on a single server. What happens when:- 100 users visit → Server handles it fine ✅
- 10,000 users visit → Server slows down ⚠️
- 100,000 users visit → Server crashes ❌
Real-World Analogy: Restaurant Hostess
Without Load Balancer = Restaurant with no hostess- Customers walk in, sit anywhere
- One table gets 10 people (overcrowded)
- Other tables are empty
- Bad customer experience
- Hostess greets customers
- Assigns them to available tables evenly
- All tables equally busy
- Great customer experience
Why This Matters: Real Cost of Getting It Wrong
Case Study: Target’s 2013 Black Friday Crash Target’s website crashed on Black Friday 2013:- The Setup: Used wrong type of load balancer
- The Problem: Load balancer couldn’t handle HTTP traffic properly
- The Incident: Website down for 4 hours during peak shopping
- The Cost: $440M in lost sales (that day alone)
- The Fix: Migrated to proper Layer 7 load balancer
- The Lesson: Choosing wrong load balancer cost $440M
Getting traffic into your application reliably and securely is just as important as the network inside.
1. Load Balancing Decision Tree
Understanding Azure’s 4 Load Balancers (From Absolute Zero)
Azure has 4 different load balancers. Choosing the wrong one is a disaster (see Target’s $440M loss above). The Challenge: Why so many? The Answer: Different use cases need different capabilities.Layer 4 vs Layer 7 (Explained Simply)
The OSI Model is a networking standard with 7 layers. Most people only care about 2: Layer 4 (Transport Layer) = Dumb, fast pipe- Sees: IP address and port number only
- Example: “Send packet to 10.0.0.5 port 80”
- Doesn’t know: What’s in the packet (HTTP? SQL? Video?)
- Speed: Extremely fast (<1ms latency)
- Analogy: Mail carrier who only reads the address on envelope
- Sees: HTTP headers, URL paths, cookies, everything
- Example: “Send
/apirequests to Server A,/imagesto Server B” - Knows: Content type, can inspect and modify
- Speed: Slower (3-20ms latency, must parse HTTP)
- Analogy: Mail carrier who opens mail, reads it, decides where it should go
[!TIP] Jargon Alert: Layer 4 vs Layer 7 Layer 4 (Transport): Knows IP and Port. “Send packet to 10.0.0.5:80”. (Dumb, fast pipe). Layer 7 (Application): Knows URL, Cookies, Headers. “SendWhen to Use Each: Use Layer 4 when:/apito Service A and/imagesto Service B”. (Smart, CPU intensive).
- ✅ Need maximum speed (latency < 1ms)
- ✅ Non-HTTP traffic (databases, game servers)
- ✅ Don’t need to inspect content
- ✅ Need smart routing (
/api→ different server than/images) - ✅ Need SSL termination (decrypt HTTPS once, not on every server)
- ✅ Need Web Application Firewall (WAF) protection
- ✅ HTTP/HTTPS traffic only
Global vs Regional Load Balancers (Simplified)
Regional = Works within one Azure region (e.g., East US)- Example: 3 servers in East US datacenter
- Example: Servers in 3 continents
Azure’s 4 Load Balancers Explained
| Tool | Scope | Layer | Protocol | Monthly Cost | Best For |
|---|---|---|---|---|---|
| Azure Load Balancer | Regional | Layer 4 | TCP/UDP | $18 | Databases, High throughput |
| Application Gateway | Regional | Layer 7 | HTTP/S | $125 | Web apps in one region, WAF |
| Traffic Manager | Global | DNS | Any | $1.35/M queries | Non-HTTP, Legacy failover |
| Front Door | Global | Layer 7 | HTTP/S | 0.03/GB | Global web apps, CDN, WAF |
Global (Multi-Region) vs Regional (Detailed)
| Tool | Scope | Layer | Protocol | Best For |
|---|---|---|---|---|
| Front Door | Global | Layer 7 | HTTP/S | Web Apps, Microservices, CDN |
| Traffic Manager | Global | DNS | Any | Non-HTTP, Legacy failover |
| App Gateway | Regional | Layer 7 | HTTP/S | WAF, SSL Termination, Ingress |
| Load Balancer | Regional | Layer 4 | TCP/UDP | Databases, High throughput, Non-HTTP |
Deep Dive: Load Balancer Comparison
When choosing between Azure’s load balancing services, understanding the nuances is critical for production systems.| Feature | Azure Load Balancer | Application Gateway | Azure Front Door | Traffic Manager |
|---|---|---|---|---|
| OSI Layer | Layer 4 (TCP/UDP) | Layer 7 (HTTP/HTTPS) | Layer 7 (HTTP/HTTPS) | DNS (Layer 3) |
| Scope | Regional (Zone-redundant) | Regional | Global (Multi-region) | Global (DNS-based) |
| SSL Termination | No | Yes | Yes | No |
| Path-based Routing | No | Yes (/api → Backend1) | Yes (/api → Origin1) | No |
| WAF | No | Yes (OWASP 3.2) | Yes (OWASP 3.2 + MS Rules) | No |
| Session Affinity | 5-tuple hash | Cookie-based | Cookie-based | No |
| Health Probes | TCP/HTTP | HTTP/HTTPS | HTTP/HTTPS | HTTP/HTTPS/TCP |
| Latency | <1ms | 3-10ms | 10-20ms (edge routing) | 60s+ (DNS TTL) |
| Throughput | 4M flows/sec | ~20 Gbps | ~50 Gbps | N/A (DNS only) |
| Cost | 0.005/GB | 0.008/GB | 0.03/GB | $1.35/M DNS queries |
| Typical Use Case | SQL Server, MongoDB | Microservices on AKS | Global SPA, CDN | DR Failover |
Understanding Key Features (Explained Simply)
SSL Termination = Decrypt HTTPS once at load balancer, not on every server- Why It Matters: Saves CPU on your servers (encryption is expensive)
- Example: 1,000 HTTPS requests → Load balancer decrypts once → Servers get plain HTTP
- Cost Savings: 20-30% less CPU usage on servers
- Example:
/api→ API servers,/images→ Image servers,/admin→ Admin servers - Why It Matters: Optimize server resources for specific tasks
- Analogy: Restaurant with different stations (grill, salad bar, dessert)
- Blocks: SQL injection, XSS (cross-site scripting), DDoS attacks
- Example: Hacker sends
https://yoursite.com/api?id=1' OR '1'='1→ WAF blocks it - Real Cost: Equifax breach cost $4B, could have been prevented with WAF
- Problem: User logs in to Server A, next request goes to Server B (session lost!)
- Solution: “Pin” user to Server A for entire session
- Better Solution: Use Redis for shared sessions (no sticky sessions needed)
[!WARNING] Gotcha: Traffic Manager Isn’t a Load Balancer Traffic Manager is a DNS service. It returns an IP address to the client, then the client connects directly to that backend. If the backend goes down after DNS resolution, Traffic Manager won’t reroute traffic until the next DNS lookup (60+ seconds later). Use it for coarse-grained multi-region failover, not for real-time load balancing. Visual Example:Common Mistake #1: Using Traffic Manager for Real-Time Failover The Trap:
- Team deploys global app
- Uses Traffic Manager for failover
- Region goes down
- Problem: Users stuck on dead region for 60+ seconds (DNS TTL)
- Impact: Bad user experience, lost revenue
- Use Azure Front Door ($35/month)
- Failover in <10 seconds (no DNS caching)
- Cost: 1.35/M queries (similar price for most apps)
Session Affinity (Sticky Sessions) Explained
The Problem (Story Format): Imagine you’re shopping online:- Step 1: You visit website → Load balancer sends you to Server A
- Step 2: You log in → Server A stores “You are logged in” in memory
- Step 3: You add item to cart → Load balancer sends you to Server B
- Result: Server B doesn’t know you’re logged in → “401 Unauthorized” error ❌
Method 1: Azure Load Balancer (5-Tuple Hash)
How It Works:- Load balancer looks at your IP address + port
- Creates a “fingerprint” (hash)
- Always sends same fingerprint to same server
- ✅ Works for any protocol (TCP, UDP, HTTP)
- ✅ Very fast (no cookies to parse)
- ❌ If your IP changes (mobile switching cell towers), you get routed to different server
- ❌ If you’re behind NAT (corporate network), everyone shares same IP
Method 2: Application Gateway / Front Door (Cookie-Based)
How It Works:- First request → Load balancer picks Server A
- Response includes cookie:
Set-Cookie: ApplicationGatewayAffinity=abc123 - Future requests → Browser sends cookie → Load balancer reads it → Routes to Server A
- ✅ Survives IP changes (mobile networks, VPN switches)
- ✅ More accurate than IP-based
- ❌ If user clears cookies, session is lost
- ❌ Slightly slower (must parse HTTP headers)
[!TIP] Best Practice: Use Redis or Azure App Service Distributed Cache for session state, so sticky sessions aren’t required. This allows horizontal scaling without session loss. Why Shared Session Storage is Better:Common Mistake #2: Relying on Sticky Sessions The Trap:
- App stores sessions in server memory
- Uses sticky sessions
- Server crashes → All sessions on that server lost
- Impact: Users forced to log in again
- E-commerce site during Black Friday
- Server crash lost 10,000 active sessions
- Users had to re-add items to cart
- 70% abandoned their carts
- Cost: $2.1M in lost sales
- Migrate sessions to Redis ($20/month)
- No sticky sessions needed
- Server crashes don’t lose sessions
Health Probes: Keeping Dead Servers Out of Rotation
The Problem (Explained Simply): Imagine you have 3 servers behind a load balancer:- Server A: Running fine ✅
- Server B: Running fine ✅
- Server C: Crashed (out of memory) ❌
How Health Probes Work
The Concept: Load balancer acts like a doctor doing checkups:- Every 15-30 seconds: “Are you healthy?”
- Server responds: “Yes, I’m fine!” → Stays in rotation
- Server doesn’t respond: (Marked unhealthy after 2-3 failures) → Removed from rotation
Azure Load Balancer Health Probes (Simple)
- Every 15 seconds, try to connect to port 80
- If 2 consecutive failures → Mark server unhealthy
- Marks unhealthy after: 2 × 15s = 30 seconds
- Question: “Is port 80 open?”
- Server: “Yes, port is open” ✅
- Problem: Port might be open, but app crashed!
- Question: “GET /health → Give me HTTP 200 OK”
- Server: “HTTP 200 OK” ✅
- Better: Confirms app is actually responding
Application Gateway Health Probes (Advanced)
- Every 30 seconds, send
GET /health - If server doesn’t respond in 30 seconds → Timeout
- If 3 consecutive failures → Mark unhealthy
- Marks unhealthy after: 3 × 30s = 90 seconds
/health endpoint that checks everything:
- Port might be open ✅
- App might be running ✅
- But if database is down → Health check fails → Server removed from rotation ✅
[!WARNING] Gotcha: Health Probe IPs Health probes come from Azure’s internal IP rangeCommon Mistake #3: Blocking Health Probe IP Real-World Example: A team locked down their NSG to only allow traffic from Front Door IP ranges:168.63.129.16. You MUST allow this IP in your NSG, or all backends will be marked unhealthy! Visual:
- Incident duration: 4 hours
- Users affected: 500,000
- Revenue lost: $2.8M
- Prevention: One extra NSG rule (free!)
Connection Draining: Gracefully Shutting Down Backends
Problem: You deploy a new version. Azure removes the old VM from the load balancer pool, but it has 50 active connections processing long-running API requests. If you immediately kill the VM, those requests fail.Azure Load Balancer
- Idle Timeout: After 30 minutes of inactivity, connection is closed.
- No Graceful Draining: Azure Load Balancer doesn’t support draining. Use a rolling update strategy.
Application Gateway
- How it works: When you remove a backend, App Gateway stops sending new requests to it, but allows existing connections to finish for up to 300 seconds.
[!TIP] Best Practice: Set drain timeout to your P99 request latency. If 99% of requests finish in 10 seconds, set drain timeout to 15s.
Cross-Region Load Balancing: Front Door vs Traffic Manager
| Scenario | Use Front Door | Use Traffic Manager |
|---|---|---|
| Global HTTP/S app | ✅ Automatic failover, anycast | ❌ DNS caching causes stale routes |
| Non-HTTP workload (TCP/UDP) | ❌ HTTP/S only | ✅ Works with any protocol |
| Real-time failover required | ✅ Sub-second failover | ❌ 60s+ DNS TTL delay |
| Cost-sensitive | ❌ $0.03/GB (3x more) | ✅ $1.35/M queries |
| Need CDN + WAF | ✅ Built-in | ❌ Must add separate CDN |
Decision Flowchart
2. Azure Front Door
The modern entry point for global web applications.- CDN: Caches static content at the edge.
- Anycast: Users connect to the nearest Microsoft Edge node (POPs).
- WAF: Web Application Firewall protects against SQL Injection, XSS.
[!WARNING] Gotcha: The 5-minute timeout Front Door has a hard 100-timeout for connections. If your backend takes 5 minutes to process a report, Front Door will cut the connection. Use Async patterns!
3. Azure Application Gateway
The regional Layer 7 load balancer.- WAF: Uses OWASP rules (same as Front Door).
- Autoscaling: Scales up based on traffic load.
- AGIC: Application Gateway Ingress Controller for AKS.
4. Azure NAT Gateway
The Problem: SNAT Port Exhaustion. When 100 VMs try to talk to the internet using one Standard Load Balancer public IP, they run out of “Source Ports” (SNAT ports). Connections start failing randomly. The Solution: NAT Gateway.- Dedicated resource for outbound traffic.
- Provides 64,000 SNAT ports per Public IP.
- You can attach up to 16 Public IPs (1 Million+ connections).
[!IMPORTANT] Best Practice: Always attach a NAT Gateway to your subnets if you have high outbound traffic (e.g., API scrapers, high-volume webhooks).
5. Azure DNS
Public Zones
Host your domain (example.com). Azure has ultra-fast global DNS servers (ns1-01.azure-dns.com).
Private Zones
Internal DNS (app.internal).
- Resolve hostnames across VNets.
- Auto-registration: When you create a VM, it automatically gets a DNS record (
vm1.app.internal). - Used heavily by Private Link to map
mypaas.privatelink.database.windows.net.
Split-Horizon DNS
You can haveapi.company.com resolve to a Public IP for external users, but a Private IP (10.0.0.5) for internal users on VPN.
6. Case Study: E-Commerce Architecture
Putting it all together:- User hits
www.shop.com. - Azure Front Door intercepts, checks WAF, serves global cache.
- Forwards dynamic request to Application Gateway in
Region A. - App Gateway routes
/cartto AKS Cluster (in a private subnet). - AKS Pod talks to Azure SQL via Private Link (traffic never leaves VNet).
- AKS Pod sends email via SendGrid using NAT Gateway (to prevent SNAT failing).
- DevOps Engineer connects via VPN Gateway to debug DB issues.