Skip to main content

Cost Optimization

What You’ll Learn

By the end of this chapter, you’ll understand:
  • Why cloud costs spiral out of control - The $50,000 surprise cloud bill (true story!)
  • Azure pricing model - How Azure charges you (compute, storage, network) and hidden costs
  • Quick wins - How to save 30-60% immediately (right-sizing, auto-shutdown, reserved instances)
  • FinOps framework - How to implement Financial Operations culture in your team
  • Service-specific optimization - AKS, Cosmos DB, Storage cost reduction tactics
  • Real-world cost reduction - Case studies showing $28,000/month savings

Introduction: The Cloud Cost Crisis (Start Here if You’re New)

The $50,000 Surprise Bill (True Story)

Scenario: Small startup with 10 employees
Month 1: Azure bill = $500
  ↓ "Looks reasonable!"

Month 2: Azure bill = $2,000
  ↓ "We launched new features, makes sense"

Month 3: Azure bill = $8,000
  ↓ "Uh oh..."

Month 4: Azure bill = $50,000 ❌
  ↓ CEO: "WE'RE GOING BANKRUPT!"
What happened?
  • Developer left 20 VMs running for testing (forgot to shut them down)
  • Autoscaling set to create unlimited VMs (no budget limit)
  • Storage set to Hot tier for all data (including old logs)
  • Network traffic exploded (no CDN, serving 100 TB/month from Azure directly)
  • No monitoring, no alerts, no cost tracking
Cost breakdown (Month 4):
  • 20 test VMs @ 200/month=200/month = 4,000/month
  • 50 autoscaled production VMs @ 300/month=300/month = 15,000/month
  • 200 TB storage (Hot tier) @ 37/TB=37/TB = 7,400/month
  • 100 TB network egress @ 87/TB=87/TB = 8,700/month
  • Cosmos DB (50,000 RU/s always on) = $4,800/month
  • Azure SQL (Premium tier, unused) = $3,000/month
  • Misc (snapshots, disks, load balancers) = $7,100/month
Total: $50,000/month for a startup with 1,000 users! The fix (took 1 week):
  • Auto-shutdown test VMs at night → Saved $2,400/month (60% savings)
  • Right-sized production VMs → Saved $6,000/month
  • Moved old data to Archive tier → Saved $6,500/month
  • Added Azure Front Door CDN → Saved $7,000/month
  • Cosmos DB autoscale → Saved $2,500/month
  • Deleted unused resources → Saved $10,000/month
New monthly cost: $15,600/month (69% savings!)

Why Cloud Costs Are Hard to Control

Cloud vs. Traditional Data Center: Traditional Data Center (CapEx model):
Buy 10 servers upfront: $100,000
  ↓ Pay once
  ↓ Use for 3 years
  ↓ Total cost: $100,000 (fixed, predictable)
  ↓ Cost per month: $2,778
  ↓ No surprises ✅
Cloud (OpEx model):
Spin up 10 VMs: $0 upfront
  ↓ Pay per second of usage
  ↓ Developer creates 50 more VMs for testing
  ↓ Autoscaling creates 100 more VMs during spike
  ↓ Someone leaves VM running over weekend
  ↓ Total cost: Unpredictable! ❌
  ↓ Can range from $1,000 to $100,000/month
The Problem:
  • Easy to spend: Spinning up resources takes 5 minutes, costs happen instantly
  • Hard to track: Bills come 30 days later (you forget what you created)
  • Invisible waste: Unused resources keep billing you (forgotten VMs, orphaned disks)
  • Complex pricing: 1,000+ pricing variables (VM sizes, storage tiers, network egress, etc.)

Real-World Cost Horror Stories

Example 1: The $72,000 Test Environment What happened:
  • QA team created test environment for Black Friday load testing
  • Used production-sized VMs (50 Standard_D16s_v3 VMs)
  • Ran load test for 2 days
  • Forgot to delete test environment
  • Test VMs ran for 6 months
Cost:
  • 50 VMs × 600/month×6months=600/month × 6 months = 180,000
  • Should have cost: 50 VMs × 600/month×0.07months(2days)=600/month × 0.07 months (2 days) = 2,100
  • Waste: $177,900
Prevention:
  • Auto-delete policy: Delete all resources tagged “Test” after 7 days
  • Cost: $0 (built-in Azure feature)

Example 2: The $200,000 CDN Bill What happened:
  • E-commerce site serves product images directly from Azure Storage
  • No CDN (Content Delivery Network)
  • 500 TB/month of network egress
Cost:
  • Network egress: 500 TB × 87/TB=87/TB = 43,500/month
  • Storage: 10 TB × 18/TB=18/TB = 180/month
  • Total: $43,680/month
Fix (Added Azure Front Door CDN):
  • CDN cache hit rate: 95%
  • Network egress from Azure: 25 TB (95% served from CDN cache)
  • 25 TB × 87/TB=87/TB = 2,175/month
  • CDN cost: $2,000/month
  • Total: $4,175/month (90% savings!)
Savings: $39,505/month
Example 3: The “Stopped” VM That Kept Billing What happened:
  • Developer “stopped” 20 VMs thinking it would save money
  • VMs showed as “Stopped” in Azure Portal
  • Still got charged $4,000/month
Why?
  • VMs were “Stopped” (shutdown OS) but NOT “Deallocated”
  • Azure still reserved the hardware (CPU, RAM)
  • Still charging compute costs
Fix:
# Wrong: Just shutting down OS (still charges!)
Stop-VM -Name "my-vm"

# Correct: Deallocate VM (stops compute charges)
Stop-AzVM -Name "my-vm" -ResourceGroupName "rg-prod" -Force
Savings: $4,000/month

Understanding Azure Pricing (Simple Explanation)

Azure has 3 main cost categories: 1. Compute (Running things)
  • What: Cost to run VMs, App Services, Functions, Containers
  • How charged: Per second/minute/hour
  • Analogy: Renting a car (pay per hour you use it)
  • Example: Standard_D2s_v3 VM = 0.096/hour=0.096/hour = 70/month (if running 24/7)
2. Storage (Storing things)
  • What: Cost to store data (disks, blobs, databases)
  • How charged: Per GB stored per month + operations (read/write)
  • Analogy: Renting a storage unit (pay per square foot per month)
  • Example: 1 TB Blob Storage (Hot tier) = $18/month
3. Network (Moving things)
  • What: Cost to transfer data in/out of Azure
  • How charged: Per GB transferred
  • Analogy: Shipping packages (pay per pound shipped)
  • Example: 100 GB egress = $8.70

The Hidden Costs Nobody Tells You About

1. Stopped VMs still cost money (if not deallocated)
  • VM stopped (OS shutdown): Still charges compute ❌
  • VM deallocated: Compute stops, only storage charged ✅
2. Unattached disks keep billing
  • Deleted VM, but disk still exists: $100/month waste
  • Fix: Delete disks when deleting VMs
3. Snapshots are not free
  • 10 snapshots × 100 GB × 0.05/GB=0.05/GB = 50/month
  • Fix: Delete old snapshots after 30 days
4. Public IP addresses cost money
  • Static public IP: $3.50/month (even if not attached to anything)
  • Unattached public IPs: Common waste
5. Load balancers without backends
  • Empty load balancer: $20/month
  • Fix: Delete unused load balancers
6. Database backups
  • Automatic backups for Azure SQL: Included
  • Keeping backups > 35 days: Extra cost ($0.20/GB/month)

What is Cost Optimization? Cost optimization is the practice of reducing cloud spending while maintaining or improving performance. It’s not about being cheap—it’s about spending wisely and eliminating waste. Why It Matters:
  • The Problem: Cloud bills can spiral out of control. A 1,000/monthbillcanbecome1,000/month bill can become 10,000/month if not monitored.
  • The Solution: Understand what you’re paying for, eliminate waste, and optimize resources.
Real-World Example:
  • Before Optimization: Company spends $50,000/month on Azure
    • 50 VMs running 24/7 (even at night)
    • Oversized VMs (using 10% CPU but paying for 100%)
    • Unused resources (old disks, snapshots)
    • No reservations (paying full price)
  • After Optimization: Company spends $20,000/month (60% savings!)
    • Auto-shutdown for dev/test VMs
    • Right-sized VMs
    • Deleted unused resources
    • Purchased 3-year reservations
Azure Cost Optimization

Understanding Azure Pricing

How Azure Charges You: Azure uses a pay-as-you-go model. You pay for what you use, billed per second/minute/hour. Key Concepts:
What: Charges for running VMs, App Services, FunctionsHow It Works:
  • VM running = charged per second
  • VM stopped (deallocated) = no compute charge (storage still charged)
  • VM stopped (not deallocated) = still charged (reserves hardware)
Example:
Standard_D2s_v3 VM:
- Running: $0.096/hour = $70/month (if running 24/7)
- Stopped (deallocated): $0/month compute, $10/month storage
- Stopped (not deallocated): $70/month (still reserves hardware!)
Gotcha: Just shutting down the OS doesn’t stop charges. You must deallocate the VM.

1. Azure Cost Management

Cost Analysis

Visualize spending by service, resource group, tags

Budgets

Set spending limits and alerts

Recommendations

Azure Advisor provides savings recommendations

Cost Allocation

Use tags for chargeback and showback
[!WARNING] Gotcha: B-Series Credits “Burstable” (B-series) VMs are cheap because they limit your CPU. If you use 100% CPU for too long, you run out of “credits” and your VM gets throttled to 10% speed. Monitor your CPUCreditsRemaining metric!
[!TIP] Jargon Alert: Amortization If you pay 3,600upfrontfora3yearreservation,"AmortizedCost"showsitas3,600 upfront for a 3-year reservation, "Amortized Cost" shows it as 100/month in your reports. This helps you track the effective monthly burn rate, not just the cash flow.

2. Quick Wins

Problem: VMs running at 10% CPUSolution: Downsize VM
# Before: Standard_D4s_v3 (4 vCPU, $150/month)
# After: Standard_D2s_v3 (2 vCPU, $75/month)
# Savings: $75/month (50%)

az vm resize \
  --resource-group rg-prod \
  --name vm-web-01 \
  --size Standard_D2s_v3
For stable workloads:
  • 1-year: 30-50% savings
  • 3-year: 50-70% savings
Example: 1,000/month1,000/month → 400/month with 3-year RI
# Stop VMs at 6 PM on weekdays
# Start VMs at 8 AM on weekdays

# Using Azure Automation
# Savings: 60% (nights + weekends)
For fault-tolerant workloads:
  • Up to 90% discount
  • Can be evicted with 30-second warning
  • Perfect for: Batch jobs, testing, CI/CD agents
Common wastes:
  • Unattached disks ($100/month each)
  • Old snapshots ($50/month)
  • Orphaned public IPs ($4/month each)
  • Unused load balancers ($20/month)
# Find unattached disks
az disk list --query "[?diskState=='Unattached'].{Name:name, ResourceGroup:resourceGroup}"

# Delete them
az disk delete --name disk-old --resource-group rg-prod --yes

3. Storage Cost Optimization

Hot → Cool → Archive

Example: 1 TB for 1 year
- Hot: $216/year
- Cool: $120/year (44% savings)
- Archive: $12/year (94% savings!)

4. FinOps Best Practices

Tagging Strategy

Tag every resource with:
  • CostCenter
  • Environment (prod/dev/test)
  • Owner
  • Project

Showback/Chargeback

Show teams their costs
  • Monthly cost reports
  • Budget per team
  • Accountability

Regular Reviews

  • Weekly: Cost anomalies
  • Monthly: Optimization opportunities
  • Quarterly: Architecture review

Automation

  • Auto-shutdown for dev/test
  • Auto-scale based on usage
  • Budget alerts


5. FinOps Framework

FinOps Lifecycle: Inform, Optimize, Operate
FinOps (Financial Operations) = Cultural practice bringing financial accountability to cloud spending.

FinOps Principles

  1. Teams need to collaborate: Finance, Engineering, Product work together
  2. Everyone takes ownership: Engineers see cost impact of their decisions
  3. Centralized team drives FinOps: Dedicated FinOps team enables best practices
  4. Reports should be accessible: Real-time cost visibility for all stakeholders
  5. Decisions are driven by business value: Cost vs performance tradeoffs
  6. Take advantage of variable cost model: Right-size continuously

FinOps Lifecycle

┌─────────────┐
│   INFORM    │  ← Visibility, allocation, benchmarking
└──────┬──────┘

┌─────────────┐
│  OPTIMIZE   │  ← Right-sizing, reservations, waste removal
└──────┬──────┘

┌─────────────┐
│  OPERATE    │  ← Continuous improvement, automation
└─────────────┘

Implementing FinOps

Phase 1: Visibility (Month 1-2)
# 1. Enable Cost Management
az costmanagement export create \
  --name daily-costs \
  --type ActualCost \
  --schedule-recurrence Daily \
  --storage-account-id /subscriptions/.../storageAccounts/costs

# 2. Tag all resources
az tag create --resource-id $RESOURCE_ID \
  --tags Environment=Production Team=Platform CostCenter=Engineering

# 3. Create cost alerts
az consumption budget create \
  --budget-name monthly-budget \
  --amount 10000 \
  --time-grain Monthly \
  --start-date 2026-01-01 \
  --end-date 2026-12-31
Phase 2: Accountability (Month 3-4)
  • Assign cost owners to each resource group
  • Create showback reports per team
  • Monthly cost review meetings
Phase 3: Optimization (Month 5+)
  • Implement right-sizing recommendations
  • Purchase reserved instances for base load
  • Automate waste removal

6. Tagging Strategy & Governance

Cost Allocation through Tagging Strategy
Tags = Metadata for cost allocation, automation, and governance.

Essential Tags

TagExamplePurpose
EnvironmentProduction, Dev, StagingSeparate costs by environment
CostCenterEngineering, MarketingChargeback to departments
Owner[email protected]Accountability
ProjectProjectAlpha, Migration2026Track project costs
ApplicationWebApp, API, DatabaseGroup related resources
CriticalityCritical, High, Medium, LowPrioritize optimization

Tag Governance with Azure Policy

Require tags on all resources:
{
  "if": {
    "allOf": [
      {
        "field": "type",
        "equals": "Microsoft.Compute/virtualMachines"
      },
      {
        "field": "tags['Environment']",
        "exists": "false"
      }
    ]
  },
  "then": {
    "effect": "deny"
  }
}

Bulk Tagging Script

# Tag all VMs in a resource group
$rg = "rg-production"
$tags = @{
    Environment = "Production"
    CostCenter = "Engineering"
    Owner = "[email protected]"
}

Get-AzResource -ResourceGroupName $rg | ForEach-Object {
    $resource = $_
    $resourceTags = $resource.Tags
    
    # Merge tags (don't overwrite existing)
    foreach ($key in $tags.Keys) {
        if (-not $resourceTags.ContainsKey($key)) {
            $resourceTags[$key] = $tags[$key]
        }
    }
    
    Set-AzResource -ResourceId $resource.ResourceId -Tag $resourceTags -Force
}

7. Service-Specific Cost Optimization

AKS Cost Optimization

Problem: AKS cluster costs $5,000/month, mostly idle Solutions:
  1. Right-size node pools:
# Enable cluster autoscaler
az aks nodepool update \
  --cluster-name aks-prod \
  --name nodepool1 \
  --min-count 3 \
  --max-count 10 \
  --enable-cluster-autoscaler
  1. Use Spot VMs for fault-tolerant workloads:
az aks nodepool add \
  --cluster-name aks-prod \
  --name spotpool \
  --priority Spot \
  --eviction-policy Delete \
  --spot-max-price -1 \
  --node-count 3

# Savings: Up to 90% for batch jobs

Cosmos DB Cost Optimization

Problem: Cosmos DB costs $2,000/month with 20,000 RU/s provisioned Solutions:
  1. Use Autoscale instead of Manual:
# Before: 20,000 RU/s manual = $1,920/month (always on)
# After: 2,000-20,000 RU/s autoscale = $960/month avg

az cosmosdb sql container throughput update \
  --account-name mycosmosdb \
  --database-name mydb \
  --name mycontainer \
  --max-throughput 20000  # Autoscale
  1. Optimize queries (reduce RU consumption):
-- Bad: Full collection scan (1000 RUs)
SELECT * FROM c WHERE c.status = 'active'

-- Good: Use index (5 RUs)
SELECT c.id, c.name FROM c WHERE c.status = 'active'

Storage Cost Optimization

Problem: 100 TB of blob storage costs $2,000/month in Hot tier Solution: Use lifecycle management:
{
  "rules": [
    {
      "name": "move-to-cool-after-30-days",
      "type": "Lifecycle",
      "definition": {
        "filters": {
          "blobTypes": ["blockBlob"]
        },
        "actions": {
          "baseBlob": {
            "tierToCool": {
              "daysAfterModificationGreaterThan": 30
            },
            "tierToArchive": {
              "daysAfterModificationGreaterThan": 90
            }
          }
        }
      }
    }
  }
}
Savings:
  • Hot: 0.0184/GB=0.0184/GB = 1,884/month for 100 TB
  • Cool: 0.01/GB=0.01/GB = 1,024/month (46% savings)
  • Archive: 0.00099/GB=0.00099/GB = 101/month (95% savings)

8. Real-World Cost Reduction Case Studies

Case Study 1: E-Commerce Platform

Before:
  • Monthly cost: $45,000
  • 50 VMs running 24/7
  • All resources in Premium tier
Optimizations:
  1. Right-sized VMs (avg CPU 15%) → Saved $8,000/month
  2. Stopped dev/test VMs at night → Saved $6,000/month
  3. Purchased 3-year RIs for prod → Saved $12,000/month
  4. Moved logs to Cool storage → Saved $2,000/month
After:
  • Monthly cost: $17,000
  • Savings: $28,000/month (62%)

Case Study 2: SaaS Startup

Before:
  • Monthly cost: $12,000
  • AKS cluster with 20 nodes
  • Cosmos DB at 50,000 RU/s
Optimizations:
  1. Enabled AKS cluster autoscaler → Saved $3,500/month
  2. Cosmos DB autoscale → Saved $2,500/month
  3. Lifecycle management for storage → Saved $1,200/month
  4. Deleted unused resources → Saved $800/month
After:
  • Monthly cost: $4,000
  • Savings: $8,000/month (67%)

9. Cost Optimization Checklist

✅ Right-sized VMs (not oversized)
✅ Reserved Instances for stable workloads
✅ Spot VMs for batch/testing
✅ Auto-shutdown for dev/test
✅ Deleted unused resources
✅ Storage lifecycle policies
✅ Monitoring and alerts enabled
✅ Tags on all resources
✅ Regular cost reviews scheduled


6. Interview Questions

Beginner Level

Answer:
  • CapEx (Capital Expenditure): Upfront cost for physical infrastructure (Servers, Datacenters). depreciated over time.
  • OpEx (Operational Expenditure): Pay-as-you-go model (Cloud). Billed for what you use immediately. No upfront cost.
Answer: You must Deallocate the VM (Stopped/Deallocated). Just shutting down the OS (Stopped) still reserves the hardware and incurs compute charges. Note: Storage (Disk) is still billed even when deallocated.

Intermediate Level

Answer:
  • Reserved Instances (RI): Commit to 1 or 3 years for steady workloads. ~40-70% savings. Guaranteed capacity.
  • Spot VMs: Bid on unused Azure capacity. ~90% savings. No guarantee (can be evicted). Good for stateless/batch jobs.
Answer: A licensing benefit that lets you bring your own on-premises Windows Server and SQL Server licenses (with Software Assurance) to Azure. It removes the cost of the OS license from the hourly VM rate, saving up to 40%.

Advanced Level

Answer:
  1. Tagging Policy: Enforce a CostCenter tag on all resources via Azure Policy.
  2. Cost Management: Create cost views filtered by tag.
  3. Exports: Export cost data to Azure Storage/Power BI for custom reporting.
  4. Subscription Partitioning: For strict isolation, give each tenant their own Subscription (Management Groups for governance).

7. Key Takeaways

Visibility

You can’t optimize what you can’t measure. Use cost analysis and dashboards properly.

Accountability

Use Tags to assign costs to teams. Make engineers responsible for their cloud spend.

Commitment

Use Reserved Instances and Savings Plans for predictable base loads to save 50%+.

Waste Elimination

Aggressively find and delete unused resources (orphaned disks, IPs, stopped VMs).

Architecture

Serverless and PaaS often scale to zero and cost less than always-on IaaS for variable workloads.

Next Steps

Continue to Chapter 13

Master high availability and disaster recovery strategies