Chapter 17: Managing the Bill - Cost Management and FinOps
In the cloud, cost is not just a line item for finance; it is an engineering metric. A poorly architected system isn’t just slow or insecure—it is expensive. In Google Cloud, managing costs requires a deep understanding of the billing hierarchy, the discount models, and the automation tools available to enforce “FinOps” (Financial Operations) principles.1. The GCP Billing Hierarchy
To manage costs at scale, you must first understand how they are attributed.- Cloud Billing Account: The top-level resource linked to a payment method.
- Projects: All resources live in projects, and each project is linked to one billing account.
- Labels: These are key-value pairs (e.g.,
team:search,env:prod) attached to resources. Labels are the single most important tool for cost allocation. They are exported into your billing data, allowing you to see exactly how much each team is spending.
2. Discount Models: CUDs and SUDs
GCP offers several ways to reduce your “list price” spend.Sustained Use Discounts (SUDs)
SUDs are automatic. If you run a Compute Engine instance for more than 25% of a month, Google automatically starts applying a discount. For a full month, the discount can reach up to 30%.Committed Use Discounts (CUDs)
CUDs require a commitment (1 or 3 years) in exchange for deep discounts (up to 70%).- Resource-based CUDs: You commit to a specific amount of vCPU and RAM in a specific region. Best for predictable, steady-state workloads.
- Flexible (Spend-based) CUDs: You commit to a specific hourly spend (e.g., “$10/hour”). This applies across multiple regions and even multiple products (Compute Engine, Cloud Run, Spanner). Best for dynamic organizations that change machine types or regions frequently.
3. Cost Optimization Strategies
The Recommender API
Google uses ML to analyze your resource usage and provides “Recommendations.”- Rightsizing: It might suggest moving a VM from
n2-standard-4ton2-standard-2if the CPU usage is consistently below 10%. - Idle Resources: It identifies unattached Persistent Disks, idle IP addresses, and unused Load Balancers that are costing you money every hour.
Spot VMs (formerly Preemptible)
Spot VMs offer a 60-91% discount compared to on-demand prices.- The Catch: Google can take them back at any time with a 30-second notice.
- Best Use: Batch processing, CI/CD runners, and fault-tolerant GKE node pools.
4. Advanced Visibility: Billing Export to BigQuery
The standard billing console is fine for small projects, but for enterprises, you must enable the Billing Export to BigQuery.- Granularity: You get per-hour, per-resource cost data.
- Custom Dashboards: Point Looker Studio at your BigQuery billing dataset to build custom dashboards for every team lead.
- Anomaly Detection: You can write SQL queries to detect “Cost Spikes” (e.g., “Alert me if any project spends 20% more today than it did yesterday”).
5. GKE Cost Optimization
Kubernetes is a major cost driver. GKE offers specialized tools to keep it under control:- GKE Autopilot: You pay only for the Pods you run. Google handles the “bin-packing” (fitting as many pods onto a node as possible), eliminating the cost of idle node capacity.
- Cost Allocation: GKE can attribute costs down to the Namespace or even the Label level within a cluster. This is essential for chargebacks in a shared cluster environment.
6. Budgets and Programmatic Alerts
A “Budget” in GCP does not stop your services; it only sends alerts.- Thresholds: Set alerts at 50%, 90%, and 100% of your expected spend.
- Pub/Sub Integration: You can send a budget alert to a Pub/Sub topic. This can trigger a Cloud Function that automatically shuts down non-production environments if they exceed their monthly limit.
6. Advanced FinOps: Egress and Orphans
6.1 Identifying “Orphaned” Resources
A common source of waste is “orphaned” resources—disks or IPs left behind after a VM is deleted.- BigQuery SQL: Use the billing export to find resources with
cost > 0butusage = 0or no associated labels. - Automation: Use the Recommender API to automatically identify and delete these orphans in non-production projects.
6.2 Network Egress Analysis
Egress is often the most misunderstood cost.- Internet Egress: Sending data to the public internet (Expensive).
- Cross-Region Egress: Sending data between GCP regions (Moderate).
- Zonal Egress: Sending data between zones in the same region (Small, but adds up).
- Tip: Use VPC Flow Logs joined with BigQuery billing to identify which specific service is driving high egress costs.
7. Interview Preparation
1. Q: What are “Committed Use Discounts” (CUDs) and how do they differ from “Sustained Use Discounts” (SUDs)? A: SUDs are automatic; you get them just by running a VM for more than 25% of a month. CUDs require a commitment (1 or 3 years) but offer much deeper discounts (up to 70%). CUDs can be Resource-based (fixed vCPU/RAM in one region) or Flexible (spend-based, applying across multiple regions and products like Cloud Run and Spanner). 2. Q: Why is “Billing Export to BigQuery” considered a mandatory FinOps practice? A: The standard Cloud Console only provides high-level views. Billing Export provides granular, per-resource, hourly cost data. By exporting to BigQuery, you can:- Join costs with Labels to create accurate department-level chargebacks.
- Build custom dashboards in Looker Studio.
- Write SQL queries to detect “Cost Spikes” or “Zombie Resources” (idle disks/IPs) programmatically.
- VM Rightsizing: Suggesting a smaller machine type if CPU is low.
- Idle Resources: Identifying unattached Persistent Disks or unassigned Static IPs.
- CUD Recommendations: Identifying where a commitment would save money based on steady-state usage.
- Disable Billing for the project (shuts down all resources).
- Scale down GKE deployments to zero.
- Remove external IP addresses.