Skip to main content

Chapter 15: Infrastructure as Code - Terraform on GCP

In the modern cloud era, clicking through the web console is considered an “anti-pattern” for production systems. Infrastructure as Code (IaC) allows you to define your entire data center in text files. This ensures that your environments are reproducible, version-controlled, and auditable. In Google Cloud, Terraform is the undisputed king of IaC.

1. Why Terraform is the GCP Standard

While Google offers Deployment Manager (native) and Config Connector (Kubernetes-based), Terraform remains the preferred choice for most engineers.
  • The Google Provider: Google maintains one of the most comprehensive Terraform providers in existence. New GCP features often have Terraform support on “Day 0.”
  • Immutable Infrastructure: Terraform encourages you to replace resources rather than patching them, which reduces “Configuration Drift.”
  • Plan and Apply: The terraform plan command acts as a safety net, showing you exactly what will happen before you make any changes.

2. Advanced State Management

The terraform.tfstate file is the source of truth for your infrastructure. In production, you must manage this with extreme care.

The GCS Backend

Always store your state in a Cloud Storage bucket with the following configuration:
  • Versioning: Enable bucket versioning so you can recover from a corrupted state.
  • Locking: GCS natively supports state locking. This prevents two developers from running terraform apply at the same time and corrupting the state.

Structuring Environments

  • Separate State per Environment: Never use the same state file for dev and prod. If you accidentally delete your state in dev, you don’t want it to impact prod.
  • Workspaces vs. Directories: Most GCP experts prefer separate directories (e.g., environments/prod/, environments/dev/) over Terraform Workspaces for clearer isolation and variable management.

3. The Cloud Foundation Toolkit (CFT)

Instead of writing every resource from scratch, Google provides the Cloud Foundation Toolkit.
  • Best-Practice Modules: CFT is a set of open-source Terraform modules that implement Google’s best practices for VPCs, GKE clusters, Project factories, and more.
  • Opinionated Security: These modules come with “secure defaults,” like disabling public IPs for GKE nodes or enforcing encryption on buckets.

3.1 The Project Factory Module Deep Dive

The Project Factory is the most critical module in the CFT. It automates the creation of GCP projects while enforcing organization-level compliance. What it Automates:
  • Project Creation: Handles the google_project resource.
  • Billing Linkage: Connects the project to the central billing account.
  • Service API Enablement: Enables a list of APIs (e.g., compute.googleapis.com, container.googleapis.com) automatically.
  • Shared VPC Attachment: Handles the complex handshake of attaching a project as a “Service Project” to a host VPC.
  • Default Service Account Deletion: (Security Best Practice) Deletes the default, over-privileged compute service account.
  • Group IAM Bindings: Assigns standard IAM roles to G Suite / Google Groups for developers, auditors, and admins.
Example Implementation:
module "project-factory" {
  source  = "terraform-google-modules/project-factory/google"
  version = "~> 14.0"

  name            = "prod-data-platform"
  random_project_id = true
  org_id          = var.org_id
  billing_account = var.billing_account
  folder_id       = var.folder_id

  # Shared VPC Configuration
  svpc_host_project_id = "host-project-123"
  shared_vpc_subnets   = [
    "projects/host-project-123/regions/us-central1/subnets/data-subnet"
  ]

  # API Enablement
  activate_apis = [
    "compute.googleapis.com",
    "bigquery.googleapis.com",
    "storage-api.googleapis.com"
  ]
}

4. Google Cloud Deploy: Managed CD

Once your infrastructure is provisioned, you need a way to deploy your code. Cloud Deploy is Google’s fully managed continuous delivery service for GKE, Cloud Run, and Anthos.

Key Concepts

  • Delivery Pipeline: Defines the progression of a release through different targets (e.g., devstagingprod).
  • Skaffold: Cloud Deploy uses Skaffold under the hood to decouple the build/deploy configuration from the pipeline definition.
  • Rollout Strategies:
    • Canary: Deploy to a small percentage of users first.
    • Blue/Green: Deploy a full new version and switch traffic instantly.
  • Approval Gates: You can require a manual “click” from a lead engineer before a release moves into the production target.

5. Policy as Code: Guarding the Pipeline

To prevent developers from accidentally creating insecure infrastructure (like an open S3 bucket), use Policy as Code.
  • Terraform Validator: A tool that checks your terraform plan against your organization’s security policies before it is applied.
  • Example Policy: “No VM can have an external IP address unless it has a specific tag.”

6. Config Connector: The K8s Alternative

For teams that are “all-in” on Kubernetes, Config Connector allows you to manage GCP resources using Kubernetes YAML.
  • The CRD Model: A Cloud SQL database becomes a Kind: SQLInstance object in your GKE cluster.
  • Reconciliation: Kubernetes’ controller loop constantly checks the state of your GCP resources and fixes any drift, just as it does for pods.

7. Advanced Terraform Patterns: Meta-Arguments and DRY Code

7.1 Lifecycle Meta-Arguments

Terraform provides meta-arguments to control how resources are handled during an apply.
  • prevent_destroy: Essential for critical resources like production databases or DNS zones. It prevents Terraform from destroying the resource even if you remove it from the code.
  • ignore_changes: Useful when certain attributes are managed by other tools (e.g., GKE node pool sizes managed by the autoscaler).

7.2 Keeping Code DRY with Terragrunt

In a large-scale GCP environment, you often have dozens of nearly identical projects. Terragrunt is a thin wrapper that helps you:
  • Inherit configuration: Define your provider and backend once and inherit them across all projects.
  • Dependency management: Ensure your VPC is created before your GKE cluster.

7.3 Google Terraform Validator

This tool allows you to validate your terraform plan against your organization’s security policies (Forseti or CAI-based).
  • Integration: Run it in your Cloud Build pipeline. If a developer tries to create a bucket without encryption, the build fails.

8. Interview Preparation

1. Q: Why is it critical to store the Terraform state file in a remote GCS backend? A: Storing state locally prevents collaboration and is a single point of failure. A GCS Backend provides:
  • Locking: Prevents two users from running apply simultaneously and corrupting the state.
  • Versioning: Allows you to roll back to a previous state if the current one is corrupted.
  • Security: State files often contain sensitive information (like DB passwords); GCS allows you to restrict access via IAM.
2. Q: Explain the concept of “Configuration Drift” and how Terraform handles it. A: Configuration Drift occurs when the real-world infrastructure (manual changes in the console) no longer matches the code. Terraform detects this during the plan phase. By comparing the tfstate with the live environment, Terraform identifies the “drift” and proposes a plan to revert the manual changes or update the code to match the desired state, ensuring the infrastructure remains consistent. 3. Q: What are “Terraform Modules” and why are they used in enterprise environments? A: Modules are containers for multiple resources that are used together. They enable Code Reusability and Standardization. In an enterprise, you can create a “Standard VPC Module” that includes all necessary subnets, firewalls, and logging. Developers then use this module rather than writing networking code from scratch, ensuring compliance with company security standards. 4. Q: How does “Config Connector” differ from Terraform? A:
  • Terraform: Is a standalone CLI tool that uses HCL. It is “imperative-style” execution (you run a command to apply).
  • Config Connector: Is a Kubernetes-native controller. You define GCP resources as K8s YAML files. The controller is constantly “reconciling”—if a resource is deleted in the console, Config Connector will automatically recreate it within minutes without any manual intervention.
5. Q: What is the “Cloud Foundation Toolkit” (CFT)? A: CFT is a collection of open-source Terraform modules maintained by Google. They are built to Google’s Best Practices for security and reliability. Instead of reinventing the wheel, architects use CFT modules for complex setups like “Project Factory” (automated project creation), “Networking,” and “GKE” to ensure their environment is enterprise-hardened from day one.

Implementation: The “Platform Engineer” Lab

Setting up a Production Terraform Backend

# backend.tf
terraform {
  backend "gcs" {
    bucket  = "my-company-tfstate"
    prefix  = "terraform/state/prod"
  }
}

# provider.tf
provider "google" {
  project = var.project_id
  region  = "us-central1"
}

# vpc.tf (Using a CFT module)
module "vpc" {
  source  = "terraform-google-modules/network/google"
  version = "~> 6.0"

  project_id   = var.project_id
  network_name = "prod-vpc"
  routing_mode = "GLOBAL"

  subnets = [
    {
      subnet_name   = "prod-subnet-01"
      subnet_ip     = "10.0.1.0/24"
      subnet_region = "us-central1"
    }
  ]
}

Pro-Tip: Terraform Workgraph

When debugging slow Terraform runs, use terraform graph | dot -Tpng > graph.png. This visualizes the dependency tree of your resources, helping you identify bottlenecks or circular dependencies that might be slowing down your deployments.