Skip to main content

Chapter 5: Virtual Machines at Scale - Compute Engine

Compute Engine (GCE) is Google Cloud’s Infrastructure-as-a-Service (IaaS) flagship. While it looks like standard VMs on the surface, its architecture is a marvel of custom hardware and software integration. From the Titan security chip to the Colossus file system, GCE is designed for consistency, security, and performance at massive scale.

1. Under the Hood: The GCE Infrastructure

The Titan Security Chip

Every physical server in Google’s data centers contains a Titan chip—a custom hardware root of trust.
  • Secure Boot: Titan verifies the integrity of the BIOS and OS bootloaders before the machine is allowed to join the network.
  • Identity: It provides a unique, cryptographically verifiable identity to the hardware, preventing “insider threat” hardware tampering.

Virtualization: KVM and the “Hypervisor”

Google uses a heavily modified version of KVM (Kernel-based Virtual Machine).
  • No Overcommit: Unlike some cloud providers, GCP does not overcommit CPU resources on standard machine types. If you buy 4 vCPUs, you get 4 physical threads of execution.
  • Live Migration: This is GCE’s “killer feature.” When Google needs to perform hardware maintenance or update a host kernel, it moves your running VM to a new host without a reboot or noticeable downtime.

1.3 The Life of a Packet (Andromeda SDN)

When a VM sends a packet, it doesn’t just “go out.” It undergoes a complex transformation:
  1. vNIC Interception: The packet is intercepted by the virtual NIC.
  2. Andromeda Encapsulation: Google’s SDN, Andromeda, encapsulates the packet (usually in a custom GRE or VXLAN-like header).
  3. Flow Programming: Andromeda checks if this flow is known. If not, it consults the central controller to program the host’s Open vSwitch (OVS).
  4. Hardware Offload: On modern instances, this encapsulation is offloaded to custom ASICs, ensuring that the host CPU isn’t wasted on networking overhead.

2. Machine Families: The Right Tool for the Job

GCP categorizes VMs into families, each optimized for specific hardware footprints.

2.1 Custom Machine Types

A unique GCP feature: if a predefined machine (like n2-standard-4) doesn’t fit your needs, you can create a Custom Machine Type.
  • The Rule: You must follow the supported CPU-to-Memory ratios (typically 1 vCPU to 1GB–6.5GB of RAM).
  • SRE Tip: Use custom machines to match your app’s exact profile (e.g., an app that needs 10 vCPUs but only 12GB of RAM). This saves ~15% compared to paying for the unused RAM in a larger standard instance.

2.2 General Purpose (Balanced)

  • E2: Cost-optimized, uses dynamic resource scheduling. Good for small apps and dev environments. No local SSD support.
  • N2 / N2D: The workhorses. N2 uses Intel (Ice Lake), N2D uses AMD (EPYC). Best for web servers, enterprise apps, and databases.
  • Tau T2D: Google’s best price-performance for scale-out workloads. Uses AMD EPYC processors.

Tau T2D: The Price-Performance Leader

Tau VMs are specifically designed for “scale-out” workloads (web servers, microservices, media transcoding). Performance Benchmark (Estimated):
MetricN2 (Standard)Tau T2D% Difference
Price per vCPU/hr$0.048$0.038~20% Cheaper
SPECrate®2017_int100142~40% Higher
Overall ValueBaseline+60%Massive
Note: Tau T2D does not support Local SSD or GPUs, as it is strictly optimized for compute efficiency.

2.3 Compute Optimized (High Frequency)

  • C2 / C2D: Optimized for single-threaded performance and memory bandwidth. Features high clock speeds (up to 3.8+ GHz).
  • C3: The first machine type powered by Intel Sapphire Rapids and Google’s custom IPU (Infrastructure Processing Unit). This offloads storage and networking entirely, leaving the CPU 100% available for your code.

2.4 Accelerator Optimized (GPU & TPU)

  • A2 / A3: Designed specifically for NVIDIA A100 and H100 GPUs.
  • TPU v4/v5: Google’s custom AI silicon.
    • ICI (Inter-Core Interconnect): TPU pods are connected by a dedicated high-speed 2D/3D torus network, bypassing the standard data center network to achieve ultra-low latency for model weight synchronization.

3. Advanced VM Management: Spot VMs and Preemption

3.1 Spot VMs (The 91% Discount)

Spot VMs are excess Google capacity offered at a massive discount (60-91%).
  • The Catch: They can be reclaimed by Google at any time with only a 30-second notice.
  • The Strategy: Use them for fault-tolerant, stateless workloads (e.g., batch processing, rendering, stateless web workers).

3.2 Handling the Termination Signal

In a production Spot environment, your app MUST handle the termination signal. Example (Python Termination Handler):
import signal
import sys

def handle_termination(sig, frame):
    print("Received termination signal! Checkpointing work...")
    # 1. Flush logs
    # 2. Upload current progress to GCS
    # 3. Gracefully shutdown
    sys.exit(0)

# Listen for the SIGTERM signal sent by GCE
signal.signal(signal.SIGTERM, handle_termination)

4. Special Use Cases: Nested Virtualization and Argos VCU

4.1 Nested Virtualization

GCP allows you to run a hypervisor (like KVM or VMware) inside a GCE VM.
  • Use Case: Dev/Test environments for specialized software that requires its own kernel modules or proprietary virtualization.
  • Requirement: Must use an L1 instance type (Haswell or newer) and enable the nested virtualization flag in the image.

4.2 Argos VCU (Video Compression Units)

Google uses custom ASICs called Argos for YouTube and Google Meet. These are now being exposed to GCE users for massive-scale video transcoding, offering 20x-40x better performance/watt than standard CPUs.

3. Deployment Artifacts: Custom Images vs. Machine Images

To achieve fast boot times in a MIG, you must bake your application into a bootable artifact.

3.1 Custom Images (The Standard)

A custom image is a disk image that includes the OS, your application, and dependencies.
  • Scope: Single disk (the boot disk).
  • Best For: Instance Templates, MIGs, and sharing base OS builds across the organization.
  • Internals: Stored as compressed tarballs in a hidden GCS bucket. When you boot, GCP uses streaming to start the VM before the entire image is even copied to the PD.

3.2 Machine Images (The Comprehensive Backup)

A machine image is a more “complete” artifact than a custom image.
  • Scope: Captures all disks (boot + data), metadata, labels, and the instance configuration.
  • Best For: Cloning entire environments or creating consistent backups of complex multi-disk servers.
  • Internals: Uses differential compression. If you create multiple machine images of the same VM, only the changed blocks are stored.

4. Managed Instance Groups (MIGs)

A MIG is a collection of identical VM instances that you control as a single entity. It provides the automation layer for high availability.

The Auto-Healing Loop

  1. Health Check: You define a health check (e.g., “Is port 8080 responding?”).
  2. Detection: If an instance fails the check, the MIG marks it as unhealthy.
  3. Recreation: The MIG deletes the unhealthy instance and creates a fresh one from the Instance Template.

Autoscaling Strategies

  • CPU Utilization: Scale when average CPU across the group exceeds X%.
  • Load Balancing Capacity: Scale based on the number of requests per second reaching the LB.
  • Cloud Monitoring Metrics: Scale based on custom metrics (e.g., number of messages in a Pub/Sub queue).
  • Predictive Autoscaling: Google uses ML to predict upcoming traffic spikes (based on historical data) and starts scaling before the traffic arrives.

Rolling Updates and Blue/Green

  • Max Surge: How many extra instances can be created during an update (e.g., maxSurge=3 means 3 new VMs are built before deleting old ones).
  • Max Unavailable: How many instances can be offline during an update.
  • Canary Updates: Roll out a new version to only 10% of the group to test for errors before full deployment.

5. Sole-Tenant Nodes: Hardware Isolation

For workloads that require physical isolation (compliance) or specific licensing (BYOL), GCP offers Sole-Tenant Nodes.
  • Physical Host Reservation: You rent the entire physical server. No “noisy neighbors.”
  • Node Groups: Group these hosts and apply placement policies (e.g., ensure two VMs never run on the same physical rack).
  • Overcommit (Internal): You can overcommit CPUs on your own sole-tenant nodes to save money if you know your workloads are bursty.

6. Advanced Security: Shielded & Confidential VMs

Shielded VMs

A suite of security features that protect your VMs from boot-level malware (rootkits).
  • vTPM (Virtual Trusted Platform Module): Stores keys and secrets securely.
  • Integrity Monitoring: Alerts you if the boot state of the VM has changed from its known “good” state.

Confidential Computing

Confidential VMs use AMD SEV (Secure Encrypted Virtualization) to encrypt data in use (while it is in RAM).
  • The Key: The encryption key is generated by the AMD hardware and is never accessible to Google or the host OS.
  • Use Case: Processing highly sensitive data (PII, financial records, medical data) where you don’t even trust the cloud provider.

7. Storage Architecture: Persistent Disks (PD)

Persistent Disks are not local drives; they are network-attached storage distributed across the Colossus file system.
  • Standard PD (pd-standard): HDD-backed, sequential IO.
  • Balanced PD (pd-balanced): SSD-backed, best for general enterprise apps.
  • Performance SSD (pd-ssd): High IOPS, low latency.
  • Extreme PD (pd-extreme): For the most demanding databases (up to 100k IOPS).
  • Local SSD: Physically attached to the host. Fast but ephemeral (data is lost if the instance is deleted).

Regional PD: The High Availability King

Regional PDs synchronously replicate data across two zones in the same region. If an entire zone fails, you can force-attach the disk to a VM in the second zone with zero data loss.

8. Networking: gVNIC and Tiered Networking

gVNIC (Google Virtual NIC)

A modern device driver designed for high-throughput, low-latency networking. It is required for many high-performance machine types (like C2) to reach 50 Gbps+ throughput.

Network Service Tiers

  • Premium Tier (Default): Traffic stays on Google’s private global fiber network for as long as possible. Best performance.
  • Standard Tier: Traffic exits Google’s network at the nearest PoP and travels over the public internet. Cheaper, but higher latency.

9. Advanced Instance Management: Resource Policies and Schedules

9.1 Snapshot Schedules

Never rely on manual backups. Resource policies allow you to automate data protection.
  • Schedule: Daily, weekly, or hourly.
  • Retention: Define how many days/weeks of snapshots to keep.
  • Consistency: Use Application-Consistent Snapshots for databases (requires the guest agent to freeze the file system briefly).

9.2 Instance Schedules (Cost Savings)

For non-production environments, use Instance Schedules to automatically start and stop VMs.
  • Scenario: Turn off dev servers at 6:00 PM and start them at 8:00 AM on weekdays.
  • Benefit: Reduces costs by ~65% for dev/test environments.

9.3 Placement Policies

Control where your VMs are physically located relative to each other:
  • Spread: VMs are placed on different physical racks (reduces correlated failure risk).
  • Compact: VMs are placed as close as possible (reduces latency for HPC/clustered workloads).

10. Implementation: Pro-Level VM Configuration

When creating a production VM, never just “click through” the console. Follow these best practices:
  1. Use Service Accounts: Never use the default “Compute Engine Service Account” with broad permissions. Create a custom SA with the least privilege.
  2. Enable Shielded VM: It should be the default for all workloads.
  3. Optimize Boot Time: Use Custom Images with your application pre-installed rather than relying on heavy startup scripts.
  4. Tag for Firewalls: Use Network Tags to apply firewall rules dynamically (e.g., all VMs with the web-server tag allow port 80).
  5. Metadata and Labelling: Use labels for cost tracking (e.g., env=prod, team=billing).

Lab: Creating a Resilient MIG with gcloud

# 1. Create a Custom Service Account
gcloud iam service-accounts create web-worker-sa

# 2. Create Instance Template with Shielded Features
gcloud compute instance-templates create prod-web-template \
    --machine-type=n2-standard-2 \
    --service-account=web-worker-sa@YOUR_PROJECT.iam.gserviceaccount.com \
    --scopes=cloud-platform \
    --shielded-secure-boot \
    --shielded-vtpm \
    --image-family=debian-11 \
    --image-project=debian-cloud

# 3. Create a Regional MIG (Across 3 zones)
gcloud compute instance-groups managed create prod-web-mig \
    --template=prod-web-template \
    --size=3 \
    --region=us-central1 \
    --distribution-policy-zones=us-central1-a,us-central1-b,us-central1-c

# 4. Set Auto-healing
gcloud compute health-checks create http web-hc --port=80
gcloud compute instance-groups managed set-autohealing prod-web-mig \
    --health-check=web-hc \
    --initial-delay=300 \
    --region=us-central1

---

## 10. Interview Preparation

<AccordionGroup>
  <Accordion title="Q1: Explain 'Live Migration' in Compute Engine and why it's a significant engineering achievement.">
    **Answer:** Live Migration is the ability to move a running VM from one physical host to another without rebooting the VM or causing noticeable downtime.
    
    *   **Engineering Challenge:** It requires transferring the entire state of the VM (RAM, CPU registers, device state) while the VM is still executing instructions.
    *   **Google's Approach:** Google uses a pre-copy mechanism where it streams RAM pages to the new host while the VM is running. It then performs a very brief "brownout" (typically <50ms) to transfer the final delta and switch execution.
    *   **Value:** It allows Google to perform host hardware maintenance, kernel patches, and infrastructure upgrades without affecting customer workloads, maintaining high availability for "always-on" systems.
  </Accordion>

  <Accordion title="Q2: When would you use a Sole-Tenant Node instead of a standard multi-tenant VM?">
    **Answer:** I would choose Sole-Tenant Nodes for three main reasons:
    
    1.  **Security & Compliance:** When regulatory requirements (like HIPAA or PCI-DSS) mandate physical hardware isolation from other customers.
    2.  **Licensing (BYOL):** When you have existing software licenses tied to physical cores or processors (e.g., certain Windows or Oracle licenses).
    3.  **Performance Consistency:** To eliminate the "noisy neighbor" effect entirely by ensuring your VMs are the only ones using the physical CPU cache and memory controllers on that host.
  </Accordion>

  <Accordion title="Q3: How does 'Confidential Computing' protect data while it's being processed in RAM?">
    **Answer:** Confidential VMs use **AMD SEV (Secure Encrypted Virtualization)** to encrypt data *in use*.
    
    *   **Mechanism:** The hardware generates a unique encryption key for each VM. This key is stored in the processor itself and is never accessible to the host OS, the hypervisor, or Google employees.
    *   **In-Memory Encryption:** All data written to RAM by the VM is encrypted by the CPU. Even if someone were to physically scan the RAM sticks or compromise the hypervisor, they would only see encrypted ciphertext.
    *   **Use Case:** Highly sensitive data processing (e.g., financial transactions, genomic research) where the data owner does not trust the infrastructure provider.
  </Accordion>

  <Accordion title="Q4: Compare Regional Persistent Disks (Regional PD) with standard Zonal PDs. What are the trade-offs?">
    **Answer:** 
    
    *   **Regional PD:** Synchronously replicates data across **two zones** in the same region.
        *   *Pro:* High Availability. If a zone fails, you can force-attach the disk to a VM in the second zone with zero data loss (RPO=0).
        *   *Con:* Higher write latency (since data must be written to two locations) and higher cost.
    *   **Zonal PD:** Stores data in a single zone.
        *   *Pro:* Lowest latency and lowest cost.
        *   *Con:* If the zone fails, the disk is inaccessible until the zone is restored.
    
    **Interview Insight:** For stateful workloads like databases where you can't afford any data loss, Regional PD is the preferred architectural choice.
  </Accordion>

  <Accordion title="Q5: What are 'Shielded VMs' and what specific threats do they mitigate?">
    **Answer:** Shielded VMs are a set of security features that protect your VMs from **rootkits and boot-level malware**.
    
    *   **Secure Boot:** Uses a hardware root of trust (Titan chip) to verify the digital signature of the firmware and OS bootloader.
    *   **vTPM:** A virtualized Trusted Platform Module that securely stores keys and provides measured boot capability.
    *   **Integrity Monitoring:** Continuously compares the current boot state with a known "good" baseline and alerts you if there is a mismatch.
    *   **Threats Mitigated:** Unauthorized kernel modifications, BIOS tampering, and "Man-in-the-Middle" attacks on the boot process.
  </Accordion>
</AccordionGroup>