Skip to main content

Chapter 1: GCP Fundamentals & Architecture

Google Cloud Platform (GCP) isn’t just a collection of rented servers. It is a massive, global-scale distributed system built on over two decades of engineering innovation. To be a GCP Engineer, you must understand the “under-the-hood” architecture that makes Google’s cloud unique.

1. Google’s Physical Infrastructure: The Global Network

Most cloud providers rent space in third-party data centers. Google, however, builds its own data centers and, more importantly, its own fiber optic network.

1.1 The Network Advantage

  • Jupiter Network Fabric: Inside Google’s data centers, the “Jupiter” network provides 1.3 Petabits per second of total bisection bandwidth. This allows every server in a data center to talk to any other server at full speed, as if they were on the same switch.
  • Andromeda (Software Defined Network): This is the “brain” that manages the network. It handles everything from load balancing to firewalls without needing dedicated hardware appliances.
  • B4 Global Network: Google’s private global backbone. When you send data from a VM in New York to a VM in London, it stays on Google’s private fiber, bypassing the public internet entirely.

B4 vs. The Public Internet: The Latency Reality

While the public internet relies on unpredictable BGP routing through dozens of intermediate ISPs, B4 uses centralized traffic engineering to optimize for the shortest path.
RouteStandard Internet (Estimated)Google B4 BackboneImprovement
NYC to London85ms - 110ms68ms - 74ms~25%
Tokyo to Sydney140ms - 180ms105ms - 115ms~35%
Sao Paulo to NYC130ms - 160ms100ms - 110ms~30%
Note: These numbers represent Round-Trip Time (RTT) and can vary based on solar flares, undersea cable conditions, and current traffic engineering policies.

1.2 Regions and Zones: Designing for Failure

  • Region: A geographical area (e.g., us-east1 in South Carolina).
  • Zone: An isolated failure domain within a region. Think of a zone as one or more physical data centers.
  • Low Latency: Zones in the same region are connected by high-speed networking with under 1ms round-trip latency.
SRE Best Practice:
“Everything fails, all the time.”
To protect against a single data center failing (e.g., due to a power outage), you must deploy your application across at least two zones (Zonal High Availability). To protect against an entire region failing (e.g., a natural disaster), you must deploy across multiple regions (Regional Disaster Recovery).

1.3 Choosing Regions and Zones (Real-World Considerations)

When selecting regions and zones, consider:
  • Latency to users: Place workloads close to your primary user base.
  • Data residency: Some industries require data to stay in specific countries.
  • Available services: Not all services or machine types are in every region.
  • Cost: Pricing can vary slightly between regions.
Example pattern:
  • Latency‑sensitive frontends in europe-west1.
  • Batch/analytics workloads in us-central1 (often cheaper and well connected).
  • DR site in a different continent (asia-southeast1).

1.4 Hardware Security: The Titan Chip

Google doesn’t trust third-party hardware entirely. Every server in a Google data center includes a custom-designed hardware chip called Titan.

The Root of Trust

Titan is a low-power microcontroller designed to ensure that a machine boots from a known-good state.
  • Secure Boot: Titan verifies the first stage of the bootloader. If the signature is invalid, the machine will not boot.
  • Integrity Monitoring: It continuously monitors the firmware and BIOS for any signs of tampering.
  • Identity: Titan provides a cryptographically strong identity to each machine, which is used for service-to-service authentication (ALTS).

1.5 The Jupiter Network: Inside the Data Center

While B4 connects data centers, Jupiter is the network inside them.

Clos Topology and Bisection Bandwidth

Jupiter uses a Clos topology, a multi-stage circuit-switching network.
  • Total Throughput: 1.3 Petabits per second (Pbps) of bisection bandwidth.
  • Why it matters: In traditional networks, traffic “oversubscribes” the core switches, leading to bottlenecks. In Jupiter, any server can talk to any other server at full 10Gbps/100Gbps speed without congestion.
  • Optical Circuit Switching (OCS): Google uses MEMS-based optical switches to dynamically reconfigure the network topology without manual cabling.

1.6 Andromeda: The SDN Brain

Andromeda is Google’s Software-Defined Networking (SDN) stack. It is the virtualization layer that makes VPCs possible.

Control Plane vs. Data Plane

  • The Control Plane (Centralized): Andromeda’s control plane manages the configuration of millions of virtual endpoints. It computes the shortest path and pushes flow rules to the hosts.
  • The Data Plane (Distributed): The actual packet processing happens on the GCE hosts. Andromeda uses Hoverboard (a high-performance packet processor) to handle encapsulation (encap/decap), firewalls, and load balancing in software, often leveraging specialized NIC features.

1.7 Colossus: The Planet-Scale File System

All GCP storage services (Cloud Storage, Persistent Disk, BigQuery) are built on top of Colossus, the successor to the original Google File System (GFS).

Distributed Storage Architecture

  • D-Nodes: The storage servers that hold the data chunks.
  • Curators: Metadata managers that handle replication, recovery, and garbage collection.
  • Reed-Solomon Encoding: Instead of simple replication (which is expensive), Colossus uses Erasure Coding. It breaks data into kk data chunks and mm parity chunks. Even if multiple disks fail, the data can be reconstructed.
  • Scalability: Colossus handles exabytes of data across millions of disks without a single point of failure.

1.8 Google’s Custom Hardware: The TPU and Custom Silicon

Google’s scale allows it to design its own silicon, optimizing for specific workloads like Artificial Intelligence and Video Transcoding.

Tensor Processing Units (TPUs)

TPUs are Google’s custom-developed ASICs (Application-Specific Integrated Circuits) used to accelerate machine learning workloads.
  • TPU v4/v5: These are the latest generations, featuring high-bandwidth memory (HBM) and specialized interconnects that allow thousands of TPUs to work together as a single supercomputer (TPU Pods).
  • Architecture: TPUs use a Matrix Multiplication Unit (MXU) that can process thousands of operations in a single clock cycle, significantly outperforming general-purpose GPUs for large-scale training.
  • Networking: TPU Pods use a specialized, low-latency topology (e.g., a 3D torus) to ensure that the data bottleneck isn’t the network.

Argos: The VCU (Video Coding Unit)

Argos is a custom chip designed to handle the massive video transcoding requirements of YouTube.
  • Efficiency: It is 20-30x more efficient than traditional CPUs for video processing.
  • Impact: By offloading video transcoding to Argos, Google frees up millions of CPU cores for other cloud tasks.

1.9 Planet-Scale Engineering: Borg, Colossus, and Spanner

The services you use in GCP are the externalized versions of the tools Google uses to run its own business.

Borg: The Predecessor to Kubernetes

Borg is Google’s internal cluster manager. It handles hundreds of thousands of jobs, across many thousands of machines, in a multitude of clusters.
  • Lessons Learned: Kubernetes was designed based on the 15+ years of experience Google had running Borg. Concepts like Pods, Services, and Labels all originated in Borg.

The “Global Consistency” Challenge

In a traditional system, you choose between Availability and Consistency (the CAP theorem). Google’s engineers defied this by building Cloud Spanner.
  • The Secret: As discussed in Chapter 7, Spanner uses TrueTime (GPS + Atomic Clocks) to synchronize time across the entire world within a 10ms uncertainty bound. This allows for “External Consistency” globally, something previously thought impossible.

1.10 The Life of a Packet: From User to TPU

Understanding how a request moves through Google’s infrastructure is key to optimizing performance.
  1. Anycast Entry: A user’s browser resolves api.google.com to an Anycast IP address. The request is routed via BGP to the physically closest Google Edge Point of Presence (PoP).
  2. Edge Termination: The Google Front End (GFE) terminates the TCP and TLS connections. If the request is for a cached asset, Cloud CDN serves it immediately.
  3. Backbone Transit: If it’s a dynamic request, the GFE proxies it over the B4 private backbone. The packet is encapsulated using Google’s proprietary protocol and sent at near-light speeds across the globe.
  4. Cluster Entry: The packet arrives at a data center and is unencapsulated. It hits a Maglev load balancer, which uses consistent hashing to select a healthy backend server.
  5. Andromeda Delivery: The Andromeda SDN identifies the target virtual machine (VM) and delivers the packet directly to the host’s virtual NIC (vNIC).
  6. Application Logic: The code running on GCE or GKE processes the request. It might call a database (Spanner) or an AI model (running on a TPU).
  7. Titan Verification: Every step of this compute process is secured by Titan chips, ensuring that the firmware and OS haven’t been tampered with.

1.11 Data Center Design: Power and Cooling at Scale

Google’s data centers are some of the most efficient in the world, achieving a Power Usage Effectiveness (PUE) of ~1.1 (where 1.0 is perfect efficiency).

Evaporative Cooling

Most data centers use massive air conditioners. Google uses evaporative cooling (or “swamp coolers”).
  • Process: Hot air from the servers is passed through water-soaked pads. The evaporation of the water cools the air, which is then recycled back to the servers.
  • Efficiency: This uses 10% of the energy of traditional chillers.

Custom UPS (Uninterruptible Power Supply)

Traditional data centers use large, centralized UPS systems. Google builds a battery directly into every server rack.
  • Impact: This reduces power conversion losses and ensures that a single UPS failure doesn’t take down an entire row of servers.

2. The GCP Resource Hierarchy: Governing at Scale

GCP uses a strict “Parent-Child” hierarchy. This is the secret to how Google manages millions of resources across thousands of customers while maintaining strict security boundaries.

2.1 Cloud Identity: The Authentication Root

Before the Organization node, there is Cloud Identity.
  • The Directory: It stores your users, groups, and device information.
  • SSO Integration: Cloud Identity can federate with Active Directory, Azure AD, or Okta using SAML 2.0 or OIDC.
  • The Bound: Your GCP Organization is cryptographically bound to your Cloud Identity domain (e.g., acme.com).

2.2 Tier 1: The Organization (The Root)

This represents your company. It is linked to your domain (e.g., company.com) via Cloud Identity or Google Workspace.
  • Centralized Ownership: If an employee leaves the company, the Organization ensures that the company—not the individual—owns the projects and data.
  • Global Policies: You can apply organization policies (Org Policies) that restrict what can be done anywhere under the org (e.g., disallow public IPs, restrict regions).

2.2 Tier 2: Folders (The Departments)

Folders are optional but highly recommended for any organization with more than 5 projects.
  • Example: You can have a Prod/ folder and a Dev/ folder, or folders by business unit (Finance/, Marketing/, Platform/).
  • Inheritance: Permissions (IAM) and org policies applied to a folder are automatically inherited by all projects inside it.
Typical patterns:
  • Org → Prod → Payments-Project
  • Org → NonProd → Shared-Dev-Tools
  • Org → Security → Logging-Aggregation.

2.3 Tier 3: Projects (The Containers)

The project is the fundamental unit for enabling APIs, billing, and managing resources.
  • Project ID: A permanent, globally unique string.
  • Project Number: A permanent, unique number assigned by Google (used internally and in some APIs).
  • Trust Boundary: By default, resources in Project A cannot talk to resources in Project B unless you explicitly connect them (e.g., via VPC Peering, Shared VPC, or service perimeters).
  • Billing Link: Each project is linked to exactly one billing account.

2.4 Tier 4: Resources (The Infrastructure)

The actual VMs, Cloud Storage buckets, BigQuery datasets, GKE clusters, etc.
  • IAM can be set at the resource level for fine‑grained control.
  • Labels on resources flow into billing export for cost allocation.

2.5 Designing a Hierarchy for a Real Company

Example design for a mid‑size org:
Org: example.com
├── Folder: Prod
│   ├── Project: prod-web
│   ├── Project: prod-data
│   └── Project: prod-shared-vpc
├── Folder: NonProd
│   ├── Project: dev-web
│   ├── Project: test-web
│   └── Project: sandbox
└── Folder: Security
    ├── Project: logging-aggregation
    └── Project: security-tools
Key ideas:
  • Separate prod vs non‑prod to keep access and blast radius distinct.
  • Have shared services projects (logging, networking) managed by platform teams.

3. Quotas and Limits: Preventing “Bill Shock”

Google Cloud uses quotas to protect you from accidental overspending and to protect their infrastructure from being overwhelmed.

3.1 Types of Quotas

  1. Rate Quotas:
    Limits on how many API calls you can make per unit time (e.g., 1,000 requests per minute to the Cloud Build API).
  2. Allocation Quotas:
    Limits on how many resources you can have (e.g., “You can only have 24 vCPUs in region us-central1”).
  3. Per‑user / per‑service limits:
    Some services also have per‑user or per‑region caps.

3.2 How to Inspect and Request Quota Increases

  • Console: IAM & Admin → Quotas (or search “Quotas”).
  • CLI: gcloud compute project-info describe and gcloud services commands.
Engineer’s Note: If you need more resources than your quota allows, you must file a Quota Increase Request in the Console. Google usually approves these within minutes for established accounts, but large jumps may require justification. Practical steps:
  • Before a major launch, review quotas in each region you plan to use.
  • Use monitoring alerts on quota metrics where possible to avoid surprises.

4. Interaction Tools: Console, CLI, and Shell

4.1 The Google Cloud Console

The web-based GUI. Excellent for visual learners and for exploring new services. Use cases:
  • Viewing resource topology, metrics, and logs.
  • Quick one-off changes or experiments.
  • Browsing documentation integrated into product UIs.

4.2 The gcloud CLI

The most powerful tool for a GCP Engineer. It allows you to automate everything.
  • Structure: gcloud [SERVICE] [GROUP] [COMMAND] [FLAGS]
  • Example: gcloud compute instances create my-vm --zone=us-central1-a
Best practices:
  • Use --format and --filter to build scripts that parse output reliably.
  • Store common settings (project, region, zone) using gcloud config set.

4.3 Cloud Shell (The Hidden Gem)

A free, temporary Linux VM accessible via your browser.
  • Pre-configured: Has gcloud, kubectl, terraform, docker, and git pre-installed.
  • $HOME directory: You get 5 GB of persistent storage for your scripts.
  • Boost Mode: Need more power? You can “boost” the Cloud Shell to get a 4-core CPU and 16 GB of RAM for a few hours.
Cloud Shell is the fastest way to get a reproducible environment without installing anything locally.

Lab: Deep Dive into gcloud and Cloud Shell

Open Cloud Shell and execute these “Production-ready” commands:
# 1. Update the gcloud components to the latest version
gcloud components update

# 2. Set your default project and zone to save typing later
gcloud config set project [YOUR_PROJECT_ID]
gcloud config set compute/zone us-central1-a

# 3. Use the 'filter' and 'format' flags (Essential for automation)
# Find only the regions that are currently UP and display them as a clean list
gcloud compute regions list --filter="status:UP" --format="value(name)"

# 4. View your current quotas
gcloud compute project-info describe --project [YOUR_PROJECT_ID]
Extend this lab by:
  • Listing all projects you have access to: gcloud projects list.
  • Describing one of them: gcloud projects describe [PROJECT_ID].
  • Experimenting with different --format outputs (e.g., table, json, yaml).

Summary Checklist

  • Do you understand the difference between a Region and a Zone?
  • Can you explain why the Organization node is important for security?
  • Do you know how to request a quota increase?
  • Have you successfully launched Cloud Shell?
In the next chapter, we will master Identity & Access Management (IAM)—the system that determines who has the “keys” to your kingdom.

Interview Preparation

Answer: These are the three pillars of Google’s network:
  1. Jupiter: The physical network fabric inside a data center. It provides 1.3 Pbps of bisection bandwidth, allowing thousands of servers to communicate at full speed without congestion.
  2. Andromeda: The Software-Defined Network (SDN) stack. It’s the “intelligence” that manages routing, firewalls, and load balancing at the host level rather than using discrete hardware appliances.
  3. B4: The private global fiber backbone that connects Google’s data centers worldwide. It uses centralized traffic engineering to optimize for latency, often beating the public internet by 25-35%.
Answer: The Organization node is the root of the hierarchy and represents the company. Its primary significance includes:
  • Centralized Control: It prevents “shadow projects” by ensuring all projects created by employees are owned by the company domain.
  • Governance: It allows for the application of Organization Policies (e.g., restricting which regions can be used) that cannot be overridden by project-level admins.
  • IAM Inheritance: Roles granted at the Org level flow down to all folders and projects, enabling consistent access control across the entire company.
Answer: I would use a folder-based structure:
  1. Root: Organization node (company.com).
  2. Folders (Tier 1): One folder per business unit (e.g., Retail, Cloud-Services).
  3. Folders (Tier 2): Inside each BU folder, create sub-folders for environments (e.g., Prod, Non-Prod, Sandbox).
  4. Projects: Application-specific projects (e.g., retail-inventory-prod) live inside the environment folders.
  5. Shared Folders: A dedicated Security or Networking folder for centralized resources like Shared VPC host projects or log sinks.
Answer:
  1. Rate Quotas: Limit the number of API requests over time (e.g., 1000 requests per minute). These protect the API control plane from being overwhelmed.
  2. Allocation Quotas: Limit the total number of physical resources you can consume (e.g., 24 vCPUs in a region). these protect your budget and Google’s capacity.
Interview Tip: Mention that quotas are per-project and often per-region. If you hit a quota, you must request an increase in the console, which Google typically reviews for capacity and account history.
Answer: Cloud Shell is a temporary, managed Linux VM that is:
  • Pre-configured: It comes with gcloud, kubectl, terraform, and docker pre-installed and updated.
  • Authenticated: It automatically uses your console credentials, removing the need to manage local keys.
  • Persistent: It includes 5GB of $HOME directory storage that persists between sessions.
  • Accessible: It provides “Boost Mode” (4 vCPUs, 16GB RAM) for heavy operations like building large container images.