Skip to main content

Chapter 18: Putting it All Together - The Capstone Project

Congratulations! You have completed the 18-chapter journey to becoming a Google Cloud Engineer. You’ve mastered everything from the low-level Jupiter network fabric to high-level serverless orchestration. Now, it’s time to prove your skills by building a Planet-Scale Social Feed Architecture.

1. The Scenario: “CloudStream Global”

You are the Lead Cloud Architect for “CloudStream,” a new social platform. Your CEO wants the platform to launch in three continents (US, Europe, and Asia) on day one. The requirements are strict:
  • Zero Data Loss: User metadata must be strongly consistent globally.
  • Low Latency: Images and videos must load in under 200ms anywhere in the world.
  • Infinite Scale: The system must handle 10 requests or 10 million requests without manual intervention.
  • Hardened Security: The platform must survive massive DDoS attacks and prevent internal data exfiltration.

2. The Reference Architecture

To meet these requirements, you will implement the following “Best-of-GCP” stack:

Edge and Traffic Management

  • Global HTTP(S) Load Balancer: A single anycast IP (34.x.x.x) fronting the whole world.
  • Cloud Armor: WAF policies to block SQLi/XSS and an “Edge Security Policy” to block non-launch countries.
  • Cloud CDN: Enabled for the media bucket, using Signed Cookies for private content.

Compute and Orchestration

  • GKE Autopilot (Regional): Running the core “Feed API” in us-central1, europe-west1, and asia-east1.
  • Workload Identity: All pods use custom service accounts to talk to databases.
  • Cloud Run: Running the “Image Resizer” microservice, triggered by Eventarc whenever a new photo is uploaded to GCS.

Data and Storage Layer

  • Cloud Spanner (Multi-region Instance): Storing user profiles, followers, and the “Social Graph” with external consistency using TrueTime.
  • Cloud Storage (Dual-region): Storing raw and resized media assets with Turbo Replication enabled.
  • Memorystore for Redis Cluster: Caching the most popular “Hot Feeds” to reduce Spanner read costs.

Analytics and Intelligence

  • Pub/Sub: Ingesting every “Like” and “Share” event.
  • Dataflow: A streaming pipeline that aggregates engagement metrics in real-time.
  • BigQuery: Storing petabytes of engagement data for the marketing team.

3. Implementation Phases

Phase 1: The Foundation (IaC)

Write Terraform code to provision the VPC, subnets in three regions, and the Cloud Spanner instance. Ensure the Terraform state is stored in a versioned GCS bucket with locking enabled. Detailed Variable Definition (variables.tf): To ensure consistency across regions, use a map-based variable structure:
variable "regions" {
  type = map(object({
    cidr        = string
    gke_master  = string
    gke_pods    = string
    gke_svcs    = string
  }))
  default = {
    "us-central1" = {
      cidr       = "10.0.1.0/24"
      gke_master = "172.16.0.0/28"
      gke_pods   = "10.1.0.0/16"
      gke_svcs   = "10.2.0.0/20"
    },
    "europe-west1" = {
      cidr       = "10.0.2.0/24"
      gke_master = "172.16.0.16/28"
      gke_pods   = "10.3.0.0/16"
      gke_svcs   = "10.4.0.0/20"
    },
    "asia-east1" = {
      cidr       = "10.0.3.0/24"
      gke_master = "172.16.0.32/28"
      gke_pods   = "10.5.0.0/16"
      gke_svcs   = "10.6.0.0/20"
    }
  }
}

variable "spanner_config" {
  description = "The multi-regional instance configuration"
  default     = "nam-eur-asia1" # Multi-continent Spanner config
}

Phase 2: Security Hardening

  • Implement VPC Service Controls. Create a perimeter around your project that includes BigQuery and Cloud Storage to prevent developers from accidentally (or maliciously) copying data to personal accounts.
  • Use Secret Manager to store the Spanner connection strings and API keys for external services.

Phase 3: The CI/CD Pipeline

Configure Cloud Build to:
  1. Run unit tests.
  2. Build the Docker image and push it to Artifact Registry.
  3. Trigger Cloud Deploy to roll out the image to the dev GKE cluster, followed by a manual approval gate for production.

Phase 4: Observability

Create a Cloud Monitoring Dashboard that tracks the “Golden Signals” (Latency, Traffic, Errors, Saturation). Set up an SLO of 99.9% availability for the Feed API.

Phase 5: Global Traffic Routing

Use a Global External HTTP(S) Load Balancer with Network Endpoint Groups (NEG) to route traffic directly to your GKE services across all three regions.
  • Anycast IP: Users in Tokyo hit the asia-east1 cluster, while users in London hit europe-west1, all using the same IP address.
  • Failover: If the US region goes down, the Load Balancer automatically shifts traffic to Europe or Asia based on proximity and capacity.

Phase 6: Real-time Analytics Pipeline

Ingest engagement data (likes, shares) into BigQuery using Pub/Sub and Dataflow.
  • Windowing: Use a 1-minute sliding window in Dataflow to calculate “Trending Topics” in real-time.
  • Sink: Write the results to a BigQuery table partitioned by hour for efficient querying by the business team.

4. The “Chaos” Test (Final Exam)

To graduate, you must prove your architecture can survive the following:
  1. Zonal Outage: Manually disable one GKE node pool. Does the Load Balancer automatically shift traffic?
  2. The “Slashdot” Effect: Simulate a massive traffic spike. Does the GKE Horizontal Pod Autoscaler (HPA) and the Cloud Run service scale fast enough?
  3. Security Breach: Try to download an object from your “Private” media bucket without a Signed URL. Does it fail as expected?

5. Final Words: The Path to Professional Architect

You now possess the technical knowledge required for the Google Cloud Professional Cloud Architect (PCA) certification. This exam doesn’t just test your knowledge of buttons; it tests your ability to design complex systems like the one you built in this project.
  • Google Cloud Architecture Framework: The 5 pillars (Operational Excellence, Security, Reliability, Performance, Cost).
  • Site Reliability Engineering (SRE) Book: Understanding how Google runs these systems at scale.
Welcome to the fold, Cloud Engineer. The world’s data is now in your hands.

The Project: “Global-Shop”

You are tasked with building the infrastructure for a global e-commerce platform that must be highly available, scalable, and secure.

The Architecture Requirements:

  1. Networking:
    • Create a custom VPC with a Hub-and-Spoke topology.
    • Use Cloud NAT for private instances to access the internet.
    • Set up Cloud DNS for internal and external name resolution.
  2. Compute & Orchestration:
    • Deploy a GKE Autopilot cluster in us-central1.
    • Use Cloud Run for lightweight microservices (e.g., Email Service, PDF Generator).
    • Use Artifact Registry to store all container images.
  3. Data & Storage:
    • Use Cloud SQL (PostgreSQL) for the core order management system.
    • Use Cloud Spanner for the global inventory system (requires strong consistency across regions).
    • Use Cloud Storage for product images and static assets.
    • Use BigQuery to analyze sales data in real-time.
  4. Load Balancing & Performance:
    • Use a Global External HTTP(S) Load Balancer as the single entry point.
    • Enable Cloud CDN for fast image loading globally.
    • Attach Cloud Armor to the Load Balancer to protect against SQL injection and DDoS.
  5. Observability & Security:
    • Set up Cloud Monitoring dashboards for your GKE cluster and Cloud SQL.
    • Configure Log-based alerts for “Payment Failure” events.
    • Store all sensitive API keys and DB passwords in Secret Manager.
  6. Infrastructure as Code:
    • Define at least the VPC and GCS buckets using Terraform.

Deliverables

1. The Architecture Diagram

Create a detailed architecture diagram showing all the components and how they interact. You can use tools like Lucidchart, Draw.io, or Google Slides.

2. The Deployment Script

Provide a README and the necessary scripts (gcloud commands or Terraform files) to provision the entire environment.

3. The “War Story”

Write a short document explaining:
  • One major challenge you faced during deployment.
  • How you optimized for cost.
  • How you ensured high availability if an entire zone goes down.

7. Interview Preparation: Architectural Deep Dive

1. Q: Why choose Cloud Spanner over Cloud SQL for a global social media application? A: A global social media app needs External Consistency to avoid “split-brain” issues (e.g., a user sees they have a new follower in one region but not another). Cloud SQL replicas are asynchronous and regional. Cloud Spanner provides synchronous, multi-regional replication with 99.999% availability, ensuring that a “write” in the US is immediately consistent and visible for a “read” in Asia. 2. Q: How does “Cloud Armor” protect a global application at the Edge? A: Cloud Armor is a WAF (Web Application Firewall) that integrates with the Global Load Balancer. It blocks attacks (SQL injection, XSS, DDoS) at Google’s network edge, before the traffic even reaches your GKE cluster or VPC. This saves your compute resources from processing malicious traffic and provides a massive scale defense against Layer 7 DDoS attacks. 3. Q: Explain the benefit of using “Dual-Region” Cloud Storage buckets for media assets. A: Dual-region buckets (e.g., us-central1 and us-east1) provide High Availability across two geographic locations. If one region goes down, the data remains accessible from the other with a 99.99% SLA. When combined with Turbo Replication, Google guarantees that 100% of newly uploaded data is replicated between regions within 15 minutes (RPO). 4. Q: Why is “Workload Identity” mandatory for this project’s security architecture? A: Because this is a high-traffic app, managing static JSON service account keys for thousands of GKE pods is a security nightmare. Workload Identity eliminates these keys entirely. It maps the Kubernetes Service Account directly to a Google IAM role, providing short-lived, auto-rotating tokens that cannot be stolen or leaked from the container filesystem. 5. Q: What are the “Four Golden Signals” you would track on your Capstone dashboard? A:
  • Latency: Time taken to load the social feed (p50, p95, p99).
  • Traffic: Number of HTTP requests per second (RPS) hitting the Load Balancer.
  • Errors: Rate of 4xx and 5xx responses (crucial for detecting bugs vs. capacity issues).
  • Saturation: CPU/Memory utilization of the GKE nodes and the Spanner instance (helps with scaling decisions).

Final Review Checklist

  • Does the application scale to zero if there’s no traffic? (Cloud Run)
  • Is there any single point of failure? (Avoid single-zone deployments)
  • Is the “Least Privilege” principle applied in IAM?
  • Are all secrets stored in Secret Manager?
  • Is billing export to BigQuery enabled?

Congratulations!

You are now a GCP Cloud Engineer. You have the skills to design, build, and manage planet-scale infrastructure on Google Cloud. What’s Next?
  • Take the GCP Professional Cloud Architect or Professional Cloud Developer certification exams.
  • Share your Capstone Project on GitHub and LinkedIn.
  • Keep exploring! Google Cloud is constantly evolving.
See you in the cloud!