Skip to main content

Chapter 7: SQL at Scale - Cloud SQL, AlloyDB, and Cloud Spanner

Google Cloud’s relational database portfolio is designed to bridge the gap between traditional RDBMS reliability and modern cloud-scale requirements. Whether you need a standard PostgreSQL instance for a monolithic application or a globally distributed database that defies the CAP theorem for planet-scale systems, GCP offers specialized solutions tailored to your needs. This chapter assumes no prior database administration experience. We will cover database fundamentals, connectivity patterns, high availability architecture, backup strategies, scaling approaches, and transaction management from the ground up.

0. From Scratch: Understanding Relational Databases

Before diving into GCP’s specific offerings, let’s establish foundational concepts about relational databases.

0.1 What is a Relational Database?

A relational database stores data in tables with rows and columns:
  • Table: A collection of related data (e.g., “Customers”, “Orders”)
  • Row: A single record in the table (e.g., one customer)
  • Column: An attribute or field (e.g., “customer_name”, “email”)
  • Primary Key: Unique identifier for each row
  • Foreign Key: Reference to a row in another table (creates relationships)
Relational databases use SQL (Structured Query Language) to query and manipulate data.

0.2 ACID Properties

Production databases must guarantee ACID properties:
  • Atomicity: A transaction either completes entirely or not at all (no partial updates)
  • Consistency: Database remains in a valid state before and after transactions
  • Isolation: Concurrent transactions don’t interfere with each other
  • Durability: Once committed, data survives crashes and power failures
These properties are critical for financial systems, e-commerce, and any application where data correctness is paramount.

0.3 Database Workload Types

Understanding workload types helps choose the right database:
  • OLTP (Online Transaction Processing):
    • High volume of short transactions (inserts, updates, deletes)
    • Examples: e-commerce checkouts, bank transfers, user registrations
    • Optimized for write throughput and low latency
  • OLAP (Online Analytical Processing):
    • Complex queries scanning large datasets
    • Examples: business intelligence, reporting, data warehousing
    • Optimized for read throughput and aggregations
  • Hybrid (HTAP):
    • Both transactional and analytical workloads on the same database
    • Examples: real-time dashboards, operational analytics
    • Requires specialized architecture

1. Cloud SQL: The Managed Standard

Cloud SQL is a fully managed service for MySQL, PostgreSQL, and SQL Server. It is the go-to choice for migrating existing applications to the cloud or starting new projects that don’t require global scale.

1.1 Cloud SQL Architecture: The Principal’s View

While GCP abstracts most of the complexity, a Principal Engineer must understand the underlying infrastructure to predict behavior during tail-latency events or regional degradations.

The “Managed Instance” Anatomy

Cloud SQL is not a “serverless” database in the same way BigQuery is; it is a highly automated orchestration of Google Compute Engine (GCE) and specialized storage.
  • Orchestration Layer: A control plane (managed by Google) handles provisioning, patching, and failover.
  • The VM Instance: A hidden GCE instance runs the database engine (MySQL, Postgres, or SQL Server).
  • The Sidecar: Each instance runs a “Sentinel” or agent process that monitors health and handles administrative tasks (backups, log exports).
  • Persistent Storage: Uses Google’s block storage. For HA, it uses Regional Persistent Disk (Regional PD), which provides synchronous replication at the block level across two zones.

Zonal vs. Regional Storage

FeatureZonal StorageRegional Storage (HA)
ReplicationSingle zone (locally redundant)Synchronous across two zones
Write LatencyLowestHigher (network round-trip to 2nd zone)
AvailabilityZonal failure = OutageZonal failure = ~60s Failover
RecoveryRequires backup restoreAutomatic failover

1.2 High Availability (HA) Deep Dive

Cloud SQL’s HA isn’t just “having a standby.” It’s a coordinated block-level replication and IP takeover mechanism.

The Failover Lifecycle

When a primary instance becomes unreachable (node failure, network partition, or zonal outage), the following sequence occurs:
  1. Detection: The control plane’s monitoring service (Sentinel) detects that the primary is unresponsive (typically after 2-3 missed heartbeats).
  2. Fencing: The control plane ensures the old primary cannot write to the disk anymore (avoiding “split-brain” scenarios).
  3. Disk Attachment: The Regional PD is detached from the failed primary and attached to the standby VM in the second zone.
  4. Engine Recovery: The database engine on the standby starts. It must perform Crash Recovery by replaying the Write-Ahead Log (WAL) from the disk to ensure data consistency.
  5. IP Takeover: The internal load balancer or DNS-based service endpoint points to the new instance.
Principal Note: The failover time is dominated by WAL recovery. If you have very long-running transactions or a high volume of un-checkpointed writes, your failover time will increase. Monitoring pg_stat_activity for long-running transactions is a critical HA strategy.

Configuration (CLI)

# Create an HA PostgreSQL Instance
gcloud sql instances create prod-db \
    --database-version=POSTGRES_14 \
    --tier=db-custom-4-15360 \
    --region=us-central1 \
    --availability-type=REGIONAL \
    --storage-type=SSD \
    --storage-size=100GB \
    --storage-auto-increase \
    --backup-start-time=03:00

HA Limitations and Considerations

  • Regional Scope: HA protects against zonal failures but not regional disasters
  • Performance Impact: Synchronous replication adds slight latency to writes (~2-5ms)
  • Cost: HA configuration doubles compute costs (primary + standby)
  • Planned Maintenance: Even with HA, maintenance operations may cause brief disruptions

1.3 Connectivity Patterns

Connecting applications to Cloud SQL securely requires careful architecture.

Private IP vs Public IP

Private IP (Recommended):
  • Database instance has only a private IP address
  • Accessible only from within your VPC network
  • Requires VPC peering setup
  • More secure (no Internet exposure)
Public IP:
  • Database accessible from the Internet
  • Requires authorized networks (IP allowlists)
  • Use only for development or migration scenarios
  • Always enable SSL/TLS

The Cloud SQL Auth Proxy

Never open port 3306 (MySQL) or 5432 (PostgreSQL) to the public internet. The Cloud SQL Auth Proxy is a secure connection method:
  • IAM Authentication: Uses service account credentials instead of passwords
  • Automatic Encryption: All traffic encrypted with TLS
  • Certificate Management: Handles SSL certificate rotation automatically
  • No IP Allowlists: Eliminates need to maintain IP ranges
Auth Proxy Setup (Local Development)
# Download the proxy binary
curl -o cloud_sql_proxy https://dl.google.com/cloudsql/cloud_sql_proxy.linux.amd64
chmod +x cloud_sql_proxy

# Run the proxy
./cloud_sql_proxy -instances=PROJECT_ID:REGION:INSTANCE_NAME=tcp:5432

# Connect from your application to localhost:5432
psql "host=127.0.0.1 port=5432 dbname=mydb user=postgres"
Auth Proxy in Production (Kubernetes Sidecar)
apiVersion: apps/v1
kind: Deployment
metadata:
  name: my-app
spec:
  template:
    spec:
      containers:
      - name: app
        image: gcr.io/project/app:latest
        env:
        - name: DB_HOST
          value: "127.0.0.1"
        - name: DB_PORT
          value: "5432"
      - name: cloud-sql-proxy
        image: gcr.io/cloudsql-docker/gce-proxy:latest
        command:
        - "/cloud_sql_proxy"
        - "-instances=PROJECT_ID:REGION:INSTANCE_NAME=tcp:5432"
        securityContext:
          runAsNonRoot: true

Private Service Connection (VPC Peering)

For private IP access without the Auth Proxy:
# Allocate an IP range for Cloud SQL
gcloud compute addresses create google-managed-services-default \
    --global \
    --purpose=VPC_PEERING \
    --prefix-length=16 \
    --network=default

# Create private service connection
gcloud services vpc-peerings connect \
    --service=servicenetworking.googleapis.com \
    --ranges=google-managed-services-default \
    --network=default

1.4 Backup and Recovery

Cloud SQL provides automated and on-demand backups.

Automated Backups

  • Frequency: Daily, at a configured time
  • Retention: 7 days (default), up to 365 days
  • Point-in-Time Recovery (PITR): Restore to any point in the last 7 days
  • Binary Logging: Required for PITR, captures all changes
# Configure backup settings
gcloud sql instances patch prod-db \
    --backup-start-time=03:00 \
    --retained-backups-count=30 \
    --retained-transaction-log-days=7

On-Demand Backups

# Create manual backup
gcloud sql backups create \
    --instance=prod-db \
    --description="Pre-migration backup"

Restore Operations

# Restore from a specific backup
gcloud sql backups restore BACKUP_ID \
    --instance=prod-db

# Point-in-time restore to 1 hour ago
gcloud sql instances clone prod-db prod-db-restore \
    --point-in-time='2024-01-24T15:00:00.000Z'

Backup Best Practices

  • Test Restores: Regularly practice restore procedures
  • Export Backups: For critical data, export backups to Cloud Storage
  • Cross-Region Copies: For disaster recovery, maintain backups in multiple regions
  • Monitor Backup Success: Set up alerts for failed backups

1.5 Read Replicas and Scaling

Read replicas allow horizontal scaling of read traffic.

Read Replica Architecture

  • Asynchronous Replication: Changes replicated from primary to replicas
  • Read-Only: Replicas cannot accept writes
  • Replication Lag: Typically milliseconds, can increase under load
  • Multiple Replicas: Create multiple replicas for different purposes

Creating Read Replicas

# Create a read replica in the same region
gcloud sql instances create prod-db-replica-1 \
    --master-instance-name=prod-db \
    --region=us-central1 \
    --tier=db-custom-2-7680

# Create a cross-region read replica for disaster recovery
gcloud sql instances create prod-db-replica-dr \
    --master-instance-name=prod-db \
    --region=europe-west1 \
    --tier=db-custom-4-15360

Application Integration Patterns

Pattern 1: Connection Pooling with Read/Write Split
# Python example using SQLAlchemy
from sqlalchemy import create_engine
from sqlalchemy.orm import sessionmaker

# Primary (read-write)
primary_engine = create_engine('postgresql://user:pass@primary-ip/db')

# Replica (read-only)
replica_engine = create_engine('postgresql://user:pass@replica-ip/db')

# Use primary for writes
def create_order(order_data):
    session = sessionmaker(bind=primary_engine)()
    # ... insert order
    session.commit()

# Use replica for reads
def get_order_list():
    session = sessionmaker(bind=replica_engine)()
    return session.query(Order).all()
Pattern 2: Load Balancer for Replicas For multiple read replicas, use an internal TCP load balancer to distribute read traffic:
# Create health check
gcloud compute health-checks create tcp read-replica-health \
    --port=5432

# Create backend service with all replicas
gcloud compute backend-services create read-replica-backend \
    --protocol=TCP \
    --health-checks=read-replica-health \
    --region=us-central1

# Add replicas as backends
gcloud compute backend-services add-backend read-replica-backend \
    --instance=prod-db-replica-1 \
    --instance-zone=us-central1-a \
    --region=us-central1

1.6 Maintenance Windows and Upgrades

Cloud SQL requires periodic maintenance for security patches and updates.

Maintenance Window Configuration

# Set maintenance window to Sunday 4:00 AM
gcloud sql instances patch prod-db \
    --maintenance-window-day=SUN \
    --maintenance-window-hour=04 \
    --maintenance-release-channel=production

Maintenance Release Channels

  • Production: Stable updates, lower frequency
  • Preview: Early access to new features, higher frequency
  • Week: Minimum 1 week notice for maintenance

Minimizing Maintenance Impact

  • Enable HA: Reduces disruption during maintenance
  • Use Read Replicas: Redirect traffic during primary maintenance
  • Monitor Maintenance Notifications: Subscribe to maintenance alerts
  • Test on Non-Production: Apply updates to dev/staging first

1.7 Performance Optimization

Query Insights

Enable Cloud SQL Query Insights to identify slow queries:
# Enable Query Insights
gcloud sql instances patch prod-db \
    --insights-config-query-insights-enabled \
    --insights-config-query-string-length=1024 \
    --insights-config-record-application-tags \
    --insights-config-record-client-address
Query Insights provides:
  • Top queries by execution time, CPU, I/O
  • Query execution plans
  • Historical query performance trends
  • Recommendations for missing indexes

Connection Pooling

Database connections are expensive resources. Use connection pooling: Built-in Cloud SQL Connector (Recommended):
from google.cloud.sql.connector import Connector

# Initialize Connector
connector = Connector()

# Create connection pool
def getconn():
    conn = connector.connect(
        "PROJECT:REGION:INSTANCE",
        "pg8000",
        user="postgres",
        password="password",
        db="mydb"
    )
    return conn

# Use with SQLAlchemy
engine = create_engine(
    "postgresql+pg8000://",
    creator=getconn,
    pool_size=10,
    max_overflow=5,
    pool_pre_ping=True
)

Storage Auto-Increase

Enable automatic storage increase to prevent out-of-disk errors:
gcloud sql instances patch prod-db \
    --storage-auto-increase \
    --storage-auto-increase-limit=500GB

2. AlloyDB for PostgreSQL

AlloyDB is a PostgreSQL-compatible, enterprise-grade database designed for the most demanding transactional and analytical workloads.

2.1 Architecture: Decoupled Compute and Storage

AlloyDB separates the database engine from the storage layer, enabling superior performance and scalability.

Log-Structured Storage

  • Traditional Approach: Write full data pages to disk
  • AlloyDB Approach: Write only logs (change records) to storage
  • Benefit: Reduces I/O overhead by 10x for write-heavy workloads

Multi-Tier Caching

AlloyDB uses a sophisticated caching hierarchy:
  1. L1 Cache (RAM): Hot data in memory
  2. L2 Cache (Local SSD): Frequently accessed data
  3. L3 Cache (Regional Storage): Persistent storage layer
This architecture provides:
  • Sub-millisecond read latency for cached data
  • Automatic promotion of frequently accessed data
  • Transparent to applications (standard PostgreSQL wire protocol)

2.2 Columnar Engine (Hybrid OLTP/OLAP)

AlloyDB’s columnar engine enables analytical queries on transactional data without ETL or replication.

How It Works

  • Dual Storage Format: Data stored in both row and column formats
  • Automatic Population: ML algorithms identify analytical query patterns
  • Memory-Based: Columnar data kept in memory for fast access
  • Zero Impact: Analytical queries don’t affect transactional performance

Configuring the Columnar Engine

# Create AlloyDB cluster with columnar engine enabled
gcloud alloydb clusters create prod-alloydb-cluster \
    --region=us-central1 \
    --password=secure-password

# Create primary instance with columnar engine
gcloud alloydb instances create prod-alloydb-primary \
    --cluster=prod-alloydb-cluster \
    --region=us-central1 \
    --instance-type=PRIMARY \
    --cpu-count=8 \
    --database-flags=google_columnar_engine.enabled=on,google_columnar_engine.memory_size_mb=4096

Use Cases for Columnar Engine

  • Real-time dashboards: Query live transactional data
  • Operational analytics: No need for separate data warehouse
  • Hybrid workloads: OLTP and OLAP on the same database
  • Simplified architecture: Eliminates ETL pipelines

2.3 AlloyDB High Availability

AlloyDB provides enterprise-grade HA with sub-minute failover:
  • Cross-Zone Replication: Automatic replication across availability zones
  • Automatic Failover: Typically completes in 10-30 seconds
  • Zero Data Loss: Synchronous replication ensures no data loss
  • Transparent to Applications: Connection endpoint remains the same

2.4 Read Pool Instances

AlloyDB supports read pool instances for scaling read traffic:
# Create read pool instance
gcloud alloydb instances create prod-alloydb-read-pool \
    --cluster=prod-alloydb-cluster \
    --region=us-central1 \
    --instance-type=READ_POOL \
    --cpu-count=4 \
    --read-pool-node-count=3
Read pools automatically:
  • Distribute read traffic across multiple nodes
  • Scale nodes up/down based on load
  • Maintain low replication lag (typically under 1ms)

2.5 AlloyDB vs Cloud SQL PostgreSQL

FeatureCloud SQL PostgreSQLAlloyDB for PostgreSQL
PerformanceStandard PostgreSQLUp to 4x faster transactions
Analytical QueriesSlow on OLTP dataUp to 100x faster (columnar)
ScalingVertical + Read ReplicasVertical + Read Pools
Availability99.95%99.99%
Failover Timeunder 60 seconds10-30 seconds
Max Size64 TB64 TB+
PriceLowerHigher (premium features)
When to Choose AlloyDB:
  • Need for hybrid OLTP/OLAP workloads
  • Sub-second failover requirements
  • Maximum PostgreSQL performance
  • Enterprise support and SLAs

3. Cloud Spanner: The Global Giant

Cloud Spanner is the world’s first “NewSQL” database. It provides the scalability of NoSQL with the consistency of SQL and ACID transactions across continents.

3.1 Understanding Distributed Databases

Traditional relational databases struggle with global scale due to the CAP theorem:
  • C (Consistency): All nodes see the same data
  • A (Availability): System remains operational
  • P (Partition Tolerance): System continues despite network splits
Most databases choose two:
  • CP: Consistent and Partition-Tolerant (sacrifice availability)
  • AP: Available and Partition-Tolerant (sacrifice consistency)
Spanner’s Innovation: Achieves strong consistency AND high availability globally through TrueTime.

3.2 The Engineering of TrueTime

TrueTime is not just a “very accurate clock.” It is a distributed system that explicitly accounts for its own uncertainty.

The now() Interval

When a Spanner server calls TrueTime.now(), it doesn’t get a single number. It gets an interval: [earliest, latest]. The width of this interval (latest - earliest) is the uncertainty (ε).

External Consistency via “Commit Wait”

To guarantee that a transaction T1T_1 which finishes before T2T_2 starts will always be seen by T2T_2, Spanner follows the Commit Wait rule:
  1. Assign T1T_1 a timestamp s1=TrueTime.now().latests_1 = TrueTime.now().latest.
  2. Delay the commit of T1T_1 until TrueTime.now().earliest>s1TrueTime.now().earliest > s_1.
This ensures that the commit timestamp s1s_1 is definitely in the past across the entire fleet before the transaction is finalized. This “wait” is typically less than 10ms, but it is the secret to global serializability.

3.3 Spanner Architecture: Paxos Groups and Splits

Spanner scales by sharding data into Splits (contiguous ranges of rows).

Paxos Groups

Each split is not just stored on one machine; it is replicated across multiple zones/regions. Each split forms its own Paxos Group.
  • The Leader: One replica in the Paxos group is elected as the leader. All writes for that split must go through the leader.
  • The Followers: Replicas that participate in the consensus but don’t initiate writes.

Scaling via Resharding

As a split grows too large (~4GB) or receives too much traffic, Spanner automatically:
  1. Splits the Split: Divides the row range into two.
  2. Migrates Replicas: Moves Paxos group members to different servers to balance CPU and storage.

3.4 Read-Only Transactions vs. Read-Write Transactions

A Principal Engineer must distinguish between these to optimize global performance.

Read-Write Transactions (Pessimistic)

  • Uses 2-Phase Locking (2PL) on the leaders.
  • Requires 2-Phase Commit (2PC) if the transaction spans multiple splits.
  • Performance is limited by cross-region consensus latency.

Read-Only Transactions (Snapshot Reads)

  • Lock-Free: Does not take any locks.
  • Timestamp-Based: Reads data as of a specific TrueTime timestamp.
  • Local Execution: Can be served by the nearest replica (even if it’s not the leader), provided that replica has caught up to the requested timestamp.
Optimization Tip: Use Read-Only transactions for 99% of your dashboard and reporting needs to avoid contention with the write path.

3.4 Spanner Data Modeling

Data modeling in Spanner differs from traditional SQL databases.

Interleaved Tables (Parent-Child)

Spanner supports table interleaving to co-locate related data:
CREATE TABLE Customers (
  CustomerId INT64 NOT NULL,
  Name STRING(MAX),
  Email STRING(MAX)
) PRIMARY KEY (CustomerId);

CREATE TABLE Orders (
  CustomerId INT64 NOT NULL,
  OrderId INT64 NOT NULL,
  OrderDate DATE,
  TotalAmount NUMERIC
) PRIMARY KEY (CustomerId, OrderId),
  INTERLEAVE IN PARENT Customers ON DELETE CASCADE;
Benefits:
  • Locality: Orders stored physically near their customer
  • Performance: Joins between customers and orders are extremely fast
  • Atomic Updates: Updates to customer and orders in same transaction

Choosing Primary Keys

Bad Primary Key Design (Creates Hotspots):
-- Sequential IDs create hotspot on last split
CREATE TABLE Events (
  EventId INT64 NOT NULL,  -- Auto-increment
  ...
) PRIMARY KEY (EventId);
Good Primary Key Design:
-- UUID spreads writes across splits
CREATE TABLE Events (
  EventId STRING(36) NOT NULL,  -- UUID
  ...
) PRIMARY KEY (EventId);

-- Or hash-based key
CREATE TABLE Events (
  ShardId INT64 NOT NULL,  -- Hash(userId) % 1000
  EventId INT64 NOT NULL,
  ...
) PRIMARY KEY (ShardId, EventId);

3.5 The “Mutation” Limit

In Spanner, an INSERT, UPDATE, or DELETE is called a mutation.

Mutation Constraints

  • Limit: Maximum 20,000 mutations per commit
  • Calculation: (Columns affected) × (Indexes + 1)
  • Example: Updating 5 columns on a table with 2 indexes = 15 mutations per row

Bulk Loading Strategies

Wrong Approach (Fails):
# DON'T: Single transaction with 1M rows
with database.batch() as batch:
    for i in range(1_000_000):
        batch.insert("Orders", columns, values)
# ERROR: COMMIT_FAILS_TOO_MANY_MUTATIONS
Correct Approach (Batching):
# DO: Batch in chunks of 10,000 mutations
def batch_insert(rows, batch_size=5000):
    for i in range(0, len(rows), batch_size):
        with database.batch() as batch:
            chunk = rows[i:i+batch_size]
            for row in chunk:
                batch.insert("Orders", columns, row)
Production Approach (Dataflow):
# Use Dataflow for massive ingests
import apache_beam as beam
from apache_beam.io.gcp.experimental import SpannerIO

with beam.Pipeline() as pipeline:
    (pipeline
     | 'Read CSV' >> beam.io.ReadFromText('gs://bucket/data.csv')
     | 'Parse' >> beam.Map(parse_csv_row)
     | 'Write to Spanner' >> SpannerIO.Write(
         instance_id='my-instance',
         database_id='my-database',
         table='Orders',
         batch_size_bytes=10_000_000))

3.6 Spanner Performance Optimization

Commit Wait and Latency

Spanner’s commit includes a “commit wait” period based on TrueTime uncertainty:
  • Typical commit wait: 5-10ms
  • Total write latency: Network + processing + commit wait
  • Regional: ~10-20ms
  • Multi-regional: ~50-100ms (cross-region consensus)

Query Best Practices

Use Secondary Indexes Wisely:
-- Create secondary index for common queries
CREATE INDEX OrdersByDate ON Orders(OrderDate);

-- Force index usage if needed
SELECT * FROM Orders@{FORCE_INDEX=OrdersByDate}
WHERE OrderDate = '2024-01-24';
Avoid Full Table Scans:
-- BAD: Full table scan
SELECT * FROM Orders WHERE LOWER(CustomerName) = 'acme';

-- GOOD: Use indexed column
SELECT * FROM Orders WHERE CustomerId = 12345;

3.7 Spanner Pricing Model

Spanner pricing is based on:
  1. Nodes: 0.90/hourpernode(regional),0.90/hour per node (regional), 3/hour (multi-regional)
  2. Storage: $0.30/GB/month
  3. Network Egress: Standard GCP egress rates
Sizing Guidelines:
  • Start with 3 nodes minimum (1 per zone for HA)
  • Each node provides ~10,000 QPS reads, ~2,000 QPS writes
  • Monitor CPU usage; add nodes when sustained >65%

4. Decision Matrix: Which Database to Choose?

ScenarioRecommendationRationale
Standard OLTP app, single regionCloud SQLCost-effective, familiar SQL
Heavy analytics on transactional dataAlloyDBColumnar engine for OLAP
Global application, low-latency readsSpanner (Multi-Regional)External consistency globally
Financial system, strong consistencySpanner or AlloyDBACID guarantees, high availability
Migrating from MySQL/PostgresCloud SQL → AlloyDBEasiest migration path
Migrating from OracleAlloyDBAdvanced enterprise features
IoT, time-series dataBigtable (NoSQL)Better for non-relational data
Budget-constrained, moderate loadCloud SQLLowest cost option

5. Database Migration Service (DMS)

DMS simplifies migrations from on-premises or other clouds to Cloud SQL or AlloyDB.

5.1 Migration Types

  • One-Time (Dump and Restore): Full database copy, requires downtime
  • Continuous (CDC): Change Data Capture keeps target in sync, minimal downtime

5.2 Migration Steps

# 1. Create migration job
gcloud database-migration migration-jobs create my-migration \
    --source=mysql-on-prem \
    --destination=cloudsql \
    --type=CONTINUOUS \
    --region=us-central1

# 2. Start migration
gcloud database-migration migration-jobs start my-migration \
    --region=us-central1

# 3. Monitor progress
gcloud database-migration migration-jobs describe my-migration \
    --region=us-central1

# 4. Promote (cutover to Cloud SQL)
gcloud database-migration migration-jobs promote my-migration \
    --region=us-central1

5.3 Migration Best Practices

  • Test migrations: Practice on non-production data first
  • Validate data: Compare row counts and checksums
  • Monitor replication lag: Ensure lag is minimal before cutover
  • Plan rollback: Have a rollback plan in case of issues
  • Incremental cutover: Migrate non-critical systems first

6. Production Readiness: SRE Best Practices

6.1 Monitoring and Alerting

Key Metrics to Monitor:
  • Cloud SQL:
    • CPU utilization (alert at >80%)
    • Disk utilization (alert at >75%)
    • Replication lag (alert at >5 seconds)
    • Connection count (alert at >80% of max)
  • Spanner:
    • CPU utilization per node (alert at >65%)
    • Storage per node (alert at >80% of limit)
    • Leader changes (indicates instability)
    • Latency percentiles (p50, p95, p99)
# Example: Create Cloud Monitoring alert
gcloud alpha monitoring policies create \
    --notification-channels=CHANNEL_ID \
    --display-name="Cloud SQL High CPU" \
    --condition-display-name="CPU > 80%" \
    --condition-threshold-value=0.8 \
    --condition-threshold-duration=300s

6.2 Disaster Recovery Planning

RTO/RPO Definitions:
  • RTO (Recovery Time Objective): How long can you be down?
  • RPO (Recovery Point Objective): How much data loss is acceptable?
Strategies:
StrategyRTORPOCost
Automated BackupsHoursMinutesLow
HA ConfigurationSecondsZeroMedium
Cross-Region ReplicaMinutesSecondsHigh
Multi-Regional SpannerSecondsZeroHighest

6.3 Security Hardening

  • Network Security:
    • Use private IP addresses
    • Implement VPC Service Controls
    • Enable Cloud SQL Auth Proxy
  • Access Control:
    • Follow principle of least privilege
    • Use IAM for database access
    • Rotate passwords regularly
  • Encryption:
    • Enable encryption at rest (default)
    • Use customer-managed encryption keys (CMEK) for compliance
    • Always use TLS for connections

6.4 Cost Optimization

  • Right-Sizing:
    • Start small, scale up based on metrics
    • Use committed use discounts for Spanner
    • Consider shared-core instances for dev/test
  • Storage Management:
    • Enable storage auto-increase with limits
    • Archive old data to Cloud Storage
    • Use table partitioning for large tables
  • Query Optimization:
    • Use Query Insights to identify slow queries
    • Add appropriate indexes
    • Implement connection pooling

7. Interview Preparation: Architectural Deep Dive

1. Q: What is the primary difference between Cloud SQL High Availability (HA) and Cloud Spanner’s availability model? A: Cloud SQL HA is Regional but restricted to a single primary and one standby. It uses synchronous replication to a regional persistent disk; failover takes ~60 seconds and requires a brief disconnection. Cloud Spanner is Natively Distributed. It doesn’t have a “standby”; every node is active. In multi-regional configs, it offers 99.999% availability with transparent failover and no manual intervention, as data is replicated across multiple regions using Paxos consensus. 2. Q: Explain the “Mutation Limit” in Cloud Spanner and how to design for it. A: Cloud Spanner limits a single transaction to 20,000 mutations. A mutation is calculated as (Number of Columns Updated) × (Number of Secondary Indexes + 1). If you update 10 columns in a table with 3 indexes, that’s 40 mutations per row. To handle large ingests, you must implement application-level batching (e.g., chunks of 5,000 rows) or use Dataflow with the SpannerIO connector, which handles partitioning and mutation limits automatically. 3. Q: Why does Cloud Spanner require “TrueTime” to maintain external consistency? A: In a distributed system, clocks on different servers are never perfectly in sync. Spanner uses TrueTime (GPS + Atomic Clocks) to provide a bounded uncertainty interval [earliest, latest]. When a transaction commits, Spanner waits until it is certain that the current time is past the commit’s latest timestamp. This “commit wait” ensures that any subsequent transaction starting anywhere in the world will see a timestamp strictly greater than the previous one, guaranteeing global serializability (External Consistency). 4. Q: When should an architect choose AlloyDB over standard Cloud SQL PostgreSQL? A: AlloyDB should be chosen when the workload requires Enterprise-grade performance or Hybrid Transactional/Analytical Processing (HTAP). AlloyDB is up to 4x faster for writes and 100x faster for analytical queries than standard Postgres due to its log-structured storage and Columnar Engine. It also provides a 99.99% SLA (compared to 99.95% for Cloud SQL) and sub-second failover times. 5. Q: What is the “Hotspot” problem in Cloud Spanner/Bigtable, and how do you prevent it? A: Hotspots occur when many concurrent writes hit the same range of keys (e.g., using a sequential timestamp or auto-incrementing ID as a Primary Key). Since Spanner and Bigtable store data lexicographically, these writes all hit the same “split” or node, causing a bottleneck. Prevention: Use a UUID (random string) or a hashed version of the key. If a sequential ID is required, prefix it with a shard ID (e.g., hash(id) % 100).

Lab: Production-Grade Cloud SQL Deployment

This lab walks through deploying a production-ready Cloud SQL instance with HA, backups, monitoring, and security.

Lab Architecture

  • VPC with private subnet
  • Cloud SQL with private IP
  • HA enabled (regional persistent disk)
  • Automated backups and PITR
  • Read replica in different region
  • Cloud SQL Auth Proxy for secure access
  • Monitoring and alerting

Step 1: Network Setup

# Create VPC
gcloud compute networks create prod-vpc --subnet-mode=custom

# Create subnet
gcloud compute networks subnets create db-subnet \
    --network=prod-vpc \
    --region=us-central1 \
    --range=10.0.1.0/24

# Allocate IP range for private service connection
gcloud compute addresses create google-managed-services \
    --global \
    --purpose=VPC_PEERING \
    --prefix-length=16 \
    --network=prod-vpc

# Create private service connection
gcloud services vpc-peerings connect \
    --service=servicenetworking.googleapis.com \
    --ranges=google-managed-services \
    --network=prod-vpc

Step 2: Create Production Cloud SQL Instance

# Create HA PostgreSQL instance with private IP
gcloud sql instances create prod-db \
    --database-version=POSTGRES_14 \
    --tier=db-custom-4-15360 \
    --region=us-central1 \
    --network=prod-vpc \
    --no-assign-ip \
    --availability-type=REGIONAL \
    --storage-type=SSD \
    --storage-size=100GB \
    --storage-auto-increase \
    --storage-auto-increase-limit=500GB \
    --enable-bin-log \
    --backup-start-time=03:00 \
    --retained-backups-count=30 \
    --retained-transaction-log-days=7 \
    --maintenance-window-day=SUN \
    --maintenance-window-hour=04 \
    --maintenance-release-channel=production \
    --insights-config-query-insights-enabled \
    --insights-config-query-string-length=1024

Step 3: Create Read Replica

# Create cross-region read replica for DR
gcloud sql instances create prod-db-replica \
    --master-instance-name=prod-db \
    --region=us-east1 \
    --tier=db-custom-2-7680 \
    --availability-type=ZONAL

Step 4: Configure Security

# Create database and user
gcloud sql databases create appdb --instance=prod-db
gcloud sql users create appuser --instance=prod-db --password=SECURE_PASSWORD

# Set up IAM authentication (recommended)
gcloud sql users create app-service-account@PROJECT_ID.iam \
    --instance=prod-db \
    --type=CLOUD_IAM_SERVICE_ACCOUNT

Step 5: Set Up Monitoring

# Enable Cloud Monitoring
gcloud services enable monitoring.googleapis.com

# Create notification channel (email)
gcloud alpha monitoring channels create \
    --display-name="DBA Team" \
    --type=email \
    [email protected]

# Create alert for high CPU
gcloud alpha monitoring policies create \
    --notification-channels=CHANNEL_ID \
    --display-name="Cloud SQL High CPU" \
    --condition-display-name="CPU > 80%" \
    --condition-threshold-value=0.8 \
    --condition-threshold-duration=300s

Step 6: Test Connectivity

# Install Cloud SQL Proxy
curl -o cloud_sql_proxy https://dl.google.com/cloudsql/cloud_sql_proxy.linux.amd64
chmod +x cloud_sql_proxy

# Run proxy
./cloud_sql_proxy -instances=PROJECT_ID:us-central1:prod-db=tcp:5432 &

# Connect and test
psql "host=127.0.0.1 port=5432 dbname=appdb user=appuser"

Step 7: Validate HA and Backup

# Trigger manual backup
gcloud sql backups create --instance=prod-db

# List backups
gcloud sql backups list --instance=prod-db

# Simulate failover (triggers automatic failover to standby)
gcloud sql instances failover prod-db

Summary: Database Selection Framework

Choose Cloud SQL when:
  • Migrating existing MySQL/PostgreSQL/SQL Server apps
  • Single-region deployment sufficient
  • Budget-conscious choice
  • Standard OLTP workload
Choose AlloyDB when:
  • Need PostgreSQL with maximum performance
  • Hybrid OLTP/OLAP workloads
  • Enterprise-grade SLAs required
  • Real-time analytics on transactional data
Choose Spanner when:
  • Global application with users worldwide
  • Need strong consistency across regions
  • Scale beyond single database limits
  • Mission-critical, always-available system
In the next chapter, we explore NoSQL databases for unstructured and semi-structured data with Cloud Firestore and Bigtable.