Chapter 7: SQL at Scale - Cloud SQL, AlloyDB, and Cloud Spanner

Google Cloud’s relational database portfolio is designed to bridge the gap between traditional RDBMS reliability and modern cloud-scale requirements. Whether you need a standard PostgreSQL instance for a monolithic application or a globally distributed database that defies the CAP theorem for planet-scale systems, GCP offers specialized solutions tailored to your needs. This chapter assumes no prior database administration experience. We will cover database fundamentals, connectivity patterns, high availability architecture, backup strategies, scaling approaches, and transaction management from the ground up.

0. From Scratch: Understanding Relational Databases

Before diving into GCP’s specific offerings, let’s establish foundational concepts about relational databases.

0.1 What is a Relational Database?

A relational database stores data in tables with rows and columns:

Table: A collection of related data (e.g., “Customers”, “Orders”)
Row: A single record in the table (e.g., one customer)
Column: An attribute or field (e.g., “customer_name”, “email”)
Primary Key: Unique identifier for each row
Foreign Key: Reference to a row in another table (creates relationships)

Relational databases use SQL (Structured Query Language) to query and manipulate data.

0.2 ACID Properties

Production databases must guarantee ACID properties:

Atomicity: A transaction either completes entirely or not at all (no partial updates)
Consistency: Database remains in a valid state before and after transactions
Isolation: Concurrent transactions don’t interfere with each other
Durability: Once committed, data survives crashes and power failures

These properties are critical for financial systems, e-commerce, and any application where data correctness is paramount.

0.3 Database Workload Types

Understanding workload types helps choose the right database:

OLTP (Online Transaction Processing):
- High volume of short transactions (inserts, updates, deletes)
- Examples: e-commerce checkouts, bank transfers, user registrations
- Optimized for write throughput and low latency
OLAP (Online Analytical Processing):
- Complex queries scanning large datasets
- Examples: business intelligence, reporting, data warehousing
- Optimized for read throughput and aggregations
Hybrid (HTAP):
- Both transactional and analytical workloads on the same database
- Examples: real-time dashboards, operational analytics
- Requires specialized architecture

1. Cloud SQL: The Managed Standard

Cloud SQL is a fully managed service for MySQL, PostgreSQL, and SQL Server. It is the go-to choice for migrating existing applications to the cloud or starting new projects that don’t require global scale.

1.1 Cloud SQL Architecture: The Principal’s View

While GCP abstracts most of the complexity, a Principal Engineer must understand the underlying infrastructure to predict behavior during tail-latency events or regional degradations.

The “Managed Instance” Anatomy

Cloud SQL is not a “serverless” database in the same way BigQuery is; it is a highly automated orchestration of Google Compute Engine (GCE) and specialized storage.

Orchestration Layer: A control plane (managed by Google) handles provisioning, patching, and failover.
The VM Instance: A hidden GCE instance runs the database engine (MySQL, Postgres, or SQL Server).
The Sidecar: Each instance runs a “Sentinel” or agent process that monitors health and handles administrative tasks (backups, log exports).
Persistent Storage: Uses Google’s block storage. For HA, it uses Regional Persistent Disk (Regional PD), which provides synchronous replication at the block level across two zones.

Zonal vs. Regional Storage

Feature	Zonal Storage	Regional Storage (HA)
Replication	Single zone (locally redundant)	Synchronous across two zones
Write Latency	Lowest	Higher (network round-trip to 2nd zone)
Availability	Zonal failure = Outage	Zonal failure = ~60s Failover
Recovery	Requires backup restore	Automatic failover

1.2 High Availability (HA) Deep Dive

Cloud SQL’s HA isn’t just “having a standby.” It’s a coordinated block-level replication and IP takeover mechanism.

The Failover Lifecycle

When a primary instance becomes unreachable (node failure, network partition, or zonal outage), the following sequence occurs:

Detection: The control plane’s monitoring service (Sentinel) detects that the primary is unresponsive (typically after 2-3 missed heartbeats).
Fencing: The control plane ensures the old primary cannot write to the disk anymore (avoiding “split-brain” scenarios).
Disk Attachment: The Regional PD is detached from the failed primary and attached to the standby VM in the second zone.
Engine Recovery: The database engine on the standby starts. It must perform Crash Recovery by replaying the Write-Ahead Log (WAL) from the disk to ensure data consistency.
IP Takeover: The internal load balancer or DNS-based service endpoint points to the new instance.

Principal Note: The failover time is dominated by WAL recovery. If you have very long-running transactions or a high volume of un-checkpointed writes, your failover time will increase. Monitoring pg_stat_activity for long-running transactions is a critical HA strategy.

Configuration (CLI)

# Create an HA PostgreSQL Instance
gcloud sql instances create prod-db \
    --database-version=POSTGRES_14 \
    --tier=db-custom-4-15360 \
    --region=us-central1 \
    --availability-type=REGIONAL \
    --storage-type=SSD \
    --storage-size=100GB \
    --storage-auto-increase \
    --backup-start-time=03:00

HA Limitations and Considerations

Regional Scope: HA protects against zonal failures but not regional disasters
Performance Impact: Synchronous replication adds slight latency to writes (~2-5ms)
Cost: HA configuration doubles compute costs (primary + standby)
Planned Maintenance: Even with HA, maintenance operations may cause brief disruptions

1.3 Connectivity Patterns

Connecting applications to Cloud SQL securely requires careful architecture.

Private IP vs Public IP

Private IP (Recommended):

Database instance has only a private IP address
Accessible only from within your VPC network
Requires VPC peering setup
More secure (no Internet exposure)

Public IP:

Database accessible from the Internet
Requires authorized networks (IP allowlists)
Use only for development or migration scenarios
Always enable SSL/TLS

The Cloud SQL Auth Proxy

Never open port 3306 (MySQL) or 5432 (PostgreSQL) to the public internet. The Cloud SQL Auth Proxy is a secure connection method:

IAM Authentication: Uses service account credentials instead of passwords
Automatic Encryption: All traffic encrypted with TLS
Certificate Management: Handles SSL certificate rotation automatically
No IP Allowlists: Eliminates need to maintain IP ranges

Auth Proxy Setup (Local Development)

# Download the proxy binary
curl -o cloud_sql_proxy https://dl.google.com/cloudsql/cloud_sql_proxy.linux.amd64
chmod +x cloud_sql_proxy

# Run the proxy
./cloud_sql_proxy -instances=PROJECT_ID:REGION:INSTANCE_NAME=tcp:5432

# Connect from your application to localhost:5432
psql "host=127.0.0.1 port=5432 dbname=mydb user=postgres"

Auth Proxy in Production (Kubernetes Sidecar)

apiVersion: apps/v1
kind: Deployment
metadata:
  name: my-app
spec:
  template:
    spec:
      containers:
      - name: app
        image: gcr.io/project/app:latest
        env:
        - name: DB_HOST
          value: "127.0.0.1"
        - name: DB_PORT
          value: "5432"
      - name: cloud-sql-proxy
        image: gcr.io/cloudsql-docker/gce-proxy:latest
        command:
        - "/cloud_sql_proxy"
        - "-instances=PROJECT_ID:REGION:INSTANCE_NAME=tcp:5432"
        securityContext:
          runAsNonRoot: true

Private Service Connection (VPC Peering)

For private IP access without the Auth Proxy:

# Allocate an IP range for Cloud SQL
gcloud compute addresses create google-managed-services-default \
    --global \
    --purpose=VPC_PEERING \
    --prefix-length=16 \
    --network=default

# Create private service connection
gcloud services vpc-peerings connect \
    --service=servicenetworking.googleapis.com \
    --ranges=google-managed-services-default \
    --network=default

1.4 Backup and Recovery

Cloud SQL provides automated and on-demand backups.

Automated Backups

Frequency: Daily, at a configured time
Retention: 7 days (default), up to 365 days
Point-in-Time Recovery (PITR): Restore to any point in the last 7 days
Binary Logging: Required for PITR, captures all changes

# Configure backup settings
gcloud sql instances patch prod-db \
    --backup-start-time=03:00 \
    --retained-backups-count=30 \
    --retained-transaction-log-days=7

On-Demand Backups

# Create manual backup
gcloud sql backups create \
    --instance=prod-db \
    --description="Pre-migration backup"

Restore Operations

# Restore from a specific backup
gcloud sql backups restore BACKUP_ID \
    --instance=prod-db

# Point-in-time restore to 1 hour ago
gcloud sql instances clone prod-db prod-db-restore \
    --point-in-time='2024-01-24T15:00:00.000Z'

Backup Best Practices

Test Restores: Regularly practice restore procedures
Export Backups: For critical data, export backups to Cloud Storage
Cross-Region Copies: For disaster recovery, maintain backups in multiple regions
Monitor Backup Success: Set up alerts for failed backups

1.5 Read Replicas and Scaling

Read replicas allow horizontal scaling of read traffic.

Read Replica Architecture

Asynchronous Replication: Changes replicated from primary to replicas
Read-Only: Replicas cannot accept writes
Replication Lag: Typically milliseconds, can increase under load
Multiple Replicas: Create multiple replicas for different purposes

Creating Read Replicas

# Create a read replica in the same region
gcloud sql instances create prod-db-replica-1 \
    --master-instance-name=prod-db \
    --region=us-central1 \
    --tier=db-custom-2-7680

# Create a cross-region read replica for disaster recovery
gcloud sql instances create prod-db-replica-dr \
    --master-instance-name=prod-db \
    --region=europe-west1 \
    --tier=db-custom-4-15360

Application Integration Patterns

Pattern 1: Connection Pooling with Read/Write Split

# Python example using SQLAlchemy
from sqlalchemy import create_engine
from sqlalchemy.orm import sessionmaker

# Primary (read-write)
primary_engine = create_engine('postgresql://user:pass@primary-ip/db')

# Replica (read-only)
replica_engine = create_engine('postgresql://user:pass@replica-ip/db')

# Use primary for writes
def create_order(order_data):
    session = sessionmaker(bind=primary_engine)()
    # ... insert order
    session.commit()

# Use replica for reads
def get_order_list():
    session = sessionmaker(bind=replica_engine)()
    return session.query(Order).all()

Pattern 2: Load Balancer for Replicas For multiple read replicas, use an internal TCP load balancer to distribute read traffic:

# Create health check
gcloud compute health-checks create tcp read-replica-health \
    --port=5432

# Create backend service with all replicas
gcloud compute backend-services create read-replica-backend \
    --protocol=TCP \
    --health-checks=read-replica-health \
    --region=us-central1

# Add replicas as backends
gcloud compute backend-services add-backend read-replica-backend \
    --instance=prod-db-replica-1 \
    --instance-zone=us-central1-a \
    --region=us-central1

1.6 Maintenance Windows and Upgrades

Cloud SQL requires periodic maintenance for security patches and updates.

Maintenance Window Configuration

# Set maintenance window to Sunday 4:00 AM
gcloud sql instances patch prod-db \
    --maintenance-window-day=SUN \
    --maintenance-window-hour=04 \
    --maintenance-release-channel=production

Maintenance Release Channels

Production: Stable updates, lower frequency
Preview: Early access to new features, higher frequency
Week: Minimum 1 week notice for maintenance

Minimizing Maintenance Impact

Enable HA: Reduces disruption during maintenance
Use Read Replicas: Redirect traffic during primary maintenance
Monitor Maintenance Notifications: Subscribe to maintenance alerts
Test on Non-Production: Apply updates to dev/staging first

1.7 Performance Optimization

Query Insights

Enable Cloud SQL Query Insights to identify slow queries:

# Enable Query Insights
gcloud sql instances patch prod-db \
    --insights-config-query-insights-enabled \
    --insights-config-query-string-length=1024 \
    --insights-config-record-application-tags \
    --insights-config-record-client-address

Query Insights provides:

Top queries by execution time, CPU, I/O
Query execution plans
Historical query performance trends
Recommendations for missing indexes

Connection Pooling

Database connections are expensive resources. Use connection pooling: Built-in Cloud SQL Connector (Recommended):

from google.cloud.sql.connector import Connector

# Initialize Connector
connector = Connector()

# Create connection pool
def getconn():
    conn = connector.connect(
        "PROJECT:REGION:INSTANCE",
        "pg8000",
        user="postgres",
        password="password",
        db="mydb"
    )
    return conn

# Use with SQLAlchemy
engine = create_engine(
    "postgresql+pg8000://",
    creator=getconn,
    pool_size=10,
    max_overflow=5,
    pool_pre_ping=True
)

Storage Auto-Increase

Enable automatic storage increase to prevent out-of-disk errors:

gcloud sql instances patch prod-db \
    --storage-auto-increase \
    --storage-auto-increase-limit=500GB

2. AlloyDB for PostgreSQL

AlloyDB is a PostgreSQL-compatible, enterprise-grade database designed for the most demanding transactional and analytical workloads.

2.1 Architecture: Decoupled Compute and Storage

AlloyDB separates the database engine from the storage layer, enabling superior performance and scalability.

Log-Structured Storage

Traditional Approach: Write full data pages to disk
AlloyDB Approach: Write only logs (change records) to storage
Benefit: Reduces I/O overhead by 10x for write-heavy workloads

Multi-Tier Caching

AlloyDB uses a sophisticated caching hierarchy:

L1 Cache (RAM): Hot data in memory
L2 Cache (Local SSD): Frequently accessed data
L3 Cache (Regional Storage): Persistent storage layer

This architecture provides:

Sub-millisecond read latency for cached data
Automatic promotion of frequently accessed data
Transparent to applications (standard PostgreSQL wire protocol)

2.2 Columnar Engine (Hybrid OLTP/OLAP)

AlloyDB’s columnar engine enables analytical queries on transactional data without ETL or replication.

How It Works

Dual Storage Format: Data stored in both row and column formats
Automatic Population: ML algorithms identify analytical query patterns
Memory-Based: Columnar data kept in memory for fast access
Zero Impact: Analytical queries don’t affect transactional performance

Configuring the Columnar Engine

# Create AlloyDB cluster with columnar engine enabled
gcloud alloydb clusters create prod-alloydb-cluster \
    --region=us-central1 \
    --password=secure-password

# Create primary instance with columnar engine
gcloud alloydb instances create prod-alloydb-primary \
    --cluster=prod-alloydb-cluster \
    --region=us-central1 \
    --instance-type=PRIMARY \
    --cpu-count=8 \
    --database-flags=google_columnar_engine.enabled=on,google_columnar_engine.memory_size_mb=4096

Use Cases for Columnar Engine

Real-time dashboards: Query live transactional data
Operational analytics: No need for separate data warehouse
Hybrid workloads: OLTP and OLAP on the same database
Simplified architecture: Eliminates ETL pipelines

2.3 AlloyDB High Availability

AlloyDB provides enterprise-grade HA with sub-minute failover:

Cross-Zone Replication: Automatic replication across availability zones
Automatic Failover: Typically completes in 10-30 seconds
Zero Data Loss: Synchronous replication ensures no data loss
Transparent to Applications: Connection endpoint remains the same

2.4 Read Pool Instances

AlloyDB supports read pool instances for scaling read traffic:

# Create read pool instance
gcloud alloydb instances create prod-alloydb-read-pool \
    --cluster=prod-alloydb-cluster \
    --region=us-central1 \
    --instance-type=READ_POOL \
    --cpu-count=4 \
    --read-pool-node-count=3

Read pools automatically:

Distribute read traffic across multiple nodes
Scale nodes up/down based on load
Maintain low replication lag (typically under 1ms)

2.5 AlloyDB vs Cloud SQL PostgreSQL

Feature	Cloud SQL PostgreSQL	AlloyDB for PostgreSQL
Performance	Standard PostgreSQL	Up to 4x faster transactions
Analytical Queries	Slow on OLTP data	Up to 100x faster (columnar)
Scaling	Vertical + Read Replicas	Vertical + Read Pools
Availability	99.95%	99.99%
Failover Time	under 60 seconds	10-30 seconds
Max Size	64 TB	64 TB+
Price	Lower	Higher (premium features)

When to Choose AlloyDB:

Need for hybrid OLTP/OLAP workloads
Sub-second failover requirements
Maximum PostgreSQL performance
Enterprise support and SLAs

3. Cloud Spanner: The Global Giant

Cloud Spanner is the world’s first “NewSQL” database. It provides the scalability of NoSQL with the consistency of SQL and ACID transactions across continents.

3.1 Understanding Distributed Databases

Traditional relational databases struggle with global scale due to the CAP theorem:

C (Consistency): All nodes see the same data
A (Availability): System remains operational
P (Partition Tolerance): System continues despite network splits

Most databases choose two:

CP: Consistent and Partition-Tolerant (sacrifice availability)
AP: Available and Partition-Tolerant (sacrifice consistency)

Spanner’s Innovation: Achieves strong consistency AND high availability globally through TrueTime.

3.2 The Engineering of TrueTime

TrueTime is not just a “very accurate clock.” It is a distributed system that explicitly accounts for its own uncertainty.

The `now()` Interval

When a Spanner server calls TrueTime.now(), it doesn’t get a single number. It gets an interval: [earliest, latest]. The width of this interval (latest - earliest) is the uncertainty (ε).

External Consistency via “Commit Wait”

To guarantee that a transaction

T_1

which finishes before

T_2

starts will always be seen by

T_2

, Spanner follows the Commit Wait rule:

Assign $T_1$ a timestamp $s_1 = TrueTime.now().latest$ .
Delay the commit of $T_1$ until $TrueTime.now().earliest > s_1$ .

This ensures that the commit timestamp

s_1

is definitely in the past across the entire fleet before the transaction is finalized. This “wait” is typically less than 10ms, but it is the secret to global serializability.

3.3 Spanner Architecture: Paxos Groups and Splits

Spanner scales by sharding data into Splits (contiguous ranges of rows).

Paxos Groups

Each split is not just stored on one machine; it is replicated across multiple zones/regions. Each split forms its own Paxos Group.

The Leader: One replica in the Paxos group is elected as the leader. All writes for that split must go through the leader.
The Followers: Replicas that participate in the consensus but don’t initiate writes.

Scaling via Resharding

As a split grows too large (~4GB) or receives too much traffic, Spanner automatically:

Splits the Split: Divides the row range into two.
Migrates Replicas: Moves Paxos group members to different servers to balance CPU and storage.

3.4 Read-Only Transactions vs. Read-Write Transactions

A Principal Engineer must distinguish between these to optimize global performance.

Read-Write Transactions (Pessimistic)

Uses 2-Phase Locking (2PL) on the leaders.
Requires 2-Phase Commit (2PC) if the transaction spans multiple splits.
Performance is limited by cross-region consensus latency.

Read-Only Transactions (Snapshot Reads)

Lock-Free: Does not take any locks.
Timestamp-Based: Reads data as of a specific TrueTime timestamp.
Local Execution: Can be served by the nearest replica (even if it’s not the leader), provided that replica has caught up to the requested timestamp.

Optimization Tip: Use Read-Only transactions for 99% of your dashboard and reporting needs to avoid contention with the write path.

3.4 Spanner Data Modeling

Data modeling in Spanner differs from traditional SQL databases.

Interleaved Tables (Parent-Child)

Spanner supports table interleaving to co-locate related data:

CREATE TABLE Customers (
  CustomerId INT64 NOT NULL,
  Name STRING(MAX),
  Email STRING(MAX)
) PRIMARY KEY (CustomerId);

CREATE TABLE Orders (
  CustomerId INT64 NOT NULL,
  OrderId INT64 NOT NULL,
  OrderDate DATE,
  TotalAmount NUMERIC
) PRIMARY KEY (CustomerId, OrderId),
  INTERLEAVE IN PARENT Customers ON DELETE CASCADE;

Benefits:

Locality: Orders stored physically near their customer
Performance: Joins between customers and orders are extremely fast
Atomic Updates: Updates to customer and orders in same transaction

Choosing Primary Keys

Bad Primary Key Design (Creates Hotspots):

-- Sequential IDs create hotspot on last split
CREATE TABLE Events (
  EventId INT64 NOT NULL,  -- Auto-increment
  ...
) PRIMARY KEY (EventId);

Good Primary Key Design:

-- UUID spreads writes across splits
CREATE TABLE Events (
  EventId STRING(36) NOT NULL,  -- UUID
  ...
) PRIMARY KEY (EventId);

-- Or hash-based key
CREATE TABLE Events (
  ShardId INT64 NOT NULL,  -- Hash(userId) % 1000
  EventId INT64 NOT NULL,
  ...
) PRIMARY KEY (ShardId, EventId);

3.5 The “Mutation” Limit

In Spanner, an INSERT, UPDATE, or DELETE is called a mutation.

Mutation Constraints

Limit: Maximum 20,000 mutations per commit
Calculation: (Columns affected) × (Indexes + 1)
Example: Updating 5 columns on a table with 2 indexes = 15 mutations per row

Bulk Loading Strategies

Wrong Approach (Fails):

# DON'T: Single transaction with 1M rows
with database.batch() as batch:
    for i in range(1_000_000):
        batch.insert("Orders", columns, values)
# ERROR: COMMIT_FAILS_TOO_MANY_MUTATIONS

Correct Approach (Batching):

# DO: Batch in chunks of 10,000 mutations
def batch_insert(rows, batch_size=5000):
    for i in range(0, len(rows), batch_size):
        with database.batch() as batch:
            chunk = rows[i:i+batch_size]
            for row in chunk:
                batch.insert("Orders", columns, row)

Production Approach (Dataflow):

# Use Dataflow for massive ingests
import apache_beam as beam
from apache_beam.io.gcp.experimental import SpannerIO

with beam.Pipeline() as pipeline:
    (pipeline
     | 'Read CSV' >> beam.io.ReadFromText('gs://bucket/data.csv')
     | 'Parse' >> beam.Map(parse_csv_row)
     | 'Write to Spanner' >> SpannerIO.Write(
         instance_id='my-instance',
         database_id='my-database',
         table='Orders',
         batch_size_bytes=10_000_000))

3.6 Spanner Performance Optimization

Commit Wait and Latency

Spanner’s commit includes a “commit wait” period based on TrueTime uncertainty:

Typical commit wait: 5-10ms
Total write latency: Network + processing + commit wait
Regional: ~10-20ms
Multi-regional: ~50-100ms (cross-region consensus)

Query Best Practices

Use Secondary Indexes Wisely:

-- Create secondary index for common queries
CREATE INDEX OrdersByDate ON Orders(OrderDate);

-- Force index usage if needed
SELECT * FROM Orders@{FORCE_INDEX=OrdersByDate}
WHERE OrderDate = '2024-01-24';

Avoid Full Table Scans:

-- BAD: Full table scan
SELECT * FROM Orders WHERE LOWER(CustomerName) = 'acme';

-- GOOD: Use indexed column
SELECT * FROM Orders WHERE CustomerId = 12345;

3.7 Spanner Pricing Model

Spanner pricing is based on:

Nodes: $0.90/hour per node (regional),$ 3/hour (multi-regional)
Storage: $0.30/GB/month
Network Egress: Standard GCP egress rates

Sizing Guidelines:

Start with 3 nodes minimum (1 per zone for HA)
Each node provides ~10,000 QPS reads, ~2,000 QPS writes
Monitor CPU usage; add nodes when sustained >65%

4. Decision Matrix: Which Database to Choose?

Scenario	Recommendation	Rationale
Standard OLTP app, single region	Cloud SQL	Cost-effective, familiar SQL
Heavy analytics on transactional data	AlloyDB	Columnar engine for OLAP
Global application, low-latency reads	Spanner (Multi-Regional)	External consistency globally
Financial system, strong consistency	Spanner or AlloyDB	ACID guarantees, high availability
Migrating from MySQL/Postgres	Cloud SQL → AlloyDB	Easiest migration path
Migrating from Oracle	AlloyDB	Advanced enterprise features
IoT, time-series data	Bigtable (NoSQL)	Better for non-relational data
Budget-constrained, moderate load	Cloud SQL	Lowest cost option

5. Database Migration Service (DMS)

DMS simplifies migrations from on-premises or other clouds to Cloud SQL or AlloyDB.

5.1 Migration Types

One-Time (Dump and Restore): Full database copy, requires downtime
Continuous (CDC): Change Data Capture keeps target in sync, minimal downtime

5.2 Migration Steps

# 1. Create migration job
gcloud database-migration migration-jobs create my-migration \
    --source=mysql-on-prem \
    --destination=cloudsql \
    --type=CONTINUOUS \
    --region=us-central1

# 2. Start migration
gcloud database-migration migration-jobs start my-migration \
    --region=us-central1

# 3. Monitor progress
gcloud database-migration migration-jobs describe my-migration \
    --region=us-central1

# 4. Promote (cutover to Cloud SQL)
gcloud database-migration migration-jobs promote my-migration \
    --region=us-central1

5.3 Migration Best Practices

Test migrations: Practice on non-production data first
Validate data: Compare row counts and checksums
Monitor replication lag: Ensure lag is minimal before cutover
Plan rollback: Have a rollback plan in case of issues
Incremental cutover: Migrate non-critical systems first

6. Production Readiness: SRE Best Practices

6.1 Monitoring and Alerting

Key Metrics to Monitor:

Cloud SQL:
- CPU utilization (alert at >80%)
- Disk utilization (alert at >75%)
- Replication lag (alert at >5 seconds)
- Connection count (alert at >80% of max)
Spanner:
- CPU utilization per node (alert at >65%)
- Storage per node (alert at >80% of limit)
- Leader changes (indicates instability)
- Latency percentiles (p50, p95, p99)

# Example: Create Cloud Monitoring alert
gcloud alpha monitoring policies create \
    --notification-channels=CHANNEL_ID \
    --display-name="Cloud SQL High CPU" \
    --condition-display-name="CPU > 80%" \
    --condition-threshold-value=0.8 \
    --condition-threshold-duration=300s

6.2 Disaster Recovery Planning

RTO/RPO Definitions:

RTO (Recovery Time Objective): How long can you be down?
RPO (Recovery Point Objective): How much data loss is acceptable?

Strategies:

Strategy	RTO	RPO	Cost
Automated Backups	Hours	Minutes	Low
HA Configuration	Seconds	Zero	Medium
Cross-Region Replica	Minutes	Seconds	High
Multi-Regional Spanner	Seconds	Zero	Highest

6.3 Security Hardening

Network Security:
- Use private IP addresses
- Implement VPC Service Controls
- Enable Cloud SQL Auth Proxy
Access Control:
- Follow principle of least privilege
- Use IAM for database access
- Rotate passwords regularly
Encryption:
- Enable encryption at rest (default)
- Use customer-managed encryption keys (CMEK) for compliance
- Always use TLS for connections

6.4 Cost Optimization

Right-Sizing:
- Start small, scale up based on metrics
- Use committed use discounts for Spanner
- Consider shared-core instances for dev/test
Storage Management:
- Enable storage auto-increase with limits
- Archive old data to Cloud Storage
- Use table partitioning for large tables
Query Optimization:
- Use Query Insights to identify slow queries
- Add appropriate indexes
- Implement connection pooling

7. Interview Preparation: Architectural Deep Dive

1. Q: What is the primary difference between Cloud SQL High Availability (HA) and Cloud Spanner’s availability model? A: Cloud SQL HA is Regional but restricted to a single primary and one standby. It uses synchronous replication to a regional persistent disk; failover takes ~60 seconds and requires a brief disconnection. Cloud Spanner is Natively Distributed. It doesn’t have a “standby”; every node is active. In multi-regional configs, it offers 99.999% availability with transparent failover and no manual intervention, as data is replicated across multiple regions using Paxos consensus. 2. Q: Explain the “Mutation Limit” in Cloud Spanner and how to design for it. A: Cloud Spanner limits a single transaction to 20,000 mutations. A mutation is calculated as (Number of Columns Updated) × (Number of Secondary Indexes + 1). If you update 10 columns in a table with 3 indexes, that’s 40 mutations per row. To handle large ingests, you must implement application-level batching (e.g., chunks of 5,000 rows) or use Dataflow with the SpannerIO connector, which handles partitioning and mutation limits automatically. 3. Q: Why does Cloud Spanner require “TrueTime” to maintain external consistency? A: In a distributed system, clocks on different servers are never perfectly in sync. Spanner uses TrueTime (GPS + Atomic Clocks) to provide a bounded uncertainty interval [earliest, latest]. When a transaction commits, Spanner waits until it is certain that the current time is past the commit’s latest timestamp. This “commit wait” ensures that any subsequent transaction starting anywhere in the world will see a timestamp strictly greater than the previous one, guaranteeing global serializability (External Consistency). 4. Q: When should an architect choose AlloyDB over standard Cloud SQL PostgreSQL? A: AlloyDB should be chosen when the workload requires Enterprise-grade performance or Hybrid Transactional/Analytical Processing (HTAP). AlloyDB is up to 4x faster for writes and 100x faster for analytical queries than standard Postgres due to its log-structured storage and Columnar Engine. It also provides a 99.99% SLA (compared to 99.95% for Cloud SQL) and sub-second failover times. 5. Q: What is the “Hotspot” problem in Cloud Spanner/Bigtable, and how do you prevent it? A: Hotspots occur when many concurrent writes hit the same range of keys (e.g., using a sequential timestamp or auto-incrementing ID as a Primary Key). Since Spanner and Bigtable store data lexicographically, these writes all hit the same “split” or node, causing a bottleneck. Prevention: Use a UUID (random string) or a hashed version of the key. If a sequential ID is required, prefix it with a shard ID (e.g., hash(id) % 100).

Lab: Production-Grade Cloud SQL Deployment

This lab walks through deploying a production-ready Cloud SQL instance with HA, backups, monitoring, and security.

Lab Architecture

VPC with private subnet
Cloud SQL with private IP
HA enabled (regional persistent disk)
Automated backups and PITR
Read replica in different region
Cloud SQL Auth Proxy for secure access
Monitoring and alerting

Step 1: Network Setup

# Create VPC
gcloud compute networks create prod-vpc --subnet-mode=custom

# Create subnet
gcloud compute networks subnets create db-subnet \
    --network=prod-vpc \
    --region=us-central1 \
    --range=10.0.1.0/24

# Allocate IP range for private service connection
gcloud compute addresses create google-managed-services \
    --global \
    --purpose=VPC_PEERING \
    --prefix-length=16 \
    --network=prod-vpc

# Create private service connection
gcloud services vpc-peerings connect \
    --service=servicenetworking.googleapis.com \
    --ranges=google-managed-services \
    --network=prod-vpc

Step 2: Create Production Cloud SQL Instance

# Create HA PostgreSQL instance with private IP
gcloud sql instances create prod-db \
    --database-version=POSTGRES_14 \
    --tier=db-custom-4-15360 \
    --region=us-central1 \
    --network=prod-vpc \
    --no-assign-ip \
    --availability-type=REGIONAL \
    --storage-type=SSD \
    --storage-size=100GB \
    --storage-auto-increase \
    --storage-auto-increase-limit=500GB \
    --enable-bin-log \
    --backup-start-time=03:00 \
    --retained-backups-count=30 \
    --retained-transaction-log-days=7 \
    --maintenance-window-day=SUN \
    --maintenance-window-hour=04 \
    --maintenance-release-channel=production \
    --insights-config-query-insights-enabled \
    --insights-config-query-string-length=1024

Step 3: Create Read Replica

# Create cross-region read replica for DR
gcloud sql instances create prod-db-replica \
    --master-instance-name=prod-db \
    --region=us-east1 \
    --tier=db-custom-2-7680 \
    --availability-type=ZONAL

Step 4: Configure Security

# Create database and user
gcloud sql databases create appdb --instance=prod-db
gcloud sql users create appuser --instance=prod-db --password=SECURE_PASSWORD

# Set up IAM authentication (recommended)
gcloud sql users create app-service-account@PROJECT_ID.iam \
    --instance=prod-db \
    --type=CLOUD_IAM_SERVICE_ACCOUNT

Step 5: Set Up Monitoring

# Enable Cloud Monitoring
gcloud services enable monitoring.googleapis.com

# Create notification channel (email)
gcloud alpha monitoring channels create \
    --display-name="DBA Team" \
    --type=email \
    [email protected]

# Create alert for high CPU
gcloud alpha monitoring policies create \
    --notification-channels=CHANNEL_ID \
    --display-name="Cloud SQL High CPU" \
    --condition-display-name="CPU > 80%" \
    --condition-threshold-value=0.8 \
    --condition-threshold-duration=300s

Step 6: Test Connectivity

# Install Cloud SQL Proxy
curl -o cloud_sql_proxy https://dl.google.com/cloudsql/cloud_sql_proxy.linux.amd64
chmod +x cloud_sql_proxy

# Run proxy
./cloud_sql_proxy -instances=PROJECT_ID:us-central1:prod-db=tcp:5432 &

# Connect and test
psql "host=127.0.0.1 port=5432 dbname=appdb user=appuser"

Step 7: Validate HA and Backup

# Trigger manual backup
gcloud sql backups create --instance=prod-db

# List backups
gcloud sql backups list --instance=prod-db

# Simulate failover (triggers automatic failover to standby)
gcloud sql instances failover prod-db

Summary: Database Selection Framework

Choose Cloud SQL when:

Migrating existing MySQL/PostgreSQL/SQL Server apps
Single-region deployment sufficient
Budget-conscious choice
Standard OLTP workload

Choose AlloyDB when:

Need PostgreSQL with maximum performance
Hybrid OLTP/OLAP workloads
Enterprise-grade SLAs required
Real-time analytics on transactional data

Choose Spanner when:

Global application with users worldwide
Need strong consistency across regions
Scale beyond single database limits
Mission-critical, always-available system

In the next chapter, we explore NoSQL databases for unstructured and semi-structured data with Cloud Firestore and Bigtable.

Overview

Testing & Code Quality

Crash Courses

AI Engineering

Math for ML - Understanding Linear Algebra

Probability & Statistics for ML

Math for ML - Understanding Calculus

ML Mastery

Deep Learning Mastery

NestJS Mastery

Microservices Mastery

Low Level Design

OOP Concepts

SOLID Principles

Design Patterns

LLD Case Studies

System Design (HLD)

Senior Level (L5+/Staff)

HLD Case Studies

Engineering Fundamentals

DevOps & Operations

Azure Cloud Engineering

AWS Cloud

AWS Monitoring & Observability

AWS Security Services

AWS Serverless

AWS Operations

AWS Advanced

AWS Case Studies

GCP Cloud Engineering

DevOps Tools

Database Engineering

HIPAA Compliance Mastery

Operating Systems

Linux Internals

Distributed Systems

Networking Mastery

Build Your Own X

Go Lang Mastery

C Programming

Classic Research Papers

Distributed System Tools

​Chapter 7: SQL at Scale - Cloud SQL, AlloyDB, and Cloud Spanner

​0. From Scratch: Understanding Relational Databases

​0.1 What is a Relational Database?

​0.2 ACID Properties

​0.3 Database Workload Types

​1. Cloud SQL: The Managed Standard

​1.1 Cloud SQL Architecture: The Principal’s View

​The “Managed Instance” Anatomy

​Zonal vs. Regional Storage

​1.2 High Availability (HA) Deep Dive

​The Failover Lifecycle

​Configuration (CLI)

​HA Limitations and Considerations

​1.3 Connectivity Patterns

​Private IP vs Public IP

​The Cloud SQL Auth Proxy