Chapter 6: Data at Scale - Cloud Storage
Google Cloud Storage (GCS) is a world-class object storage service designed for global consistency, massive scalability, and extreme durability. While other clouds offer object storage, GCS distinguishes itself with its unified API, instant retrieval across all storage classes, and its foundation on the Colossus file system.1. The Foundation: Colossus and Erasure Coding
GCS isn’t just “hard drives in a rack.” It sits on top of Colossus, Google’s cluster-level file system (the successor to GFS).1.1 Erasure Coding vs. Traditional RAID
Traditional RAID (like RAID 5/6) has a “rebuild” problem: if a disk fails, the remaining disks are hammered to reconstruct data, increasing the risk of a second failure. Colossus uses Reed-Solomon (RS) Erasure Coding (typically 14 data chunks + 2 parity chunks):- Chunk Distribution: A file is broken into 14 pieces. Google calculates 2 extra pieces (parity).
- The Magic: Any 14 of those 16 pieces can reconstruct the original file.
- D-Striping: These chunks are distributed across hundreds of different physical racks. If an entire rack loses power, the file is still accessible with 0ms latency because the other 15 chunks are elsewhere.
- Infinite Scale: Because there is no central “controller” (unlike a NAS/SAN), you can read/write as fast as your network allows.
1.2 Strong Global Consistency
Unlike other clouds that historically used “Eventual Consistency,” GCS uses strong consistency for all operations.- The “Read-after-Write” Promise: As soon as you get a
200 OKfrom an upload, any client in the world requesting that object will see the new version immediately. - Implementation: GCS uses a distributed consensus protocol (similar to Paxos/Raft) to ensure that metadata updates are synchronized globally before acknowledging the write.
2. Storage Classes: One API, Many Costs
One of GCS’s greatest strengths is that all storage classes use the same API and offer millisecond latency for the first byte of data. There is no “glacier” retrieval delay in GCP.| Class | Use Case | Durability | Min Duration |
|---|---|---|---|
| Standard | Hot data, websites, mobile apps, streaming. | 11 nines | None |
| Nearline | Monthly backups, long-tail media. | 11 nines | 30 Days |
| Coldline | Quarterly backups, disaster recovery. | 11 nines | 90 Days |
| Archive | Yearly regulatory archives, “never delete” data. | 11 nines | 365 Days |
3. Location Types and Availability
Where you put your data affects both cost and availability.- Regional: Data is stored in a single region (e.g.,
us-central1). Lowest cost, but vulnerable to a full region outage. - Dual-Region: Data is replicated across two specific regions (e.g.,
us-east1andus-west1).- Turbo Replication: An optional feature that guarantees 100% of data is replicated within 15 minutes (99.9% RPO).
Monitoring Turbo Replication
For mission-critical global apps, you should monitor the replication health.- Metric:
storage.googleapis.com/replication/replication_delay - Alerting: Set an alert in Cloud Monitoring if the delay exceeds 10 minutes. This gives you a 5-minute buffer before the SLA is breached.
- Multi-Region: Data is spread across a large geographic area (e.g., “US”, “EU”). Highest availability and best performance for global users.
4. Security Architecture
Accessing data in GCS is managed through several layers of security.Uniform vs. Fine-Grained Access
- Uniform (Recommended): Uses IAM roles at the bucket level. It simplifies auditing and prevents “hidden” public files.
- Fine-Grained: Uses ACLs (Access Control Lists) on individual objects. Use this only if you need different permissions for every file in a bucket.
Encryption
- Default: Everything is encrypted at rest using Google-managed keys.
- CMEK (Customer-Managed Encryption Keys): You manage the keys in Cloud KMS.
- CSEK (Customer-Supplied Encryption Keys): You provide the raw key for every request. Google never stores the key on disk.
Signed URLs and Cookies
Provide time-limited access (e.g., 15 minutes) to an object for a user who does not have a Google account. Perfect for profile pictures or temporary download links.5. Lifecycle Management and Versioning
5.1 Lifecycle Mechanics: The Principal’s View
Lifecycle rules are a background orchestration service. Understanding their “asynchronous” nature is key for SREs.- The 24-Hour Cycle: GCS evaluates lifecycle rules approximately once every 24 hours. If you upload a file that should be deleted after 1 day, it might take up to 48 hours for it to actually disappear.
- Propagation Delay: After a rule is applied, it can take up to 24 hours for the changes to take effect across the global fleet.
- Action Precedence: If multiple rules match an object, GCS follows a specific precedence (e.g.,
SetStorageClasshappens beforeDelete). - Billing: You are charged for the storage class the object is currently in until the lifecycle action is fully finalized by the background worker.
5.2 Object Holds: Dynamic Retention
Beyond static policies, GCS supports Object Holds for dynamic legal or administrative control.- Event-Based Hold: Used when you don’t know the deletion date in advance (e.g., “Keep these records until the contract ends”). Once the hold is released, the retention period starts counting.
- Temporary Hold: A simple “do not delete” flag used for legal discovery or troubleshooting.
- Comparison: While a Retention Policy applies to the whole bucket, a Hold is applied to individual objects.
5.3 Object Versioning
When enabled, GCS keeps the history of an object. If you overwritecat.jpg, GCS stores the old version with a unique generation number.
- Use Case: Recovery from accidental deletion or malicious ransomware.
6. Performance Optimization
Parallel Composite Uploads
For very large files (GBs/TBs), thegcloud storage tool can break the file into 32MB chunks, upload them in parallel, and then instruct GCS to “compose” them into a single object on the server side.
gcloud storage vs gsutil
Google recently released the Go-basedgcloud storage component. It is significantly faster than the Python-based gsutil because it uses a more efficient threading model and multi-stream uploads.
7. Data Transfer Services
- Storage Transfer Service (STS): A managed service to pull data from AWS S3, Azure Blob Storage, or on-prem servers into GCS.
- Transfer Appliance: A high-capacity server Google ships to you. You fill it with PBs of data and ship it back for upload.
8. Retention Policies and Bucket Lock (WORM)
For legal and compliance reasons (e.g., SEC 17a-4, HIPAA), you may need WORM (Write Once, Read Many) storage.8.1 The “Immutable” Contract
- Retention Policy: Define a period (e.g., 7 years) during which objects cannot be deleted or modified.
- Bucket Lock: This is the “Nuclear Option.” Once you lock a retention policy, it can never be removed, not even by a Project Owner or Google Support.
Principal Warning: Bucket Lock is irreversible. If you lock a bucket containing 100TB with a 10-year retention, you are legally and financially committed to paying for that storage for the next decade. There is no “undo” button.
8.2 Compliance Mode vs. Governance Mode
- Governance Mode (holds): Can be removed by users with specific permissions.
- Compliance Mode (Bucket Lock): Cannot be removed by anyone once locked.
9. Automation and Integration
9.1 Pub/Sub Notifications for GCS
You can configure GCS to send a message to a Pub/Sub topic whenever an object is created, deleted, or updated.- Use Cases: Triggering a Cloud Function to resize an image, starting a Dataflow job when a new CSV arrives, or auditing sensitive file deletions.
9.2 Abort Incomplete Multipart Uploads
A common “hidden cost” in GCS is incomplete uploads. If an upload fails halfway, the chunks stay in your bucket and you are charged for them!- The Fix: Add a lifecycle rule to “Abort Incomplete Multipart Uploads” after 7 days. This ensures that failed uploads don’t leak money.
9.3 External Tables in BigQuery
GCS is the primary storage for Data Lakes. You can query files in GCS directly from BigQuery without importing them.- Formats: CSV, JSON, Avro, Parquet, ORC.
- SRE Tip: Use Parquet or Avro for production data lakes; they are columnar and significantly faster/cheaper to query than JSON/CSV.
10. Advanced Security: CMEK and IAM Conditions
10.1 Customer-Managed Encryption Keys (CMEK)
While GCS encrypts everything by default, CMEK puts you in control of the key lifecycle in Cloud KMS.- Key Rotation: You can rotate the key in KMS without re-encrypting existing objects (new versions/objects use the new key).
- Access Control: You can disable the key in KMS to instantly revoke access to all data in the bucket, even if someone has IAM permissions to read the objects.
10.2 IAM Conditions for GCS
You can grant access based on object attributes like prefixes.- Example: Grant a user access only to objects starting with
user-data/alice/. - Constraint: This is only possible if you use Uniform Bucket-Level Access.
11. Implementation: The “Storage Expert” Lab
Setting up a Secure, Multi-Region Bucket
Pro-Tip: Public Access Prevention
In the GCS console, always enable Public Access Prevention (PAP) at the project or bucket level. This acts as a global “safety switch” that prevents anyone from accidentally making a bucket public, even if they have the correct permissions.Interview Preparation
Q1: What is 'Colossus' and how does it provide 11 nines of durability for Cloud Storage?
Q1: What is 'Colossus' and how does it provide 11 nines of durability for Cloud Storage?
Answer: Colossus is Google’s next-generation cluster file system. It provides high durability using Reed-Solomon Erasure Coding.
- Mechanism: When an object is uploaded, Colossus breaks it into chunks and calculates parity blocks. These chunks are distributed across thousands of independent disks, power domains, and failure zones.
- Self-Healing: Colossus continuously monitors the health of all disks. If a disk fails, it immediately reconstructs the missing chunks from the parity data onto fresh disks in the background.
- Result: This massively distributed architecture ensures that even the simultaneous loss of several disks or an entire equipment rack will not result in data loss.
Q2: Explain the 'Unified API' concept in GCS. How does it differ from other cloud providers?
Q2: Explain the 'Unified API' concept in GCS. How does it differ from other cloud providers?
Answer: GCS uses a single, consistent API for all storage classes (Standard, Nearline, Coldline, Archive).
- Latency: Unlike AWS S3 Glacier which requires a “thawing” period (minutes to hours) to retrieve data, GCS Archive class provides millisecond latency for the first byte. The retrieval is instant.
- Tooling: You use the same
gcloud storagecommands or client libraries regardless of whether the data is hot or cold. - Abstraction: This allows developers to write code without needing to worry about complex “restore” workflows, simplifying application logic for long-term data retention.
Q3: When would you use 'Uniform Bucket-Level Access' vs. fine-grained ACLs?
Q3: When would you use 'Uniform Bucket-Level Access' vs. fine-grained ACLs?
Answer:
- Uniform (Recommended): Use this for almost all production buckets. It enforces security using IAM roles only. This simplifies auditing, prevents “hidden” public files at the object level, and integrates with features like VPC Service Controls.
- ACLs: Use these only when you have a legacy requirement where different users need different permissions for every single file within the same bucket (e.g., a shared drive where some files are private and others are public).
Q4: What is 'Turbo Replication' and how do you monitor its health?
Q4: What is 'Turbo Replication' and how do you monitor its health?
Answer: Turbo Replication is a premium feature for Dual-Region buckets that guarantees 100% of your data is replicated between the two regions within 15 minutes (99.9% RPO).
- Monitoring: I would monitor the
storage.googleapis.com/replication/replication_delaymetric in Cloud Monitoring. - Alerting: I would set an alert to trigger if the delay exceeds 10 minutes, giving the SRE team a buffer to investigate before the SLA is breached.
Q5: Explain the 'Bucket Lock' (WORM) feature and its regulatory importance.
Q5: Explain the 'Bucket Lock' (WORM) feature and its regulatory importance.
Answer: Bucket Lock allows you to “lock” a Retention Policy on a bucket.
- WORM (Write Once, Read Many): Once a policy is locked, objects cannot be deleted or modified until they reach the specified age (e.g., 7 years).
- Immutability: Crucially, once locked, the policy cannot be removed or shortened, even by a Project Owner or Google Support.
- Compliance: This is essential for meeting regulatory requirements like FINRA or HIPAA, where certain data must be preserved for audit purposes without any possibility of tampering or deletion.