Chapter 13: Operations and Visibility - Google Cloud Observability
Observability is the ability to understand the internal state of your system by examining its external outputs. In the cloud, this means more than just “checking if a server is up.” Google Cloud Observability (formerly Stackdriver) provides a unified suite for Monitoring, Logging, and Application Performance Management (APM), built on the same principles Google uses to maintain its own global services.1. Cloud Monitoring: The Pulse of your System
Cloud Monitoring collects metrics, events, and metadata from GCP, AWS, and even on-premise infrastructure.The Ops Agent
To get deep visibility into Compute Engine VMs (like disk usage, memory, and application logs), you must install the Ops Agent. It combines the power of Fluent Bit (for logs) and OpenTelemetry (for metrics) into a single, high-performance binary.Monitoring Query Language (MQL) for Cross-Project Analysis
For complex analysis, GCM offers MQL. It allows you to perform advanced operations like:- Ratio calculation: “What is the ratio of 5xx errors to total requests?”
- Forecasting: “When will my disk reach 90% capacity based on current growth?”
- Aggregation: Grouping metrics by labels like
regionorversion.
fetch command with resource.labels.project_id.
Managed Service for Prometheus (GMP)
If you already use Prometheus for Kubernetes monitoring, GMP provides a fully managed, globally scalable backend. You can keep your Prometheus configurations and Grafana dashboards but offload the storage and scaling to Google.2. Cloud Logging: The Ledger of Events
Cloud Logging is a real-time log management service that can ingest terabytes of logs per second.Log Analytics (SQL Support) with Cross-Source Joins
One of the most powerful recent additions is Log Analytics. You can now query your logs using standard BigQuery SQL. This allows you to join log data with other datasets or perform complex group-by operations that were previously impossible in a standard log explorer. Advanced SQL Examples:- Joining Logs with Metrics: Correlate application errors in logs with CPU spikes in metrics.
- Cross-Source Joins: Join Cloud Audit logs with Application logs to see who made a change that caused an error.
The Log Router and Sinks
Every log entry passes through the Log Router, where you can:- Exclusion Filters: Drop logs you don’t need (to save money).
- Log Sinks: Export logs to:
- Cloud Storage: For long-term, low-cost compliance (WORM support).
- BigQuery: For deep analytical processing.
- Pub/Sub: For real-time processing by a Cloud Function or external tool (like Splunk or Datadog).
3. APM: Distributed Tracing and Profiling
When an application is slow, you need to know where it’s slow. APM tools provide the “why.”Cloud Trace
Cloud Trace is a distributed tracing system. It tracks a single request as it moves through your frontend, multiple microservices, and databases.- Latency Distributions: Visualize how latency varies over time.
- Analysis Reports: Automatically identify if a new deployment caused a latency regression.
Cloud Profiler
Cloud Profiler is a continuous profiling tool that analyzes the CPU and memory consumption of your code in production.- Low Overhead: It uses statistical sampling to keep overhead below 0.5%.
- Flame Graphs: Identify exactly which function in your code is consuming the most resources, helping you optimize costs and performance without guessing.
4. The SRE Framework: SLIs, SLOs, and SLAs
A Principal Engineer doesn’t just build systems; they define the standard for their survival. GCP provides a native framework for managing reliability.4.1 Definitions: The Reliability Hierarchy
- SLI (Service Level Indicator): A quantitative measure of some aspect of the level of service.
- Example: “Percentage of successful HTTP requests.”
- SLO (Service Level Objective): A target value or range of values for a service level that is measured by an SLI.
- Example: “99.9% of HTTP requests must return a 2xx status code over 30 days.”
- SLA (Service Level Agreement): A legal contract between you and your customers that defines the consequences of missing the SLO (usually financial credits).
- Example: “If uptime falls below 99.9%, we refund 10% of the monthly bill.”
4.2 Implementing SLOs in GCP
To implement an SLO in Cloud Monitoring, you follow the Compliance Period model:- Metric Selection: Use an existing metric (e.g., Load Balancer Latency) or a custom MQL-based SLI.
- The Good Total Ratio: Define what counts as “good” vs “total.”
Ratio = (Good Events) / (Total Events)
- Error Budgets: The inverse of your SLO. If your SLO is 99.9%, your error budget is 0.1%.
- Burn Rate: The speed at which you are consuming your budget. An alert should trigger if your burn rate suggests you will exhaust your budget before the end of the 30-day period.
4.3 Measuring Reliability Math
| SLO | Allowed Downtime (Per Month) | Allowed Downtime (Per Year) |
|---|---|---|
| 99% | 7.3 hours | 3.65 days |
| 99.9% | 43.8 minutes | 8.77 hours |
| 99.99% | 4.38 minutes | 52.56 minutes |
Principal Note: 100% is never the target. 100% uptime means you aren’t deploying enough new features. Use your Error Budget to justify risky deployments or infrastructure changes.
5. Advanced Log Analytics: SRE Forensic SQL
Log Analytics allows you to treat logs as a structured BigQuery dataset. This is the primary tool for “Post-Mortem” analysis.The “Anatomy of an Incident” Query
Use this SQL to correlate multiple log streams during a production outage:6. Advanced Monitoring: Metric Scopes and Dashboards as Code
6.1 Metric Scopes (Multi-Project Monitoring)
In a large organization, your resources are spread across hundreds of projects. Metric Scopes allow you to monitor multiple “Member Projects” from a single “Scoping Project.”- Centralized Ops: Your SRE team can create one dashboard that pulls CPU metrics from the
payments,identity, andfrontendprojects simultaneously. - Limit: A single scoping project can monitor up to 375 member projects.
6.2 Dashboards as Code
Manual dashboarding is an anti-pattern. You should define your dashboards in JSON or Terraform to ensure reproducibility and version control.- GCP Dashboard JSON: Every dashboard can be exported as a JSON structure.
- Terraform Resource:
google_monitoring_dashboardallows you to manage these JSON structures as code.
6.3 Synthetic Monitoring
Beyond simple “ping” checks, Synthetic Monitoring allows you to run scripted tests (using Mocha/Puppeteer) that simulate real user journeys.- Example: “Log in -> Add to cart -> Checkout.”
- Benefit: Catch logic errors in your frontend before users do.
7. Interview Preparation
1. Q: What is the “Golden Signals” framework in SRE, and how do you monitor them in GCP? A: The Golden Signals are Latency, Traffic, Errors, and Saturation.- Latency/Traffic/Errors: Monitored via Cloud Monitoring metrics (e.g.,
loadbalancing.googleapis.com/https/request_count). - Saturation: Monitored via resource metrics (e.g.,
compute.googleapis.com/instance/cpu/utilization). You use these to define SLIs (Service Level Indicators) which then form the basis of your SLOs (Service Level Objectives).
JOIN (correlating Audit logs with App logs), GROUP BY (finding the top 10 error-causing IP addresses), and window functions, turning logs into a structured analytical dataset.
4. Q: When would you use MQL (Monitoring Query Language) over the standard dashboard UI?
A: MQL is required for complex calculations that the GUI cannot perform. Examples include:
- Ratios: Calculating the error rate by dividing
5xx_countbytotal_request_count. - Cross-Project Aggregation: Summarizing CPU usage across 50 different projects into one chart.
- Advanced Filtering: Using regex or multiple join conditions across different metric types.
- Cloud Trace: A distributed tracing tool that follows a single request as it moves through multiple microservices, identifying which service is causing a latency bottleneck.
- Cloud Profiler: A continuous profiling tool that looks inside the code to see which specific function or line of code is consuming the most CPU or Memory, helping developers optimize performance and reduce compute costs.