Skip to main content

Chapter 13: Operations and Visibility - Google Cloud Observability

Observability is the ability to understand the internal state of your system by examining its external outputs. In the cloud, this means more than just “checking if a server is up.” Google Cloud Observability (formerly Stackdriver) provides a unified suite for Monitoring, Logging, and Application Performance Management (APM), built on the same principles Google uses to maintain its own global services.

1. Cloud Monitoring: The Pulse of your System

Cloud Monitoring collects metrics, events, and metadata from GCP, AWS, and even on-premise infrastructure.

The Ops Agent

To get deep visibility into Compute Engine VMs (like disk usage, memory, and application logs), you must install the Ops Agent. It combines the power of Fluent Bit (for logs) and OpenTelemetry (for metrics) into a single, high-performance binary.

Monitoring Query Language (MQL) for Cross-Project Analysis

For complex analysis, GCM offers MQL. It allows you to perform advanced operations like:
  • Ratio calculation: “What is the ratio of 5xx errors to total requests?”
  • Forecasting: “When will my disk reach 90% capacity based on current growth?”
  • Aggregation: Grouping metrics by labels like region or version.
Cross-Project Queries: MQL enables you to query metrics across multiple projects in a single statement using the fetch command with resource.labels.project_id.
fetch gce_instance
| metric 'compute.googleapis.com/instance/cpu/utilization'
| align mean_aligner() | within 1m
| group_by [project_id: resource.label("project_id")], [value_utilization_mean: mean(value.utilization)]
This is essential for organizations with multiple environments (dev, staging, prod) or business units in separate projects.

Managed Service for Prometheus (GMP)

If you already use Prometheus for Kubernetes monitoring, GMP provides a fully managed, globally scalable backend. You can keep your Prometheus configurations and Grafana dashboards but offload the storage and scaling to Google.

2. Cloud Logging: The Ledger of Events

Cloud Logging is a real-time log management service that can ingest terabytes of logs per second.

Log Analytics (SQL Support) with Cross-Source Joins

One of the most powerful recent additions is Log Analytics. You can now query your logs using standard BigQuery SQL. This allows you to join log data with other datasets or perform complex group-by operations that were previously impossible in a standard log explorer. Advanced SQL Examples:
  1. Joining Logs with Metrics: Correlate application errors in logs with CPU spikes in metrics.
  2. Cross-Source Joins: Join Cloud Audit logs with Application logs to see who made a change that caused an error.
SELECT
  a.timestamp,
  a.method,
  u.user_email,
  a.status
FROM
  `project.dataset.cloud_audit_log_entries` a
JOIN
  `project.dataset.application_log_entries` u
ON
  a.request_id = u.request_id
WHERE
  a.method = 'google.cloud.compute.v1.Instances.Insert'
  AND u.log_message LIKE '%failed%'

The Log Router and Sinks

Every log entry passes through the Log Router, where you can:
  • Exclusion Filters: Drop logs you don’t need (to save money).
  • Log Sinks: Export logs to:
    • Cloud Storage: For long-term, low-cost compliance (WORM support).
    • BigQuery: For deep analytical processing.
    • Pub/Sub: For real-time processing by a Cloud Function or external tool (like Splunk or Datadog).

3. APM: Distributed Tracing and Profiling

When an application is slow, you need to know where it’s slow. APM tools provide the “why.”

Cloud Trace

Cloud Trace is a distributed tracing system. It tracks a single request as it moves through your frontend, multiple microservices, and databases.
  • Latency Distributions: Visualize how latency varies over time.
  • Analysis Reports: Automatically identify if a new deployment caused a latency regression.

Cloud Profiler

Cloud Profiler is a continuous profiling tool that analyzes the CPU and memory consumption of your code in production.
  • Low Overhead: It uses statistical sampling to keep overhead below 0.5%.
  • Flame Graphs: Identify exactly which function in your code is consuming the most resources, helping you optimize costs and performance without guessing.

4. The SRE Framework: SLIs, SLOs, and SLAs

A Principal Engineer doesn’t just build systems; they define the standard for their survival. GCP provides a native framework for managing reliability.

4.1 Definitions: The Reliability Hierarchy

  • SLI (Service Level Indicator): A quantitative measure of some aspect of the level of service.
    • Example: “Percentage of successful HTTP requests.”
  • SLO (Service Level Objective): A target value or range of values for a service level that is measured by an SLI.
    • Example: “99.9% of HTTP requests must return a 2xx status code over 30 days.”
  • SLA (Service Level Agreement): A legal contract between you and your customers that defines the consequences of missing the SLO (usually financial credits).
    • Example: “If uptime falls below 99.9%, we refund 10% of the monthly bill.”

4.2 Implementing SLOs in GCP

To implement an SLO in Cloud Monitoring, you follow the Compliance Period model:
  1. Metric Selection: Use an existing metric (e.g., Load Balancer Latency) or a custom MQL-based SLI.
  2. The Good Total Ratio: Define what counts as “good” vs “total.”
    • Ratio = (Good Events) / (Total Events)
  3. Error Budgets: The inverse of your SLO. If your SLO is 99.9%, your error budget is 0.1%.
    • Burn Rate: The speed at which you are consuming your budget. An alert should trigger if your burn rate suggests you will exhaust your budget before the end of the 30-day period.

4.3 Measuring Reliability Math

SLOAllowed Downtime (Per Month)Allowed Downtime (Per Year)
99%7.3 hours3.65 days
99.9%43.8 minutes8.77 hours
99.99%4.38 minutes52.56 minutes
Principal Note: 100% is never the target. 100% uptime means you aren’t deploying enough new features. Use your Error Budget to justify risky deployments or infrastructure changes.

5. Advanced Log Analytics: SRE Forensic SQL

Log Analytics allows you to treat logs as a structured BigQuery dataset. This is the primary tool for “Post-Mortem” analysis.

The “Anatomy of an Incident” Query

Use this SQL to correlate multiple log streams during a production outage:
SELECT
  t.timestamp,
  t.text_payload AS error_message,
  a.protopayload_auditlog.method_name AS admin_action,
  a.protopayload_auditlog.authentication_info.principal_email AS actor
FROM
  `my-project.global._Default._Default` AS t
LEFT JOIN
  `my-project.global._Default._Default` AS a
ON
  t.insert_id = a.insert_id
WHERE
  t.severity = 'ERROR'
  AND t.timestamp > TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 1 HOUR)
ORDER BY
  t.timestamp DESC

6. Advanced Monitoring: Metric Scopes and Dashboards as Code

6.1 Metric Scopes (Multi-Project Monitoring)

In a large organization, your resources are spread across hundreds of projects. Metric Scopes allow you to monitor multiple “Member Projects” from a single “Scoping Project.”
  • Centralized Ops: Your SRE team can create one dashboard that pulls CPU metrics from the payments, identity, and frontend projects simultaneously.
  • Limit: A single scoping project can monitor up to 375 member projects.

6.2 Dashboards as Code

Manual dashboarding is an anti-pattern. You should define your dashboards in JSON or Terraform to ensure reproducibility and version control.
  • GCP Dashboard JSON: Every dashboard can be exported as a JSON structure.
  • Terraform Resource: google_monitoring_dashboard allows you to manage these JSON structures as code.

6.3 Synthetic Monitoring

Beyond simple “ping” checks, Synthetic Monitoring allows you to run scripted tests (using Mocha/Puppeteer) that simulate real user journeys.
  • Example: “Log in -> Add to cart -> Checkout.”
  • Benefit: Catch logic errors in your frontend before users do.

7. Interview Preparation

1. Q: What is the “Golden Signals” framework in SRE, and how do you monitor them in GCP? A: The Golden Signals are Latency, Traffic, Errors, and Saturation.
  • Latency/Traffic/Errors: Monitored via Cloud Monitoring metrics (e.g., loadbalancing.googleapis.com/https/request_count).
  • Saturation: Monitored via resource metrics (e.g., compute.googleapis.com/instance/cpu/utilization). You use these to define SLIs (Service Level Indicators) which then form the basis of your SLOs (Service Level Objectives).
2. Q: Explain the difference between a “Log Router” and a “Log Sink”. A: The Log Router is the central engine that receives every log entry from every GCP service. It applies “Exclusion Filters” to drop logs you don’t want to pay for. A Log Sink is a rule that tells the Router to export specific logs to a destination: Cloud Storage (long-term/compliance), BigQuery (analysis/SQL), or Pub/Sub (real-time streaming/third-party tools). 3. Q: How does “Log Analytics” (SQL support) change log investigation? A: Traditionally, logs were searched using a text-based filter (Log Explorer). Log Analytics allows you to use BigQuery SQL to query logs. This enables complex operations like JOIN (correlating Audit logs with App logs), GROUP BY (finding the top 10 error-causing IP addresses), and window functions, turning logs into a structured analytical dataset. 4. Q: When would you use MQL (Monitoring Query Language) over the standard dashboard UI? A: MQL is required for complex calculations that the GUI cannot perform. Examples include:
  • Ratios: Calculating the error rate by dividing 5xx_count by total_request_count.
  • Cross-Project Aggregation: Summarizing CPU usage across 50 different projects into one chart.
  • Advanced Filtering: Using regex or multiple join conditions across different metric types.
5. Q: What is the primary purpose of Cloud Trace and Cloud Profiler? A:
  • Cloud Trace: A distributed tracing tool that follows a single request as it moves through multiple microservices, identifying which service is causing a latency bottleneck.
  • Cloud Profiler: A continuous profiling tool that looks inside the code to see which specific function or line of code is consuming the most CPU or Memory, helping developers optimize performance and reduce compute costs.

Implementation: The “SRE” Lab

Setting up a Log-Based Alert and Export

# 1. Create a Log-based Metric to count 404 errors
gcloud logging metrics create 404-error-count \
    --description="Count of 404 errors" \
    --log-filter='resource.type="gce_instance" AND textPayload:"404"'

# 2. Create an Alerting Policy based on that metric
# (Typically done via Terraform or Console due to JSON complexity)

# 3. Create a Log Sink to export all Audit Logs to BigQuery
gcloud logging sinks create audit-logs-to-bq \
    bigquery.googleapis.com/projects/$PROJECT_ID/datasets/audit_logs \
    --log-filter='logName:"logs/cloudaudit.googleapis.com"'

Pro-Tip: Uptime Checks from the Edge

Always set up Uptime Checks that originate from multiple geographic regions. A service might look “up” from the Google Network, but it could be unreachable for users in Europe or Asia due to a BGP or DNS issue. Global uptime checks catch these “hidden” outages.