Observability in Microservices
When you have 50 services, “tailing the logs” is impossible. You need centralized observability.1. The Three Pillars
- Logs: Immutable record of discrete events. (“Error at 10:00 PM”).
- Metrics: Aggregated data over time. (“CPU usage is 80%”, “Requests per second: 50”).
- Tracing: The path of a single request across multiple services.
2. Distributed Tracing with Zipkin/Micrometer
Spring Boot 3 uses Micrometer Tracing (formerly Spring Cloud Sleuth). Dependencies:io.micrometer:micrometer-tracing-bridge-braveio.zipkin.reporter2:zipkin-reporter-brave
Trace ID (global) and Span ID (local). These IDs are propagated via HTTP headers (traceparent).
Running Zipkin:
application.yml):
Order Service, which calls Inventory Service, you can see the full timeline in Zipkin UI (http://localhost:9411).
3. Metrics with Prometheus & Grafana
Actuator exposes metrics at/actuator/metrics. Prometheus scrapes them.
Dependency: io.micrometer:micrometer-registry-prometheus.
Config:
prometheus.yml):
4. Centralized Logging (ELK / Loki)
Don’t write logs to files. Write to Console (STDOUT). Use a log shipper (Fluentd/Promtail) to send them to ElasticSearch or Loki. Lombok Logging:/actuator/prometheus every 15s. Grafana visualizes the data.
5. Deep Dive: Spring Boot Actuator
Actuator exposes operational information about your running application.Enabling Actuator
Key Endpoints
| Endpoint | Description |
|---|---|
/actuator/health | Application health (UP/DOWN). Includes DB, Disk, etc. |
/actuator/info | Application metadata (version, Git commit). |
/actuator/metrics | All available metrics. |
/actuator/metrics/{name} | Specific metric (e.g., jvm.memory.used). |
/actuator/env | Environment properties. |
/actuator/loggers | View/Change log levels at runtime. |
/actuator/prometheus | Prometheus-formatted metrics. |
Securing Actuator
6. Custom Metrics with Micrometer
Track your own business KPIs.- Counter: Monotonically increasing (e.g., requests served).
- Gauge: Current value (e.g., active connections).
- Timer: Duration of events (e.g., request latency).
- Distribution Summary: Statistical summary (e.g., request size).