Skip to main content

Observability in Microservices

When you have 50 services, “tailing the logs” is impossible. You need centralized observability.

1. The Three Pillars

  1. Logs: Immutable record of discrete events. (“Error at 10:00 PM”).
  2. Metrics: Aggregated data over time. (“CPU usage is 80%”, “Requests per second: 50”).
  3. Tracing: The path of a single request across multiple services.

2. Distributed Tracing with Zipkin/Micrometer

Spring Boot 3 uses Micrometer Tracing (formerly Spring Cloud Sleuth). Dependencies:
  • io.micrometer:micrometer-tracing-bridge-brave
  • io.zipkin.reporter2:zipkin-reporter-brave
How it works Every request gets a unique Trace ID (global) and Span ID (local). These IDs are propagated via HTTP headers (traceparent).
Running Zipkin:
docker run -d -p 9411:9411 openzipkin/zipkin
Config (application.yml):
management:
  tracing:
    sampling:
      probability: 1.0 # Sample 100% of requests (Don't do this in prod!)
Now, when you hit Order Service, which calls Inventory Service, you can see the full timeline in Zipkin UI (http://localhost:9411).

3. Metrics with Prometheus & Grafana

Actuator exposes metrics at /actuator/metrics. Prometheus scrapes them. Dependency: io.micrometer:micrometer-registry-prometheus. Config:
management:
  endpoints:
    web:
      exposure:
        include: prometheus
Prometheus Config (prometheus.yml):
scrape_configs:
  - job_name: 'spring_micrometer'
    metrics_path: '/actuator/prometheus'
    static_configs:
      - targets: ['host.docker.internal:8080']
Grafana: Connect Grafana to Prometheus and import a standard Spring Boot Dashboard (ID: 4701). You’ll get instant graphs for JVM memory, GC pauses, and HTTP throughput.

4. Centralized Logging (ELK / Loki)

Don’t write logs to files. Write to Console (STDOUT). Use a log shipper (Fluentd/Promtail) to send them to ElasticSearch or Loki. Lombok Logging:
@Slf4j
@Service
public class OrderService {
    public void createOrder() {
        log.info("Creating order..."); // Automatically includes Trace ID and Span ID
    }
}
Prometheus scrapes /actuator/prometheus every 15s. Grafana visualizes the data.

5. Deep Dive: Spring Boot Actuator

Actuator exposes operational information about your running application.

Enabling Actuator

<dependency>
    <groupId>org.springframework.boot</groupId>
    <artifactId>spring-boot-starter-actuator</artifactId>
</dependency>
By default, most endpoints are disabled for security. Enable All:
management:
  endpoints:
    web:
      exposure:
        include: "*" # WARNING: Don't do this in production without security

Key Endpoints

EndpointDescription
/actuator/healthApplication health (UP/DOWN). Includes DB, Disk, etc.
/actuator/infoApplication metadata (version, Git commit).
/actuator/metricsAll available metrics.
/actuator/metrics/{name}Specific metric (e.g., jvm.memory.used).
/actuator/envEnvironment properties.
/actuator/loggersView/Change log levels at runtime.
/actuator/prometheusPrometheus-formatted metrics.

Securing Actuator

@Configuration
public class SecurityConfig {
    @Bean
    public SecurityFilterChain actuatorSecurity(HttpSecurity http) throws Exception {
        http.authorizeHttpRequests(auth -> auth
                .requestMatchers("/actuator/**").hasRole("ADMIN")
                .anyRequest().authenticated()
        );
        return http.build();
    }
}

6. Custom Metrics with Micrometer

Track your own business KPIs.
@Service
@RequiredArgsConstructor
public class OrderService {

    private final MeterRegistry meterRegistry;

    public void placeOrder(Order order) {
        // Increment counter
        meterRegistry.counter("orders.placed", "status", "success").increment();
        
        // Record time
        Timer.Sample sample = Timer.start(meterRegistry);
        processOrder(order);
        sample.stop(meterRegistry.timer("order.processing.time"));
        
        // Gauge (current value)
        meterRegistry.gauge("orders.pending", getPendingOrderCount());
    }
}
Metric Types:
  • Counter: Monotonically increasing (e.g., requests served).
  • Gauge: Current value (e.g., active connections).
  • Timer: Duration of events (e.g., request latency).
  • Distribution Summary: Statistical summary (e.g., request size).

Monitoring Flow (Complete Picture)