What is Micrometer and how does it relate to distributed tracing?

Micrometer is a vendor-neutral metrics and observation facade for the JVM. In Spring Boot 3 its Observation API unifies metrics, logging, and tracing so a single @Observed annotation records timing, creates a trace span, and adds log correlation.

← Back to Java Enterprise Mastery

Distributed Tracing: Monitoring the Microservice Maze

Q: What is the difference between metrics and traces?

Metrics are aggregated measurements over time such as average latency and error rate, cheap to store but losing individual request detail. Traces are detailed records of individual requests across services, expensive at full volume. Use metrics for dashboards and traces for diagnosing specific incidents.

Q: How do I correlate trace IDs with application logs?

Micrometer Tracing automatically adds traceId and spanId to the MDC. Include %X{traceId} and %X{spanId} in your log pattern. Searching logs by traceId in Grafana Loki or Elastic then returns all log lines from all services for that request.

"In a monolith, the stack trace is your map. In a microservice mesh, the stack trace is a lie-the real map is the Trace ID."

When a monolithic application fails, you check one log file. It’s linear, predictable, and contained. But when you move to a distributed architecture, a single user click can trigger a chain reaction: the API Gateway authenticates, the Order Service validates, the Inventory Service checks stock, and the Payment Service communicates with an external bank.

If that request fails or takes 10 seconds, where is the bottleneck?

Is it a slow database in the Inventory service?
A GC pause in the Payment service?
Or a congested network bridge in the Gateway?

Without Distributed Tracing, you are effectively blind. You might have 50 potential culprits, 200 network hops, and 1,000 log files to search. Distributed tracing is the practice of tagging every request with a unique identifier that follows it across every service boundary, giving you a subatomic view of your system’s performance.

1. The Anatomy of a Trace: Spans, Contexts, and Propagation

To master observability, you must first understand the fundamental units of work defined by the OpenTelemetry and Brave specifications.

The Trace vs. The Span

The Trace: The entire journey of a request from the moment it enters the system (usually at the load balancer or gateway) to the moment the response is returned. A Trace is a Directed Acyclic Graph (DAG) of spans.
The Span: A single unit of work. This could be an HTTP GET request, an SQL query execution, or the time spent serializing a JSON object. Every span has a start time, an end time, and metadata (Tags and Logs).

Correlation IDs: The Primary Key of Observability

The magic of tracing lies in the Trace ID. This 64-bit or 128-bit hex string is generated at the entry point and must be "propagated" to every downstream service.

Trace ID: Shared by every span in the entire request journey.
Span ID: Unique to each individual step.
Parent ID: Links a span to the step that called it, allowing Zipkin to reconstruct the "Tree View" of the request.

W3C Trace Context: The Universal Language

In the early days, tracing was fragmented. Twitter used X-B3-TraceId (Zipkin), while others used custom headers. This made "Polyglot" tracing (Java talking to Go talking to Node.js) a nightmare.

Enter the W3C Trace Context standard. It defines two critical headers:

traceparent: Contains the version, trace ID, parent span ID, and flags (like "should I sample this?").
- Example: 00-4bf92f3577b34da6a3ce929d0e0e4736-00f067aa0ba902b7-01
tracestate: Allows vendors to pass their own proprietary data without breaking the standard trace ID.

By adopting W3C, your Spring Boot services can now "hand off" a trace to a Python AI agent or a Rust-based crypto engine with zero effort.

2. Micrometer Tracing: The Modern Standard

If you are coming from Spring Boot 2.x, you likely used Spring Cloud Sleuth. However, in the 2026 enterprise landscape (and since Spring Boot 3.0), Sleuth has been retired. Tracing has moved into the Micrometer ecosystem-the same library used for metrics.

Why the Change?

Historically, metrics and tracing were handled by separate libraries. This was inefficient. If you wanted to time a method, you used Micrometer for a timer and Sleuth for a span. Now, you use the Observation API.

One Instrumentation to Rule Them All: You create a single "Observation." Depending on your configuration, this observation can automatically produce a Micrometer Timer and a Tracing Span simultaneously.

The Architecture: Bridge and Handler

Micrometer Tracing acts as a Facade (similar to SLF4J). It provides a common API, but you must choose a "Tracer Implementation" and a "Reporter."

The Tracer (Brave or OTel):
- Brave: The classic Zipkin-compatible tracer. Reliable and mature.
- OpenTelemetry (OTel): The future-proof industry standard. Use this if you plan to export to Jaeger, Honeycomb, or AWS X-Ray.
The Bridge: A small library that translates Micrometer calls into Brave or OTel calls.
The Reporter: Sends the completed spans to a backend (like Zipkin).

Pro-Grade Dependency Setup

To enable tracing in a modern Spring Boot 3 application, your pom.xml needs to look like this (using Brave as the tracer):

xml

<!-- 1. The Micrometer Tracing Core -->
<dependency>
    <groupId>io.micrometer</groupId>
    <artifactId>micrometer-tracing</artifactId>
</dependency>

<!-- 2. The Bridge (Micrometer -> Brave) -->
<dependency>
    <groupId>io.micrometer</groupId>
    <artifactId>micrometer-tracing-bridge-brave</artifactId>
</dependency>

<!-- 3. The Reporter (Send to Zipkin) -->
<dependency>
    <groupId>io.zipkin.reporter2</groupId>
    <artifactId>zipkin-reporter-brave</artifactId>
</dependency>

3. The "Hardware Mirror": Observability Overhead

We must never forget the hardware reality: Tracing is not free. Every time you start a span, the JVM must:

Read the current system clock (high-precision).
Create a Span object on the Heap.
Store the context in a ThreadLocal (which can impact cache locality).
Buffer the span in memory before sending it over the network to Zipkin.

If your service processes 100,000 requests per second, creating 100,000 spans will significantly increase your Allocation Rate and trigger more frequent GC Pauses. This is why the next section-Sampling-is the most important part of your production configuration.

4. Sampling: The "Bouncer" for Your Spans

In a massive distributed system, you cannot trace 100% of requests. The I/O cost of serializing and sending every span to a Zipkin collector would consume more CPU than the actual business logic. You must use Sampling.

Head-Based Sampling: The Fast Path

This is the default for most systems. The decision to sample is made at the very beginning of the request (the "Head").

If the first service decides to sample, it sets the "Sampled" flag in the W3C traceparent to 01.
Every downstream service sees this flag and must follow the instruction.

The Probability Configuration: In Spring Boot, you control this via property files:

yaml

management:
  tracing:
    sampling:
      probability: 0.1 # Collect only 10% of requests

Tail-Based Sampling: The Intelligent Path

What if you only want to see traces that failed or took longer than 500ms? Head-based sampling can't do this, because the decision is made before the request even happens. Tail-based sampling collects everything in memory at the collector level (using the OpenTelemetry Collector), waits for the request to finish, and then decides whether to keep it or discard it.

Pros: You never miss a single error. No wasted storage on "boring" 200 OK requests.
Cons: Requires a dedicated OTel Collector layer and higher memory usage at the ingress.

5. Async & Kafka: Keeping the Trace Alive (The Context Gap)

One of the most common "broken traces" occurs when you move from a standard HTTP thread to an asynchronous process or a message broker like Kafka.

The ThreadLocal Trap

Standard tracing context is stored in a ThreadLocal. When you call .thenApply() on a CompletableFuture or use an @Async method, you are jumping to a different thread. That thread is "cold"-it has no idea about the Trace ID of the parent.

The Micrometer Solution: You must wrap your Executors or use the ObservationRegistry.

java

// Logic for propagating context across threads
ExecutorService executor = ContextExecutorService.wrap(
    Executors.newFixedThreadPool(5),
    () -> ContextSnapshot.captureAll(observationRegistry)
);

Message-Driven Tracing (Kafka/RabbitMQ)

When you produce a message to Kafka, the Trace ID must travel with the message.

Producer: Micrometer Tracing intercepts the KafkaTemplate and injects the Trace ID into the Kafka Record Headers.
Consumer: The receiver reads the headers, "re-hydrates" the context, and starts a child span. This allows you to see the "gap" in time between when a message was sent and when it was actually processed by the consumer-a vital metric for identifying Consumer Lag.

6. Visualizing the Journey: Zipkin and Grafana Tempo

A Trace ID is just a string until you visualize it.

Zipkin: The Classic Choice

Zipkin provides a lightweight, easy-to-run UI that allows you to:

Waterfall Charts: See exactly which service is the bottleneck. If Service A takes 2s but its call to Service B only takes 100ms, you know the problem is in Service A's local logic (e.g., a slow loop or lock contention).
Service Graphs: Automatically generate a map of your entire architecture showing how services interact.

Grafana Tempo: The High-Scale Modernity

Tempo is "Trace-ID only" storage. It doesn't index your traces. Instead, it relies on your Logs (Loki) or Metrics (Prometheus) to find the Trace ID. Once you have the ID, Tempo retrieves the full trace from object storage (like S3). This is infinitely more scalable than Zipkin's database-heavy approach for 2026-scale clusters.

7. Metrics + Traces: The Power of Exemplars

The ultimate "Observability Superpower" is Exemplars. Imagine you are looking at a Prometheus graph showing a spike in latency. Usually, you would have to manually search Zipkin for a trace at that exact time. With Exemplars, the Trace ID is attached directly to the metric point. You can click a "dot" on the Grafana graph and jump instantly to the specific trace that caused that latency spike.

Summary: Mastering the View

Distributed tracing is the difference between "I think the server is slow" and "I know the Inventory database query on table 'X' took 1.4s due to a missing index."

Infrastructure First: Trace context propagation must be handled by the framework (Micrometer), not the developer.
Performance is a Feature: Use sampling to keep your JVM allocation rate healthy.
Correlate Everything: A trace without logs is just a timeline. A trace with logs is a forensic evidence file.

You have now mastered the art of distributed monitoring. No request can move through your mesh without leaving a digital footprint.

Frequently Asked Questions

Q: What is Micrometer and how does it relate to tracing?

Micrometer is a vendor-neutral metrics and observation facade for the JVM, analogous to SLF4J for logging. Starting with Spring Boot 3, Micrometer's Observation API unifies metrics, logging, and distributed tracing under one instrumentation model. When you instrument a method with @Observed, Micrometer automatically records timing metrics, creates a trace span, and adds log correlation - all from a single annotation.

Q: What is the difference between metrics and traces?

Metrics are aggregated measurements over time - average latency, request count, error rate. They are cheap to store (just numbers) but lose individual request detail. Traces are detailed records of individual requests - their path through services, the duration of each hop, and any errors. Traces are expensive to store at full volume. Use metrics for dashboards and alerts; use traces for diagnosing specific incidents identified by the metrics.

Q: How do I correlate trace IDs with application logs?

Micrometer Tracing (or Sleuth on Spring Boot 2) automatically adds `traceId` and `spanId` to the MDC (Mapped Diagnostic Context). Your log pattern should include `%X{traceId}` and `%X{spanId}`. When you search logs in Grafana Loki or Elastic by `traceId`, you get all log lines from all services for that specific request - eliminating the need to manually correlate timestamps across service log files.

Part of the Java Enterprise Mastery - engineering the view.