Observability with OpenTelemetry and Spring Boot: Logs, Metrics, Traces in One Stack (2026)

Production systems fail in ways you didn't predict. A service slows down — but the CPU is fine, the memory is fine, the error rate is zero. Somewhere in the chain, a database query is taking 800ms instead of 5ms, triggered by a code path that only runs for enterprise accounts. You'd never find this by looking at a single metric. You need all three pillars of observability: metrics to know something is wrong, traces to find where it's wrong, and logs to understand why.

OpenTelemetry has become the industry standard for instrumentation. It works across languages and backends, and Spring Boot 3.x has first-class support for it. This guide covers setting up the complete observability stack.

The Three Pillars and Why You Need All Three

Metrics are aggregates over time: request rate, error rate, P99 latency, heap usage. They're cheap to store and perfect for alerting — but they tell you that something is wrong, not what or why.

Traces follow a request through the system. A distributed trace shows every service call, database query, and cache operation for a single request, with timing for each. Traces answer "which part of the system is slow for this specific request?"

Logs are the full context around a specific event. When a trace shows a slow database call, the corresponding log tells you the exact SQL, the parameters, the user ID, and any warnings Hibernate generated.

These three signals work together. An alert fires on a metric (P99 > 1s). You look at a trace to find the slow span (a repository method). You look at the log for that span to find the exact query and parameters. Without all three, you're flying partially blind.

The OpenTelemetry Stack for Spring Boot

<!-- Spring Boot 3.x — auto-configures OpenTelemetry -->
<dependency>
    <groupId>io.micrometer</groupId>
    <artifactId>micrometer-tracing-bridge-otel</artifactId>
</dependency>
<dependency>
    <groupId>io.opentelemetry.instrumentation</groupId>
    <artifactId>opentelemetry-spring-boot-starter</artifactId>
    <version>2.5.0</version>
</dependency>

# application.yml
spring:
  application:
    name: order-service

management:
  tracing:
    sampling:
      probability: 1.0   # 100% in dev; 10% in production
  otlp:
    tracing:
      endpoint: http://otel-collector:4318/v1/traces
    metrics:
      endpoint: http://otel-collector:4318/v1/metrics
  metrics:
    distribution:
      percentiles-histogram:
        http.server.requests: true

logging:
  pattern:
    level: "%5p [${spring.application.name:},%X{traceId:-},%X{spanId:-}]"

With this configuration, Spring Boot automatically:

Creates a trace for every HTTP request
Propagates trace context to downstream service calls
Injects trace IDs into all log lines
Exports traces, metrics, and logs to the OpenTelemetry collector

The LGTM Stack: Loki, Grafana, Tempo, Mimir

Grafana Labs has built a complete observability stack that integrates perfectly with OpenTelemetry:

# docker-compose.yml — local development observability
services:
  otel-collector:
    image: otel/opentelemetry-collector-contrib:latest
    volumes:
      - ./otel-config.yml:/etc/otel/config.yml
    ports:
      - "4318:4318"   # OTLP HTTP
      - "4317:4317"   # OTLP gRPC

  grafana:
    image: grafana/grafana:latest
    ports:
      - "3000:3000"

  tempo:   # Distributed traces
    image: grafana/tempo:latest

  loki:    # Logs
    image: grafana/loki:latest

  mimir:   # Metrics
    image: grafana/mimir:latest

The beautiful part of this stack is Grafana's ability to link between signals. From a trace in Tempo, you can click directly to the logs for that trace in Loki. From a Grafana dashboard, you can click on a spike in latency and jump to exemplar traces from that time period.

Custom Spans for Business Operations

Spring Boot auto-instruments HTTP endpoints, JDBC calls, and Spring Data repositories. But you also want traces for business operations that matter:

@Service
@RequiredArgsConstructor
public class OrderService {

    private final Tracer tracer;

    public OrderResult processOrder(OrderRequest req) {
        // Create a custom span for the business operation
        Span span = tracer.nextSpan()
            .name("order.processing")
            .tag("order.customerId", req.getCustomerId())
            .tag("order.region", req.getRegion())
            .tag("order.itemCount", String.valueOf(req.getItems().size()))
            .start();

        try (Tracer.SpanInScope scope = tracer.withSpan(span)) {
            // All operations here are child spans of "order.processing"
            validateOrder(req);          // Auto-traced if using Spring AOP
            OrderResult result = saveAndPublish(req);

            span.tag("order.id", result.orderId().toString());
            span.tag("order.outcome", "success");
            return result;

        } catch (Exception e) {
            span.tag("error", "true");
            span.tag("error.message", e.getMessage());
            throw e;
        } finally {
            span.end();
        }
    }
}

Custom spans let you see business-level timing in traces. Instead of seeing only "JDBC execute" and "HTTP GET", you see "order.processing" with its children — immediately showing which part of your business logic is slow.

Structured Logging with Trace Correlation

Logs are most useful when they include the trace ID — this is what enables navigation from a trace to the corresponding logs:

@Slf4j
@Service
public class OrderService {

    public void processOrder(Order order) {
        // Spring Boot + Micrometer automatically adds traceId and spanId to MDC
        // So this log line includes the trace ID without any manual code:
        log.info("Processing order",
            // Structured logging with logstash encoder:
            kv("orderId", order.getId()),
            kv("customerId", order.getCustomerId()),
            kv("amount", order.getTotal()),
            kv("region", order.getRegion())
        );
    }
}

<!-- logback-spring.xml: structured JSON logs for log aggregation -->
<dependency>
    <groupId>net.logstash.logback</groupId>
    <artifactId>logstash-logback-encoder</artifactId>
    <version>7.4</version>
</dependency>

<appender name="JSON" class="ch.qos.logback.core.ConsoleAppender">
    <encoder class="net.logstash.logback.encoder.LogstashEncoder">
        <!-- Auto-includes traceId and spanId from MDC -->
        <includeMdcKeyName>traceId</includeMdcKeyName>
        <includeMdcKeyName>spanId</includeMdcKeyName>
    </encoder>
</appender>

JSON-structured logs are machine-parseable. Loki can filter by {app="order-service"} | json | traceId="abc123" to find all logs for a specific trace.

SLO Monitoring: Alert on What Users Experience

# Grafana SLO definition — alert when SLO is at risk
# 99.9% of requests under 1 second, over 30 days

# Error budget: 0.1% × 30 days × 24 hours × 3600 seconds = 2592 seconds
# That's 43 minutes of slow responses per month

# PromQL: multi-window burn rate alert
# Fires when error budget is burning 14.4x faster than normal (1-hour window)
(
  sum(rate(http_server_requests_seconds_bucket{le="1.0",status!~"5.."}[1h]))
  /
  sum(rate(http_server_requests_seconds_count{status!~"5.."}[1h]))
) < 0.999
AND
(
  sum(rate(http_server_requests_seconds_bucket{le="1.0",status!~"5.."}[5m]))
  /
  sum(rate(http_server_requests_seconds_count{status!~"5.."}[5m]))
) < 0.999

SLO-based alerting tells you when users are experiencing problems, not when some internal metric crosses an arbitrary threshold. It's the difference between alerting on what matters and alert fatigue from metrics that fluctuate harmlessly.

Common Mistakes to Avoid

100% sampling in production — tracing every request at high throughput is expensive; use adaptive sampling or a fixed rate (1-10%) for high-traffic services
Missing trace propagation — if you create threads manually (ThreadPoolExecutor, CompletableFuture with a custom executor), you need to propagate the trace context explicitly; Spring's virtual thread executor handles this automatically
Logs without trace IDs — logs that don't include traceId can't be linked to traces in Grafana; always include the correlation fields
Collecting metrics without dashboards — metrics are only useful if someone looks at them; create dashboards for every service before shipping to production

Summary

Complete observability requires all three pillars: metrics (Micrometer → Prometheus/Mimir), traces (OpenTelemetry → Tempo), and logs (structured with trace IDs → Loki). Spring Boot 3 auto-instruments HTTP calls, JDBC, and Spring Data out of the box. Custom spans add business-level visibility. Grafana links all three signals, making incident investigation fast. SLO-based alerting ensures you alert on user experience, not on internal metrics.

Catch Performance Issues Before They Show Up in Traces

OpenTelemetry traces tell you something is slow. JOptimize tells you why — N+1 queries, missing indexes, and over-fetching — before they appear in production traces.

IntelliJ Plugin — static performance analysis: Install JOptimize
Web Dashboard — full project audit: Analyze your project free →

Observe better, fix faster.