The three pillars of observability are metrics, logs, and traces. OpenTelemetry unifies them with a single instrumentation standard. Here's how to implement it properly in Spring Boot.
JOptimize Team
Production systems fail in ways you didn't predict. A service slows down — but the CPU is fine, the memory is fine, the error rate is zero. Somewhere in the chain, a database query is taking 800ms instead of 5ms, triggered by a code path that only runs for enterprise accounts. You'd never find this by looking at a single metric. You need all three pillars of observability: metrics to know something is wrong, traces to find where it's wrong, and logs to understand why.
OpenTelemetry has become the industry standard for instrumentation. It works across languages and backends, and Spring Boot 3.x has first-class support for it. This guide covers setting up the complete observability stack.
Metrics are aggregates over time: request rate, error rate, P99 latency, heap usage. They're cheap to store and perfect for alerting — but they tell you that something is wrong, not what or why.
Traces follow a request through the system. A distributed trace shows every service call, database query, and cache operation for a single request, with timing for each. Traces answer "which part of the system is slow for this specific request?"
Logs are the full context around a specific event. When a trace shows a slow database call, the corresponding log tells you the exact SQL, the parameters, the user ID, and any warnings Hibernate generated.
These three signals work together. An alert fires on a metric (P99 > 1s). You look at a trace to find the slow span (a repository method). You look at the log for that span to find the exact query and parameters. Without all three, you're flying partially blind.
<!-- Spring Boot 3.x — auto-configures OpenTelemetry --> <dependency> <groupId>io.micrometer</groupId> <artifactId>micrometer-tracing-bridge-otel</artifactId> </dependency> <dependency> <groupId>io.opentelemetry.instrumentation</groupId> <artifactId>opentelemetry-spring-boot-starter</artifactId> <version>2.5.0</version> </dependency>
# application.yml spring: application: name: order-service management: tracing: sampling: probability: 1.0 # 100% in dev; 10% in production otlp: tracing: endpoint: http://otel-collector:4318/v1/traces metrics: endpoint: http://otel-collector:4318/v1/metrics metrics: distribution: percentiles-histogram: http.server.requests: true logging: pattern: level: "%5p [${spring.application.name:},%X{traceId:-},%X{spanId:-}]"
With this configuration, Spring Boot automatically:
Grafana Labs has built a complete observability stack that integrates perfectly with OpenTelemetry:
# docker-compose.yml — local development observability services: otel-collector: image: otel/opentelemetry-collector-contrib:latest volumes: - ./otel-config.yml:/etc/otel/config.yml ports: - "4318:4318" # OTLP HTTP - "4317:4317" # OTLP gRPC grafana: image: grafana/grafana:latest ports: - "3000:3000" tempo: # Distributed traces image: grafana/tempo:latest loki: # Logs image: grafana/loki:latest mimir: # Metrics image: grafana/mimir:latest
The beautiful part of this stack is Grafana's ability to link between signals. From a trace in Tempo, you can click directly to the logs for that trace in Loki. From a Grafana dashboard, you can click on a spike in latency and jump to exemplar traces from that time period.
Spring Boot auto-instruments HTTP endpoints, JDBC calls, and Spring Data repositories. But you also want traces for business operations that matter:
@Service @RequiredArgsConstructor public class OrderService { private final Tracer tracer; public OrderResult processOrder(OrderRequest req) { // Create a custom span for the business operation Span span = tracer.nextSpan() .name("order.processing") .tag("order.customerId", req.getCustomerId()) .tag("order.region", req.getRegion()) .tag("order.itemCount", String.valueOf(req.getItems().size())) .start(); try (Tracer.SpanInScope scope = tracer.withSpan(span)) { // All operations here are child spans of "order.processing" validateOrder(req); // Auto-traced if using Spring AOP OrderResult result = saveAndPublish(req); span.tag("order.id", result.orderId().toString()); span.tag("order.outcome", "success"); return result; } catch (Exception e) { span.tag("error", "true"); span.tag("error.message", e.getMessage()); throw e; } finally { span.end(); } } }
Custom spans let you see business-level timing in traces. Instead of seeing only "JDBC execute" and "HTTP GET", you see "order.processing" with its children — immediately showing which part of your business logic is slow.
Logs are most useful when they include the trace ID — this is what enables navigation from a trace to the corresponding logs:
@Slf4j @Service public class OrderService { public void processOrder(Order order) { // Spring Boot + Micrometer automatically adds traceId and spanId to MDC // So this log line includes the trace ID without any manual code: log.info("Processing order", // Structured logging with logstash encoder: kv("orderId", order.getId()), kv("customerId", order.getCustomerId()), kv("amount", order.getTotal()), kv("region", order.getRegion()) ); } }
<!-- logback-spring.xml: structured JSON logs for log aggregation --> <dependency> <groupId>net.logstash.logback</groupId> <artifactId>logstash-logback-encoder</artifactId> <version>7.4</version> </dependency>
<appender name="JSON" class="ch.qos.logback.core.ConsoleAppender"> <encoder class="net.logstash.logback.encoder.LogstashEncoder"> <!-- Auto-includes traceId and spanId from MDC --> <includeMdcKeyName>traceId</includeMdcKeyName> <includeMdcKeyName>spanId</includeMdcKeyName> </encoder> </appender>
JSON-structured logs are machine-parseable. Loki can filter by {app="order-service"} | json | traceId="abc123" to find all logs for a specific trace.
# Grafana SLO definition — alert when SLO is at risk # 99.9% of requests under 1 second, over 30 days # Error budget: 0.1% × 30 days × 24 hours × 3600 seconds = 2592 seconds # That's 43 minutes of slow responses per month
# PromQL: multi-window burn rate alert
# Fires when error budget is burning 14.4x faster than normal (1-hour window)
(
sum(rate(http_server_requests_seconds_bucket{le="1.0",status!~"5.."}[1h]))
/
sum(rate(http_server_requests_seconds_count{status!~"5.."}[1h]))
) < 0.999
AND
(
sum(rate(http_server_requests_seconds_bucket{le="1.0",status!~"5.."}[5m]))
/
sum(rate(http_server_requests_seconds_count{status!~"5.."}[5m]))
) < 0.999
SLO-based alerting tells you when users are experiencing problems, not when some internal metric crosses an arbitrary threshold. It's the difference between alerting on what matters and alert fatigue from metrics that fluctuate harmlessly.
traceId can't be linked to traces in Grafana; always include the correlation fieldsComplete observability requires all three pillars: metrics (Micrometer → Prometheus/Mimir), traces (OpenTelemetry → Tempo), and logs (structured with trace IDs → Loki). Spring Boot 3 auto-instruments HTTP calls, JDBC, and Spring Data out of the box. Custom spans add business-level visibility. Grafana links all three signals, making incident investigation fast. SLO-based alerting ensures you alert on user experience, not on internal metrics.
OpenTelemetry traces tell you something is slow. JOptimize tells you why — N+1 queries, missing indexes, and over-fetching — before they appear in production traces.
Observe better, fix faster.
Master Spring Boot, security, and Java performance with hands-on courses.
JOptimize finds N+1 queries, EAGER collections, and 70+ other issues in your Java codebase — in under 30 seconds.