Consumer lag means your consumers can't keep up with producers. Unaddressed, it grows until you're hours behind real-time. Here's how to monitor it, find the cause, and fix it.
JOptimize Team
Kafka consumer lag is the gap between the latest message produced and the latest message processed. A lag of zero means your consumers are keeping up in real-time. A growing lag means your consumers are falling behind producers — and if it keeps growing, you'll eventually be processing yesterday's events today.
Lag is one of the most important metrics to monitor in a Kafka-based system. A lag spike can indicate a slow consumer, a rebalance, a downstream database bottleneck, or a sudden increase in produce rate. Each cause has a different fix.
Kafka stores messages in ordered logs per partition. Each consumer group maintains an offset — the position of the last committed message. Lag is:
Lag = Latest offset (producer position) - Consumer group offset (committed position)
A topic with 6 partitions can have lag in any of them. Total lag is the sum across all partitions. But a single partition with high lag while others are at zero indicates an imbalance — one partition has a slow consumer or a hot key.
# Check lag from command line kafka-consumer-groups.sh --bootstrap-server kafka:9092 \ --group order-service \ --describe # Output: # GROUP TOPIC PARTITION CURRENT-OFFSET LOG-END-OFFSET LAG # order-service orders.placed 0 10523 10523 0 # order-service orders.placed 1 8201 10890 2689 ← lag! # order-service orders.placed 2 11043 11043 0
Partition 1 has a lag of 2689 — nearly 2700 unprocessed messages. This is a problem.
The command-line check is for manual diagnosis. Production monitoring needs automated alerting:
# Prometheus + Kafka Exporter — automatically scrapes consumer group lag services: kafka-exporter: image: danielqsj/kafka-exporter:latest command: --kafka.server=kafka:9092 ports: - "9308:9308"
# Grafana alert: fire when lag exceeds threshold for 5 minutes
kafka_consumergroup_lag_sum{
consumergroup="order-service",
topic="orders.placed"
} > 10000
Set the threshold based on your acceptable processing delay. For a real-time notification system, even 1000 might be too much. For an overnight batch report, 100000 might be acceptable.
By default, a Spring Kafka listener processes one message at a time. If each message takes 50ms to process, one partition can handle only 20 messages/second. At 100 messages/second producer rate, lag grows by 80/second:
// Default: 1 thread per listener — slow for I/O-bound processing @KafkaListener(topics = "orders.placed", groupId = "order-service") public void processOrder(OrderPlacedEvent event) { inventoryService.reserveStock(event.orderId()); // 50ms DB call notificationService.sendEmail(event.customerId()); // 80ms external call } // Total: 130ms per message → only 7.7 msg/sec per partition
// Fix: increase concurrency — one thread per partition (up to partition count) @KafkaListener( topics = "orders.placed", groupId = "order-service", concurrency = "6" // 6 threads for 6 partitions — full parallelism ) public void processOrder(OrderPlacedEvent event) { inventoryService.reserveStock(event.orderId()); notificationService.sendEmail(event.customerId()); }
# application.yml spring: kafka: listener: concurrency: 6 # Global default type: SINGLE # SINGLE (one msg) or BATCH (list of msgs)
If the bottleneck is a slow database or external service call, concurrency helps but has limits. Batch processing reduces per-message overhead:
// Batch listener: receive up to 500 messages at once @KafkaListener( topics = "orders.placed", groupId = "order-service", containerFactory = "batchKafkaListenerContainerFactory" ) public void processOrders(List<OrderPlacedEvent> events) { log.info("Processing batch of {} orders", events.size()); // Batch insert instead of individual updates List<StockReservation> reservations = events.stream() .map(e -> new StockReservation(e.orderId(), e.items())) .toList(); inventoryService.batchReserveStock(reservations); // 1 DB round trip for N events // Batch email dispatch List<EmailRequest> emails = events.stream() .map(e -> new EmailRequest(e.customerId(), "Order confirmed")) .toList(); notificationService.batchSend(emails); // 1 API call for N events }
@Bean public ConcurrentKafkaListenerContainerFactory<String, OrderPlacedEvent> batchKafkaListenerContainerFactory() { ConcurrentKafkaListenerContainerFactory<String, OrderPlacedEvent> factory = new ConcurrentKafkaListenerContainerFactory<>(); factory.setConsumerFactory(consumerFactory()); factory.setBatchListener(true); factory.getContainerProperties().setAckMode(ContainerProperties.AckMode.BATCH); return factory; }
Every consumer join or leave triggers a rebalance during which consumption stops. Frequent rebalances appear as periodic lag spikes:
# application.yml — tune consumer session parameters spring: kafka: consumer: # Increase session timeout — prevent rebalances from slow poll cycles properties: session.timeout.ms: 45000 # Default 45s — increase for slow processing heartbeat.interval.ms: 15000 # 1/3 of session timeout max.poll.interval.ms: 600000 # 10 min — for slow batch processing max.poll.records: 500 # Match your batch size
If max.poll.interval.ms is too low and your processing takes longer, the consumer is kicked from the group and triggers a rebalance. Increase it to match your actual processing time.
Messages with the same partition key always go to the same partition. If one customer generates 90% of your events, one partition is overwhelmed while others are idle:
// Problem: all events for customer ID 1 go to partition 1 kafkaTemplate.send("orders.placed", customerId.toString(), event); // If customer 1 has 10,000 orders/second, partition 1 has all the work // Fix: add entropy to the partition key for high-volume entities public String partitionKey(Long customerId, Long orderId) { // Hash combines customer (for ordering) with order (for spread) return customerId + "-" + (orderId % 10); // Spreads across 10 sub-keys } // Downside: ordering guarantee is now per-bucket, not per-customer
For strict ordering requirements, there's no perfect solution — ordering requires a single partition, and a single partition has one consumer. The trade-off is real.
Sometimes the consumer itself is slow — an N+1 query, a missing index, or a synchronous external HTTP call that should be async:
// SLOW: N+1 in consumer public void processOrder(OrderPlacedEvent event) { Order order = orderRepo.findById(event.orderId()).orElseThrow(); order.getItems().forEach(item -> { // N+1 — lazy load per item Product product = productRepo.findById(item.getProductId()).orElseThrow(); inventoryService.reserve(product, item.getQuantity()); }); } // FAST: eager fetch in consumer public void processOrder(OrderPlacedEvent event) { Order order = orderRepo.findByIdWithItems(event.orderId()).orElseThrow(); // JOIN FETCH List<Long> productIds = order.getItems().stream() .map(OrderItem::getProductId).toList(); Map<Long, Product> products = productRepo.findAllById(productIds).stream() .collect(toMap(Product::getId, identity())); // 1 query for all products order.getItems().forEach(item -> inventoryService.reserve(products.get(item.getProductId()), item.getQuantity())); }
Kafka consumer lag monitoring requires Kafka Exporter + Grafana with threshold alerts. When lag grows, the cause is usually one of five things: single-threaded consumers (fix: increase concurrency), slow downstream processing (fix: batch processing), frequent rebalancing (fix: tune session and poll timeouts), partition imbalance from hot keys, or slow consumer app code (fix: eliminate N+1 queries). Address them in that order.
N+1 queries in Kafka consumers are especially damaging — they multiply lag. JOptimize detects these patterns in your consumer services.
Keep up with your Kafka topics.
Master Spring Boot, security, and Java performance with hands-on courses.
JOptimize finds N+1 queries, EAGER collections, and 70+ other issues in your Java codebase — in under 30 seconds.