Back to Blog
kubernetesspring-bootautoscalingdevopsjavaperformance

Kubernetes HPA with Spring Boot: Auto-Scale on CPU, Memory, and Custom Metrics (2026)

Manual Kubernetes scaling is reactive and error-prone. Learn how to configure Horizontal Pod Autoscaler with Spring Boot using CPU, memory, and custom Micrometer metrics.

J

JOptimize Team

May 28, 2026· 8 min read

Manual Kubernetes scaling — bumping replicas when the on-call engineer notices high CPU — is too slow and too reactive. By the time a human responds, users are already seeing degraded performance. Horizontal Pod Autoscaler (HPA) reacts in seconds based on real metrics.


Basic CPU-Based HPA

# hpa.yaml apiVersion: autoscaling/v2 kind: HorizontalPodAutoscaler metadata: name: order-service-hpa spec: scaleTargetRef: apiVersion: apps/v1 kind: Deployment name: order-service minReplicas: 2 maxReplicas: 20 metrics: - type: Resource resource: name: cpu target: type: Utilization averageUtilization: 70 # Scale up when avg CPU > 70% behavior: scaleUp: stabilizationWindowSeconds: 60 # Wait 60s before scaling up again policies: - type: Pods value: 2 periodSeconds: 60 # Add max 2 pods per minute scaleDown: stabilizationWindowSeconds: 300 # Wait 5min before scaling down policies: - type: Pods value: 1 periodSeconds: 120 # Remove max 1 pod per 2 minutes

The behavior block is critical — without it, HPA scales down aggressively and causes flapping (scale down → traffic spike → scale up → repeat).


Resource Requests Are Required

HPA CPU utilization is calculated as: actual CPU / requested CPU. Without resource requests, HPA can't calculate utilization:

# deployment.yaml — REQUIRED for HPA to work containers: - name: order-service resources: requests: cpu: "500m" # HPA measures against this memory: "512Mi" limits: cpu: "2000m" memory: "1Gi"

With requests.cpu=500m and actual CPU at 350m, utilization = 70% — HPA holds. With actual CPU at 400m, utilization = 80% → HPA scales up.


Custom Metrics: Scale on HTTP Request Rate

Expose custom metrics from Spring Boot via Micrometer:

@Component @RequiredArgsConstructor public class RequestRateMetrics { private final MeterRegistry meterRegistry; private final AtomicLong activeRequests = new AtomicLong(0); @PostConstruct public void init() { // Expose active request count as a gauge Gauge.builder("http.server.active_requests", activeRequests, AtomicLong::get) .description("Current active HTTP requests") .register(meterRegistry); } @EventListener public void onRequest(RequestHandledEvent event) { // Called by Spring for each request } }
# HPA using custom metric (requires Prometheus Adapter) apiVersion: autoscaling/v2 kind: HorizontalPodAutoscaler spec: metrics: - type: Pods pods: metric: name: http_server_active_requests # Prometheus metric name target: type: AverageValue averageValue: 100 # Scale up when avg > 100 active requests per pod

KEDA: Event-Driven Autoscaling (Scale to Zero)

For Kafka consumers, KEDA (Kubernetes Event-Driven Autoscaling) scales based on consumer lag:

# keda-scaledobject.yaml apiVersion: keda.sh/v1alpha1 kind: ScaledObject metadata: name: order-consumer-scaler spec: scaleTargetRef: name: order-consumer minReplicaCount: 0 # Scale to ZERO when no messages maxReplicaCount: 20 triggers: - type: kafka metadata: bootstrapServers: kafka:9092 consumerGroup: order-service topic: orders lagThreshold: "100" # Scale up when lag > 100 messages per replica offsetResetPolicy: latest

With KEDA:

  • 0 messages in topic → 0 consumer pods (save money)
  • 1000 messages → 10 pods (100 messages/pod threshold)
  • 10000 messages → 20 pods (at max)

Startup and Readiness Probe Tuning for HPA

# deployment.yaml — fast probes for quick scale-up response startupProbe: httpGet: path: /actuator/health/liveness port: 8081 initialDelaySeconds: 10 periodSeconds: 5 failureThreshold: 24 # Allow 120s (24 * 5s) for startup readinessProbe: httpGet: path: /actuator/health/readiness port: 8081 periodSeconds: 5 failureThreshold: 3 # Remove from LB after 15s if unhealthy

Fast readiness probes mean new pods are added to the load balancer within 15-30 seconds of starting — important for effective scale-up response time.


Monitoring HPA Activity

# Watch HPA in real-time kubectl get hpa order-service-hpa -w # Detailed status kubectl describe hpa order-service-hpa # Shows: current metrics, last scale time, events # HPA events kubectl events --for hpa/order-service-hpa

Grafana dashboards to watch:

  • kube_horizontalpodautoscaler_status_current_replicas — current pods
  • kube_horizontalpodautoscaler_status_desired_replicas — desired pods
  • container_cpu_usage_seconds_total — actual CPU per pod

Common Mistakes to Avoid

  • No stabilization window — without scaleDown.stabilizationWindowSeconds, HPA scales down as soon as CPU drops, causing flapping; set it to at least 300 seconds
  • CPU limit without CPU request — throttling happens at the limit but HPA measures against the request; a pod with 500m request and 2000m limit at 1500m actual CPU shows 300% utilization — misleading
  • Scaling on memory for Java apps — JVM heap doesn't shrink after GC collects objects; memory utilization is not a good scaling signal for Java apps; use CPU or request rate instead
  • Not testing scale-up time — measure time from traffic spike to new pods serving traffic; if it's > 2 minutes, adjust probes and pod startup speed

Summary

HPA scales Spring Boot pods automatically based on CPU utilization (simplest), custom metrics like request rate (more accurate), or KEDA for event-driven scale-to-zero. The critical configuration is behavior: slow scale-down (300s window) prevents flapping, and fast readiness probes minimize time-to-serving for new pods. Always set resource requests — HPA can't calculate utilization without them.


Profile Your App Before Setting HPA Targets

JOptimize's live profiling shows your app's actual CPU and memory profile under load — use it to set realistic HPA targets based on real behavior.

Know your app's resource profile before auto-scaling it.

Want to go deeper?

Master Spring Boot, security, and Java performance with hands-on courses.

Detect issues in your project

JOptimize finds N+1 queries, EAGER collections, and 70+ other issues in your Java codebase — in under 30 seconds.