Kubernetes HPA with Spring Boot: Auto-Scale on CPU, Memory, and Custom Metrics (2026)

Manual Kubernetes scaling — bumping replicas when the on-call engineer notices high CPU — is too slow and too reactive. By the time a human responds, users are already seeing degraded performance. Horizontal Pod Autoscaler (HPA) reacts in seconds based on real metrics.

Basic CPU-Based HPA

# hpa.yaml
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: order-service-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: order-service
  minReplicas: 2
  maxReplicas: 20
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 70  # Scale up when avg CPU > 70%
  behavior:
    scaleUp:
      stabilizationWindowSeconds: 60   # Wait 60s before scaling up again
      policies:
      - type: Pods
        value: 2
        periodSeconds: 60              # Add max 2 pods per minute
    scaleDown:
      stabilizationWindowSeconds: 300  # Wait 5min before scaling down
      policies:
      - type: Pods
        value: 1
        periodSeconds: 120             # Remove max 1 pod per 2 minutes

The behavior block is critical — without it, HPA scales down aggressively and causes flapping (scale down → traffic spike → scale up → repeat).

Resource Requests Are Required

HPA CPU utilization is calculated as: actual CPU / requested CPU. Without resource requests, HPA can't calculate utilization:

# deployment.yaml — REQUIRED for HPA to work
containers:
- name: order-service
  resources:
    requests:
      cpu: "500m"      # HPA measures against this
      memory: "512Mi"
    limits:
      cpu: "2000m"
      memory: "1Gi"

With requests.cpu=500m and actual CPU at 350m, utilization = 70% — HPA holds. With actual CPU at 400m, utilization = 80% → HPA scales up.

Custom Metrics: Scale on HTTP Request Rate

Expose custom metrics from Spring Boot via Micrometer:

@Component
@RequiredArgsConstructor
public class RequestRateMetrics {

    private final MeterRegistry meterRegistry;
    private final AtomicLong activeRequests = new AtomicLong(0);

    @PostConstruct
    public void init() {
        // Expose active request count as a gauge
        Gauge.builder("http.server.active_requests",
                activeRequests, AtomicLong::get)
            .description("Current active HTTP requests")
            .register(meterRegistry);
    }

    @EventListener
    public void onRequest(RequestHandledEvent event) {
        // Called by Spring for each request
    }
}

# HPA using custom metric (requires Prometheus Adapter)
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
spec:
  metrics:
  - type: Pods
    pods:
      metric:
        name: http_server_active_requests  # Prometheus metric name
      target:
        type: AverageValue
        averageValue: 100  # Scale up when avg > 100 active requests per pod

KEDA: Event-Driven Autoscaling (Scale to Zero)

For Kafka consumers, KEDA (Kubernetes Event-Driven Autoscaling) scales based on consumer lag:

# keda-scaledobject.yaml
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: order-consumer-scaler
spec:
  scaleTargetRef:
    name: order-consumer
  minReplicaCount: 0    # Scale to ZERO when no messages
  maxReplicaCount: 20
  triggers:
  - type: kafka
    metadata:
      bootstrapServers: kafka:9092
      consumerGroup: order-service
      topic: orders
      lagThreshold: "100"   # Scale up when lag > 100 messages per replica
      offsetResetPolicy: latest

With KEDA:

0 messages in topic → 0 consumer pods (save money)
1000 messages → 10 pods (100 messages/pod threshold)
10000 messages → 20 pods (at max)

Startup and Readiness Probe Tuning for HPA

# deployment.yaml — fast probes for quick scale-up response
startupProbe:
  httpGet:
    path: /actuator/health/liveness
    port: 8081
  initialDelaySeconds: 10
  periodSeconds: 5
  failureThreshold: 24   # Allow 120s (24 * 5s) for startup

readinessProbe:
  httpGet:
    path: /actuator/health/readiness
    port: 8081
  periodSeconds: 5
  failureThreshold: 3    # Remove from LB after 15s if unhealthy

Fast readiness probes mean new pods are added to the load balancer within 15-30 seconds of starting — important for effective scale-up response time.

Monitoring HPA Activity

# Watch HPA in real-time
kubectl get hpa order-service-hpa -w

# Detailed status
kubectl describe hpa order-service-hpa
# Shows: current metrics, last scale time, events

# HPA events
kubectl events --for hpa/order-service-hpa

Grafana dashboards to watch:

kube_horizontalpodautoscaler_status_current_replicas — current pods
kube_horizontalpodautoscaler_status_desired_replicas — desired pods
container_cpu_usage_seconds_total — actual CPU per pod

Common Mistakes to Avoid

No stabilization window — without scaleDown.stabilizationWindowSeconds, HPA scales down as soon as CPU drops, causing flapping; set it to at least 300 seconds
CPU limit without CPU request — throttling happens at the limit but HPA measures against the request; a pod with 500m request and 2000m limit at 1500m actual CPU shows 300% utilization — misleading
Scaling on memory for Java apps — JVM heap doesn't shrink after GC collects objects; memory utilization is not a good scaling signal for Java apps; use CPU or request rate instead
Not testing scale-up time — measure time from traffic spike to new pods serving traffic; if it's > 2 minutes, adjust probes and pod startup speed

Summary

HPA scales Spring Boot pods automatically based on CPU utilization (simplest), custom metrics like request rate (more accurate), or KEDA for event-driven scale-to-zero. The critical configuration is behavior: slow scale-down (300s window) prevents flapping, and fast readiness probes minimize time-to-serving for new pods. Always set resource requests — HPA can't calculate utilization without them.

Profile Your App Before Setting HPA Targets

JOptimize's live profiling shows your app's actual CPU and memory profile under load — use it to set realistic HPA targets based on real behavior.

IntelliJ Plugin — live resource profiling: Install JOptimize for IntelliJ
Web Dashboard — performance baseline analysis: Analyze your project free →

Know your app's resource profile before auto-scaling it.