A single slow downstream service can cascade and take down your entire Spring Boot app. Learn how to implement circuit breakers, retry with backoff, and rate limiting with Resilience4j.
JOptimize Team
A microservice that calls 3 external services is 3 times more likely to fail than one that calls none. Without resilience patterns, one slow dependency cascades into a full outage: threads pile up waiting for a timeout, connection pools exhaust, and your healthy service becomes unhealthy. Resilience4j gives you the tools to break this cascade.
<dependency> <groupId>org.springframework.boot</groupId> <artifactId>spring-boot-starter-aop</artifactId> </dependency> <dependency> <groupId>io.github.resilience4j</groupId> <artifactId>resilience4j-spring-boot3</artifactId> <version>2.2.0</version> </dependency>
A circuit breaker monitors calls to an external service. After a threshold of failures, it "opens" and returns a fallback immediately without calling the service — giving it time to recover.
CLOSED (normal) → failure rate > 50% → OPEN (fast-fail) → wait 30s → HALF-OPEN (trial calls) → back to CLOSED
@Service @RequiredArgsConstructor public class PaymentService { private final PaymentClient paymentClient; @CircuitBreaker( name = "payment-service", fallbackMethod = "paymentFallback" ) public PaymentResult processPayment(PaymentRequest request) { return paymentClient.process(request); } // Fallback — called when circuit is OPEN private PaymentResult paymentFallback( PaymentRequest request, Exception ex) { log.warn("Payment service unavailable, queuing request: {}", ex.getMessage()); paymentQueue.enqueue(request); return PaymentResult.queued(request.getOrderId()); } }
Configuration:
# application.yml resilience4j: circuitbreaker: instances: payment-service: sliding-window-type: COUNT_BASED sliding-window-size: 10 # Evaluate last 10 calls failure-rate-threshold: 50 # Open if >50% fail wait-duration-in-open-state: 30s # Wait 30s before trying again permitted-number-of-calls-in-half-open-state: 3 minimum-number-of-calls: 5 # Need at least 5 calls before evaluating record-exceptions: - java.io.IOException - java.util.concurrent.TimeoutException - feign.FeignException
For transient failures (network blip, brief overload), retry is the right tool. Without backoff, retries can amplify load on an already-struggling service.
@Service public class InventoryService { @Retry( name = "inventory-service", fallbackMethod = "inventoryFallback" ) public InventoryStatus checkStock(Long productId) { return inventoryClient.getStatus(productId); } private InventoryStatus inventoryFallback( Long productId, Exception ex) { log.warn("Inventory check failed after retries: {}", ex.getMessage()); return InventoryStatus.unknown(); // Degrade gracefully } }
resilience4j: retry: instances: inventory-service: max-attempts: 3 wait-duration: 500ms enable-exponential-backoff: true exponential-backoff-multiplier: 2.0 # 500ms → 1s → 2s retry-exceptions: - java.io.IOException - java.net.SocketTimeoutException ignore-exceptions: - com.myapp.exception.ValidationException # Don't retry validation errors
The recommended order: Retry wraps CircuitBreaker. If the circuit is open, the retry immediately gets a CallNotPermittedException and retries are wasted. Instead, apply retry first:
@Retry(name = "payment-service") // Outer: retry transient failures @CircuitBreaker(name = "payment-service", fallbackMethod = "paymentFallback") // Inner: open on sustained failure public PaymentResult processPayment(PaymentRequest request) { return paymentClient.process(request); }
Order of decorators matters: the outermost annotation is applied last in the chain.
A bulkhead isolates calls to a service so one slow dependency can't exhaust all available threads:
@Bulkhead( name = "slow-external-api", type = Bulkhead.Type.SEMAPHORE, fallbackMethod = "externalApiFallback" ) public ExternalData callExternalApi(String query) { return externalApiClient.query(query); }
resilience4j: bulkhead: instances: slow-external-api: max-concurrent-calls: 10 # Max 10 concurrent calls to this service max-wait-duration: 0 # Fail immediately if at capacity
With 10 max concurrent calls and 200 threads trying to call the slow API, 190 threads get an immediate fallback response instead of queuing and eventually timing out.
@RateLimiter( name = "external-api", fallbackMethod = "rateLimitFallback" ) public SearchResult search(String query) { return searchApiClient.search(query); } private SearchResult rateLimitFallback(String query, RequestNotPermitted ex) { return SearchResult.cached(query); }
resilience4j: ratelimiter: instances: external-api: limit-for-period: 100 # 100 calls limit-refresh-period: 1s # per second timeout-duration: 0 # Fail immediately if rate exceeded
Use rate limiter when calling external APIs with quotas (Google Maps, Stripe, etc.) or to protect your own endpoints from being overloaded.
management: endpoints: web: exposure: include: health,circuitbreakers,retries health: circuitbreakers: enabled: true
This exposes /actuator/health with circuit breaker state:
{ "status": "UP", "components": { "circuitBreakers": { "status": "UP", "details": { "payment-service": { "status": "CIRCUIT_CLOSED", "details": { "failureRate": "12.5%", "bufferedCalls": 8 } } } } } }
Micrometer exposes Resilience4j metrics to Prometheus: resilience4j_circuitbreaker_state, resilience4j_retry_calls_total, resilience4j_bulkhead_available_concurrent_calls.
@Retry without ignore-exceptions — retrying a 400 Bad Request (which will never succeed) wastes time and adds load; always specify which exceptions to retryCallNotPermittedException and returns 500; always provide a meaningful fallbackconnectTimeout and readTimeout on your HTTP clientResilience4j provides four tools for microservice resilience: circuit breaker (stop calling failing services), retry with exponential backoff (handle transient failures), bulkhead (limit concurrent calls to isolate load), and rate limiter (protect quotas). Layer them in order — retry wraps circuit breaker — and monitor state via Actuator + Micrometer to catch cascading failures before they propagate.
JOptimize flags HTTP client calls without circuit breaker or retry configuration, uncapped thread pools calling external services, and missing timeout configurations in Spring Boot microservices.
Prevent cascading failures before they reach production.
Master Spring Boot, security, and Java performance with hands-on courses.
JOptimize finds N+1 queries, EAGER collections, and 70+ other issues in your Java codebase — in under 30 seconds.