Software ArchitectureSystem Design

Circuit Breaker Pattern: Stopping Cascading Failures in Distributed Systems

Complete guide to the Circuit Breaker pattern for distributed system resilience. Understand how cascading failures occur, implement the three-state circuit breaker (CLOSED, OPEN, HALF-OPEN), configure failure rate and slow call thresholds, add fallback strategies (static data, cached response, degraded mode), implement with Resilience4j (Java), Opossum (Node.js), and configure at the service mesh level with Istio outlier detection.

TT
Emily Ross
7 min read
Circuit Breaker Pattern: Stopping Cascading Failures in Distributed Systems

Circuit Breaker Pattern: Stopping Cascading Failures in Distributed Systems


Table of Contents


How Cascading Failures Actually Happen

The threat model is specific and predictable:

text
Time 0:   100 users/sec hitting /api/recommendations
          RecommendationService is healthy - avg 50ms response

Time 1m:  RecommendationService DB becomes slow (disk I/O saturation)
          RecommendationService now takes 30 seconds to respond

Time 2m:  API server creates a new thread per pending request
          100 users/sec x 30 seconds lag = 3,000 concurrent threads waiting

Time 3m:  API server reaches max thread pool (500 threads)
          API server now queues ALL requests - login fails, checkout fails

Time 4m:  Users cannot log in - the entire site is down
          Root cause: a slow RECOMMENDATION widget

This is a cascading failure caused by resource exhaustion: threads consumed by the slow service cause all other operations to fail.


Circuit Breaker Mechanics: The Three States

CLOSED State (Normal Operation)

  • All calls pass through to the downstream service
  • The circuit breaker records success/failure of each call
  • Calculates failure rate using a sliding window (count-based or time-based)
  • Transition to OPEN when failure rate crosses threshold

OPEN State (Protection Mode)

  • All calls are blocked immediately - no network call is made
  • Returns a predefined fallback response instantly
  • The service gets complete relief - no requests, CPU/DB can recover
  • After a configurable wait duration, moves to HALF-OPEN

HALF-OPEN State (Testing Recovery)

  • Allows a small number of trial requests through
  • If trial requests succeed: circuit closes (normal operation resumes)
  • If trial requests fail: circuit re-opens (longer wait before next trial)
  • Prevents flooding a recovering service

Configuration Deep Dive: Thresholds and Windows

Two types of calls trip a circuit breaker:

  1. Failure Rate - Too many errors (5xx, connection refused, timeout)
  2. Slow Call Rate - Too many calls exceeding a slow call duration threshold (e.g., > 3 seconds)
yaml
# Resilience4j configuration (application.yml):
resilience4j:
  circuitbreaker:
    instances:
      recommendationService:
        registerHealthIndicator: true
        
        # Sliding window: count-based (last N calls) or time-based (last N seconds):
        slidingWindowType: COUNT_BASED
        slidingWindowSize: 20         # Evaluate based on last 20 calls
        
        # Trip when 50%+ of calls fail:
        failureRateThreshold: 50
        
        # Also trip when 60%+ of calls take > 3 seconds:
        slowCallRateThreshold: 60
        slowCallDurationThreshold: 3000ms
        
        # Minimum calls before evaluating failure rate:
        minimumNumberOfCalls: 5
        
        # How long to stay OPEN before testing:
        waitDurationInOpenState: 30s
        
        # How many trial calls in HALF-OPEN state:
        permittedNumberOfCallsInHalfOpenState: 5
        
        # Which exceptions count as failures:
        recordExceptions:
          - java.io.IOException
          - java.util.concurrent.TimeoutException
        
        # Which exceptions do NOT trip the circuit:
        ignoreExceptions:
          - com.myapp.exceptions.BusinessRuleException  # Not a service fault

Fallback Strategies: Failing Gracefully

A circuit breaker without a good fallback just replaces one failure mode with another. Design fallbacks that provide real value:

Fallback TypeDescriptionExample
Static responseReturn a hardcoded defaultEmpty recommendations list
Cached responseReturn last known good data from RedisShow 1-hour-old recommendations
Degraded modeReturn reduced functionalityShow top-10 all-time bestsellers
Error with contextExplicit UI message"Recommendations temporarily unavailable"
Alternative serviceCall a simpler fallback serviceBasic category-based suggestions

Never return confusing errors or silent failures - users should always know if a feature is degraded.


Implementation: Resilience4j for Java/Kotlin

java
@Service
public class RecommendationClient {
    private final WebClient webClient;
    private final CircuitBreakerRegistry cbRegistry;
    
    @CircuitBreaker(name = "recommendationService", fallbackMethod = "getStaticRecommendations")
    public List<Product> getPersonalizedRecommendations(String userId) {
        return webClient.get()
            .uri("/recommendations/{userId}", userId)
            .retrieve()
            .bodyToFlux(Product.class)
            .collectList()
            .timeout(Duration.ofSeconds(3))  // Combine with timeout!
            .block();
    }
    
    // Fallback: called when circuit is OPEN or call fails
    public List<Product> getStaticRecommendations(String userId, CallNotPermittedException ex) {
        log.warn("Circuit OPEN for recommendations. User: {}", userId);
        return productRepository.findTop10Bestsellers(); // Degraded fallback
    }
    
    public List<Product> getStaticRecommendations(String userId, TimeoutException ex) {
        log.warn("Recommendation timeout. Returning cached. User: {}", userId);
        return redisCache.getOrDefault("top10", List.of()); // Cached fallback
    }
}

Bulkhead Pattern: Isolating Thread Pools

The Bulkhead pattern (named after ship watertight compartments) runs different services in separate thread pools, so one slow service can't exhaust all threads:

java
@Bulkhead(
    name = "recommendationService",
    type = Bulkhead.Type.THREADPOOL,
    fallbackMethod = "getStaticRecommendations"
)
public CompletableFuture<List<Product>> getRecommendationsAsync(String userId) {
    return CompletableFuture.supplyAsync(
        () -> webClient.get().uri("/recommendations/{userId}", userId)...
    );
}

// Configuration: max 10 threads for recommendation service
# Only 10 threads can wait for recommendation - chat, login, checkout are unaffected
resilience4j.thread-pool-bulkhead.instances.recommendationService.maxThreadPoolSize: 10
resilience4j.thread-pool-bulkhead.instances.recommendationService.coreThreadPoolSize: 5

Combining Retry + Circuit Breaker Correctly

A critical mistake: putting Retry outside Circuit Breaker. This defeats the circuit breaker - retries send more requests to a failing service, worsening the cascade.

java
// ❌ WRONG: Retry wraps Circuit Breaker
// If CB is OPEN, retry tries 3 times x 100ms = 3 fallback calls (not wrong, but wasteful)
@Retry(name = "recSvc")
@CircuitBreaker(name = "recSvc")  // WRONG ORDER
public Product getRecommendation(String userId) { ... }

// ✅ CORRECT: Circuit Breaker wraps Retry
// Retry tries 3 times -> CB counts failures -> trips after threshold
@CircuitBreaker(name = "recSvc")  
@Retry(name = "recSvc")           // CORRECT ORDER - CB sees aggregated results
public Product getRecommendation(String userId) { ... }

// Even better: Use separate names + Retry for transient network errors only
// CB for detecting genuinely failing services

Monitoring Circuit Breaker State

Expose circuit breaker metrics to your observability stack:

yaml
# Spring Boot Actuator exposes CB health:
management:
  endpoints.web.exposure.include: health,metrics,circuitbreakerevents
  health.circuitbreakers.enabled: true

# Prometheus metrics exposed:
# resilience4j_circuitbreaker_state{name="recommendationService"} 1.0
# 0 = CLOSED, 1 = OPEN, 2 = HALF_OPEN
# resilience4j_circuitbreaker_failure_rate{name="recommendationService"} 0.65

Alert when:

  • Any circuit enters OPEN state -> team notification (service is DOWN)
  • Circuit stays OPEN > 5 minutes -> page the on-call engineer

Frequently Asked Questions

When should I lower the failure threshold vs lengthen the wait duration? Failure threshold controls how sensitive the circuit is - lower it (30%) for critical payment services, keep it higher (60%) for non-critical recommendation engines. Wait duration controls how long the downstream gets to recover - lengthen it (60s) for services needing database recovery; keep it shorter (10s) for services with transient network issues.

Can I use Circuit Breaker at the API Gateway level instead of per-service? Yes - this is the service mesh approach. Istio's DestinationRule with outlierDetection implements circuit breaking at the network layer, between any two services, without any code changes. The trade-off: it only detects HTTP 5xx errors and TCP connection failures, not application-level errors like malformed responses. Application-level circuit breakers (Resilience4j) can detect any custom failure condition.


Key Takeaway

The Circuit Breaker pattern is the difference between a system that fails completely and one that fails gracefully. Combined with the Bulkhead pattern (thread pool isolation) and well-designed fallbacks, it transforms a distributed system from a "house of cards" into a resilient architecture where partial failures remain partial. The 30 minutes spent configuring Resilience4j or Istio outlier detection is insurance against the 4am incident where a slow recommendation service brings down your entire checkout flow.

Read next: Zero Trust Architecture: Securing Software for 2026 ->


Part of the Software Architecture Hub - comprehensive guides from architectural foundations to advanced distributed systems patterns.