Software ArchitectureSystem Design

Circuit Breaker Pattern: Stopping Cascading Failures in Distributed Systems

TT
TopicTrick Team
Circuit Breaker Pattern: Stopping Cascading Failures in Distributed Systems

Circuit Breaker Pattern: Stopping Cascading Failures in Distributed Systems


Table of Contents


How Cascading Failures Actually Happen

The threat model is specific and predictable:

text

This is a cascading failure caused by resource exhaustion: threads consumed by the slow service cause all other operations to fail.


Circuit Breaker Mechanics: The Three States

mermaid

CLOSED State (Normal Operation)

  • All calls pass through to the downstream service
  • The circuit breaker records success/failure of each call
  • Calculates failure rate using a sliding window (count-based or time-based)
  • Transition to OPEN when failure rate crosses threshold

OPEN State (Protection Mode)

  • All calls are blocked immediately — no network call is made
  • Returns a predefined fallback response instantly
  • The service gets complete relief — no requests, CPU/DB can recover
  • After a configurable wait duration, moves to HALF-OPEN

HALF-OPEN State (Testing Recovery)

  • Allows a small number of trial requests through
  • If trial requests succeed: circuit closes (normal operation resumes)
  • If trial requests fail: circuit re-opens (longer wait before next trial)
  • Prevents flooding a recovering service

Configuration Deep Dive: Thresholds and Windows

Two types of calls trip a circuit breaker:

  1. Failure Rate — Too many errors (5xx, connection refused, timeout)
  2. Slow Call Rate — Too many calls exceeding a slow call duration threshold (e.g., > 3 seconds)
yaml

Fallback Strategies: Failing Gracefully

A circuit breaker without a good fallback just replaces one failure mode with another. Design fallbacks that provide real value:

Fallback TypeDescriptionExample
Static responseReturn a hardcoded defaultEmpty recommendations list
Cached responseReturn last known good data from RedisShow 1-hour-old recommendations
Degraded modeReturn reduced functionalityShow top-10 all-time bestsellers
Error with contextExplicit UI message"Recommendations temporarily unavailable"
Alternative serviceCall a simpler fallback serviceBasic category-based suggestions

Never return confusing errors or silent failures — users should always know if a feature is degraded.


Implementation: Resilience4j for Java/Kotlin

java

Bulkhead Pattern: Isolating Thread Pools

The Bulkhead pattern (named after ship watertight compartments) runs different services in separate thread pools, so one slow service can't exhaust all threads:

java

Combining Retry + Circuit Breaker Correctly

A critical mistake: putting Retry outside Circuit Breaker. This defeats the circuit breaker — retries send more requests to a failing service, worsening the cascade.

java

Monitoring Circuit Breaker State

Expose circuit breaker metrics to your observability stack:

yaml

Alert when:

  • Any circuit enters OPEN state → team notification (service is DOWN)
  • Circuit stays OPEN > 5 minutes → page the on-call engineer

Frequently Asked Questions

When should I lower the failure threshold vs lengthen the wait duration? Failure threshold controls how sensitive the circuit is — lower it (30%) for critical payment services, keep it higher (60%) for non-critical recommendation engines. Wait duration controls how long the downstream gets to recover — lengthen it (60s) for services needing database recovery; keep it shorter (10s) for services with transient network issues.

Can I use Circuit Breaker at the API Gateway level instead of per-service? Yes — this is the service mesh approach. Istio's DestinationRule with outlierDetection implements circuit breaking at the network layer, between any two services, without any code changes. The trade-off: it only detects HTTP 5xx errors and TCP connection failures, not application-level errors like malformed responses. Application-level circuit breakers (Resilience4j) can detect any custom failure condition.


Key Takeaway

The Circuit Breaker pattern is the difference between a system that fails completely and one that fails gracefully. Combined with the Bulkhead pattern (thread pool isolation) and well-designed fallbacks, it transforms a distributed system from a "house of cards" into a resilient architecture where partial failures remain partial. The 30 minutes spent configuring Resilience4j or Istio outlier detection is insurance against the 4am incident where a slow recommendation service brings down your entire checkout flow.

Read next: Zero Trust Architecture: Securing Software for 2026 →


Part of the Software Architecture Hub — comprehensive guides from architectural foundations to advanced distributed systems patterns.