ArchitectureSystem Design

Resilience Patterns: Circuit Breakers

TT
TopicTrick Team
Resilience Patterns: Circuit Breakers

Resilience Patterns: Circuit Breakers


1. The Circuit Breaker: Stop the Bleeding

Like the fuse in your house:

  • Closed State (Normal): Traffic flows through.
  • Open State (Broken): If the service fails $5$ times in a row, the breaker "Opens." All requests fail INSTANTLY with a "Service Busy" message. This allows the slow service time to "Heal" without being bombarded with new requests.
  • Half-Open (Testing): After 1 minute, the breaker allows $1$ request through. If it works, the breaker closes and everything goes back to normal.

2. Retries: The "Try Again" Logic

Not all errors are permanent! A "Network Blip" might make a request fail, but trying again $10$ms later might work.

  • Exponential Backoff: Don't try again immediately. Wait 1s, then 2s, then 4s, then 8s.
  • The Reason: If $1,000$ servers all "Retry" at the EXACT same millisecond, they will DDOS and crash the target server.

3. Timeouts: The "Don't Wait" Rule

The #1 cause of microservice failure is Waiting.

  • In 2026, we never wait for more than $2$ seconds for a service response.
  • If the service hasn't answered by then, give up!
  • The Fail-Safe: Show the user "Cached Data" (Module 183) or a "Default" response. It is always better to show an "Old" profile picture than to show a "Loading Spinner" forever.

4. Bulkheads: Isolate the Damage

Named after the waterproof walls in a ship.

  • If one room on the ship floods, the others stay dry.
  • The Logic: If the "Recommendation Engine" is crashing the server, limit it to only use 10% of the server's CPU. The "Critical" Payment engine stays safe and fast. This is the secret to building applications that are "Partially Broken" but still 100% functional for money-making tasks.

Frequently Asked Questions

Are libraries like Hystrix still used? Netflix Hystrix is legacy. In 2026, we use Resilience4j (Java), Gobreaker (Go), or specialized Service Mesh features in Istio (Module 194) that handle the circuit breaking at the network level, so you don't even have to write the code yourself!

What is 'Jitter'? When doing retries, always add a "Random" delay (Jitter). Instead of 2.0s, wait 2.1s or 1.9s. This ensures that a massive wave of retries is "Spread out" over time, giving your database a much better chance of surviving the load.


Key Takeaway

Resilience is about "Graceful Failure." By mastering the Circuit Breaker and the discipline of Timeouts, you gain the ability to build systems that are indestructible in the face of network chaos. You graduate from "Hoping everything works" to "Architecting for the inevitable error."

Read next: Observability: Logging, Monitoring, and Tracing →


Part of the Software Architecture Hub — engineering the stability.