The Saga Pattern: Distributed Transactions in Microservices — From ACID to Compensating Transactions

The Saga Pattern: Distributed Transactions in Microservices — From ACID to Compensating Transactions
Table of Contents
- Why Distributed Transactions Fail: The CAP Theorem Constraint
- Two-Phase Commit (2PC): Why It Doesn't Scale
- The Saga Solution: Local Transactions + Compensations
- Choreography-Based Saga: Decentralized Event Flow
- Orchestration-Based Saga: Centralized Coordinator
- Designing Compensating Transactions: The Hard Part
- Implementing Orchestration with Temporal
- The Pivot Transaction: A Critical Concept
- Idempotency in Saga Steps
- Common Pitfalls: Saga Anti-Patterns
- Frequently Asked Questions
- Key Takeaway
Why Distributed Transactions Fail: The CAP Theorem Constraint
The CAP Theorem states that a distributed system can guarantee at most two of: Consistency, Availability, and Partition Tolerance. In a microservices architecture with independent databases:
- Partition Tolerance is non-negotiable (network unreliability is a reality)
- Availability is typically required for business-critical operations
- Strong Consistency (like ACID transactions) is therefore what you sacrifice
When the Payment Service's database and the Order Service's database are on different servers, there's no mechanism to synchronously coordinate a write to both atomically. One service may commit and the other may fail — leaving data in an inconsistent state.
Two-Phase Commit (2PC): Why It Doesn't Scale
Two-Phase Commit attempts to provide distributed ACID semantics:
Phase 1 (Prepare): The coordinator sends "Can you commit?" to all participants. They lock their resources and respond "Yes" or "No."
Phase 2 (Commit/Abort): If all say "Yes," coordinator sends "Commit." If any say "No," sends "Abort."
Why 2PC fails in microservices:
| Problem | Impact |
|---|---|
| Synchronous locking | All involved databases lock records for the entire transaction duration — seconds of lock contention |
| Coordinator SPOF | If the coordinator crashes between phases, all participants are blocked with locked resources indefinitely |
| Availability | All participants must be available simultaneously — if one is down, the entire transaction blocks |
| Latency | Two network round trips minimum before any commit |
At Netflix scale, 2PC would lock database rows for hundreds of milliseconds while waiting for 10+ services — unacceptable.
The Saga Solution: Local Transactions + Compensations
A Saga replaces one large ACID transaction with a sequence of local transactions, each isolated to one service's database:
Each transaction Ti has a corresponding compensating transaction Ci that logically undoes Ti's effect.
Choreography-Based Saga: Decentralized Event Flow
In choreography, no central coordinator exists. Each service reacts to events and emits its own:
Advantages:
- No single point of failure — each service manages its own flow
- Easy to add new participants (new service subscribes to relevant events)
- Loose coupling between services
Disadvantages:
- The complete business workflow is implicit — spread across multiple services
- Debugging "where did the saga get stuck?" requires correlating events across services
- Cyclic event dependencies can emerge as the saga grows
- Difficult to handle failure scenarios that span 5+ services
Use choreography for: Simple 2-3 step workflows with clear event boundaries and well-understood failure modes.
Orchestration-Based Saga: Centralized Coordinator
An explicit Saga Orchestrator service knows the full workflow and coordinates each step:
Advantages:
- The complete workflow is explicit and visible in one place
- Easy to add compensation logic — the orchestrator handles all failure scenarios
- Simpler to monitor progress and debug stuck sagas
- No cyclic dependencies possible
Disadvantages:
- The orchestrator becomes a central dependency — must be highly available
- Risk of business logic leaking into the orchestrator (should only coordinate, not decide)
Use orchestration for: Complex workflows with 4+ services, multiple failure scenarios, strict business rules, or when explicitness is valued over decoupling.
Designing Compensating Transactions: The Hard Part
Compensating transactions are not simply rollbacks — they are new business actions that produce a consistent end-state:
| Forward Transaction | Naive "Rollback" | Correct Compensating Transaction |
|---|---|---|
chargeCard($150) | Delete payment record | refundCard($150) + send customer refund email |
sendWelcomeEmail() | Delete email record | Cannot un-send — send "We're sorry" follow-up |
reserveConferenceRoom() | Delete reservation | cancelReservation() + notify attendees |
createOrder() | Delete order row | cancelOrder() + notify customer with reason |
The irreversibility problem: Some actions cannot be compensated — they can only be acknowledged and mitigated:
- Emails sent (can't un-send — send apology instead)
- Physical goods dispatched (initiate return process)
- External API calls with side effects (log the inconsistency, handle manually)
Design rule: Define the compensating transaction for every saga step before implementing the forward transaction. If you can't define a meaningful compensation, reconsider your saga decomposition.
Implementing Orchestration with Temporal
Temporal is the industry-leading workflow engine for implementing orchestration sagas:
Temporal's key advantage: if your orchestrator crashes mid-saga, it automatically resumes from the last successful step when it restarts — no manual recovery code needed.
Idempotency in Saga Steps
Every saga step and compensating transaction must be idempotent — safe to execute multiple times with the same result. Temporal and other frameworks retry failed activities, so your activity handlers will receive duplicate calls:
Frequently Asked Questions
Should the Saga Orchestrator be a separate service or embedded in the Order Service? For small systems (< 5 services in the saga), embedding the orchestrator in the initiating service (Order Service) is pragmatic. For complex enterprise workflows touching 10+ services, a dedicated workflow service (using Temporal, Conductor, or Camunda) provides better visibility, reusability across workflows, and separation of concerns.
How do I handle a compensating transaction that also fails? This is called a pivot failure — the hardest class of distributed system bug. Strategies: (1) Retry the compensation with exponential backoff (works for transient failures). (2) Store the failed compensation in a dead-letter queue for manual intervention. (3) Alert the operations team with a reconciliation report. (4) Design compensations to be fault-tolerant from the start (idempotent, with retry logic). Some failures genuinely require human intervention — design your system to surface them clearly rather than silently.
Key Takeaway
The Saga Pattern is the correct answer to distributed transactions in microservices — not because it's simpler than 2PC, but because it's more realistic. Failures in a distributed system are normal events, not exceptions. Sagas make failure handling explicit and first-class: every forward transaction has a defined compensating transaction, every step is idempotent, and the coordinator (if using orchestration) maintains a complete audit trail of what happened and why. Master this pattern, and distributed consistency problems become manageable engineering problems rather than sources of midnight incidents.
Read next: Backend for Frontend (BFF): Optimizing APIs for Every Device →
Part of the Software Architecture Hub — comprehensive guides from architectural foundations to advanced distributed systems patterns.
