Software ArchitectureSystem Design

The Saga Pattern: Distributed Transactions in Microservices — From ACID to Compensating Transactions

TT
TopicTrick Team
The Saga Pattern: Distributed Transactions in Microservices — From ACID to Compensating Transactions

The Saga Pattern: Distributed Transactions in Microservices — From ACID to Compensating Transactions


Table of Contents


Why Distributed Transactions Fail: The CAP Theorem Constraint

The CAP Theorem states that a distributed system can guarantee at most two of: Consistency, Availability, and Partition Tolerance. In a microservices architecture with independent databases:

  • Partition Tolerance is non-negotiable (network unreliability is a reality)
  • Availability is typically required for business-critical operations
  • Strong Consistency (like ACID transactions) is therefore what you sacrifice

When the Payment Service's database and the Order Service's database are on different servers, there's no mechanism to synchronously coordinate a write to both atomically. One service may commit and the other may fail — leaving data in an inconsistent state.


Two-Phase Commit (2PC): Why It Doesn't Scale

Two-Phase Commit attempts to provide distributed ACID semantics:

Phase 1 (Prepare): The coordinator sends "Can you commit?" to all participants. They lock their resources and respond "Yes" or "No."

Phase 2 (Commit/Abort): If all say "Yes," coordinator sends "Commit." If any say "No," sends "Abort."

Why 2PC fails in microservices:

ProblemImpact
Synchronous lockingAll involved databases lock records for the entire transaction duration — seconds of lock contention
Coordinator SPOFIf the coordinator crashes between phases, all participants are blocked with locked resources indefinitely
AvailabilityAll participants must be available simultaneously — if one is down, the entire transaction blocks
LatencyTwo network round trips minimum before any commit

At Netflix scale, 2PC would lock database rows for hundreds of milliseconds while waiting for 10+ services — unacceptable.


The Saga Solution: Local Transactions + Compensations

A Saga replaces one large ACID transaction with a sequence of local transactions, each isolated to one service's database:

text

Each transaction Ti has a corresponding compensating transaction Ci that logically undoes Ti's effect.


Choreography-Based Saga: Decentralized Event Flow

In choreography, no central coordinator exists. Each service reacts to events and emits its own:

mermaid

Advantages:

  • No single point of failure — each service manages its own flow
  • Easy to add new participants (new service subscribes to relevant events)
  • Loose coupling between services

Disadvantages:

  • The complete business workflow is implicit — spread across multiple services
  • Debugging "where did the saga get stuck?" requires correlating events across services
  • Cyclic event dependencies can emerge as the saga grows
  • Difficult to handle failure scenarios that span 5+ services

Use choreography for: Simple 2-3 step workflows with clear event boundaries and well-understood failure modes.


Orchestration-Based Saga: Centralized Coordinator

An explicit Saga Orchestrator service knows the full workflow and coordinates each step:

mermaid

Advantages:

  • The complete workflow is explicit and visible in one place
  • Easy to add compensation logic — the orchestrator handles all failure scenarios
  • Simpler to monitor progress and debug stuck sagas
  • No cyclic dependencies possible

Disadvantages:

  • The orchestrator becomes a central dependency — must be highly available
  • Risk of business logic leaking into the orchestrator (should only coordinate, not decide)

Use orchestration for: Complex workflows with 4+ services, multiple failure scenarios, strict business rules, or when explicitness is valued over decoupling.


Designing Compensating Transactions: The Hard Part

Compensating transactions are not simply rollbacks — they are new business actions that produce a consistent end-state:

Forward TransactionNaive "Rollback"Correct Compensating Transaction
chargeCard($150)Delete payment recordrefundCard($150) + send customer refund email
sendWelcomeEmail()Delete email recordCannot un-send — send "We're sorry" follow-up
reserveConferenceRoom()Delete reservationcancelReservation() + notify attendees
createOrder()Delete order rowcancelOrder() + notify customer with reason

The irreversibility problem: Some actions cannot be compensated — they can only be acknowledged and mitigated:

  • Emails sent (can't un-send — send apology instead)
  • Physical goods dispatched (initiate return process)
  • External API calls with side effects (log the inconsistency, handle manually)

Design rule: Define the compensating transaction for every saga step before implementing the forward transaction. If you can't define a meaningful compensation, reconsider your saga decomposition.


Implementing Orchestration with Temporal

Temporal is the industry-leading workflow engine for implementing orchestration sagas:

python

Temporal's key advantage: if your orchestrator crashes mid-saga, it automatically resumes from the last successful step when it restarts — no manual recovery code needed.


Idempotency in Saga Steps

Every saga step and compensating transaction must be idempotent — safe to execute multiple times with the same result. Temporal and other frameworks retry failed activities, so your activity handlers will receive duplicate calls:

python

Frequently Asked Questions

Should the Saga Orchestrator be a separate service or embedded in the Order Service? For small systems (< 5 services in the saga), embedding the orchestrator in the initiating service (Order Service) is pragmatic. For complex enterprise workflows touching 10+ services, a dedicated workflow service (using Temporal, Conductor, or Camunda) provides better visibility, reusability across workflows, and separation of concerns.

How do I handle a compensating transaction that also fails? This is called a pivot failure — the hardest class of distributed system bug. Strategies: (1) Retry the compensation with exponential backoff (works for transient failures). (2) Store the failed compensation in a dead-letter queue for manual intervention. (3) Alert the operations team with a reconciliation report. (4) Design compensations to be fault-tolerant from the start (idempotent, with retry logic). Some failures genuinely require human intervention — design your system to surface them clearly rather than silently.


Key Takeaway

The Saga Pattern is the correct answer to distributed transactions in microservices — not because it's simpler than 2PC, but because it's more realistic. Failures in a distributed system are normal events, not exceptions. Sagas make failure handling explicit and first-class: every forward transaction has a defined compensating transaction, every step is idempotent, and the coordinator (if using orchestration) maintains a complete audit trail of what happened and why. Master this pattern, and distributed consistency problems become manageable engineering problems rather than sources of midnight incidents.

Read next: Backend for Frontend (BFF): Optimizing APIs for Every Device →


Part of the Software Architecture Hub — comprehensive guides from architectural foundations to advanced distributed systems patterns.