What is the difference between a choreography saga and an orchestration saga?

In choreography, each service reacts to events and publishes its own events - there is no central coordinator. In orchestration, a central orchestrator tells each service what to do and handles failures. Choreography is simpler and more decoupled for straightforward flows; orchestration is easier to reason about, test, and debug for complex multi-step workflows. Orchestration is generally preferred for sagas with many steps or complex compensation logic.

How do I implement compensating transactions in a saga?

Design a compensating action for every step that can be rolled back - if payment is charged, the compensation is a refund; if inventory is reserved, the compensation is a release. Compensating transactions must be idempotent (safe to call multiple times) and must eventually succeed. Store the saga state and completed steps in a durable store so compensation can continue after failures or restarts. Not all steps are compensatable - design sagas to put non-compensatable steps last.

How does the saga pattern relate to the two-phase commit (2PC)?

2PC is a synchronous distributed transaction protocol - all participating services must be available and agree before committing. Sagas are an asynchronous alternative - each service executes its step and publishes a result, with compensating transactions handling failures. 2PC is simpler conceptually but impractical in microservices (blocking, creates distributed locks, fails when any participant is unavailable). Sagas sacrifice atomicity for availability and loose coupling.

The Saga Pattern: Distributed Transactions in Microservices - From ACID to Compensating Transactions

Q: Should the Saga Orchestrator be a separate service or embedded in the initiating service?

For small systems with fewer than 5 services in the saga, embedding the orchestrator in the initiating service (e.g. Order Service) is pragmatic and reduces infrastructure. For complex enterprise workflows touching 10 or more services, a dedicated orchestration service (using a workflow engine like Temporal or Conductor) is cleaner - it centralises saga state, provides visibility, and separates coordination concerns from business logic.

← Back to Software Architecture Hub

The Saga Pattern: Distributed Transactions in Microservices - From ACID to Compensating Transactions

Why Distributed Transactions Fail: The CAP Theorem Constraint
Two-Phase Commit (2PC): Why It Doesn't Scale
The Saga Solution: Local Transactions + Compensations
Choreography-Based Saga: Decentralized Event Flow
Orchestration-Based Saga: Centralized Coordinator
Designing Compensating Transactions: The Hard Part
Implementing Orchestration with Temporal
The Pivot Transaction: A Critical Concept
Idempotency in Saga Steps
Common Pitfalls: Saga Anti-Patterns
Frequently Asked Questions
Key Takeaway

Why Distributed Transactions Fail: The CAP Theorem Constraint

The CAP Theorem states that a distributed system can guarantee at most two of: Consistency, Availability, and Partition Tolerance. In a microservices architecture with independent databases:

Partition Tolerance is non-negotiable (network unreliability is a reality)
Availability is typically required for business-critical operations
Strong Consistency (like ACID transactions) is therefore what you sacrifice

When the Payment Service's database and the Order Service's database are on different servers, there's no mechanism to synchronously coordinate a write to both atomically. One service may commit and the other may fail - leaving data in an inconsistent state.

Two-Phase Commit (2PC): Why It Doesn't Scale

Two-Phase Commit attempts to provide distributed ACID semantics:

Phase 1 (Prepare): The coordinator sends "Can you commit?" to all participants. They lock their resources and respond "Yes" or "No."

Phase 2 (Commit/Abort): If all say "Yes," coordinator sends "Commit." If any say "No," sends "Abort."

Why 2PC fails in microservices:

Problem	Impact
Synchronous locking	All involved databases lock records for the entire transaction duration - seconds of lock contention
Coordinator SPOF	If the coordinator crashes between phases, all participants are blocked with locked resources indefinitely
Availability	All participants must be available simultaneously - if one is down, the entire transaction blocks
Latency	Two network round trips minimum before any commit

At Netflix scale, 2PC would lock database rows for hundreds of milliseconds while waiting for 10+ services - unacceptable.

The Saga Solution: Local Transactions + Compensations

A Saga replaces one large ACID transaction with a sequence of local transactions, each isolated to one service's database:

text

Forward Path (Happy Path):
T1: OrderService.createOrder()       -> Order DB committed
T2: InventoryService.reserveItems()  -> Inventory DB committed  
T3: PaymentService.chargeCard()      -> Payment DB committed
T4: ShippingService.scheduleShipment() -> Shipping DB committed
-> SAGA SUCCESS

Failure Path (T3 PaymentService fails):
T1: Done ✓
T2: Done ✓
T3: FAILED (card declined)
-> Execute compensating transactions in reverse:
C2: InventoryService.releaseItems()    -> Releases the reservation
C1: OrderService.cancelOrder()         -> Marks order CANCELLED
-> SAGA COMPENSATED (data consistent again)

Each transaction Ti has a corresponding compensating transaction Ci that logically undoes Ti's effect.

Choreography-Based Saga: Decentralized Event Flow

In choreography, no central coordinator exists. Each service reacts to events and emits its own:

Advantages:

No single point of failure - each service manages its own flow
Easy to add new participants (new service subscribes to relevant events)
Loose coupling between services

Disadvantages:

The complete business workflow is implicit - spread across multiple services
Debugging "where did the saga get stuck?" requires correlating events across services
Cyclic event dependencies can emerge as the saga grows
Difficult to handle failure scenarios that span 5+ services

Use choreography for: Simple 2-3 step workflows with clear event boundaries and well-understood failure modes.

Orchestration-Based Saga: Centralized Coordinator

An explicit Saga Orchestrator service knows the full workflow and coordinates each step:

Advantages:

The complete workflow is explicit and visible in one place
Easy to add compensation logic - the orchestrator handles all failure scenarios
Simpler to monitor progress and debug stuck sagas
No cyclic dependencies possible

Disadvantages:

The orchestrator becomes a central dependency - must be highly available
Risk of business logic leaking into the orchestrator (should only coordinate, not decide)

Use orchestration for: Complex workflows with 4+ services, multiple failure scenarios, strict business rules, or when explicitness is valued over decoupling.

Designing Compensating Transactions: The Hard Part

Compensating transactions are not simply rollbacks - they are new business actions that produce a consistent end-state:

Forward Transaction	Naive "Rollback"	Correct Compensating Transaction
`chargeCard($150)`	Delete payment record	`refundCard($150)` + send customer refund email
`sendWelcomeEmail()`	Delete email record	Cannot un-send - send "We're sorry" follow-up
`reserveConferenceRoom()`	Delete reservation	`cancelReservation()` + notify attendees
`createOrder()`	Delete order row	`cancelOrder()` + notify customer with reason

The irreversibility problem: Some actions cannot be compensated - they can only be acknowledged and mitigated:

Emails sent (can't un-send - send apology instead)
Physical goods dispatched (initiate return process)
External API calls with side effects (log the inconsistency, handle manually)

Design rule: Define the compensating transaction for every saga step before implementing the forward transaction. If you can't define a meaningful compensation, reconsider your saga decomposition.

Implementing Orchestration with Temporal

Temporal is the industry-leading workflow engine for implementing orchestration sagas:

python

# Temporal workflow in Python:
@workflow.defn
class PlaceOrderWorkflow:
    @workflow.run
    async def run(self, order_id: str, customer_id: str, items: list) -> OrderResult:
        # Temporal durably records state - crash-safe!
        
        try:
            # Step 1: Create order
            order = await workflow.execute_activity(
                create_order,
                args=[order_id, customer_id, items],
                start_to_close_timeout=timedelta(seconds=10),
            )
            
            # Step 2: Reserve inventory
            reservation = await workflow.execute_activity(
                reserve_inventory,
                args=[order_id, items],
                start_to_close_timeout=timedelta(seconds=10),
            )
            
            # Step 3: Process payment
            payment = await workflow.execute_activity(
                charge_payment,
                args=[order_id, customer_id, order.total],
                start_to_close_timeout=timedelta(seconds=30),
            )
            
            # Step 4: Schedule shipping
            await workflow.execute_activity(
                schedule_shipping,
                args=[order_id, order.shipping_address],
                start_to_close_timeout=timedelta(seconds=10),
            )
            
            return OrderResult(success=True, order_id=order_id)
            
        except ActivityError as e:
            # Automatic compensation on any failure:
            await workflow.execute_activity(
                release_inventory, args=[order_id],
                start_to_close_timeout=timedelta(seconds=10),
            )
            await workflow.execute_activity(
                cancel_order, args=[order_id],
                start_to_close_timeout=timedelta(seconds=10),
            )
            return OrderResult(success=False, error=str(e))

Temporal's key advantage: if your orchestrator crashes mid-saga, it automatically resumes from the last successful step when it restarts - no manual recovery code needed.

Idempotency in Saga Steps

Every saga step and compensating transaction must be idempotent - safe to execute multiple times with the same result. Temporal and other frameworks retry failed activities, so your activity handlers will receive duplicate calls:

python

@activity.defn
async def charge_payment(order_id: str, amount: float) -> PaymentResult:
    # Idempotency: use order_id as idempotency key
    existing = await payment_db.find_by_order(order_id)
    if existing and existing.status == "COMPLETED":
        return PaymentResult(payment_id=existing.id, status="COMPLETED")
    
    # Safe: payment gateway idempotency key prevents double-charging
    result = await stripe.charges.create(
        amount=amount,
        idempotency_key=f"saga-{order_id}",  # Stripe will deduplicate
        currency="usd"
    )
    await payment_db.save(order_id, result.id, "COMPLETED")
    return PaymentResult(payment_id=result.id, status="COMPLETED")

Frequently Asked Questions

Should the Saga Orchestrator be a separate service or embedded in the Order Service? For small systems (< 5 services in the saga), embedding the orchestrator in the initiating service (Order Service) is pragmatic. For complex enterprise workflows touching 10+ services, a dedicated workflow service (using Temporal, Conductor, or Camunda) provides better visibility, reusability across workflows, and separation of concerns.

How do I handle a compensating transaction that also fails? This is called a pivot failure - the hardest class of distributed system bug. Strategies: (1) Retry the compensation with exponential backoff (works for transient failures). (2) Store the failed compensation in a dead-letter queue for manual intervention. (3) Alert the operations team with a reconciliation report. (4) Design compensations to be fault-tolerant from the start (idempotent, with retry logic). Some failures genuinely require human intervention - design your system to surface them clearly rather than silently.

Key Takeaway

The Saga Pattern is the correct answer to distributed transactions in microservices - not because it's simpler than 2PC, but because it's more realistic. Failures in a distributed system are normal events, not exceptions. Sagas make failure handling explicit and first-class: every forward transaction has a defined compensating transaction, every step is idempotent, and the coordinator (if using orchestration) maintains a complete audit trail of what happened and why. Master this pattern, and distributed consistency problems become manageable engineering problems rather than sources of midnight incidents.

Part of the Software Architecture Hub - comprehensive guides from architectural foundations to advanced distributed systems patterns.

The Saga Pattern: Distributed Transactions in Microservices - From ACID to Compensating Transactions

Table of Contents

Why Distributed Transactions Fail: The CAP Theorem Constraint

Two-Phase Commit (2PC): Why It Doesn't Scale

The Saga Solution: Local Transactions + Compensations

Choreography-Based Saga: Decentralized Event Flow

Orchestration-Based Saga: Centralized Coordinator

Designing Compensating Transactions: The Hard Part

Implementing Orchestration with Temporal

Idempotency in Saga Steps

Frequently Asked Questions

Key Takeaway