Software ArchitectureSystem Design

Event-Driven Architecture: Scaling Beyond Request-Response with Kafka, Idempotency & Choreography

TT
TopicTrick Team
Event-Driven Architecture: Scaling Beyond Request-Response with Kafka, Idempotency & Choreography

Event-Driven Architecture: Scaling Beyond Request-Response with Kafka, Idempotency & Choreography


Table of Contents


Events vs Commands vs Queries

Understanding the three types of messages prevents architectural confusion:

TypeDirectionSemanticExampleResponse?
CommandOne producer → one consumer"Do this" — an instructionCreateInvoiceCommandYes (success/failure)
EventOne producer → many consumers"This happened" — a factUserRegisteredEventNo — fire and forget
QueryOne requester → one responder"Tell me this"GetUserByIdQueryYes (data)

Why events are different:

  • A Command expects something specific to happen — it fails if the consumer rejects it
  • An Event records that something already happened — producers don't control or wait for reactions
  • Events are past tense (OrderCreated, PaymentReceived) — they are immutable historical facts

Pub/Sub vs Competing Consumers: When to Use Each

Pub/Sub: One event → all consumer groups receive it. Each consumer group processes all events independently. New consumer groups can be added without affecting the producer.

Work Queue: One task → exactly one worker processes it. Adding workers increases throughput — workers compete for tasks.

In Kafka: Both models use the same topic. The difference is consumer group structure:

  • Multiple consumer groups = Pub/Sub (each group gets all messages)
  • Multiple consumers in one group = Competing Consumers (partitions distributed)

Kafka vs RabbitMQ: The Honest Comparison

FeatureApache KafkaRabbitMQ
ModelAppend-only log (events stored for days/weeks)Queue (messages deleted after consumption)
ThroughputMillions of messages/second~50K messages/second
Message replayYes — any consumer group can replay from any offsetNo — once consumed, gone
Message routingTopic + partition keyExchange + routing key (very flexible)
Message orderingGuaranteed within a partitionGuaranteed within a queue
RetentionConfigurable (days, weeks, forever)Until acknowledged
Perfect forEvent streaming, audit logs, big data pipelinesTask queues, RPC, complex routing
Operational complexityHigh (ZooKeeper/KRaft, tuning required)Lower (management UI, simple ops)

Rule of thumb: Use Kafka when you need replay, multiple independent consumers, or high volume. Use RabbitMQ when you need flexible routing, request/reply, or simple task queues.


Designing Events: Schema and Versioning

Events are contracts between services. Breaking changes to event schemas break all consumers.

Good event design principles:

json
{
  "eventId": "7a9f0d2e-4b1c-4a8e-9c3d-123456789abc",
  "eventType": "user.registered.v2",
  "timestamp": "2026-04-17T12:00:00Z",
  "version": 2,
  "correlationId": "request-abc-123",
  "source": "identity-service",
  "payload": {
    "userId": "user-550e8400",
    "email": "alice@example.com",
    "registrationMethod": "GOOGLE_OAUTH",
    "country": "GB"
  }
}

Evolving schemas safely (Avro + Schema Registry):

avro
// v1 schema — deployed first
{
  "type": "record",
  "name": "UserRegistered",
  "fields": [
    {"name": "userId", "type": "string"},
    {"name": "email",  "type": "string"}
  ]
}

// v2 schema — added field with default (backward compatible!)
{
  "type": "record",
  "name": "UserRegistered",
  "fields": [
    {"name": "userId", "type": "string"},
    {"name": "email",  "type": "string"},
    {"name": "country", "type": "string", "default": "UNKNOWN"}
    // No default = BREAKING CHANGE — old consumers can't deserialize
  ]
}

Idempotency: Handling At-Least-Once Delivery

Kafka guarantees at-least-once delivery — a message may be delivered multiple times if a consumer crashes before acknowledging. Your handlers must be idempotent (applying the same message twice has the same effect as once):

python
def handle_payment_received(event: dict):
    payment_id = event['paymentId']
    
    # Idempotency check: have we already processed this payment?
    if payment_repository.exists(payment_id):
        logger.info(f"Duplicate payment event ignored: {payment_id}")
        return  # Safe to skip — we already processed this
    
    # Process the payment (first time only):
    with transaction():
        payment = Payment.from_event(event)
        payment_repository.save(payment)
        # Process only once
        billing_service.reconcile(payment)
        # Mark as processed atomically:
        processed_events.insert(payment_id)  # Prevents future duplicates

Dead-Letter Queues: Graceful Failure Handling

When a consumer fails to process a message (validation error, downstream service down, unexpected data), it must not be lost or block the main queue:

python
# Kafka consumer with DLT routing:
@kafka_consumer('orders.created')
def handle_order_created(message: Message):
    try:
        order = parse_order(message.value)
        process_order(order)
        message.commit()
    except ValidationError as e:
        # Permanent failure — route to DLT immediately:
        dlq_producer.send('orders.created.dlq', 
                          value=message.value,
                          headers={'failure_reason': str(e),
                                   'original_offset': str(message.offset)})
        message.commit()  # Acknowledge to avoid infinite retry
    except TemporaryError as e:
        # Transient failure — retry (do NOT commit):
        time.sleep(exponential_backoff(message.retry_count))
        raise  # Let framework retry up to max_retries

Choreography vs Orchestration

Two patterns for coordinating multi-step workflows in EDA:

Choreography (Decentralized): Each service reacts to events and publishes its own events — no central coordinator:

text
OrderService:   Emits  OrderCreated
InventoryService: Listens OrderCreated → Reserves stock → Emits InventoryReserved
PaymentService: Listens InventoryReserved → Charges card → Emits PaymentProcessed
ShippingService: Listens PaymentProcessed → Creates shipment

✅ Highly decoupled. ❌ Hard to visualize the full workflow. Hard to handle failures.

Orchestration (Centralized): A dedicated Saga Coordinator sends commands and tracks state:

text
SagaCoordinator: → Commands InventoryService.reserve(order)
InventoryService: → Replies success/failure
SagaCoordinator: → Commands PaymentService.charge(order)
PaymentService:   → Replies success/failure
SagaCoordinator: → Commands ShippingService.createShipment(order)

✅ Visible workflow, easy failure compensation. ❌ The coordinator becomes a new dependency.

When to use each:

  • Choreography: < 4 steps, well-understood failure modes
  • Orchestration (Saga): Multi-step transactions requiring compensating actions

Frequently Asked Questions

How do I guarantee exactly-once delivery in Kafka? Kafka's built-in Exactly-Once Semantics (EOS) uses idempotent producers (producer ID + sequence numbers) and transactional APIs (beginTransaction, commitTransaction). Combined with read_committed isolation on consumers, this ensures a message is processed exactly once as long as you use Kafka transactions correctly. However, EOS only applies within the Kafka ecosystem — once you write to an external system (database, REST API), you need application-level idempotency.

What is the difference between a topic partition and a consumer group? A partition is Kafka's unit of parallelism on the producer side — messages with the same key (e.g., userId) go to the same partition, guaranteeing per-key ordering. A consumer group is Kafka's unit of parallelism on the consumer side — each partition is assigned to exactly one consumer in the group, enabling parallel consumption without duplicate processing.


Key Takeaway

Event-Driven Architecture enables the loose coupling that makes large distributed systems resilient and scalable. The technical depth lies not in setting up a Kafka cluster, but in the operational patterns: schema versioning without breaking consumers, idempotent handlers for at-least-once delivery, dead-letter queues for unprocessable messages, and transactional outbox for atomic database-plus-event writes. Mastering these patterns is what separates an event-driven system that works in demos from one that runs reliably in production under real-world failure conditions.

Read next: CQRS & Event Sourcing: A Practical Implementation Guide →


Part of the Software Architecture Hub — comprehensive guides from architectural foundations to advanced distributed systems patterns.