Event-Driven Architecture: Scaling Beyond Request-Response with Kafka, Idempotency & Choreography

Event-Driven Architecture: Scaling Beyond Request-Response with Kafka, Idempotency & Choreography
Table of Contents
- Events vs Commands vs Queries
- Pub/Sub vs Competing Consumers: When to Use Each
- Kafka vs RabbitMQ: The Honest Comparison
- Designing Events: Schema and Versioning
- Idempotency: Handling At-Least-Once Delivery
- Dead-Letter Queues: Graceful Failure Handling
- Choreography vs Orchestration
- The Transactional Outbox Pattern
- Message Ordering: Partitions and Sequence Numbers
- Observability: Distributed Tracing in EDA
- Frequently Asked Questions
- Key Takeaway
Events vs Commands vs Queries
Understanding the three types of messages prevents architectural confusion:
| Type | Direction | Semantic | Example | Response? |
|---|---|---|---|---|
| Command | One producer → one consumer | "Do this" — an instruction | CreateInvoiceCommand | Yes (success/failure) |
| Event | One producer → many consumers | "This happened" — a fact | UserRegisteredEvent | No — fire and forget |
| Query | One requester → one responder | "Tell me this" | GetUserByIdQuery | Yes (data) |
Why events are different:
- A Command expects something specific to happen — it fails if the consumer rejects it
- An Event records that something already happened — producers don't control or wait for reactions
- Events are past tense (
OrderCreated,PaymentReceived) — they are immutable historical facts
Pub/Sub vs Competing Consumers: When to Use Each
Pub/Sub: One event → all consumer groups receive it. Each consumer group processes all events independently. New consumer groups can be added without affecting the producer.
Work Queue: One task → exactly one worker processes it. Adding workers increases throughput — workers compete for tasks.
In Kafka: Both models use the same topic. The difference is consumer group structure:
- Multiple consumer groups = Pub/Sub (each group gets all messages)
- Multiple consumers in one group = Competing Consumers (partitions distributed)
Kafka vs RabbitMQ: The Honest Comparison
| Feature | Apache Kafka | RabbitMQ |
|---|---|---|
| Model | Append-only log (events stored for days/weeks) | Queue (messages deleted after consumption) |
| Throughput | Millions of messages/second | ~50K messages/second |
| Message replay | Yes — any consumer group can replay from any offset | No — once consumed, gone |
| Message routing | Topic + partition key | Exchange + routing key (very flexible) |
| Message ordering | Guaranteed within a partition | Guaranteed within a queue |
| Retention | Configurable (days, weeks, forever) | Until acknowledged |
| Perfect for | Event streaming, audit logs, big data pipelines | Task queues, RPC, complex routing |
| Operational complexity | High (ZooKeeper/KRaft, tuning required) | Lower (management UI, simple ops) |
Rule of thumb: Use Kafka when you need replay, multiple independent consumers, or high volume. Use RabbitMQ when you need flexible routing, request/reply, or simple task queues.
Designing Events: Schema and Versioning
Events are contracts between services. Breaking changes to event schemas break all consumers.
Good event design principles:
{
"eventId": "7a9f0d2e-4b1c-4a8e-9c3d-123456789abc",
"eventType": "user.registered.v2",
"timestamp": "2026-04-17T12:00:00Z",
"version": 2,
"correlationId": "request-abc-123",
"source": "identity-service",
"payload": {
"userId": "user-550e8400",
"email": "alice@example.com",
"registrationMethod": "GOOGLE_OAUTH",
"country": "GB"
}
}Evolving schemas safely (Avro + Schema Registry):
// v1 schema — deployed first
{
"type": "record",
"name": "UserRegistered",
"fields": [
{"name": "userId", "type": "string"},
{"name": "email", "type": "string"}
]
}
// v2 schema — added field with default (backward compatible!)
{
"type": "record",
"name": "UserRegistered",
"fields": [
{"name": "userId", "type": "string"},
{"name": "email", "type": "string"},
{"name": "country", "type": "string", "default": "UNKNOWN"}
// No default = BREAKING CHANGE — old consumers can't deserialize
]
}Idempotency: Handling At-Least-Once Delivery
Kafka guarantees at-least-once delivery — a message may be delivered multiple times if a consumer crashes before acknowledging. Your handlers must be idempotent (applying the same message twice has the same effect as once):
def handle_payment_received(event: dict):
payment_id = event['paymentId']
# Idempotency check: have we already processed this payment?
if payment_repository.exists(payment_id):
logger.info(f"Duplicate payment event ignored: {payment_id}")
return # Safe to skip — we already processed this
# Process the payment (first time only):
with transaction():
payment = Payment.from_event(event)
payment_repository.save(payment)
# Process only once
billing_service.reconcile(payment)
# Mark as processed atomically:
processed_events.insert(payment_id) # Prevents future duplicatesDead-Letter Queues: Graceful Failure Handling
When a consumer fails to process a message (validation error, downstream service down, unexpected data), it must not be lost or block the main queue:
# Kafka consumer with DLT routing:
@kafka_consumer('orders.created')
def handle_order_created(message: Message):
try:
order = parse_order(message.value)
process_order(order)
message.commit()
except ValidationError as e:
# Permanent failure — route to DLT immediately:
dlq_producer.send('orders.created.dlq',
value=message.value,
headers={'failure_reason': str(e),
'original_offset': str(message.offset)})
message.commit() # Acknowledge to avoid infinite retry
except TemporaryError as e:
# Transient failure — retry (do NOT commit):
time.sleep(exponential_backoff(message.retry_count))
raise # Let framework retry up to max_retriesChoreography vs Orchestration
Two patterns for coordinating multi-step workflows in EDA:
Choreography (Decentralized): Each service reacts to events and publishes its own events — no central coordinator:
OrderService: Emits OrderCreated
InventoryService: Listens OrderCreated → Reserves stock → Emits InventoryReserved
PaymentService: Listens InventoryReserved → Charges card → Emits PaymentProcessed
ShippingService: Listens PaymentProcessed → Creates shipment✅ Highly decoupled. ⌠Hard to visualize the full workflow. Hard to handle failures.
Orchestration (Centralized): A dedicated Saga Coordinator sends commands and tracks state:
SagaCoordinator: → Commands InventoryService.reserve(order)
InventoryService: → Replies success/failure
SagaCoordinator: → Commands PaymentService.charge(order)
PaymentService: → Replies success/failure
SagaCoordinator: → Commands ShippingService.createShipment(order)✅ Visible workflow, easy failure compensation. ⌠The coordinator becomes a new dependency.
When to use each:
- Choreography: < 4 steps, well-understood failure modes
- Orchestration (Saga): Multi-step transactions requiring compensating actions
Frequently Asked Questions
How do I guarantee exactly-once delivery in Kafka?
Kafka's built-in Exactly-Once Semantics (EOS) uses idempotent producers (producer ID + sequence numbers) and transactional APIs (beginTransaction, commitTransaction). Combined with read_committed isolation on consumers, this ensures a message is processed exactly once as long as you use Kafka transactions correctly. However, EOS only applies within the Kafka ecosystem — once you write to an external system (database, REST API), you need application-level idempotency.
What is the difference between a topic partition and a consumer group? A partition is Kafka's unit of parallelism on the producer side — messages with the same key (e.g., userId) go to the same partition, guaranteeing per-key ordering. A consumer group is Kafka's unit of parallelism on the consumer side — each partition is assigned to exactly one consumer in the group, enabling parallel consumption without duplicate processing.
Key Takeaway
Event-Driven Architecture enables the loose coupling that makes large distributed systems resilient and scalable. The technical depth lies not in setting up a Kafka cluster, but in the operational patterns: schema versioning without breaking consumers, idempotent handlers for at-least-once delivery, dead-letter queues for unprocessable messages, and transactional outbox for atomic database-plus-event writes. Mastering these patterns is what separates an event-driven system that works in demos from one that runs reliably in production under real-world failure conditions.
Read next: CQRS & Event Sourcing: A Practical Implementation Guide →
Part of the Software Architecture Hub — comprehensive guides from architectural foundations to advanced distributed systems patterns.
