Software ArchitectureSystem Design

Event-Driven Architecture: Scaling Beyond Request-Response with Kafka, Idempotency & Choreography

TT
TopicTrick Team
Event-Driven Architecture: Scaling Beyond Request-Response with Kafka, Idempotency & Choreography

Event-Driven Architecture: Scaling Beyond Request-Response with Kafka, Idempotency & Choreography


Table of Contents


Events vs Commands vs Queries

Understanding the three types of messages prevents architectural confusion:

TypeDirectionSemanticExampleResponse?
CommandOne producer → one consumer"Do this" — an instructionCreateInvoiceCommandYes (success/failure)
EventOne producer → many consumers"This happened" — a factUserRegisteredEventNo — fire and forget
QueryOne requester → one responder"Tell me this"GetUserByIdQueryYes (data)

Why events are different:

  • A Command expects something specific to happen — it fails if the consumer rejects it
  • An Event records that something already happened — producers don't control or wait for reactions
  • Events are past tense (OrderCreated, PaymentReceived) — they are immutable historical facts

Pub/Sub vs Competing Consumers: When to Use Each

mermaid

Pub/Sub: One event → all consumer groups receive it. Each consumer group processes all events independently. New consumer groups can be added without affecting the producer.

Work Queue: One task → exactly one worker processes it. Adding workers increases throughput — workers compete for tasks.

In Kafka: Both models use the same topic. The difference is consumer group structure:

  • Multiple consumer groups = Pub/Sub (each group gets all messages)
  • Multiple consumers in one group = Competing Consumers (partitions distributed)

Kafka vs RabbitMQ: The Honest Comparison

FeatureApache KafkaRabbitMQ
ModelAppend-only log (events stored for days/weeks)Queue (messages deleted after consumption)
ThroughputMillions of messages/second~50K messages/second
Message replayYes — any consumer group can replay from any offsetNo — once consumed, gone
Message routingTopic + partition keyExchange + routing key (very flexible)
Message orderingGuaranteed within a partitionGuaranteed within a queue
RetentionConfigurable (days, weeks, forever)Until acknowledged
Perfect forEvent streaming, audit logs, big data pipelinesTask queues, RPC, complex routing
Operational complexityHigh (ZooKeeper/KRaft, tuning required)Lower (management UI, simple ops)

Rule of thumb: Use Kafka when you need replay, multiple independent consumers, or high volume. Use RabbitMQ when you need flexible routing, request/reply, or simple task queues.


Designing Events: Schema and Versioning

Events are contracts between services. Breaking changes to event schemas break all consumers.

Good event design principles:

json

Evolving schemas safely (Avro + Schema Registry):

avro

Idempotency: Handling At-Least-Once Delivery

Kafka guarantees at-least-once delivery — a message may be delivered multiple times if a consumer crashes before acknowledging. Your handlers must be idempotent (applying the same message twice has the same effect as once):

python

Dead-Letter Queues: Graceful Failure Handling

When a consumer fails to process a message (validation error, downstream service down, unexpected data), it must not be lost or block the main queue:

mermaid
python

Choreography vs Orchestration

Two patterns for coordinating multi-step workflows in EDA:

Choreography (Decentralized): Each service reacts to events and publishes its own events — no central coordinator:

text

✅ Highly decoupled. ❌ Hard to visualize the full workflow. Hard to handle failures.

Orchestration (Centralized): A dedicated Saga Coordinator sends commands and tracks state:

text

✅ Visible workflow, easy failure compensation. ❌ The coordinator becomes a new dependency.

When to use each:

  • Choreography: < 4 steps, well-understood failure modes
  • Orchestration (Saga): Multi-step transactions requiring compensating actions

Frequently Asked Questions

How do I guarantee exactly-once delivery in Kafka? Kafka's built-in Exactly-Once Semantics (EOS) uses idempotent producers (producer ID + sequence numbers) and transactional APIs (beginTransaction, commitTransaction). Combined with read_committed isolation on consumers, this ensures a message is processed exactly once as long as you use Kafka transactions correctly. However, EOS only applies within the Kafka ecosystem — once you write to an external system (database, REST API), you need application-level idempotency.

What is the difference between a topic partition and a consumer group? A partition is Kafka's unit of parallelism on the producer side — messages with the same key (e.g., userId) go to the same partition, guaranteeing per-key ordering. A consumer group is Kafka's unit of parallelism on the consumer side — each partition is assigned to exactly one consumer in the group, enabling parallel consumption without duplicate processing.


Key Takeaway

Event-Driven Architecture enables the loose coupling that makes large distributed systems resilient and scalable. The technical depth lies not in setting up a Kafka cluster, but in the operational patterns: schema versioning without breaking consumers, idempotent handlers for at-least-once delivery, dead-letter queues for unprocessable messages, and transactional outbox for atomic database-plus-event writes. Mastering these patterns is what separates an event-driven system that works in demos from one that runs reliably in production under real-world failure conditions.

Read next: CQRS & Event Sourcing: A Practical Implementation Guide →


Part of the Software Architecture Hub — comprehensive guides from architectural foundations to advanced distributed systems patterns.