Event-Driven Architecture: A Practical Guide

Every architect I know has a slight fondness for event-driven systems, and every architect I know has at least one production scar from one. They are powerful. They are also a particular kind of expensive, and the expense is often invisible until it's too late to back out.

When events are the right answer

The two genuinely strong cases:

Fan-out. One thing happens; many things need to know about it; the producers and consumers evolve independently. A user signs up, and six downstream systems want to react. A request-response model here is misery; a published event is clean.
Temporal decoupling. The producer does not want to wait for the consumer, and can't bound the consumer's latency. A payment completes; an email must eventually be sent; the payment shouldn't fail if the email service is slow.

If your case is neither of those, you probably want RPCs, not events. RPCs are simpler to reason about, easier to debug, and produce more legible stack traces.

The three patterns I trust

1. Transactional outbox

The single most important pattern in event-driven systems. When a service mutates its database and also publishes an event, you must not treat these as two independent operations. If you do, you will eventually lose events, or send events for state changes that never happened.

The outbox pattern: in the same transaction as the database write, insert a row into an "outbox" table. A separate process polls the outbox and publishes to the message broker. The publish can fail and retry freely because the source of truth is the outbox row. When the broker confirms the publish, mark the outbox row as sent.

This is boring, correct, and the right default.

2. Event envelope with explicit versioning

Every event should have an envelope containing: a schema version, an event type, a unique ID, a timestamp, and the producer's identity. The payload is inside. Consumers must look at the version and type before they deserialise the payload.

Why: events outlive the services that produced them. You will want to change the schema. You will have consumers that haven't been updated. Without the envelope, schema evolution is a weekend of misery. With it, it's a Tuesday afternoon.

3. Dead-letter queue with a replay tool

Events will fail to process. Some because the consumer had a bug. Some because a downstream dependency was down. Some because the event itself was malformed. The question is not whether you need a DLQ; the question is whether you have the tooling to do something useful with it.

A DLQ you can't replay from is a graveyard. A DLQ with a simple replay tool — select a set of messages, fix the bug, push them back into the topic — turns failed events from a crisis into an exercise.

The mistakes I see repeatedly

Using events for request-response. "Send an event to service B, wait for the reply event". You've reinvented RPC badly.
Chaining events. A produces B, B produces C, C produces D, D produces E. Debugging any production issue in such a chain is a morning of your life you won't get back. Keep the fan-out wide and the chain short.
Relying on ordering guarantees you don't have. Most brokers guarantee ordering per partition, not globally. Consumers must handle out-of-order arrival.
Assuming exactly-once delivery. You have at-least-once. Plan for duplicates. Make consumers idempotent.
Treating the schema registry as optional. It is not. The first time you introduce a breaking change by accident you will wish you had it.

Idempotency, which deserves its own section

Every consumer must be idempotent. This is non-negotiable in an event-driven system. The standard approach is a dedupe table: before processing, the consumer records the event ID in a "processed" table as part of the same transaction as the work it does. If it sees an event ID it has already recorded, it skips.

The dedupe table needs a retention policy. It grows. Depending on your volume, a week of history is plenty; anything older is almost certainly a bug rather than a duplicate.

Kafka is not the only answer

Kafka is the industry default, and for good reason — the durability, throughput and ecosystem are all excellent. It is also genuinely expensive to run well. If you are not sure you need Kafka, you almost certainly don't.

Alternatives worth considering: a managed broker (AWS SQS + SNS, GCP Pub/Sub), NATS for low-latency in-cluster messaging, and — my slightly controversial recommendation — a plain Postgres table used as a queue if your volumes are under a few thousand messages per second. Postgres-as-a-queue is dismissed as a toy and is, in fact, the correct answer for more cases than the internet gives it credit for.

The question is not "events or no events". The question is whether the couplings that events remove are couplings you actually needed to remove.

— Nivaan