Observability Done Right: Beyond Logs, Metrics and Traces

The three-pillar model of observability — logs, metrics, traces — is a useful starting point and a misleading destination. Teams that collect all three and stop there end up with a huge bill, three disconnected tools, and an on-call engineer who still has to guess when the pager goes off at three in the morning. Observability is not what you collect. It is what questions you can answer quickly.

The diagnostic question

The test of an observability stack is a single question, asked at 3 a.m. on a Wednesday: why did that user see an error?

A mature stack can answer it in under five minutes, from the pager to a diagnosis, for at least 80% of real incidents. An immature stack has you bouncing between tools, correlating by timestamp, and making educated guesses. Every observability decision you make should be evaluated against that question.

What the three pillars do well, and what they don't

Briefly, because the industry has absorbed this:

Metrics tell you that something is wrong, at scale. They don't tell you what.
Logs tell you what happened on a specific request. They don't tell you the shape of the problem.
Traces tell you where time was spent on a specific request. They don't tell you that anything is wrong.

The three together are better than any one alone. But they are still three separate views of the same system, and in an incident, translating between them by hand is where most of the on-call time goes.

What a mature stack adds

1. Structured events with high cardinality

The underrated fourth pillar. An "event" in this sense is a structured record for every unit of work the service performs — every HTTP request, every background job, every external call — with as many attributes as you can capture: user ID, tenant, route, latency, outcome, error code, deploy version, feature flags active, region, instance. Fifty to a hundred attributes per event is normal.

You should be able to slice this dataset by any of those attributes, in real time. The reason is that real incidents almost never announce themselves cleanly — they show up as "requests are slow for a specific tenant on a specific route since the last deploy", and you need to be able to find that intersection without running nine separate dashboards.

Tools in this space: Honeycomb pioneered it, Datadog and New Relic have caught up, and you can approximate it on a shoestring with ClickHouse or a well-indexed Elasticsearch cluster.

2. Correlation by trace ID across every tool

Every log line must contain a trace ID. Every metric must be derivable from or joinable to a trace. Your dashboards must link from a p99 spike straight into the traces responsible for it, and from a trace into the logs it produced.

This is a simple rule that pays compound interest. On-call engineers stop saying "let me check the logs in the other tool" and start saying "let me click through to the related traces".

3. Error budgets, not alerts on every metric

A classic mistake: alert on everything, send every alert to PagerDuty. You end up with alert fatigue, and the real incidents get lost in the noise.

The SRE playbook has it right: define SLOs, define error budgets, alert on budget burn rates. A service with a 99.9% availability SLO has about forty-three minutes of budget per month. If you are burning it faster than expected, page. If you're not, don't.

4. A canonical log line per request

Stripe popularised this and it is the single highest-leverage logging change you can make. One structured log line per request, at the end, containing every attribute you'd want. It dramatically reduces log volume, it makes every request queryable, and it is the basis for the structured-events view above.

Things I now insist on

Every service emits a service_version or deploy_id attribute. "Did this incident start when we deployed?" becomes a one-query answer.
Every request is tagged with the user or tenant ID where privacy allows. "Is this only happening for one customer?" becomes a one-query answer.
Every feature flag evaluation is recorded in the event for that request. Flag-related incidents are otherwise extraordinarily hard to diagnose.
Every on-call has a dashboard for their service, curated by the team, that shows the three or four numbers that actually matter. Not an autogenerated firehose.

The cost reality

Observability costs a startling amount at scale. I've seen it crack 20% of a cloud bill. The ways to keep it manageable: sample aggressively at the tail, keep high cardinality but short retention for the detailed view, and maintain a separate longer-retention store for pre-aggregated metrics.

Do not skimp on what you capture during the first ninety seconds of an incident. Do skimp on the ninetieth day of retention for a routine request. The cost curve is almost entirely about retention and cardinality; the value curve is almost entirely about the recent past.

— Nivaan