CAP Theorem in the Real World

The CAP theorem gets taught in approximately this form: in the presence of a network partition, a distributed system can be either consistent or available, but not both. The teaching is technically correct, practically misleading, and has been responsible for a great deal of architectural nonsense over the last fifteen years.

Here is what I wish someone had told me when I first met CAP as a graduate trainee at Google in 2013.

Partitions are not the exotic case

CAP frames the partition as the interesting event. In a real production system, "partition" covers a very wide set of things that happen every single day. A top-of-rack switch flapping. A region with a degraded cross-AZ link. A service briefly CPU-starved to the point where its heartbeats fail. A misconfigured firewall rule. A noisy neighbour on an underlying hypervisor.

These are all, from the algorithm's point of view, partitions. Not weather events. Everyday occurrences. A production system must make a sensible CAP choice not because a partition might one day happen but because one is happening roughly now, in some component, somewhere in your stack.

CP and AP are not opposed moralities

Engineering culture has a tendency to turn CAP into a tribal identity. "We're a CP shop." "We're an AP shop." This is silly. Every real system makes different CAP choices for different workloads.

A bank's core ledger is CP and has to be. A bank's marketing preferences store is AP and should be. A social network's identity system is CP. A social network's feed ranking is AP. A sensible architecture contains both and is explicit about which is which.

The mistake I have watched teams make is picking a single global CAP posture and forcing every service to inherit it. You end up either with a hugely over-engineered CP solution for work that doesn't need it, or an AP solution holding money in it, which is where the interesting incidents happen.

PACELC is the framing that actually helps

CAP tells you what happens during a partition. It says nothing about what happens the rest of the time, which is, bluntly, almost all of the time.

Daniel Abadi's PACELC framing is more useful: if a Partition happens, you choose Availability or Consistency; Else, you choose Latency or Consistency. The second half is the bit that matters for day-to-day design, because your system's steady-state behaviour is shaped by it. A database that is "CP" under CAP can be either "PC/EL" (prioritise latency when healthy) or "PC/EC" (prioritise consistency always) — and those are radically different systems to operate.

Every time I sit down to choose a datastore I ask the PACELC question, not the CAP question. The CAP answer is forced on you. The PACELC answer is where the actual decision lives.

"Eventual consistency" is the most abused phrase in the field

The phrase has come to mean "it'll probably be fine, don't worry about it". This is not what it means. Eventual consistency is a formal guarantee that, given no new writes and enough time, all replicas will converge to the same value. It guarantees nothing about how long "enough time" is, or what you see in the interim.

In practice, an eventually-consistent system that normally converges in 10 ms can, under a partition, take minutes or longer. Your application had better be prepared for that. The correct discipline is to bound the staleness — to say, explicitly, "reads from this service may be up to N seconds stale, and callers must handle that" — and to make the bound observable.

The operational consequence

When you are on-call at two in the morning and a database shard has lost quorum, your CAP choice becomes painfully concrete. If you chose CP, your users see errors. If you chose AP, they see stale data or, worse, accept writes that you cannot honour when the partition resolves.

Both are bad. The question is which kind of bad your product can survive. A search query returning "service temporarily unavailable" is recoverable with a retry. An e-commerce order that was accepted but not actually charged is not recoverable without a human being sending an email.

Pick your bad on purpose. Pick it before the incident, not during.

What I tell teams now

For every data flow, write down the CAP and PACELC choice explicitly in the design doc. One sentence.
Bound the staleness for every eventually-consistent path, and monitor that bound.
Make sure the people who will be on-call for the system have read and understood the choice.
Don't let "eventual consistency" be a rhetorical device. It's a contract with a time bound.

CAP is a twenty-five-year-old result. It is still correct. It is also still badly taught. If this post does one thing, I'd like it to be that you never again nod along at somebody describing their system as "CP" without asking "in what failure mode, with what PACELC trade-off, and at what latency?".

— Nivaan