The Hidden Cost of Microservices

I have watched three separate companies — one FAANG-scale, two Series B — decide that the answer to their software problems was microservices. In all three cases the answer was not microservices. In two of those three cases the intervention made things measurably worse for at least eighteen months. In the third, the lift was roughly break-even after three years of effort. This is not an indictment of microservices. It is a warning about their cost.

The pitch you usually hear

Microservices, we are told, let teams move independently. They let you scale hot components. They let you use the right language for the job. They reduce blast radius during incidents. They make your architecture "cloud-native", which was the phrase everybody used between about 2016 and 2021 to justify doing things they would otherwise have had to argue for on their own merits.

Every one of those claims is true, in theory. What they have in common is that they are statements about what microservices allow. They are not statements about what microservices cost to run. The costs rarely appear on a slide.

The real cost: coordination tax

The moment you split a single deployable artefact into two, you have introduced a network between them. That network will be down, slow, partitioned and lossy. You now have to reason about a universe of failure modes that did not previously exist. Timeouts. Retries. Idempotency. Partial failures. Duplicate writes. You have to build and maintain a contract between the two services, handle versioning, handle breaking-change migrations, handle the fact that the two will never be deployed at exactly the same time.

Each of these is manageable. None of them are free. A small team shipping a product that is not yet sure what it is will spend a startling fraction of its time on the tax and not on the product. I have seen teams of six engineers maintain twelve services and spend 40% of their sprints on inter-service plumbing that existed purely because the boundaries had been drawn in the wrong places.

The distributed monolith

Here is the failure pattern I have watched three times, which is specific enough to have a name. You start with a monolith. You decide microservices are the future. You split along functional boundaries that look sensible on a whiteboard — "Users", "Orders", "Payments", "Notifications". You deploy each separately.

Within six months you discover that every interesting feature requires you to touch at least three of these services. A checkout flow needs Users to tell it who the customer is, Orders to create the line items, Payments to charge the card, and Notifications to send the confirmation. The services have to be released together because the feature depends on coordinated changes to all four. You have taken a monolith and distributed it across four processes, each of which needs the others to be alive at the same time. You have built a distributed monolith.

A distributed monolith is strictly worse than a monolith. It has the deployment complexity of microservices and the coupling of a monolith. It is the architecture equivalent of being the worst of both worlds on purpose.

When microservices are the right answer

They are the right answer when the services can be owned by different teams, scaled to different sizes, and deployed on different schedules, and those axes are load-bearing. If you cannot name a concrete reason you need each service to be independently deployable, you probably don't need them.

At the companies where microservices worked, there were specific organisational facts that made them necessary: the teams were in three different time zones, each service had a different release cadence, and the load profiles were so different that they wanted different instance types. None of that is true for most startups.

The unfashionable alternative

A well-structured monolith with clean module boundaries will serve most companies to well past their Series B. I'm not joking. A pair of engineers who understand their domain and have the discipline to keep their modules from tangling can ship product at roughly twice the pace of a team of ten on microservices, for years, before the scaling constraints start to bite.

If you are going to end up at microservices eventually — and some companies genuinely do — you will get there faster by starting from a monolith and carving off services when the seams reveal themselves, than by starting from microservices and discovering that the seams were in the wrong place.

A checklist before you split

Can you name the specific failure mode the split is supposed to fix? "Cleaner architecture" is not a failure mode.
Can two teams own the two sides? If not, you will have coordination without the autonomy benefit.
Is the interface between the two stable enough that you won't re-define it every quarter? If not, you are about to ship a lot of versioning work.
Can you afford the on-call? Each new service doubles somebody's pager load in some sense.
If a colleague asked you these questions at the pub, would your answers sound convincing after a pint?

Most of the time, for most teams, the honest answer to at least two of the above is "no". The honest response is: don't split.

— Nivaan