Database Internals: Why Your Queries Are Slow

A query being slow is not mysterious. There are a small number of specific mechanical reasons. Once you know them, query plans stop feeling like tea-leaf reading and start feeling like invoices for work the database has been doing on your behalf.

Five causes, in rough order of how often I've encountered them in production at Google, Meta and Amazon.

1. You're doing a sequential scan when you could be doing an index lookup

A sequential scan reads every row in a table and applies the WHERE clause to each. An index lookup uses a B-tree (or similar) to find the matching rows without reading the rest.

The cost difference is enormous — the scan is O(N), the lookup is O(log N) — and the fix is usually one line: CREATE INDEX. Two caveats. First, indexes cost space and slow down writes, so you can't index everything. Second, an index that isn't used by the planner is worse than no index at all, because it still eats the write cost.

Always run EXPLAIN ANALYSE (PostgreSQL) or EXPLAIN (MySQL) on the query before you add an index. Check that the planner actually picks it up. If it doesn't, find out why — usually it's because your filter predicate isn't sargable, or because the statistics are stale.

2. You have an index but the planner won't use it

Five common reasons a perfectly good index gets ignored:

You're applying a function to the column (WHERE lower(email) = ...) and the index is on the raw column. Fix: either add a functional index or normalise on write.
The statistics are out of date. Fix: ANALYSE.
The planner correctly judges that a scan is faster because the table is small or the filter is unselective. Trust it.
The column's data type doesn't match the query's data type and an implicit cast is happening. This silently disables the index.
The query has an OR across multiple columns and there's no compound index. Try a UNION ALL of two indexed queries instead.

3. The query is correct and indexed and still slow because of the joins

Three-table joins start to hurt. Five-table joins almost always hurt. The planner has to pick a join order and a join strategy (nested loop, hash join, merge join) for each pair, and as the number of tables grows the search space balloons.

Two things help: making sure the join columns are indexed on both sides, and — often more importantly — reducing the cardinality at each step. A join that drops from a million rows to a hundred is cheap. A join that goes from a million to a billion (a Cartesian-style blow-up because of a bad predicate) will destroy any query. Look at the row-count estimates at each node in the plan. If the estimate is wildly different from the actual, you have a statistics problem.

4. You are doing useful work but the database is doing it on disk instead of in memory

Most production databases keep "hot" data in a buffer cache. If your working set fits, queries are fast. If it doesn't, the database has to fetch pages from disk, and disk — even SSD — is orders of magnitude slower than RAM.

This is the single most common reason a query that was fast on Monday is slow on Thursday. Nothing changed in your code. The table grew past the size of the buffer cache, or some other query evicted your pages, and now you're paying disk latency on every read.

Fixes, roughly in ascending order of expense: add memory, reshape the query to read fewer pages, partition the table so cold data doesn't interleave with hot, or move the hot part to a cache in front of the database. A well-configured Redis in front of a Postgres instance often removes the need for a larger Postgres instance.

5. Lock contention

The sneakiest cause of slow queries, and the hardest to diagnose from an EXPLAIN plan. The query itself is fast. It just has to wait in a queue before it can start.

Common triggers: long-running transactions holding row-level locks, migrations that take an exclusive lock on a busy table, SELECT ... FOR UPDATE patterns used incorrectly. Diagnostics: look at your database's lock view (pg_locks, SHOW ENGINE INNODB STATUS) during the slow window. If you see the query blocked, the query plan is a red herring.

Reading a query plan — the two-minute version

Top-down, right-to-left. Each node has an estimated row count and an actual row count. A large gap between the two is the single most useful diagnostic signal in the whole plan. Find the node where the estimate is wrong by an order of magnitude and you've usually found the cause of the slowness.

Look for, in order:

Sequential scans on big tables (usually a missing index).
Nested loops with large outer inputs (usually the wrong join strategy).
Filters applied after a join that could have been applied before (usually a rewrite opportunity).
Sort nodes on huge intermediate results (usually an index on the ORDER BY column will kill them).

Every slow query is a database doing something it doesn't need to do. The job is to figure out what, and stop it.

— Nivaan