B-Trees in System Design: How Databases Index Billions of Rows (Visualized)

A B-tree is a self-balancing, multi-way search tree in which every node holds multiple keys and multiple child pointers, keeping the tree so wide and shallow that even a billion-row table can be searched in three or four disk reads. First described by Bayer and McCreight in 1972, B-trees (and their close variant the B+ tree) power the default indexes of MySQL InnoDB, PostgreSQL, Oracle, SQL Server, SQLite, and almost every other relational database engine in production today.

To understand why B-trees matter, you need to understand disk I/O. A magnetic hard drive takes about 5–10 ms to seek to a random block; an SSD takes microseconds but still has a minimum read unit (a 4 KB page). Reading many small items individually is expensive. A B-tree is specifically engineered around this reality: each node is sized to fill exactly one OS page (typically 4 KB or 16 KB), so each node traversal costs exactly one I/O — and because each node holds dozens to hundreds of keys, the tree height stays tiny.

Anatomy of a B-Tree Node

A B-tree of order t obeys a small set of invariants. Every internal node (non-root) holds between t-1 and 2t-1 keys. A node with k keys has exactly k+1 child pointers. All leaf nodes sit at the same depth, guaranteeing the tree is always perfectly balanced. The root is special — it can hold as few as 1 key.

Within each node the keys are stored in sorted order. Searching is a classic binary search within the node — find the right slot, then follow the child pointer. The keys act as fences: all keys in child pointer i are greater than key i-1 and less than key i. This ordering is what lets a B-tree answer both point lookups (WHERE id = 42) and range scans (WHERE id BETWEEN 10 AND 100) efficiently.

High Fan-Out Keeps the Tree Shallow

The defining advantage of a B-tree over a binary search tree (BST) is fan-out — the number of children each node can have. A BST has fan-out 2, so a tree holding n keys has height log₂(n). A B-tree with order t = 100 has fan-out up to 200, producing height log₁₀₀(n). For one billion rows: a BST needs ≈30 levels; a B-tree with fan-out 200 needs only ≈5. Each level is a disk I/O, so the difference between 30 seeks and 5 seeks is the difference between a slow query and a fast one.

Fan-out: B-tree stays shallow vs binary tree grows deep

Same 15 keys — the binary tree (left) needs 4 levels; the B-tree with fan-out 4 (right) needs only 2 levels. Fewer levels = fewer disk reads.

Searching a B-Tree: O(log n) in Practice

A search starts at the root. At each node, binary-search through the sorted keys to find the correct child pointer, then follow it down to the next node. Repeat until you reach a leaf. Because the tree has height O(log_t n) and each node fit inside one disk page, you perform at most O(log_t n) page reads — typically 3–5 for a million-row table. The algorithm is the same for SELECT ... WHERE pk = ? in any SQL database.

B-tree search: descending from root to leaf

Watch the search descend level by level, comparing keys and following the correct child pointer until the target key is found in a leaf.

Insertion and Node Splitting

Insertion always targets the correct leaf node. If the leaf has room (fewer than 2t-1 keys), the new key is simply placed in sorted position and the operation is done. The tricky case is a full node. When a node is full and a new key must be inserted into it, the node splits: the median key is promoted up to the parent, and the remaining keys become two sibling nodes. If the parent is also full, the split propagates upward — in the worst case all the way to the root, which is the only way a B-tree grows in height. This top-down approach keeps all leaves at the same depth.

B-tree node split on insertion

Inserting key 35 into a full leaf node [25, 30, 37, 45] triggers a split: the median (37) is promoted to the parent, and two new nodes are created.

B-Tree vs B+ Tree: Which Does Your Database Use?

The term B-tree is often used loosely to cover both the original Bayer-McCreight B-tree and the more common B+ tree. The key distinction: in a pure B-tree, any node (internal or leaf) may hold actual data records. In a B+ tree, only leaf nodes hold data records; internal nodes hold only keys and child pointers. Additionally, B+ tree leaf nodes are linked together in a doubly linked list, creating a sorted sequence of all rows that can be scanned sequentially.

This design makes range scans dramatically cheaper. A query like WHERE created_at BETWEEN '2024-01-01' AND '2024-03-31' descends to the first matching leaf, then follows the linked list forward until the range is exhausted — no backtracking up the tree. MySQL InnoDB, PostgreSQL, and virtually every modern relational engine uses B+ trees, not pure B-trees, precisely for this reason.

Property	B-Tree	B+ Tree
Data stored in	Any node (internal or leaf)	Leaf nodes only
Internal nodes hold	Keys + data pointers + child ptrs	Keys + child pointers only
Leaf node linking	No	Yes — doubly linked list
Range scan	Requires backtracking up tree	Follow leaf linked list — very fast
Point lookup	May stop at internal node (fast)	Always reaches a leaf
Used by	Older / niche engines	MySQL InnoDB, PostgreSQL, SQLite, Oracle
Fan-out (internal)	Lower (keys + data pointers)	Higher (keys + child ptrs only)
Tree height	Slightly taller	Slightly shorter (higher fan-out)

Page-Oriented Storage and the Buffer Pool

B-trees are inseparable from page-oriented storage. The database divides its data files into fixed-size pages — typically 4 KB (PostgreSQL) or 16 KB (MySQL InnoDB). Each B-tree node maps to exactly one page. When the engine needs to read or write a node, it reads or writes the entire page. The buffer pool (InnoDB) or shared buffers (PostgreSQL) keeps hot pages in RAM: if the root and the top few levels of the tree fit in memory, a search only needs one real disk I/O (the leaf), making lookups sub-millisecond on heavily cached systems.

Write-ahead logging (WAL) ensures durability: before a page is modified in the buffer pool, a log record is written to sequential disk storage. This means individual row inserts are cheap (sequential WAL write) while the actual B-tree page update can be deferred to a background checkpoint. This is the engine behind the durability in every ACID-compliant database.

B-Tree vs LSM Tree: Read vs Write Amplification

B-trees dominate relational databases, but a rival data structure — the Log-Structured Merge tree (LSM tree) — dominates write-heavy workloads in systems like RocksDB, Cassandra, LevelDB, and ScyllaDB. The trade-off is fundamental: B-trees optimize for reads (few page reads per lookup) at the cost of write amplification; LSM trees optimize for writes (sequential append) at the cost of read amplification.

Write amplification in a B-tree: inserting a single row may require reading a page, modifying it in memory, and writing it back — plus writing a WAL record. If the page is cold (not in buffer pool), that is two random I/Os per insert. In an LSM tree, every insert is a sequential append to an in-memory memtable, flushed to disk in sorted runs — far cheaper for write-heavy workloads but requiring periodic compaction that reads and rewrites large data files (read amplification during compaction).

Dimension	B-Tree / B+ Tree	LSM Tree
Write path	Random write to the correct page	Sequential append to memtable → SSTables
Read path	3–5 page reads (root → leaf)	May need to check multiple SSTables + bloom filters
Write amplification	Moderate (WAL + page write)	High (compaction rewrites data repeatedly)
Read amplification	Low (tree path is direct)	Higher (must merge multiple sorted runs)
Space amplification	Low	Higher (multiple versions during compaction)
Range scans	Excellent (linked leaf list in B+)	Good (sorted SSTables, bloom filters skip levels)
Best for	Read-heavy OLTP, balanced workloads	Write-heavy workloads, time-series, event logs
Examples	MySQL, PostgreSQL, SQLite, Oracle	RocksDB, Cassandra, LevelDB, ScyllaDB

Neither structure is universally better. The rough rule: choose a B-tree-backed engine (MySQL, PostgreSQL) when reads dominate or when you need strong consistency and ACID transactions. Choose an LSM-backed engine (RocksDB, Cassandra) when write throughput is the bottleneck and you can tolerate eventual consistency or tune compaction aggressively. Many systems today run both — PostgreSQL for transactional data, RocksDB as a storage engine for caches or queues.

Practical Implications: Using B-Tree Indexes Well

Understanding B-tree internals makes you a better SQL developer. A few key rules: Leftmost prefix rule — a composite index on (last_name, first_name) can serve queries filtering on last_name alone, but not on first_name alone, because tree traversal must start at the leftmost key. Avoid functions on indexed columns — WHERE YEAR(created_at) = 2024 cannot use the B-tree index on created_at because the function is applied before the comparison; use WHERE created_at BETWEEN '2024-01-01' AND '2024-12-31' instead. Index cardinality matters — a low-cardinality column (like a boolean) makes a poor B-tree index because the tree cannot narrow down the search significantly; the optimizer may full-scan instead.

-- B-tree index on a single column
CREATE INDEX idx_user_email ON users(email);

-- Composite B-tree index (leftmost prefix rule applies)
CREATE INDEX idx_order_status_date ON orders(status, created_at);

-- This query CAN use idx_order_status_date (leftmost prefix matched)
SELECT * FROM orders WHERE status = 'shipped' AND created_at > '2024-01-01';

-- This query CANNOT use the composite index (skips leftmost column)
SELECT * FROM orders WHERE created_at > '2024-01-01';

-- Range scan exploiting B+ tree leaf linked list
SELECT id, total FROM orders
WHERE created_at BETWEEN '2024-01-01' AND '2024-03-31'
ORDER BY created_at;  -- ORDER BY is free: leaves are already sorted

Frequently Asked Questions

Why do databases use B-trees instead of hash tables for indexes?

Hash indexes offer O(1) point lookups, but they cannot serve range queries (WHERE id > 100) or ordered scans at all — a hash function destroys ordering information. B-trees maintain sorted order, so they handle both point lookups and range scans efficiently. Most SQL databases do support hash indexes as a secondary option (e.g., CREATE INDEX ... USING HASH in PostgreSQL), but the B-tree index is the default precisely because it handles the full spectrum of query patterns you encounter in practice.

What happens when a B-tree root splits?

Root splitting is the only mechanism by which a B-tree grows taller. When the root node is full and a split is needed, the root itself is split: a new root node is created containing only the median key, and the two halves become its children. The tree height increases by exactly one. This is a rare event — the root must be completely full, and filling it requires enough prior insertions to have saturated every level below it. Once the tree reaches a stable height for a given dataset size, root splits become extremely infrequent.

How does a B-tree handle concurrent reads and writes in production databases?

Concurrent access to a B-tree is managed through latch coupling (also called crabbing). A writer acquires a latch on the root, then moves down the tree acquiring a latch on the child before releasing the parent — so no split can propagate upward past a latched node. Modern engines like InnoDB use optimistic latch coupling: reads are latch-free until a conflict is detected, dramatically improving read throughput. PostgreSQL uses a different approach — MVCC at the row level — so readers never block writers and vice versa, with dead tuples cleaned up by the VACUUM process that also keeps B-tree pages from bloating with stale index entries.

Three to five disk reads to find any row among a billion — that is the quiet engineering miracle that B-trees have been delivering for over fifty years. Understanding them is not academic; it is the foundation of every index decision you will ever make.
— alokknight Engineering