B-Trees in System Design: How Databases Index Billions of Rows (Visualized)
A B-tree is the self-balancing search tree behind nearly every relational database index. Its high fan-out keeps the tree shallow โ three or four levels can index hundreds of millions of rows โ so every lookup touches only a handful of disk pages. This guide explains the structure, operations, and trade-offs, with live animations.
A B-tree is a self-balancing, multi-way search tree in which every node holds multiple keys and multiple child pointers, keeping the tree so wide and shallow that even a billion-row table can be searched in three or four disk reads. First described by Bayer and McCreight in 1972, B-trees (and their close variant the B+ tree) power the default indexes of MySQL InnoDB, PostgreSQL, Oracle, SQL Server, SQLite, and almost every other relational database engine in production today.
To understand why B-trees matter, you need to understand disk I/O. A magnetic hard drive takes about 5โ10 ms to seek to a random block; an SSD takes microseconds but still has a minimum read unit (a 4 KB page). Reading many small items individually is expensive. A B-tree is specifically engineered around this reality: each node is sized to fill exactly one OS page (typically 4 KB or 16 KB), so each node traversal costs exactly one I/O โ and because each node holds dozens to hundreds of keys, the tree height stays tiny.
Anatomy of a B-Tree Node
A B-tree of order t obeys a small set of invariants. Every internal node (non-root) holds between t-1 and 2t-1 keys. A node with k keys has exactly k+1 child pointers. All leaf nodes sit at the same depth, guaranteeing the tree is always perfectly balanced. The root is special โ it can hold as few as 1 key.
Within each node the keys are stored in sorted order. Searching is a classic binary search within the node โ find the right slot, then follow the child pointer. The keys act as fences: all keys in child pointer i are greater than key i-1 and less than key i. This ordering is what lets a B-tree answer both point lookups (WHERE id = 42) and range scans (WHERE id BETWEEN 10 AND 100) efficiently.
High Fan-Out Keeps the Tree Shallow
The defining advantage of a B-tree over a binary search tree (BST) is fan-out โ the number of children each node can have. A BST has fan-out 2, so a tree holding n keys has height logโ(n). A B-tree with order t = 100 has fan-out up to 200, producing height logโโโ(n). For one billion rows: a BST needs โ30 levels; a B-tree with fan-out 200 needs only โ5. Each level is a disk I/O, so the difference between 30 seeks and 5 seeks is the difference between a slow query and a fast one.
Searching a B-Tree: O(log n) in Practice
A search starts at the root. At each node, binary-search through the sorted keys to find the correct child pointer, then follow it down to the next node. Repeat until you reach a leaf. Because the tree has height O(logt n) and each node fit inside one disk page, you perform at most O(logt n) page reads โ typically 3โ5 for a million-row table. The algorithm is the same for SELECT ... WHERE pk = ? in any SQL database.
Insertion and Node Splitting
Insertion always targets the correct leaf node. If the leaf has room (fewer than 2t-1 keys), the new key is simply placed in sorted position and the operation is done. The tricky case is a full node. When a node is full and a new key must be inserted into it, the node splits: the median key is promoted up to the parent, and the remaining keys become two sibling nodes. If the parent is also full, the split propagates upward โ in the worst case all the way to the root, which is the only way a B-tree grows in height. This top-down approach keeps all leaves at the same depth.
B-Tree vs B+ Tree: Which Does Your Database Use?
The term B-tree is often used loosely to cover both the original Bayer-McCreight B-tree and the more common B+ tree. The key distinction: in a pure B-tree, any node (internal or leaf) may hold actual data records. In a B+ tree, only leaf nodes hold data records; internal nodes hold only keys and child pointers. Additionally, B+ tree leaf nodes are linked together in a doubly linked list, creating a sorted sequence of all rows that can be scanned sequentially.
This design makes range scans dramatically cheaper. A query like WHERE created_at BETWEEN '2024-01-01' AND '2024-03-31' descends to the first matching leaf, then follows the linked list forward until the range is exhausted โ no backtracking up the tree. MySQL InnoDB, PostgreSQL, and virtually every modern relational engine uses B+ trees, not pure B-trees, precisely for this reason.
| Property | B-Tree | B+ Tree |
|---|---|---|
| Data stored in | Any node (internal or leaf) | Leaf nodes only |
| Internal nodes hold | Keys + data pointers + child ptrs | Keys + child pointers only |
| Leaf node linking | No | Yes โ doubly linked list |
| Range scan | Requires backtracking up tree | Follow leaf linked list โ very fast |
| Point lookup | May stop at internal node (fast) | Always reaches a leaf |
| Used by | Older / niche engines | MySQL InnoDB, PostgreSQL, SQLite, Oracle |
| Fan-out (internal) | Lower (keys + data pointers) | Higher (keys + child ptrs only) |
| Tree height | Slightly taller | Slightly shorter (higher fan-out) |
Page-Oriented Storage and the Buffer Pool
B-trees are inseparable from page-oriented storage. The database divides its data files into fixed-size pages โ typically 4 KB (PostgreSQL) or 16 KB (MySQL InnoDB). Each B-tree node maps to exactly one page. When the engine needs to read or write a node, it reads or writes the entire page. The buffer pool (InnoDB) or shared buffers (PostgreSQL) keeps hot pages in RAM: if the root and the top few levels of the tree fit in memory, a search only needs one real disk I/O (the leaf), making lookups sub-millisecond on heavily cached systems.
Write-ahead logging (WAL) ensures durability: before a page is modified in the buffer pool, a log record is written to sequential disk storage. This means individual row inserts are cheap (sequential WAL write) while the actual B-tree page update can be deferred to a background checkpoint. This is the engine behind the durability in every ACID-compliant database.
B-Tree vs LSM Tree: Read vs Write Amplification
B-trees dominate relational databases, but a rival data structure โ the Log-Structured Merge tree (LSM tree) โ dominates write-heavy workloads in systems like RocksDB, Cassandra, LevelDB, and ScyllaDB. The trade-off is fundamental: B-trees optimize for reads (few page reads per lookup) at the cost of write amplification; LSM trees optimize for writes (sequential append) at the cost of read amplification.
Write amplification in a B-tree: inserting a single row may require reading a page, modifying it in memory, and writing it back โ plus writing a WAL record. If the page is cold (not in buffer pool), that is two random I/Os per insert. In an LSM tree, every insert is a sequential append to an in-memory memtable, flushed to disk in sorted runs โ far cheaper for write-heavy workloads but requiring periodic compaction that reads and rewrites large data files (read amplification during compaction).
| Dimension | B-Tree / B+ Tree | LSM Tree |
|---|---|---|
| Write path | Random write to the correct page | Sequential append to memtable โ SSTables |
| Read path | 3โ5 page reads (root โ leaf) | May need to check multiple SSTables + bloom filters |
| Write amplification | Moderate (WAL + page write) | High (compaction rewrites data repeatedly) |
| Read amplification | Low (tree path is direct) | Higher (must merge multiple sorted runs) |
| Space amplification | Low | Higher (multiple versions during compaction) |
| Range scans | Excellent (linked leaf list in B+) | Good (sorted SSTables, bloom filters skip levels) |
| Best for | Read-heavy OLTP, balanced workloads | Write-heavy workloads, time-series, event logs |
| Examples | MySQL, PostgreSQL, SQLite, Oracle | RocksDB, Cassandra, LevelDB, ScyllaDB |
Neither structure is universally better. The rough rule: choose a B-tree-backed engine (MySQL, PostgreSQL) when reads dominate or when you need strong consistency and ACID transactions. Choose an LSM-backed engine (RocksDB, Cassandra) when write throughput is the bottleneck and you can tolerate eventual consistency or tune compaction aggressively. Many systems today run both โ PostgreSQL for transactional data, RocksDB as a storage engine for caches or queues.
Practical Implications: Using B-Tree Indexes Well
Understanding B-tree internals makes you a better SQL developer. A few key rules: Leftmost prefix rule โ a composite index on (last_name, first_name) can serve queries filtering on last_name alone, but not on first_name alone, because tree traversal must start at the leftmost key. Avoid functions on indexed columns โ WHERE YEAR(created_at) = 2024 cannot use the B-tree index on created_at because the function is applied before the comparison; use WHERE created_at BETWEEN '2024-01-01' AND '2024-12-31' instead. Index cardinality matters โ a low-cardinality column (like a boolean) makes a poor B-tree index because the tree cannot narrow down the search significantly; the optimizer may full-scan instead.
-- B-tree index on a single column
CREATE INDEX idx_user_email ON users(email);
-- Composite B-tree index (leftmost prefix rule applies)
CREATE INDEX idx_order_status_date ON orders(status, created_at);
-- This query CAN use idx_order_status_date (leftmost prefix matched)
SELECT * FROM orders WHERE status = 'shipped' AND created_at > '2024-01-01';
-- This query CANNOT use the composite index (skips leftmost column)
SELECT * FROM orders WHERE created_at > '2024-01-01';
-- Range scan exploiting B+ tree leaf linked list
SELECT id, total FROM orders
WHERE created_at BETWEEN '2024-01-01' AND '2024-03-31'
ORDER BY created_at; -- ORDER BY is free: leaves are already sortedFrequently Asked Questions
Why do databases use B-trees instead of hash tables for indexes?
Hash indexes offer O(1) point lookups, but they cannot serve range queries (WHERE id > 100) or ordered scans at all โ a hash function destroys ordering information. B-trees maintain sorted order, so they handle both point lookups and range scans efficiently. Most SQL databases do support hash indexes as a secondary option (e.g., CREATE INDEX ... USING HASH in PostgreSQL), but the B-tree index is the default precisely because it handles the full spectrum of query patterns you encounter in practice.
What happens when a B-tree root splits?
Root splitting is the only mechanism by which a B-tree grows taller. When the root node is full and a split is needed, the root itself is split: a new root node is created containing only the median key, and the two halves become its children. The tree height increases by exactly one. This is a rare event โ the root must be completely full, and filling it requires enough prior insertions to have saturated every level below it. Once the tree reaches a stable height for a given dataset size, root splits become extremely infrequent.
How does a B-tree handle concurrent reads and writes in production databases?
Concurrent access to a B-tree is managed through latch coupling (also called crabbing). A writer acquires a latch on the root, then moves down the tree acquiring a latch on the child before releasing the parent โ so no split can propagate upward past a latched node. Modern engines like InnoDB use optimistic latch coupling: reads are latch-free until a conflict is detected, dramatically improving read throughput. PostgreSQL uses a different approach โ MVCC at the row level โ so readers never block writers and vice versa, with dead tuples cleaned up by the VACUUM process that also keeps B-tree pages from bloating with stale index entries.
Three to five disk reads to find any row among a billion โ that is the quiet engineering miracle that B-trees have been delivering for over fifty years. Understanding them is not academic; it is the foundation of every index decision you will ever make.
โ alokknight Engineering
