How Databases Work: Storage, Indexes & the Read/Write Path (Visualized)
A database is a system that durably stores data on disk and serves fast reads and writes through storage engines, pages, and indexes. This guide explains how databases physically store rows, how B-tree and LSM indexes work, and the exact read and write paths โ with live animations.
A database is a system that organizes data so it can be stored durably on disk and retrieved quickly, even as the dataset grows to billions of rows. At its core, a database is two things working together: a storage engine that decides how bytes land on disk, and an index that lets the engine find the right bytes without scanning everything.
When you run SELECT * FROM users WHERE id = 42, the database does not read every row. It walks an index to locate the exact disk page that holds row 42, pulls that page into memory, and returns the row. Understanding how that happens โ pages, indexes, the buffer pool, the write-ahead log โ is the foundation of database performance.
Storage Engines: Pages and Rows
The storage engine is the layer that reads and writes data on disk. It never deals in single rows; it deals in fixed-size pages (commonly 8 KB in PostgreSQL, 16 KB in MySQL's InnoDB). A page is the unit of I/O: to read one row, the engine reads the whole page that contains it. Each page packs many rows plus a header and a slot directory that maps row identifiers to byte offsets inside the page.
Because disk and SSD I/O is far slower than memory, databases keep recently used pages in a memory cache called the buffer pool (or page cache). A query that finds its page already cached is a cache hit and returns in microseconds; a miss forces a disk read costing milliseconds. Maximizing the buffer-pool hit rate is one of the single biggest levers on database throughput.
Indexes: Finding a Row Without Scanning
Without an index, finding a row means a full table scan โ reading every page until a match is found, an O(n) operation. An index is an auxiliary data structure that maps key values to the location of matching rows, turning that linear scan into a logarithmic lookup. The dominant structure in relational databases is the B-tree (specifically a B+tree).
A B+tree is a balanced, multi-way tree. Internal nodes hold separator keys that guide the search; leaf nodes hold the actual keys in sorted order, with pointers to the rows. Because the tree is shallow and wide (each node holds hundreds of keys), even a billion-row table is only three or four levels deep โ so a lookup touches just a handful of pages. The sorted leaves also make range queries (WHERE age BETWEEN 20 AND 30) efficient.
Indexes are not free: every index must be updated on each INSERT, UPDATE, and DELETE, and it consumes disk space. The art of indexing is creating enough indexes to make reads fast without slowing writes to a crawl. A primary index defines the physical order of rows (a clustered index in InnoDB); secondary indexes point back to the primary key.
The Write Path: WAL, Memtable, and Flush
Writing directly to a random page on disk for every change would be brutally slow and unsafe if the machine crashed mid-write. Databases solve this with a write-ahead log (WAL): the change is first appended sequentially to a log on durable storage, and only then applied in memory. Sequential appends are fast, and the log guarantees durability โ after a crash, the database replays the WAL to recover any committed changes that had not yet been flushed.
Log-structured engines like RocksDB and Cassandra take this further with an LSM-tree: writes land in an in-memory memtable (backed by the WAL); when it fills, it is flushed to an immutable on-disk file (an SSTable). Background compaction later merges these files. This makes writes extremely fast at the cost of more work on reads and during compaction โ the opposite trade-off from a B-tree.
B-tree vs LSM-tree Storage
| B-tree (update-in-place) | LSM-tree (log-structured) | |
|---|---|---|
| Write pattern | Random page writes | Sequential appends + compaction |
| Read speed | Fast, predictable | May check several SSTables |
| Write throughput | Lower | Very high |
| Used by | PostgreSQL, MySQL InnoDB | RocksDB, Cassandra, ScyllaDB |
Transactions and ACID
A transaction groups several reads and writes into a single unit that either fully commits or fully rolls back. Relational databases guarantee ACID properties: Atomicity (all-or-nothing), Consistency (constraints hold), Isolation (concurrent transactions don't corrupt each other), and Durability (committed data survives crashes, thanks to the WAL). Isolation levels โ read committed, repeatable read, serializable โ trade strictness for concurrency.
Most modern databases implement isolation with MVCC (multi-version concurrency control): instead of locking on reads, each transaction sees a consistent snapshot of the data as of its start time, while writers create new row versions. This lets readers and writers proceed without blocking each other โ the model PostgreSQL and MySQL InnoDB both use.
OLTP vs OLAP
Workloads split into two camps. OLTP (online transaction processing) handles many small, fast reads and writes โ the lifeblood of an app's user-facing traffic โ and stores rows together (row-oriented) so a single record is one page read. OLAP (online analytical processing) scans huge volumes to compute aggregates for analytics, and stores data column-by-column (columnar) so a query touching three columns reads only those columns. Systems like Snowflake, BigQuery, and ClickHouse are columnar; PostgreSQL and MySQL are row-oriented.
Data Models Compared
| Model | Best for | Examples |
|---|---|---|
| Relational | Structured data, joins, strong consistency | PostgreSQL, MySQL |
| Document | Flexible/nested schemas, fast iteration | MongoDB, Couchbase |
| Key-value | Simple lookups, caching, very high throughput | Redis, DynamoDB |
| Columnar | Analytics, aggregates over big scans | ClickHouse, BigQuery |
Frequently Asked Questions
How does a database find a row so quickly?
It uses an index, usually a B+tree. Instead of scanning every row, the database descends the tree comparing the search key at each level and following a single pointer, reaching the target leaf in three or four page reads even for a billion-row table. The page is then served from the buffer pool if cached, or read from disk on a miss.
What is a write-ahead log and why does it matter?
A write-ahead log (WAL) is a durable, append-only record of every change, written before the change is applied to the data pages. It makes writes fast (sequential appends beat random page writes) and guarantees durability: after a crash, the database replays the WAL to recover any committed changes that had not yet been flushed to their final location.
Should I use a B-tree or LSM-tree database?
Choose a B-tree engine (PostgreSQL, MySQL InnoDB) for read-heavy or mixed workloads where predictable latency and rich transactions matter. Choose an LSM-tree engine (RocksDB, Cassandra) for write-heavy ingestion at scale where you can tolerate compaction overhead and slightly more complex reads. Most general-purpose apps are well served by a B-tree relational database.
A database is just two ideas done well: lay bytes out in pages on disk, and build an index so you never have to read them all.
โ alokknight Engineering
