Partitioning in System Design: Sharding, Partition Keys, and Rebalancing (Visualized)
Partitioning splits a large dataset into smaller, independent pieces so each piece can live on its own node. This guide covers horizontal vs vertical partitioning, partition key selection, range, hash, and list strategies, hot partitions, rebalancing, and how partitioning differs from replication โ with live animations.
Partitioning is the technique of dividing a large dataset into smaller, independent subsets โ called partitions or shards โ so that each subset can be stored, queried, and scaled on its own node. Where a single-machine database eventually hits a ceiling of memory, I/O, and CPU, partitioning lets you spread both data and load across dozens or hundreds of machines, unlocking near-linear horizontal scalability.
Partitioning is fundamental to every large-scale data system. Cassandra, DynamoDB, MongoDB, Kafka, Elasticsearch, and virtually every major cloud database rely on it. Understanding how partitioning works โ and where it breaks โ is one of the most important skills in distributed systems design.
Why Partitioning?
A single relational database server can typically handle tens of gigabytes to a few terabytes before performance degrades. Beyond that, you face three hard walls: storage (the disk fills up), throughput (disk I/O and network become the bottleneck), and memory (the working set no longer fits in RAM). Vertical scaling โ buying a bigger machine โ is expensive and has a physical ceiling.
Partitioning attacks all three walls at once. By splitting data across N nodes, you get N times the storage, N independent I/O paths, and N memory pools. A query that touches only one partition runs against just 1/N of the data, making it proportionally faster. This is the fundamental reason why partitioning is the first tool architects reach for when a database needs to scale beyond a single node.
Horizontal Partitioning (Sharding)
Horizontal partitioning โ commonly called sharding โ splits a table by rows. Each partition holds a subset of the rows but retains the full column schema. If a users table has 500 million rows, you might split it into four shards of 125 million rows each. Every shard is an independent, fully-functional table on its own database server.
The key question in horizontal partitioning is: which rows go to which shard? The answer is determined by the partition key (also called the shard key). The partition key is a column โ or combination of columns โ whose value determines which shard a row belongs to. Choosing a good partition key is the single most important decision in a partitioned system: a poor choice leads to uneven data distribution, hot spots, and cross-shard queries that are expensive to execute.
Vertical Partitioning
Vertical partitioning splits a table by columns rather than rows. Each partition holds all rows but only a subset of the columns. A products table with 50 columns might be split into a core partition (product_id, name, price) queried constantly, and a detail partition (description, specifications, large BLOB fields) queried rarely. Because the hot columns live in a compact, separate store, they fit in cache more easily and queries that don't need the cold columns never touch those pages.
Vertical partitioning is also the conceptual basis for column-oriented databases (Parquet, Redshift, BigQuery). In a columnar store each column is stored as a contiguous block on disk โ an extreme form of vertical partitioning that makes analytical aggregates (SUM, AVG, COUNT over millions of rows) dramatically faster because the query only reads the one or two columns it needs.
Partitioning Strategies: Range, Hash, and List
Once you decide to shard horizontally, you need a strategy for mapping each row's partition key to a shard. The three canonical strategies are range partitioning, hash partitioning, and list partitioning. Each has different trade-offs in query efficiency, data distribution, and operational complexity.
Range partitioning assigns rows to shards based on a contiguous range of the partition key. For example, orders with order_date in January go to Shard 1, February to Shard 2, and so on. Range partitioning makes range scans efficient โ a query like WHERE order_date BETWEEN '2024-01-01' AND '2024-03-31' touches at most three shards instead of all of them. The downside is uneven load: recent data is almost always hotter than old data, so the latest shard becomes a hot spot while earlier shards sit idle.
Hash partitioning applies a hash function to the partition key and maps the result to a shard โ shard = hash(key) % N. This scatters rows uniformly across all shards, eliminating hot spots. The price is that range queries become expensive: a WHERE user_id BETWEEN 1000 AND 2000 must fan out to all shards. Consistent hashing is a popular refinement: instead of a simple modulo, keys and shards are placed on a ring, so adding or removing a shard requires moving only 1/N of the keys rather than rehashing everything.
List partitioning explicitly maps discrete values of the partition key to specific shards. A multi-region database might route all rows where region = 'US' to US shards, region = 'EU' to EU shards, and so on. List partitioning gives you full control over co-location (European data never leaves EU nodes, satisfying GDPR) but requires manual assignment and doesn't scale automatically when you add a new region.
| Strategy | How rows are assigned | Range scan efficiency | Data distribution | Best for |
|---|---|---|---|---|
| Range | Contiguous key ranges per shard | Excellent โ touches few shards | Can be uneven (hot latest shard) | Time-series, append-heavy data |
| Hash | hash(key) % N | Poor โ must fan out to all shards | Very even | OLTP with point lookups |
| Consistent hash | Key placed on a ring | Poor for ranges | Even; minimal rebalancing cost | Distributed caches, DynamoDB |
| List | Explicit value-to-shard mapping | Good for known values | Manual; can be uneven | Multi-region, multi-tenant isolation |
Query Routing: How a Request Finds Its Shard
When a query arrives in a partitioned system, something must decide which shard to send it to. This is done by a router โ sometimes called a shard proxy, query coordinator, or driver-side routing. The router holds (or can compute) the partition map: a table that says which shard owns which key range or hash bucket. Given a query like SELECT * FROM users WHERE user_id = 42000000, the router hashes 42000000, looks up the result in the partition map, and forwards the query directly to the one shard that owns that key. No other shards are involved.
Queries that don't include the partition key โ such as SELECT * FROM users WHERE email = '[email protected]' โ cannot be routed to a single shard. The router must broadcast the query to all shards and merge the results, a scatter-gather operation that is much more expensive. This is why choosing a partition key that matches your most common query patterns is so critical.
Hot Partitions and Rebalancing
A hot partition (or hot shard) is one that receives a disproportionately large share of the traffic or data โ it is the most common operational problem in partitioned systems. Hot partitions typically arise from a poor partition key choice. If you partition a social media platform's posts by celebrity_id, a single celebrity with 50 million followers will cause their shard to handle thousands of times more read traffic than any other shard, making it a bottleneck even though the other shards are idle.
Strategies for avoiding hot partitions include: choosing a higher-cardinality partition key (user_id instead of country), adding a random suffix to the key to spread writes across multiple shards (sometimes called write sharding), and adding a read replica layer in front of hot shards. For time-series data, splitting by a compound key (e.g., sensor_id + date) ensures different sensors always go to different shards regardless of temporal clustering.
Rebalancing is the process of moving data between shards to restore even distribution โ either because a hot shard has emerged, because the cluster has grown (new shards added), or because a shard node has failed and its data must be redistributed. Rebalancing is expensive: it requires moving potentially gigabytes or terabytes of data while the system remains online, without losing any writes that arrive during the migration. Systems like Cassandra and DynamoDB automate this using consistent hashing and a gossip protocol; MongoDB uses a dedicated balancer background process that migrates chunks between shards whenever distribution becomes uneven.
Partitioning vs Replication
Partitioning and replication are complementary, not competing, techniques. Partitioning solves the scale problem: it splits data so each node holds a fraction of the total dataset, increasing overall capacity. Replication solves the availability problem: it copies the same data to multiple nodes, so if one node fails, another can serve the data. Most production systems use both simultaneously โ a dataset is split into shards (partitioning), and each shard is copied to 2โ3 replicas (replication). This gives you horizontal scalability from partitioning and fault tolerance from replication.
| Partitioning | Replication | |
|---|---|---|
| Primary goal | Scale capacity โ more total data | Fault tolerance โ survive node failure |
| What it does | Each node holds a unique subset of data | Multiple nodes hold identical copies |
| Read throughput | Increases (queries go to fewer rows) | Increases (reads can go to any replica) |
| Write throughput | Increases (writes spread across shards) | Limited (all replicas must be updated) |
| Data loss on failure | Yes โ that shard's data is unavailable | No โ other replicas serve the data |
| Used together? | Yes โ shards are typically replicated | Yes โ replicas are often also partitioned |
Partitioning in Practice: Code Example
PostgreSQL's declarative partitioning lets you define a parent table and attach child tables as range, hash, or list partitions. The planner automatically prunes irrelevant partitions from the query plan, so a query with a date range only scans the relevant child tables.
-- Range-partitioned orders table (PostgreSQL)
CREATE TABLE orders (
order_id BIGINT NOT NULL,
order_date DATE NOT NULL,
customer_id BIGINT,
total_cents INT
) PARTITION BY RANGE (order_date);
-- Attach one partition per quarter
CREATE TABLE orders_2024_q1
PARTITION OF orders
FOR VALUES FROM ('2024-01-01') TO ('2024-04-01');
CREATE TABLE orders_2024_q2
PARTITION OF orders
FOR VALUES FROM ('2024-04-01') TO ('2024-07-01');
-- Query only touches orders_2024_q1 (partition pruning)
EXPLAIN SELECT * FROM orders
WHERE order_date BETWEEN '2024-01-01' AND '2024-03-31';Frequently Asked Questions
What is the difference between partitioning and sharding?
The terms are often used interchangeably, but there is a subtle distinction in common usage. Partitioning is the general concept of dividing a dataset into parts โ it includes both horizontal and vertical partitioning, and can happen entirely within a single database server (e.g., PostgreSQL table partitioning). Sharding typically implies horizontal partitioning where each partition (shard) lives on a separate physical server or node, enabling the dataset to span multiple machines. All sharding is partitioning, but not all partitioning is sharding.
How do I choose the right partition key?
A good partition key should satisfy four criteria. High cardinality: the key should have many distinct values so rows spread evenly across shards โ a boolean or low-cardinality enum makes a terrible partition key. Query alignment: the key should appear in your most frequent query predicates so the router can direct queries to a single shard. Uniform write distribution: avoid keys that cluster writes in time (like a timestamp) unless you combine them with another dimension. Stability: changing a row's partition key value is expensive because it may require moving the row to a different shard โ so choose a key that is immutable for the life of each row.
Does partitioning make cross-shard joins impossible?
Not impossible, but expensive. When two tables are partitioned by different keys and a query needs to join them, the system must either broadcast one side of the join to all shards and merge results (scatter-gather), or pull all matching rows from both tables to a coordinator node and perform the join there. Both approaches involve significant network I/O. The usual solution is co-location: partition related tables by the same key so rows that are frequently joined together land on the same shard. For example, if users and orders are both partitioned by user_id, then JOIN users ON orders.user_id = users.user_id can be resolved within a single shard without any cross-node communication.
The partition key is the single most consequential decision in a sharded system. Get it right at the start โ changing it later means migrating your entire dataset.
โ alokknight Engineering
