Session 3.3: Database Sharding

Course → Module 3: Storage, Databases & Caching

Why Shard?

A single database server has limits. CPU, memory, disk I/O, network bandwidth. Vertical scaling buys you time, but eventually you hit the ceiling. When your tables hold billions of rows and your write throughput exceeds what one machine can handle, you need to split the data across multiple machines. That is sharding.

Sharding (also called horizontal partitioning) splits a single logical dataset across multiple database instances, called shards. Each shard holds a subset of the data and handles a subset of the queries. Together, the shards form one logical database.

Sharding is not the same as replication. Replication copies the same data to multiple machines for redundancy and read scaling. Sharding distributes different data to different machines for write scaling and storage capacity. Most production systems use both.

Three Sharding Strategies

Range-Based Sharding

Divide the shard key space into contiguous ranges. Users 1 through 1,000,000 go to Shard A. Users 1,000,001 through 2,000,000 go to Shard B. And so on.

Range-based sharding is simple to understand and implement. Range queries are efficient because all data within a range lives on the same shard. If you need all users with IDs between 500,000 and 600,000, that query hits a single shard.

The problem is hotspots. If user IDs are auto-incrementing, all new writes go to the highest shard. That shard becomes a bottleneck while the others sit idle. Time-based keys have the same problem: all recent data lands on the same shard.

Hash-Based Sharding

Apply a hash function to the shard key and use the result to determine the shard. For example: shard = hash(user_id) % number_of_shards. A good hash function distributes keys uniformly, so writes spread evenly across all shards.

Hash-based sharding eliminates hotspots for sequential keys. User IDs 1, 2, 3, 4 might map to shards 3, 1, 0, 2 respectively. The tradeoff: range queries become expensive. "Find all users with IDs 1 through 1000" now requires querying every shard and merging results, because those IDs are scattered.

Resharding is also painful. If you change the number of shards (say from 4 to 6), the modulo changes, and most keys map to different shards. You have to move data. Consistent hashing (covered in Session 3.6) mitigates this problem.

Directory-Based Sharding

Maintain a lookup table that maps each shard key (or key range) to a specific shard. The application checks the directory before every database operation to determine which shard to query.

Directory-based sharding offers maximum flexibility. You can place data on shards based on any criteria: geography, tenant ID, or custom business logic. You can move data between shards by updating the directory without changing the hash function or recalculating ranges.

The cost is the directory itself. It becomes a single point of failure and a latency bottleneck. Every query requires an extra lookup. The directory must be highly available, fast, and consistent. Many teams implement the directory in a cache (Redis) backed by a durable store.

Hash-Based Sharding in Action

flowchart TD App["Application"] --> Router["Shard Router
shard = hash(user_id) % 4"] Router -->|"hash = 0"| S0["Shard 0
Users: 4, 8, 12, ..."] Router -->|"hash = 1"| S1["Shard 1
Users: 1, 5, 9, ..."] Router -->|"hash = 2"| S2["Shard 2
Users: 2, 6, 10, ..."] Router -->|"hash = 3"| S3["Shard 3
Users: 3, 7, 11, ..."] S0 --- DB0[("PostgreSQL
Instance 0")] S1 --- DB1[("PostgreSQL
Instance 1")] S2 --- DB2[("PostgreSQL
Instance 2")] S3 --- DB3[("PostgreSQL
Instance 3")]

In this diagram, the shard router computes hash(user_id) % 4 and routes the query to the appropriate shard. Each shard is a separate PostgreSQL instance holding roughly one quarter of all users. Writes are distributed evenly. Reads by user_id hit exactly one shard.

Shard Key Selection

The shard key determines everything. It controls data distribution, query routing, and which operations are efficient. A bad shard key creates hotspots, forces cross-shard queries, and makes resharding a nightmare. Choose it based on your most common access pattern.

A good shard key has three properties:

High cardinality. The key should have many distinct values. Sharding by country gives you ~200 shards at most, and some shards (India, USA) will be much larger than others.
Even distribution. Values should distribute data uniformly across shards. Auto-incrementing IDs with hash-based sharding achieve this. Sharding by last name does not (too many "Smith" entries on one shard).
Query alignment. Your most frequent query should include the shard key. If 90% of queries are "get all orders for user X," then user_id is a good shard key because each query hits one shard.

Strategy Comparison

Dimension	Range-Based	Hash-Based	Directory-Based
Distribution	Uneven if keys are sequential	Even with good hash function	Depends on mapping logic
Range queries	Efficient (same shard)	Expensive (scatter-gather)	Possible if directory supports ranges
Hotspot risk	High (time-based, sequential keys)	Low (uniform distribution)	Low (configurable placement)
Resharding	Split ranges, move data	Painful (most keys remap)	Update directory (flexible)
Complexity	Low	Medium	High (directory service required)
Extra infrastructure	None	None	Lookup service (must be HA)
Use cases	Time-series data, archival	User data, general purpose	Multi-tenant, geo-based routing

Cross-Shard Queries

The hardest problem in sharding is queries that span multiple shards. If your data is sharded by user_id but you need to "find all users in Jakarta," the city field is not the shard key. The query must hit every shard, collect partial results, and merge them. This is called a scatter-gather query.

Scatter-gather queries are slow and resource-intensive. They scale linearly with the number of shards. With 4 shards, it is tolerable. With 400 shards, it is a serious performance problem.

Solutions include maintaining secondary indexes (either global or per-shard), denormalizing data into a separate lookup table, or using a search engine like Elasticsearch for cross-shard queries. Each solution adds complexity and consistency challenges.

Real-World Examples

Instagram: Sharding PostgreSQL

Instagram needed to handle over 25 photo uploads and 90 likes per second in its early days. They chose to shard PostgreSQL rather than switch to NoSQL. Their approach used thousands of "logical" shards mapped to a smaller number of physical PostgreSQL instances. Each logical shard was a PostgreSQL schema containing the sharded tables. This allowed them to move logical shards between physical servers without re-bucketing data.

For globally unique IDs, Instagram designed a custom ID generation scheme using PostgreSQL's PL/pgSQL. Each ID encodes 41 bits for timestamp, 13 bits for logical shard ID, and 10 bits for an auto-incrementing sequence. This produces IDs that are sortable by time and unique across all shards.

Discord: Cassandra to ScyllaDB

Discord stored trillions of messages in Cassandra, partitioned by channel_id. This worked well for most channels, but large public channels (with millions of messages) created hot partitions. Cassandra's garbage collector caused latency spikes under this workload. Discord eventually migrated to ScyllaDB (a Cassandra-compatible database written in C++) and added a data services layer in Rust with request coalescing to handle hot partitions. The migration reduced their cluster from 177 Cassandra nodes to 72 ScyllaDB nodes, with p99 read latency dropping from 40-125ms to 15ms.

graph LR subgraph Instagram Approach LS1["Logical Shard 1
(PG Schema)"] --> PS1["Physical Server A"] LS2["Logical Shard 2
(PG Schema)"] --> PS1 LS3["Logical Shard 3
(PG Schema)"] --> PS2["Physical Server B"] LS4["Logical Shard 4
(PG Schema)"] --> PS2 end subgraph Resharding LS2 -.->|"Move shard"| PS2 end

Instagram's logical shard abstraction is the key insight. By separating logical partitioning from physical placement, resharding becomes a data migration problem rather than a re-hashing problem. You move schemas between servers without touching the application's routing logic.

Systems Thinking Lens

Sharding introduces a reinforcing complexity loop. More data requires more shards. More shards increase the surface area for cross-shard queries and operational issues. Operational issues require more engineering effort. More engineering effort means less time for feature development. This loop accelerates as the system grows.

The leverage point is the shard key. A well-chosen shard key keeps most queries within a single shard, which limits the impact of the complexity loop. Instagram's choice to shard by user_id aligned with their primary access pattern (show me this user's photos), keeping the complexity manageable even at massive scale.

Assignment

You have a users table with 100 million rows, sharded by hash(user_id) % 4 across four PostgreSQL instances. The system works well for user-level queries.

A new feature requires finding all users in a specific city (e.g., "Find all users in Jakarta"). This is a cross-shard query because city is not the shard key.

Explain why this query is expensive with the current sharding scheme. How many shards must be queried? What does the query execution look like?
Propose at least two solutions to make city-based lookups efficient. For each, describe the tradeoff (what do you gain, what do you pay in complexity or consistency?).
If you could go back and choose a different shard key, what would you pick? Consider that the system still needs fast user_id lookups. Is there a compound key strategy that works for both access patterns?

Database Sharding