Database Indexing
Session 3.5 · ~5 min read
Why Indexes Exist
Without an index, every query is a full table scan. The database reads every row, checks if it matches your condition, and discards what does not. For a table with 100 rows, this is fine. For a table with 10 million rows, it is catastrophic.
An index is a separate data structure that maps column values to their physical locations in the table. Think of it like the index at the back of a textbook: instead of reading every page to find "B-tree," you look up "B-tree" in the index and jump directly to page 247.
A database index trades write speed and storage space for faster reads. Every index you add speeds up some queries and slows down every INSERT, UPDATE, and DELETE.
B-Tree Indexes
The B-tree (specifically B+ tree) is the default index type in PostgreSQL, MySQL, SQL Server, and virtually every relational database. It organizes data in a balanced tree structure where every leaf node is the same distance from the root.
A B+ tree with a branching factor of 500 can index 8 TB of data with only 4 levels. That means any lookup requires at most 4 disk I/O operations, regardless of table size. The time complexity is O(log n), which grows very slowly.
B-trees support equality lookups (WHERE id = 42), range queries (WHERE price BETWEEN 10 AND 50), prefix matching (WHERE name LIKE 'Jo%'), and sorting (ORDER BY created_at). This versatility is why they are the default.
[30 | 70]"] --> L1["[10 | 20]"] R --> L2["[40 | 50 | 60]"] R --> L3["[80 | 90]"] L1 --> D1["Rows: 1-9"] L1 --> D2["Rows: 10-19"] L1 --> D3["Rows: 20-29"] L2 --> D4["Rows: 30-39"] L2 --> D5["Rows: 40-49"] L2 --> D6["Rows: 50-59"] L2 --> D7["Rows: 60-69"] L3 --> D8["Rows: 70-79"] L3 --> D9["Rows: 80-89"] L3 --> D10["Rows: 90-99"] style R fill:#222221,stroke:#c8a882,color:#ede9e3 style L1 fill:#222221,stroke:#6b8f71,color:#ede9e3 style L2 fill:#222221,stroke:#6b8f71,color:#ede9e3 style L3 fill:#222221,stroke:#6b8f71,color:#ede9e3
The root node contains separator keys that direct the search. Internal nodes narrow the range. Leaf nodes contain the actual pointers to table rows. In a B+ tree, leaf nodes are also linked together, which makes range scans efficient: once you find the starting point, you walk the linked list instead of re-traversing the tree.
Hash Indexes
A hash index applies a hash function to the column value and stores the result in a hash table. Lookups are O(1) on average, which is faster than B-tree's O(log n). PostgreSQL benchmarks show hash indexes can be 10-22% faster than B-trees for pure equality lookups.
The tradeoff is severe: hash indexes support only equality comparisons (WHERE id = 42). They cannot do range queries, sorting, or prefix matching. If you need any of those operations on the column, a hash index is useless.
GIN and GiST Indexes
PostgreSQL offers two specialized index types for complex data. GIN (Generalized Inverted Index) is built for full-text search, arrays, and JSONB. It creates an entry for every element inside a composite value. GIN lookups are about 3x faster than GiST for text search, but GIN indexes take about 3x longer to build.
GiST (Generalized Search Tree) handles geometric data, ranges, and nearest-neighbor queries. GiST indexes are lossy for text search (they may return false positives that require a recheck), but they update faster than GIN, making them better for frequently changing data.
Index Comparison
| Dimension | B-Tree | Hash | GIN | GiST |
|---|---|---|---|---|
| Lookup time | O(log n) | O(1) | O(log n) per element | O(log n) |
| Range queries | Yes | No | No | Yes |
| Sorting | Yes | No | No | No |
| Full-text search | No | No | Yes (preferred) | Yes (lossy) |
| JSONB/Array | Limited | No | Yes | No |
| Geometric/Spatial | No | No | No | Yes |
| Build speed | Moderate | Fast | Slow (3x B-tree) | Moderate |
| Write overhead | Moderate | Low | High | Moderate |
| Best for | General purpose | Equality-only lookups | Full-text, JSONB, arrays | Geometry, ranges, nearest-neighbor |
Composite Indexes and Column Order
A composite index includes multiple columns. The database sorts entries first by the leftmost column, then by the second column within each group, and so on. This ordering has a critical consequence called the leftmost prefix rule.
The leftmost prefix rule: a composite index on (A, B, C) can serve queries on A, on (A, B), and on (A, B, C). It cannot efficiently serve queries on B alone, C alone, or (B, C). The index must be read from left to right without skipping columns.
Consider an index on (user_id, status, created_at). This index efficiently supports:
WHERE user_id = 5WHERE user_id = 5 AND status = 'pending'WHERE user_id = 5 AND status = 'pending' ORDER BY created_at
It does NOT efficiently support WHERE status = 'pending' alone, because status is not the leftmost column.
Index Selectivity
Selectivity measures how well an index narrows down results. A column with 1 million unique values in a 1 million row table has selectivity of 1.0 (perfect). A boolean column with only two values has selectivity close to 0 (terrible).
Put highly selective columns first in composite indexes. If user_id narrows 10 million rows to 50, and status narrows them to 5 million, always lead with user_id.
Covering Indexes
A covering index contains all columns needed by a query. The database can answer the query entirely from the index without touching the table data. This eliminates the random I/O cost of jumping from the index to the table rows.
In PostgreSQL, you use INCLUDE to add non-searchable columns to the index:
CREATE INDEX idx_orders_user_status
ON orders (user_id, status)
INCLUDE (created_at, total);
Now a query selecting user_id, status, created_at, total with a WHERE on user_id and status never touches the heap. The savings are substantial for wide tables with many columns.
The Hidden Write Cost
Every index must be updated on every write operation. An INSERT into a table with 5 indexes means 6 write operations: one for the table and one for each index. An UPDATE that modifies an indexed column triggers an index delete and an index insert for each affected index.
This cost is not theoretical. On write-heavy workloads, excessive indexing is a common cause of degraded performance. The solution is to index deliberately: only create indexes that serve actual queries, and drop indexes that query plans never use.
Query Time vs. Table Size
The following chart shows how query response time changes as table size grows, comparing a full table scan against an indexed lookup.
The table scan time grows linearly with row count. The indexed lookup barely moves. At 10 million rows, the difference is four orders of magnitude.
Systems Thinking Lens
Indexes are a classic balancing loop. Adding an index improves read performance, which encourages more queries, which eventually leads to more indexes, which degrades write performance. The leverage point is not "add more indexes" but "understand your query patterns." A single well-designed composite index often replaces three or four single-column indexes while also reducing write overhead.
Detected"] --> B["Add Index"] B --> C["Reads Faster"] C --> D["More Queries
Written"] D --> E["More Indexes
Requested"] E --> F["Write Performance
Degrades"] F --> G["Review &
Consolidate"] G --> B style A fill:#222221,stroke:#c8a882,color:#ede9e3 style F fill:#222221,stroke:#c8a882,color:#ede9e3
Further Reading
- MySQL Reference Manual, Comparison of B-Tree and Hash Indexes. Official MySQL documentation on when to use each index type.
- PlanetScale, Composite Indexes. Practical guide to composite index design with the leftmost prefix rule.
- pganalyze, Understanding Postgres GIN Indexes. Deep dive into GIN index internals, build time, and query performance.
- PostgreSQL Documentation, Preferred Index Types for Text Search. Official comparison of GIN vs GiST for full-text search.
- EnterpriseDB, Are Hash Indexes Faster than B-Tree Indexes in Postgres?. Benchmark results comparing hash and B-tree index performance.
Assignment
You have this query running thousands of times per second:
SELECT * FROM orders
WHERE user_id = ?
AND status = 'pending'
ORDER BY created_at DESC;
Design the optimal composite index for this query. Answer these questions:
- What columns should the index include, and in what order?
- Why does column order matter here? What happens if you put
statusfirst instead ofuser_id? - Would making this a covering index (using
INCLUDE) help? What columns would you include? - What is the write cost of this index on a table that receives 500 inserts per second?