Session 3.5: Database Indexing

Course → Module 3: Storage, Databases & Caching

Why Indexes Exist

Without an index, every query is a full table scan. The database reads every row, checks if it matches your condition, and discards what does not. For a table with 100 rows, this is fine. For a table with 10 million rows, it is catastrophic.

An index is a separate data structure that maps column values to their physical locations in the table. Think of it like the index at the back of a textbook: instead of reading every page to find "B-tree," you look up "B-tree" in the index and jump directly to page 247.

A database index trades write speed and storage space for faster reads. Every index you add speeds up some queries and slows down every INSERT, UPDATE, and DELETE.

B-Tree Indexes

The B-tree (specifically B+ tree) is the default index type in PostgreSQL, MySQL, SQL Server, and virtually every relational database. It organizes data in a balanced tree structure where every leaf node is the same distance from the root.

A B+ tree with a branching factor of 500 can index 8 TB of data with only 4 levels. That means any lookup requires at most 4 disk I/O operations, regardless of table size. The time complexity is O(log n), which grows very slowly.

B-trees support equality lookups (WHERE id = 42), range queries (WHERE price BETWEEN 10 AND 50), prefix matching (WHERE name LIKE 'Jo%'), and sorting (ORDER BY created_at). This versatility is why they are the default.

graph TD R["Root Node
[30 | 70]"] --> L1["[10 | 20]"] R --> L2["[40 | 50 | 60]"] R --> L3["[80 | 90]"] L1 --> D1["Rows: 1-9"] L1 --> D2["Rows: 10-19"] L1 --> D3["Rows: 20-29"] L2 --> D4["Rows: 30-39"] L2 --> D5["Rows: 40-49"] L2 --> D6["Rows: 50-59"] L2 --> D7["Rows: 60-69"] L3 --> D8["Rows: 70-79"] L3 --> D9["Rows: 80-89"] L3 --> D10["Rows: 90-99"] style R fill:#222221,stroke:#c8a882,color:#ede9e3 style L1 fill:#222221,stroke:#6b8f71,color:#ede9e3 style L2 fill:#222221,stroke:#6b8f71,color:#ede9e3 style L3 fill:#222221,stroke:#6b8f71,color:#ede9e3

The root node contains separator keys that direct the search. Internal nodes narrow the range. Leaf nodes contain the actual pointers to table rows. In a B+ tree, leaf nodes are also linked together, which makes range scans efficient: once you find the starting point, you walk the linked list instead of re-traversing the tree.

Hash Indexes

A hash index applies a hash function to the column value and stores the result in a hash table. Lookups are O(1) on average, which is faster than B-tree's O(log n). PostgreSQL benchmarks show hash indexes can be 10-22% faster than B-trees for pure equality lookups.

The tradeoff is severe: hash indexes support only equality comparisons (WHERE id = 42). They cannot do range queries, sorting, or prefix matching. If you need any of those operations on the column, a hash index is useless.

GIN and GiST Indexes

PostgreSQL offers two specialized index types for complex data. GIN (Generalized Inverted Index) is built for full-text search, arrays, and JSONB. It creates an entry for every element inside a composite value. GIN lookups are about 3x faster than GiST for text search, but GIN indexes take about 3x longer to build.

GiST (Generalized Search Tree) handles geometric data, ranges, and nearest-neighbor queries. GiST indexes are lossy for text search (they may return false positives that require a recheck), but they update faster than GIN, making them better for frequently changing data.

Index Comparison

Dimension	B-Tree	Hash	GIN	GiST
Lookup time	O(log n)	O(1)	O(log n) per element	O(log n)
Range queries	Yes	No	No	Yes
Sorting	Yes	No	No	No
Full-text search	No	No	Yes (preferred)	Yes (lossy)
JSONB/Array	Limited	No	Yes	No
Geometric/Spatial	No	No	No	Yes
Build speed	Moderate	Fast	Slow (3x B-tree)	Moderate
Write overhead	Moderate	Low	High	Moderate
Best for	General purpose	Equality-only lookups	Full-text, JSONB, arrays	Geometry, ranges, nearest-neighbor

Composite Indexes and Column Order

A composite index includes multiple columns. The database sorts entries first by the leftmost column, then by the second column within each group, and so on. This ordering has a critical consequence called the leftmost prefix rule.

The leftmost prefix rule: a composite index on (A, B, C) can serve queries on A, on (A, B), and on (A, B, C). It cannot efficiently serve queries on B alone, C alone, or (B, C). The index must be read from left to right without skipping columns.

Consider an index on (user_id, status, created_at). This index efficiently supports:

WHERE user_id = 5
WHERE user_id = 5 AND status = 'pending'
WHERE user_id = 5 AND status = 'pending' ORDER BY created_at

It does NOT efficiently support WHERE status = 'pending' alone, because status is not the leftmost column.

Index Selectivity

Selectivity measures how well an index narrows down results. A column with 1 million unique values in a 1 million row table has selectivity of 1.0 (perfect). A boolean column with only two values has selectivity close to 0 (terrible).

Put highly selective columns first in composite indexes. If user_id narrows 10 million rows to 50, and status narrows them to 5 million, always lead with user_id.

Covering Indexes

A covering index contains all columns needed by a query. The database can answer the query entirely from the index without touching the table data. This eliminates the random I/O cost of jumping from the index to the table rows.

In PostgreSQL, you use INCLUDE to add non-searchable columns to the index:

CREATE INDEX idx_orders_user_status
ON orders (user_id, status)
INCLUDE (created_at, total);

Now a query selecting user_id, status, created_at, total with a WHERE on user_id and status never touches the heap. The savings are substantial for wide tables with many columns.

The Hidden Write Cost

Every index must be updated on every write operation. An INSERT into a table with 5 indexes means 6 write operations: one for the table and one for each index. An UPDATE that modifies an indexed column triggers an index delete and an index insert for each affected index.

This cost is not theoretical. On write-heavy workloads, excessive indexing is a common cause of degraded performance. The solution is to index deliberately: only create indexes that serve actual queries, and drop indexes that query plans never use.

Query Time vs. Table Size

The following chart shows how query response time changes as table size grows, comparing a full table scan against an indexed lookup.

The table scan time grows linearly with row count. The indexed lookup barely moves. At 10 million rows, the difference is four orders of magnitude.

Systems Thinking Lens

Indexes are a classic balancing loop. Adding an index improves read performance, which encourages more queries, which eventually leads to more indexes, which degrades write performance. The leverage point is not "add more indexes" but "understand your query patterns." A single well-designed composite index often replaces three or four single-column indexes while also reducing write overhead.

graph LR A["Slow Query
Detected"] --> B["Add Index"] B --> C["Reads Faster"] C --> D["More Queries
Written"] D --> E["More Indexes
Requested"] E --> F["Write Performance
Degrades"] F --> G["Review &
Consolidate"] G --> B style A fill:#222221,stroke:#c8a882,color:#ede9e3 style F fill:#222221,stroke:#c8a882,color:#ede9e3

Assignment

You have this query running thousands of times per second:

SELECT * FROM orders
WHERE user_id = ?
  AND status = 'pending'
ORDER BY created_at DESC;

Design the optimal composite index for this query. Answer these questions:

What columns should the index include, and in what order?
Why does column order matter here? What happens if you put status first instead of user_id?
Would making this a covering index (using INCLUDE) help? What columns would you include?
What is the write cost of this index on a table that receives 500 inserts per second?

Database Indexing