Session 8.1: Search Engine

Course → Module 8: Real-World Case Studies II

The Problem

The web contains over 100 billion pages. A user types a few words and expects relevant results in under 300 milliseconds. The system must crawl those pages, understand their content, store it in a format optimized for retrieval, and rank results by relevance. Every component operates at a scale where naive approaches collapse.

A search engine is not one system. It is a pipeline of specialized subsystems, each solving a different problem, connected end-to-end.

The Pipeline

At the highest level, a search engine is a five-stage pipeline. Raw web pages enter one end. Ranked results exit the other.

graph LR A[URL Frontier] --> B[Crawler] B --> C[Parser] C --> D[Indexer] D --> E[Ranking Engine] E --> F[Query Server] F --> G[User Results]

Each stage has distinct compute characteristics. Crawling is I/O-bound and network-limited. Parsing is CPU-bound. Indexing is both CPU and storage-intensive. Ranking combines precomputed signals with real-time scoring. Serving demands low latency above all else.

Crawling: Discovery at Scale

The crawler's job is to fetch web pages. It starts with a seed list of known URLs, fetches each page, extracts links from the HTML, and adds new URLs to the frontier. This process runs continuously across thousands of machines.

Three problems dominate crawler design:

Politeness. Crawlers must respect robots.txt and rate-limit requests per domain. Hammering a small site with 1,000 requests per second will get you blocked and potentially cause an outage for the site owner. A politeness policy typically enforces one request per domain every few seconds.
Deduplication. The same content appears at multiple URLs. Redirect chains, URL parameters, and mirror sites all produce duplicates. The crawler maintains a URL-seen set (often a Bloom filter for memory efficiency) and a content fingerprint store to detect near-duplicates.
Frontier prioritization. Not all URLs are equal. A page on a major news site should be re-crawled hourly. A personal blog from 2008 can wait months. The URL frontier is a priority queue that balances freshness requirements against crawl budget.

Key insight: The URL frontier is the brain of the crawler. Its prioritization strategy determines what the search engine knows and how current that knowledge is. A poorly designed frontier wastes crawl budget on low-value pages while high-value content goes stale.

The Inverted Index

Once pages are crawled and parsed, the content must be stored in a structure optimized for search. A naive approach (scan every document for every query) is impossibly slow at scale. The solution is the inverted index.

A forward index maps documents to the words they contain. An inverted index flips this: it maps each word to the list of documents containing that word. This list is called a posting list.

graph TD subgraph Inverted Index T1["Term: 'database'"] --> P1["Doc 4, Doc 17, Doc 203"] T2["Term: 'sharding'"] --> P2["Doc 17, Doc 89"] T3["Term: 'replication'"] --> P3["Doc 4, Doc 89, Doc 112"] end subgraph Query: database sharding Q["Intersect posting lists"] --> R["Doc 17"] end

Each entry in a posting list typically stores more than just the document ID. It includes the term frequency (how many times the word appears), the positions of each occurrence (enabling phrase queries), and metadata like whether the term appeared in a title or heading. This additional information supports both ranking and advanced query types like "exact phrase" matching.

A query for "database sharding" becomes an intersection of two posting lists. The engine retrieves the posting list for "database" and the posting list for "sharding," then finds documents that appear in both lists. Sorted posting lists make this intersection efficient using a merge-join approach.

Ranking: TF-IDF, BM25, and PageRank

Retrieving documents is not enough. The engine must decide which documents are most relevant. Three algorithms form the foundation of modern ranking.

Algorithm	What It Measures	Strengths	Limitations
TF-IDF	Term frequency weighted by inverse document frequency	Simple, interpretable, effective baseline	No term saturation; long documents unfairly boosted
BM25	TF-IDF with saturation curve and document length normalization	Industry standard (Elasticsearch, Solr, Lucene default)	Tuning parameters (k1, b) require experimentation
PageRank	Link graph authority (how many quality pages link to this one)	Query-independent quality signal; resists keyword stuffing	Expensive to compute; can be gamed with link farms

TF-IDF scores a document by how often a query term appears in it (term frequency), discounted by how common that term is across all documents (inverse document frequency). The word "the" appears everywhere, so it contributes almost nothing. The word "sharding" appears in far fewer documents, so its presence is a stronger signal.

BM25 improves on TF-IDF with two key additions. First, it applies a saturation function: the tenth occurrence of a term in a document adds less relevance than the first. Second, it normalizes by document length so that a 500-word page and a 50,000-word page are compared fairly. BM25 is the default ranking function in Elasticsearch and Apache Lucene.

PageRank operates on the link graph, not on document content. Each page starts with equal rank. On each iteration, a page distributes its rank equally among the pages it links to. After many iterations, pages linked by many high-quality sources accumulate higher rank. Google's original 1998 paper used PageRank as a query-independent authority signal layered on top of content-based scoring.

Fuzzy Search

Users misspell words. They type "databse" instead of "database." A strict inverted index lookup returns nothing for a misspelled term. Fuzzy search bridges this gap.

The standard approach computes edit distance (Levenshtein distance) between the query term and terms in the index. An edit distance of 1 means one character insertion, deletion, or substitution separates the two strings. "databse" is edit distance 1 from "database."

At scale, computing edit distance against every term in the index is too slow. Practical systems use n-gram indexes (breaking terms into overlapping character sequences) or finite state transducers to efficiently find candidate terms within a given edit distance. Elasticsearch supports fuzzy queries natively using automaton-based matching.

Elasticsearch Sharding

A single-machine inverted index cannot hold the entire web. Elasticsearch distributes the index across a cluster using sharding. Each index is split into primary shards, and each shard is a self-contained Lucene index with its own inverted index.

When a query arrives, a coordinator node fans it out to every shard in parallel. Each shard searches its local inverted index, scores candidates using BM25, and returns its top K results. The coordinator merges these partial results into a global top-K list. This scatter-gather pattern enables horizontal scaling: more shards means more documents indexed, and more replicas means higher read throughput.

Routing determines which shard holds a given document. By default, Elasticsearch hashes the document ID and takes the modulo of the number of primary shards. This ensures even distribution. Replicas of each shard live on different nodes, providing fault tolerance. If a node dies, its replicas on other nodes continue serving queries without interruption.

Concept	Purpose	Trade-off
Primary Shard	Holds a partition of the index; handles writes	Shard count is fixed at index creation
Replica Shard	Copy of primary; serves read traffic	More replicas = more storage, higher read throughput
Coordinator Node	Receives query, fans out, merges results	Bottleneck if too many shards (fan-out overhead)
Segment Merging	Lucene merges small segments into larger ones	Improves query speed but consumes I/O during merge

Putting It Together

A complete search engine combines all of these components into a feedback-driven system. The crawler feeds the indexer. The indexer builds inverted indexes distributed across shards. The ranking engine combines content signals (BM25) with authority signals (PageRank) and freshness signals (crawl recency). The query server orchestrates distributed retrieval and returns results under a latency budget.

Each component is a system in itself, with its own scaling challenges, failure modes, and optimization surfaces. The art of search engine design is in how these components interact, not in any single component alone.

Assignment

A brand new webpage is published on the internet. Trace its journey from completely unknown to appearing in search results. For each stage, answer:

How does the crawler discover the URL? What data structure manages crawl priority?
Once fetched, how is the page's content processed and stored? What data structure enables fast keyword lookup?
When a user searches for a term on that page, how does the system retrieve and rank the result? Name the specific algorithms involved.
If the search index is distributed across 100 shards, describe how the query reaches the right shard and how partial results are merged.

Search Engine