Course → Module 8: Real-World Case Studies II

The Problem: Infinite Content, Finite Attention

A user follows 500 friends, 200 pages, and 50 groups. In the last 24 hours, those sources produced 3,000 new posts. The user will scroll through maybe 50 before closing the app. The feed ranking system decides which 50 posts appear first, and in what order.

This is not a sorting problem. It is a prediction problem. The system must estimate, for each candidate post, how likely this specific user is to engage with it. Engagement means different things: a like, a comment, a share, a click, or simply spending five seconds reading. Each action has a different weight. A share is worth more than a like. A "hide post" signal is strongly negative.

The original Facebook approach, called EdgeRank (retired around 2013), used a simple formula: affinity x weight x decay. How close is the user to the author? How valuable is the content type? How fresh is the post? This worked for hundreds of millions of users. It broke at billions.

Key insight: Feed ranking is a real-time multi-objective optimization problem. The system simultaneously maximizes engagement, content diversity, user satisfaction, and platform health, and these objectives often conflict.

Ranking Signals

Modern feed ranking systems use hundreds of signals. They fall into a few categories.

Signal Category Examples Weight Source
Social graph Friendship strength, interaction frequency, mutual friends High Graph DB, interaction logs
Content type preference User prefers videos over text, images over links Medium-High User behavior model
Recency Post age, time since last refresh Medium Post metadata
Engagement velocity How fast the post is accumulating likes/comments Medium Real-time counters
Author history Author's average engagement rate, post frequency Medium Author profile service
Negative signals Hides, unfollows, "see less" clicks, spam reports High (negative) User feedback logs
Content understanding NLP topic extraction, image classification, link quality Low-Medium ML classifiers
Session context Time of day, device type, network speed, scroll depth Low Client telemetry

No single signal dominates. The model learns non-linear combinations. A video post from a close friend at 8 PM on mobile might rank highest. The same video from a page the user barely interacts with, at 3 AM, ranks much lower.

The Ranking Pipeline

Scoring 3,000 candidates with a full neural network for every user request is too expensive. The standard approach is a multi-stage funnel that progressively narrows candidates while increasing model complexity.

graph TD A[Candidate Generation
~3,000 posts] --> B[First-Pass Ranking
Lightweight model
~500 posts] B --> C[Main Ranking
Deep neural network
~100 posts] C --> D[Re-Ranking
Diversity + policy rules
~50 posts] D --> E[Final Feed
~20 posts per page] F[User Profile] --> B F --> C G[Social Graph] --> A H[Content Features] --> B H --> C I[Real-time Signals] --> C I --> D

Stage 1: Candidate generation. Pull all eligible posts from the user's social graph. This is a fan-out-on-read operation. For each user, query their friends list, followed pages, and joined groups, then fetch recent posts from each. Pre-computed friend lists and post indexes make this fast. Output: ~3,000 candidates.

Stage 2: First-pass ranking. A lightweight model (logistic regression or a shallow neural net) scores each candidate using pre-computed features. This model runs in under 1ms per candidate. It eliminates obvious non-starters: posts in languages the user does not read, content types they never engage with, posts from sources they interact with rarely. Output: ~500 candidates.

Stage 3: Main ranking. A deep neural network scores the remaining candidates using the full feature set, including real-time engagement signals. This is the most compute-intensive stage, typically taking 10-50ms for the full batch. The model predicts multiple outcomes simultaneously: probability of like, comment, share, click, and hide. These predictions are combined into a single score using learned weights. Output: ~100 candidates.

Stage 4: Re-ranking. Policy and diversity rules adjust the final ordering. No more than 3 posts from the same source in the top 20. Boost content types that are underrepresented. Demote engagement bait. Insert ads at designated positions. Output: the final feed page.

High-Level System Architecture

graph LR subgraph Client APP[Mobile/Web Client] end subgraph Edge CDN[CDN] LB[Load Balancer] end subgraph Feed Service FS[Feed Assembler] CR[Candidate Retriever] RK[Ranking Service] AB[A/B Config Service] end subgraph Data Layer SG[(Social Graph
TAO / Graph DB)] PS[(Post Store
Sharded MySQL)] FC[(Feature Cache
Redis / Memcached)] ML[ML Model Server
GPU Cluster] RT[(Real-time Counters
Engagement Stream)] end APP --> CDN --> LB --> FS FS --> CR CR --> SG CR --> PS FS --> RK RK --> FC RK --> ML RK --> RT FS --> AB

The Feed Assembler is the orchestrator. It calls the Candidate Retriever to gather posts, passes them to the Ranking Service for scoring, and checks the A/B Config Service to determine which ranking model variant this user sees. The ML Model Server runs on GPUs and serves predictions via gRPC with batched inference to maximize throughput.

Pre-Computation vs. Real-Time

Not everything can be computed at request time. The system splits work between offline pre-computation and online real-time scoring.

Pre-computed (offline): User embeddings (updated hourly), friend strength scores (updated daily), content embeddings (computed at post creation), author quality scores (updated hourly), and content safety classifications (computed at post creation).

Real-time (online): Engagement velocity (likes in last 5 minutes), session context (device, time, scroll position), recency decay, and interaction between user state and post features.

Pre-computed features are stored in a feature store (often Redis or a dedicated system like Feast) and looked up at serving time. This keeps online latency low. The main ranking model combines pre-computed embeddings with real-time signals to produce the final score.

A/B Testing Infrastructure

Feed ranking changes are never deployed blindly. Every change goes through A/B testing. At Facebook's scale, this means running hundreds of experiments simultaneously, each affecting a slice of users.

The A/B testing infrastructure must guarantee several properties. First, experiment isolation: a user in experiment A and experiment B should not have their results contaminated by interaction effects. Second, statistical power: each experiment needs enough users to detect small changes in metrics. Third, guardrail metrics: even if an experiment improves engagement, it must not degrade user satisfaction surveys, session length, or content diversity below a threshold.

The ranking pipeline is parameterized so that different model versions, feature sets, or re-ranking rules can be swapped per user without changing the serving infrastructure. The A/B Config Service tells the Feed Assembler which configuration to use for each request, based on the user's experiment assignments.

The 200ms Budget

From the moment a user opens the app to the moment the first 20 posts render, the total time budget is roughly 200ms (excluding network transit). Here is how it breaks down:

Stage Latency Budget Operation
Candidate retrieval 30-50ms Query social graph + post indexes
Feature lookup 20-30ms Batch fetch from feature store
First-pass ranking 10-20ms Lightweight model on ~3,000 posts
Main ranking 30-50ms Deep model on ~500 posts (GPU)
Re-ranking 5-10ms Diversity and policy rules
Response assembly 10-20ms Hydrate posts with media URLs, author info
Total server-side 105-180ms

Every millisecond matters. Feature lookups are batched into a single round-trip. The main ranking model uses batched GPU inference (score 500 posts in one forward pass, not 500 individual calls). The response includes pre-fetched media URLs so the client can begin downloading images before the user scrolls to them.

Further Reading

Assignment

A user opens their social media app. They follow 400 friends and 100 pages. In the last 24 hours, 2,500 new posts were created by those sources. The app must show the top 20 posts within 200ms of the request hitting the server.

Design the end-to-end flow:

  1. What happens in the first 50ms? What data is fetched and from where?
  2. How does the first-pass ranker reduce 2,500 candidates to 400? What signals does it use?
  3. The main ranking model runs on a GPU cluster. How do you batch 400 candidates into one inference call? What is the expected latency?
  4. The re-ranker must enforce: no more than 2 posts from the same source in the top 20, at least 3 different content types (text, image, video), and one "discover" post from outside the user's graph. How do you implement these rules without re-running the ML model?
  5. Draw a timeline showing all stages and their latency budgets.