Session 8.6: News Feed Ranking

Course → Module 8: Real-World Case Studies II

The Problem: Infinite Content, Finite Attention

A user follows 500 friends, 200 pages, and 50 groups. In the last 24 hours, those sources produced 3,000 new posts. The user will scroll through maybe 50 before closing the app. The feed ranking system decides which 50 posts appear first, and in what order.

This is not a sorting problem. It is a prediction problem. The system must estimate, for each candidate post, how likely this specific user is to engage with it. Engagement means different things: a like, a comment, a share, a click, or simply spending five seconds reading. Each action has a different weight. A share is worth more than a like. A "hide post" signal is strongly negative.

The original Facebook approach, called EdgeRank (retired around 2013), used a simple formula: affinity x weight x decay. How close is the user to the author? How valuable is the content type? How fresh is the post? This worked for hundreds of millions of users. It broke at billions.

Key insight: Feed ranking is a real-time multi-objective optimization problem. The system simultaneously maximizes engagement, content diversity, user satisfaction, and platform health, and these objectives often conflict.

Ranking Signals

Modern feed ranking systems use hundreds of signals. They fall into a few categories.

Signal Category	Examples	Weight	Source
Social graph	Friendship strength, interaction frequency, mutual friends	High	Graph DB, interaction logs
Content type preference	User prefers videos over text, images over links	Medium-High	User behavior model
Recency	Post age, time since last refresh	Medium	Post metadata
Engagement velocity	How fast the post is accumulating likes/comments	Medium	Real-time counters
Author history	Author's average engagement rate, post frequency	Medium	Author profile service
Negative signals	Hides, unfollows, "see less" clicks, spam reports	High (negative)	User feedback logs
Content understanding	NLP topic extraction, image classification, link quality	Low-Medium	ML classifiers
Session context	Time of day, device type, network speed, scroll depth	Low	Client telemetry

No single signal dominates. The model learns non-linear combinations. A video post from a close friend at 8 PM on mobile might rank highest. The same video from a page the user barely interacts with, at 3 AM, ranks much lower.

The Ranking Pipeline

Scoring 3,000 candidates with a full neural network for every user request is too expensive. The standard approach is a multi-stage funnel that progressively narrows candidates while increasing model complexity.

graph TD A[Candidate Generation
~3,000 posts] --> B[First-Pass Ranking
Lightweight model
~500 posts] B --> C[Main Ranking
Deep neural network
~100 posts] C --> D[Re-Ranking
Diversity + policy rules
~50 posts] D --> E[Final Feed
~20 posts per page] F[User Profile] --> B F --> C G[Social Graph] --> A H[Content Features] --> B H --> C I[Real-time Signals] --> C I --> D

Stage 1: Candidate generation. Pull all eligible posts from the user's social graph. This is a fan-out-on-read operation. For each user, query their friends list, followed pages, and joined groups, then fetch recent posts from each. Pre-computed friend lists and post indexes make this fast. Output: ~3,000 candidates.

Stage 2: First-pass ranking. A lightweight model (logistic regression or a shallow neural net) scores each candidate using pre-computed features. This model runs in under 1ms per candidate. It eliminates obvious non-starters: posts in languages the user does not read, content types they never engage with, posts from sources they interact with rarely. Output: ~500 candidates.

Stage 3: Main ranking. A deep neural network scores the remaining candidates using the full feature set, including real-time engagement signals. This is the most compute-intensive stage, typically taking 10-50ms for the full batch. The model predicts multiple outcomes simultaneously: probability of like, comment, share, click, and hide. These predictions are combined into a single score using learned weights. Output: ~100 candidates.

Stage 4: Re-ranking. Policy and diversity rules adjust the final ordering. No more than 3 posts from the same source in the top 20. Boost content types that are underrepresented. Demote engagement bait. Insert ads at designated positions. Output: the final feed page.

High-Level System Architecture

graph LR subgraph Client APP[Mobile/Web Client] end subgraph Edge CDN[CDN] LB[Load Balancer] end subgraph Feed Service FS[Feed Assembler] CR[Candidate Retriever] RK[Ranking Service] AB[A/B Config Service] end subgraph Data Layer SG[(Social Graph
TAO / Graph DB)] PS[(Post Store
Sharded MySQL)] FC[(Feature Cache
Redis / Memcached)] ML[ML Model Server
GPU Cluster] RT[(Real-time Counters
Engagement Stream)] end APP --> CDN --> LB --> FS FS --> CR CR --> SG CR --> PS FS --> RK RK --> FC RK --> ML RK --> RT FS --> AB

The Feed Assembler is the orchestrator. It calls the Candidate Retriever to gather posts, passes them to the Ranking Service for scoring, and checks the A/B Config Service to determine which ranking model variant this user sees. The ML Model Server runs on GPUs and serves predictions via gRPC with batched inference to maximize throughput.

Pre-Computation vs. Real-Time

Not everything can be computed at request time. The system splits work between offline pre-computation and online real-time scoring.

Pre-computed (offline): User embeddings (updated hourly), friend strength scores (updated daily), content embeddings (computed at post creation), author quality scores (updated hourly), and content safety classifications (computed at post creation).

Real-time (online): Engagement velocity (likes in last 5 minutes), session context (device, time, scroll position), recency decay, and interaction between user state and post features.

Pre-computed features are stored in a feature store (often Redis or a dedicated system like Feast) and looked up at serving time. This keeps online latency low. The main ranking model combines pre-computed embeddings with real-time signals to produce the final score.

A/B Testing Infrastructure

Feed ranking changes are never deployed blindly. Every change goes through A/B testing. At Facebook's scale, this means running hundreds of experiments simultaneously, each affecting a slice of users.

The A/B testing infrastructure must guarantee several properties. First, experiment isolation: a user in experiment A and experiment B should not have their results contaminated by interaction effects. Second, statistical power: each experiment needs enough users to detect small changes in metrics. Third, guardrail metrics: even if an experiment improves engagement, it must not degrade user satisfaction surveys, session length, or content diversity below a threshold.

The ranking pipeline is parameterized so that different model versions, feature sets, or re-ranking rules can be swapped per user without changing the serving infrastructure. The A/B Config Service tells the Feed Assembler which configuration to use for each request, based on the user's experiment assignments.

The 200ms Budget

From the moment a user opens the app to the moment the first 20 posts render, the total time budget is roughly 200ms (excluding network transit). Here is how it breaks down:

Stage	Latency Budget	Operation
Candidate retrieval	30-50ms	Query social graph + post indexes
Feature lookup	20-30ms	Batch fetch from feature store
First-pass ranking	10-20ms	Lightweight model on ~3,000 posts
Main ranking	30-50ms	Deep model on ~500 posts (GPU)
Re-ranking	5-10ms	Diversity and policy rules
Response assembly	10-20ms	Hydrate posts with media URLs, author info
Total server-side	105-180ms

Every millisecond matters. Feature lookups are batched into a single round-trip. The main ranking model uses batched GPU inference (score 500 posts in one forward pass, not 500 individual calls). The response includes pre-fetched media URLs so the client can begin downloading images before the user scrolls to them.

Assignment

A user opens their social media app. They follow 400 friends and 100 pages. In the last 24 hours, 2,500 new posts were created by those sources. The app must show the top 20 posts within 200ms of the request hitting the server.

Design the end-to-end flow:

What happens in the first 50ms? What data is fetched and from where?
How does the first-pass ranker reduce 2,500 candidates to 400? What signals does it use?
The main ranking model runs on a GPU cluster. How do you batch 400 candidates into one inference call? What is the expected latency?
The re-ranker must enforce: no more than 2 posts from the same source in the top 20, at least 3 different content types (text, image, video), and one "discover" post from outside the user's graph. How do you implement these rules without re-running the ML model?
Draw a timeline showing all stages and their latency budgets.

News Feed Ranking