Session 7.3: Video Streaming

Course → Module 7: Real-World Case Studies I

The Scale of Video

Over 500 hours of video are uploaded to YouTube every minute. That is 720,000 hours of new content per day. Netflix streams to over 260 million subscribers across 190 countries. At peak hours, Netflix alone accounts for roughly 15% of all downstream internet traffic in North America.

These numbers reveal something important about the architecture. Video is not a request-response problem. It is a pipeline problem. Raw footage enters one end, and optimized streams exit the other, tailored to thousands of different device and network combinations. Every step in between is a system design decision.

Key insight: Video streaming is a storage problem disguised as a networking problem. The real complexity is not in delivering bytes. It is in preparing the right bytes for every possible viewer.

The Full Pipeline

From the moment a creator clicks "upload" to the moment a viewer presses play, the video passes through five major stages: upload, transcoding, storage, distribution, and playback.

flowchart LR A[Creator] -->|Chunked Upload| B[Upload Service] B --> C[Object Storage - Raw] C --> D[Transcoding Pipeline] D --> E[Multiple Resolutions + Codecs] E --> F[Object Storage - Processed] F --> G[CDN Edge Servers] G --> H[Viewer Device] B --> I[Metadata Service] I --> J[(Metadata DB)] J --> K[Search / Recommendation]

Two parallel paths exist from the start. The video binary goes into object storage and the transcoding pipeline. The metadata (title, description, tags, thumbnail) goes into a relational database that feeds search and recommendation systems. These two paths are independent and should be treated as separate services.

Step 1: Chunked Upload

A raw 4K video file can easily be 10-50 GB. Uploading a 50 GB file as a single HTTP request is unreliable. Network interruptions, timeouts, and browser limits all conspire against you. The solution is chunked upload.

The client splits the file into chunks (typically 5-25 MB each). Each chunk is uploaded independently with a chunk index. The server reassembles them. If a chunk fails, only that chunk is retried. YouTube uses a resumable upload protocol that tracks which chunks have been received and allows the upload to continue from the last successful chunk after any interruption.

The upload target is object storage (S3, Google Cloud Storage). The raw file is stored as-is. This is the "source of truth" for the video. Everything downstream is a derived artifact.

Step 2: Transcoding

Raw uploaded video is not suitable for streaming. It may be in a codec the viewer's device cannot decode. It is almost certainly too large for mobile networks. Transcoding converts the raw video into multiple versions optimized for different devices and bandwidth conditions.

Netflix runs its transcoding pipeline on EC2 instances in AWS, processing petabytes of video data daily. Each video is encoded into multiple resolution and bitrate combinations called a "bitrate ladder."

Video Encoding Profiles

Resolution	Bitrate (video)	Storage per hour	Target Use Case
240p	300 Kbps	~135 MB	2G/3G mobile, extreme low bandwidth
360p	700 Kbps	~315 MB	Low-end mobile
480p	1.5 Mbps	~675 MB	Standard mobile, slow Wi-Fi
720p	3 Mbps	~1.35 GB	Tablets, standard desktop
1080p	6 Mbps	~2.7 GB	HD desktop, smart TVs
4K (2160p)	16 Mbps	~7.2 GB	4K TVs, high-bandwidth connections

A single hour of uploaded video, transcoded into all six profiles, requires roughly 12.4 GB of storage. With 500 hours uploaded to YouTube every minute, that is over 6 TB of new processed video per minute. Per day, that exceeds 8.6 PB of new transcoded content.

Netflix takes this further with per-title encoding. Instead of using a fixed bitrate ladder, they analyze each video's complexity and generate a custom ladder. A slow dialogue scene needs less bitrate at 1080p than an action sequence. This optimization reduced Netflix's bandwidth requirements by roughly 50% without perceptible quality loss.

Step 3: Adaptive Bitrate Streaming

The viewer's bandwidth is not constant. It fluctuates as they move between Wi-Fi and cellular, as network congestion rises and falls, as other devices on the same network start or stop streaming. Adaptive Bitrate (ABR) streaming handles this dynamically.

The transcoded video is split into small segments, typically 2-10 seconds long. Each segment exists at every quality level. A manifest file (HLS uses .m3u8, DASH uses .mpd) lists all available segments and their quality levels. The video player on the client device monitors its download speed and buffer level, then requests the appropriate quality for each segment.

sequenceDiagram participant P as Video Player participant CDN as CDN Edge participant S as Origin Storage P->>CDN: Request manifest (.m3u8) CDN-->>P: Manifest with all quality levels Note over P: Bandwidth: 5 Mbps P->>CDN: Request segment 1 at 1080p CDN-->>P: Segment 1 (1080p) Note over P: Bandwidth drops to 1 Mbps P->>CDN: Request segment 2 at 480p CDN-->>P: Segment 2 (480p) Note over P: Bandwidth recovers to 8 Mbps P->>CDN: Request segment 3 at 1080p CDN-->>P: Segment 3 (1080p) Note over P,CDN: Quality adjusts per segment

The quality switch happens at segment boundaries. If each segment is 4 seconds long, the player can adjust quality every 4 seconds. Shorter segments allow faster adaptation but increase the number of HTTP requests and the manifest file size. Longer segments are more efficient but slower to adapt.

Step 4: CDN Delivery

Serving video from a single origin datacenter does not work at global scale. A viewer in Jakarta requesting video from a server in Virginia would experience high latency and compete for bandwidth across undersea cables. Content Delivery Networks solve this by caching video segments at edge servers close to viewers.

Netflix operates its own CDN called Open Connect. Netflix embeds custom hardware appliances (called Open Connect Appliances, or OCAs) directly inside ISP networks. During off-peak hours, popular content is pre-positioned on these devices. When a subscriber presses play, the video is served from a box inside their own ISP's network, often within the same city.

YouTube uses Google's global network and edge caches. The principle is the same: move the data closer to the viewer. Popular videos are cached at many edge locations. Long-tail content (rarely watched videos) may only exist at a regional cache or the origin. The first viewer in a region pays the latency cost, and subsequent viewers benefit from the cache.

Metadata and the Relational Layer

While the video binary lives in object storage and CDNs, everything else about the video lives in a relational database. Title, description, upload date, view count, like count, comments, channel information, tags, categories, subtitles, and content moderation flags. This metadata drives search, recommendations, and the entire user interface.

YouTube uses a combination of MySQL (Vitess, their sharding middleware) and Bigtable for different metadata needs. The key insight: video content and video metadata are separate systems with separate scaling characteristics. Content is write-once-read-many with massive bandwidth needs. Metadata is read-heavy with complex query patterns (search, filter, sort, aggregate).

flowchart TD subgraph Content Path RAW[Raw Video - S3/GCS] --> TC[Transcoding] TC --> PROC[Processed Segments - S3/GCS] PROC --> CDN[CDN Edge Cache] CDN --> VIEWER[Viewer] end subgraph Metadata Path UP[Upload Metadata] --> MDB[(MySQL / Vitess)] MDB --> SEARCH[Search Index - Elasticsearch] MDB --> REC[Recommendation Engine] MDB --> API[API for UI] API --> VIEWER end

Assignment

A creator uploads a 4K, 1-hour video (raw file size: 30 GB). A viewer in a rural area with a 3G connection (500 Kbps average) wants to watch it. Walk through every step:

How is the 30 GB file uploaded reliably? Calculate the number of chunks at 10 MB each and the time to upload on a 50 Mbps connection.
How many transcoded versions are created? What is the total storage for all versions of this 1-hour video?
The rural viewer presses play. Which resolution does the ABR algorithm select? What happens when their bandwidth fluctuates between 300 Kbps and 800 Kbps?
The video is brand new and has not been cached at any edge server near the viewer. Describe the cache miss flow and what the viewer experiences on the first segment vs. subsequent segments.

Video Streaming