Module 3: Storage, Databases & Caching (Printable)

Two Models, Two Philosophies

Every application stores data. The question is how that data is organized, queried, and scaled. For decades, relational databases (SQL) dominated. Then, around 2009, a wave of distributed systems gave rise to NoSQL. These are not competing religions. They are engineering tradeoffs. The right choice depends on your data shape, your access patterns, and your consistency requirements.

Relational Databases (SQL)

A relational database stores data in tables with predefined schemas, enforces relationships through foreign keys, and guarantees ACID properties: Atomicity, Consistency, Isolation, Durability. Every write either fully succeeds or fully rolls back.

Relational databases were designed for correctness. A banking system transferring money between accounts needs a guarantee: debit one account and credit the other, or do neither. ACID transactions provide that guarantee. The schema enforces structure. You define columns and types upfront. If you try to insert a string into an integer column, the database rejects it. This strictness prevents entire categories of bugs.

SQL databases excel at complex queries. Joins allow you to combine data from multiple tables in a single query. An e-commerce system can answer "show me all orders from customers in Jakarta who bought more than 3 items last month" with one SQL statement. Try doing that efficiently in a key-value store.

The cost of this power is rigidity. Schema changes (adding a column, changing a type) require migrations. On large tables with billions of rows, migrations can take hours and lock the table. Horizontal scaling is difficult because joins across machines are expensive. Most relational databases scale vertically first, then add read replicas for read-heavy workloads.

Major relational databases: PostgreSQL, MySQL, Oracle, SQL Server, SQLite.

NoSQL Databases

NoSQL databases abandon the rigid table-and-schema model in favor of flexible data structures (documents, key-value pairs, wide columns, or graphs). Most trade strong consistency for horizontal scalability and eventual consistency.

NoSQL emerged to solve problems that relational databases handle poorly: massive write throughput across distributed clusters, highly variable data shapes, and workloads where horizontal scaling matters more than multi-table joins.

A document database like MongoDB stores JSON-like documents. Each document can have different fields. You do not need to run ALTER TABLE to add a new attribute. You just start writing documents with the new field. This flexibility accelerates early development and works well for data that does not fit neatly into rows and columns.

Most NoSQL databases achieve horizontal scalability by partitioning data across nodes. Cassandra, for example, distributes data using consistent hashing. Adding a node automatically rebalances part of the data. This makes scaling to hundreds of nodes straightforward, something relational databases struggle with.

The tradeoff is consistency. Most NoSQL systems offer eventual consistency by default: after a write, replicas will converge to the same state, but a read immediately after a write might return stale data. Some NoSQL databases (like MongoDB with majority write concern, or DynamoDB with strong reads) can provide stronger guarantees at the cost of latency.

Throughput Comparison

Benchmark numbers depend heavily on hardware, configuration, data size, and workload shape. The following chart uses approximate figures from YCSB (Yahoo Cloud Serving Benchmark) studies and vendor benchmarks to illustrate relative throughput ranges. These are order-of-magnitude comparisons, not absolute truths for your specific workload.

Redis dominates because it is an in-memory store. Cassandra's write throughput is high because its LSM-tree storage engine is optimized for sequential writes. PostgreSQL's numbers are lower on raw throughput but it handles complex queries that the others cannot. These numbers shift dramatically with indexing, replication, and query complexity.

Comparison Across Dimensions

Dimension	SQL (Relational)	NoSQL
Data model	Fixed schema, tables with rows and columns	Flexible: documents, key-value, wide-column, graph
Schema	Schema-on-write (enforced at insert time)	Schema-on-read (interpreted at query time)
Query language	SQL (standardized, declarative)	Varies per database (MongoDB query API, CQL, Cypher)
Joins	Native, efficient multi-table joins	Generally unsupported or expensive
Transactions	Full ACID across multiple tables	Limited (single-document or single-partition in most)
Consistency	Strong consistency by default	Eventual consistency by default (tunable in some)
Scaling	Primarily vertical; horizontal via read replicas or sharding	Designed for horizontal scaling from the start
Schema evolution	ALTER TABLE migrations (can be slow on large tables)	Add fields freely; old documents still valid
Best for	Transactional systems, complex queries, data integrity	High write throughput, variable schemas, massive scale

Polyglot Persistence

Polyglot persistence means using different database technologies for different parts of the same system, matching each data store to the access pattern it handles best.

Most real-world systems do not pick one database for everything. An e-commerce platform might use PostgreSQL for orders and inventory (transactions matter), Redis for session storage and caching (speed matters), Elasticsearch for product search (full-text indexing matters), and a graph database for recommendations (relationship traversal matters).

This is not over-engineering. It is recognizing that no single database excels at every workload. The cost is operational complexity: more systems to monitor, more failure modes, more data synchronization concerns.

graph TB Client[Client Application] --> API[API Layer] API --> PG["PostgreSQL
Orders, Users, Payments
(ACID transactions)"] API --> Redis["Redis
Sessions, Cache
(sub-ms latency)"] API --> ES["Elasticsearch
Product Search
(full-text indexing)"] API --> Mongo["MongoDB
Product Catalog
(flexible schema)"] API --> Neo4j["Neo4j
Recommendations
(graph traversal)"] PG --- |"Source of truth"| Sync[Change Data Capture] Sync --> ES Sync --> Redis

The diagram above shows a polyglot architecture. PostgreSQL serves as the source of truth for transactional data. Change data capture (CDC) streams changes to Elasticsearch for search indexing and to Redis for cache warming. Each database handles the workload it was designed for.

The ACID vs. BASE Spectrum

SQL databases follow ACID. Most NoSQL databases follow BASE: Basically Available, Soft state, Eventually consistent. These are not binary categories. Many databases sit on a spectrum.

MongoDB now supports multi-document ACID transactions (since version 4.0). DynamoDB offers strongly consistent reads. CockroachDB is a distributed SQL database that provides ACID guarantees across multiple nodes. The boundaries between SQL and NoSQL are blurring. What matters is understanding the default behavior and what guarantees you are paying for in terms of latency and throughput.

Systems Thinking Lens

Choosing a database is a leverage point. Get it right early and the system scales naturally. Get it wrong and you spend months migrating under pressure. The feedback loop is delayed: you pick a database in month one, but the pain of a bad choice shows up in month twelve when traffic grows 10x.

Polyglot persistence creates a balancing loop. Each additional database solves one problem but introduces operational overhead. At some point, the cost of managing five different databases outweighs the performance benefit. The systems thinker asks: where is the point of diminishing returns for our team size and operational maturity?

Assignment

You are designing a system that stores four types of data. For each one, choose SQL or NoSQL (and name a specific database). Justify your choice in 2-3 sentences, explaining which properties of the workload drove your decision.

User profiles with fields that vary by user type (individual, business, admin). Some users have 5 fields, others have 50.
Chat messages in a messaging app with 10 million daily active users. Messages are write-heavy, read by time range, and rarely updated.
Product catalog for an e-commerce site with 2 million SKUs. Users search by name, category, price range, and attributes that vary by product type.
Social graph for a platform where users follow each other. Key queries: "who does user X follow?", "who follows user X?", "mutual friends between X and Y."

For each choice, also state the consistency model you would use (strong or eventual) and why.

Four Families, Four Philosophies

Saying "NoSQL" tells you almost nothing about how a database actually works. It is like saying "not a car" when someone asks what vehicle you drive. It could be a motorcycle, a bus, a helicopter, or a bicycle. NoSQL is an umbrella term covering at least four fundamentally different data models, each optimized for different access patterns.

The four families are: key-value stores, document databases, wide-column stores, and graph databases. Picking the right family means understanding your data shape and how you will query it.

Key-Value Stores

A key-value store is the simplest NoSQL model: every record is a unique key mapped to a blob of data. The database does not inspect the value. It only knows the key. This simplicity enables extreme speed.

Key-value stores are hash tables at scale. You give the database a key, it returns a value. You cannot query by value contents. You cannot filter or sort. You can only get, set, and delete by key. This constraint is the source of their power. With no secondary indexes, no joins, and no query parsing, operations complete in microseconds.

Redis is the most widely used key-value store. It holds data in memory, supports data structures beyond simple strings (lists, sets, sorted sets, hashes), and provides sub-millisecond latency. DynamoDB is Amazon's managed key-value and document store, built for horizontal scalability with single-digit millisecond reads at any scale.

Use when: Session storage, caching, rate limiting, leaderboards, feature flags. Any workload where the access pattern is "I know the key, give me the value."

Avoid when: You need to search by attributes within the value, or you need relationships between records.

Document Databases

A document database stores self-describing records (usually JSON or BSON) that can contain nested objects and arrays. Unlike key-value stores, the database understands the document structure and can index and query individual fields.

Document databases sit between key-value simplicity and relational richness. Each document is a self-contained unit. A user document might contain the user's name, address (nested object), order history (array of objects), and preferences (nested map). You can query any of these fields.

MongoDB is the dominant document database. It supports rich queries, aggregation pipelines, secondary indexes, and multi-document ACID transactions. Couchbase and Amazon DocumentDB are other options in this space.

The document model works well when your data naturally forms aggregates. An e-commerce product is a good example: the product name, description, images, variants, and reviews form a logical unit that gets read and written together. Storing this as a single document avoids the five-table join you would need in a relational database.

Use when: Content management, product catalogs, user profiles with variable fields, event logging.

Avoid when: You need complex joins across document types or strict referential integrity between collections.

Wide-Column Stores

A wide-column store organizes data into rows and column families, where each row can have a different set of columns. Rows are identified by a partition key, and within each partition, data is sorted by a clustering key. This structure is optimized for high write throughput and time-series access patterns.

Wide-column stores were inspired by Google's Bigtable paper (2006). The data model looks like a table, but it is fundamentally different from relational tables. There is no fixed schema. Each row can have millions of columns. Columns are grouped into families, and the database stores each family contiguously on disk.

Apache Cassandra is the most prominent wide-column store. It provides linear write scalability: adding a node to the cluster increases write throughput proportionally. Cassandra partitions data by a hash of the partition key and sorts data within each partition by the clustering columns. This makes time-ordered queries within a partition very fast.

HBase (built on Hadoop) and ScyllaDB (a Cassandra-compatible database written in C++) are other notable wide-column stores.

Use when: Time-series data (IoT sensor readings, event logs), messaging at scale, write-heavy workloads with known query patterns.

Avoid when: You need ad-hoc queries, complex aggregations, or do not know your access patterns at design time.

Graph Databases

A graph database stores data as nodes (entities) and edges (relationships), with properties on both. It is optimized for traversing connections, making queries like "find all friends of friends who live in Jakarta" fast regardless of total data size.

Relational databases can model graphs using join tables, but traversing deep relationships becomes exponentially expensive. A three-hop query (friends of friends of friends) requires three self-joins, each of which scans the entire relationship table. Graph databases use index-free adjacency: each node directly references its neighbors, so traversal cost is proportional to the local neighborhood size, not the total graph size.

Neo4j is the leading graph database. It uses the Cypher query language, which reads like a visual pattern: MATCH (a)-[:FOLLOWS]->(b)-[:FOLLOWS]->(c) RETURN c. Amazon Neptune and TigerGraph are other graph databases used at scale.

Use when: Social networks, recommendation engines, fraud detection, knowledge graphs, dependency mapping.

Avoid when: Your data has few relationships, or your queries are primarily simple lookups and range scans.

Data Model Comparison

graph LR subgraph Key-Value KV1["key: user:1001"] --> KV2["value: {blob}"] KV3["key: session:abc"] --> KV4["value: {blob}"] end subgraph Document D1["{ _id: 1001,
name: 'Andi',
orders: [...],
prefs: {...} }"] end subgraph Wide-Column WC1["Row key: user:1001"] WC1 --> WC2["profile:name"] WC1 --> WC3["profile:city"] WC1 --> WC4["metrics:logins"] WC1 --> WC5["metrics:last_seen"] end subgraph Graph G1((User A)) -->|FOLLOWS| G2((User B)) G2 -->|FOLLOWS| G3((User C)) G1 -->|LIKES| G4((Post 1)) end

Comparison Table

Dimension	Key-Value	Document	Wide-Column	Graph
Data model	Key mapped to opaque blob	JSON/BSON documents with nested fields	Rows with dynamic column families	Nodes, edges, properties
Query pattern	GET/SET by key only	Query by any field, aggregation pipelines	Partition key lookup, range scan within partition	Pattern matching, graph traversal
Schema	None (schemaless)	Flexible (schema-on-read)	Column families predefined, columns flexible	Flexible node/edge types
Scalability	Linear horizontal	Horizontal with sharding	Linear horizontal (masterless)	Vertical primarily; some horizontal
Read speed	Sub-ms (in-memory)	Low ms (indexed)	Low ms (within partition)	Low ms (local traversal)
Write speed	Sub-ms	Low ms	Very high (LSM-tree)	Moderate
Strengths	Speed, simplicity	Flexibility, rich queries	Write throughput, time-series	Relationship queries
Products	Redis, DynamoDB, Memcached	MongoDB, Couchbase, DocumentDB	Cassandra, HBase, ScyllaDB	Neo4j, Neptune, TigerGraph
Typical use case	Cache, sessions, rate limits	CMS, catalogs, user profiles	IoT, messaging, event logs	Social graphs, fraud, recommendations

Radar Comparison

The radar chart below compares the four NoSQL families across five dimensions, scored on a relative scale of 1 (weakest) to 5 (strongest). These are directional, not absolute. Your mileage will vary depending on the specific product, configuration, and workload.

Key-value stores score the highest on speed and scalability but the lowest on query flexibility. Graph databases are the inverse: rich queries but harder to scale horizontally. Document and wide-column databases sit in different middle grounds. This is why polyglot persistence exists. No single family wins on every axis.

Choosing the Right Family

flowchart TD Start["What is your primary access pattern?"] --> Q1{"Simple key lookup?"} Q1 -->|Yes| KV["Key-Value Store
(Redis, DynamoDB)"] Q1 -->|No| Q2{"Query by document fields?"} Q2 -->|Yes| Doc["Document Database
(MongoDB, Couchbase)"] Q2 -->|No| Q3{"Time-series or
write-heavy with
known partitions?"} Q3 -->|Yes| WC["Wide-Column Store
(Cassandra, ScyllaDB)"] Q3 -->|No| Q4{"Relationship
traversal?"} Q4 -->|Yes| Graph["Graph Database
(Neo4j, Neptune)"] Q4 -->|No| SQL["Consider Relational
(PostgreSQL, MySQL)"]

The decision tree above is a starting point, not a prescription. Many workloads could fit multiple families. When in doubt, start with the simplest option that meets your access pattern requirements. You can always add specialized databases later through polyglot persistence.

Systems Thinking Lens

Each NoSQL family creates a different constraint on your system. Key-value stores constrain your query patterns but free you from schema management. Graph databases free your query patterns but constrain your scaling options. These are balancing loops. The freedom you gain on one axis is paid for on another.

The leverage point is access pattern analysis. Before choosing a database family, write down your ten most important queries. If you cannot do that, you are not ready to choose. The database family should emerge from the access patterns, not the other way around. Too many teams pick a database because it is trendy, then spend months fighting its constraints.

Assignment

Match each workload below to the most appropriate NoSQL family (key-value, document, wide-column, or graph). For each, explain your reasoning in 2-3 sentences, focusing on the access pattern that drove your decision.

Session store for a web application with 50 million active sessions. Sessions are created, read by session ID, and expire after 30 minutes.
Time-series IoT data from 100,000 sensors, each sending a reading every 5 seconds. Queries are always "all readings from sensor X between time A and time B."
Recommendation engine for a streaming platform. Key query: "users who watched Movie A also watched..." which requires traversing user-movie-user paths.
Activity logs from a SaaS application. Each log entry has a timestamp, user ID, action type, and a metadata object whose fields vary by action type (login has IP and device; purchase has amount and items; error has stack trace).

Why Shard?

A single database server has limits. CPU, memory, disk I/O, network bandwidth. Vertical scaling buys you time, but eventually you hit the ceiling. When your tables hold billions of rows and your write throughput exceeds what one machine can handle, you need to split the data across multiple machines. That is sharding.

Sharding (also called horizontal partitioning) splits a single logical dataset across multiple database instances, called shards. Each shard holds a subset of the data and handles a subset of the queries. Together, the shards form one logical database.

Sharding is not the same as replication. Replication copies the same data to multiple machines for redundancy and read scaling. Sharding distributes different data to different machines for write scaling and storage capacity. Most production systems use both.

Three Sharding Strategies

Range-Based Sharding

Divide the shard key space into contiguous ranges. Users 1 through 1,000,000 go to Shard A. Users 1,000,001 through 2,000,000 go to Shard B. And so on.

Range-based sharding is simple to understand and implement. Range queries are efficient because all data within a range lives on the same shard. If you need all users with IDs between 500,000 and 600,000, that query hits a single shard.

The problem is hotspots. If user IDs are auto-incrementing, all new writes go to the highest shard. That shard becomes a bottleneck while the others sit idle. Time-based keys have the same problem: all recent data lands on the same shard.

Hash-Based Sharding

Apply a hash function to the shard key and use the result to determine the shard. For example: shard = hash(user_id) % number_of_shards. A good hash function distributes keys uniformly, so writes spread evenly across all shards.

Hash-based sharding eliminates hotspots for sequential keys. User IDs 1, 2, 3, 4 might map to shards 3, 1, 0, 2 respectively. The tradeoff: range queries become expensive. "Find all users with IDs 1 through 1000" now requires querying every shard and merging results, because those IDs are scattered.

Resharding is also painful. If you change the number of shards (say from 4 to 6), the modulo changes, and most keys map to different shards. You have to move data. Consistent hashing (covered in Session 3.6) mitigates this problem.

Directory-Based Sharding

Maintain a lookup table that maps each shard key (or key range) to a specific shard. The application checks the directory before every database operation to determine which shard to query.

Directory-based sharding offers maximum flexibility. You can place data on shards based on any criteria: geography, tenant ID, or custom business logic. You can move data between shards by updating the directory without changing the hash function or recalculating ranges.

The cost is the directory itself. It becomes a single point of failure and a latency bottleneck. Every query requires an extra lookup. The directory must be highly available, fast, and consistent. Many teams implement the directory in a cache (Redis) backed by a durable store.

Hash-Based Sharding in Action

flowchart TD App["Application"] --> Router["Shard Router
shard = hash(user_id) % 4"] Router -->|"hash = 0"| S0["Shard 0
Users: 4, 8, 12, ..."] Router -->|"hash = 1"| S1["Shard 1
Users: 1, 5, 9, ..."] Router -->|"hash = 2"| S2["Shard 2
Users: 2, 6, 10, ..."] Router -->|"hash = 3"| S3["Shard 3
Users: 3, 7, 11, ..."] S0 --- DB0[("PostgreSQL
Instance 0")] S1 --- DB1[("PostgreSQL
Instance 1")] S2 --- DB2[("PostgreSQL
Instance 2")] S3 --- DB3[("PostgreSQL
Instance 3")]

In this diagram, the shard router computes hash(user_id) % 4 and routes the query to the appropriate shard. Each shard is a separate PostgreSQL instance holding roughly one quarter of all users. Writes are distributed evenly. Reads by user_id hit exactly one shard.

Shard Key Selection

The shard key determines everything. It controls data distribution, query routing, and which operations are efficient. A bad shard key creates hotspots, forces cross-shard queries, and makes resharding a nightmare. Choose it based on your most common access pattern.

A good shard key has three properties:

High cardinality. The key should have many distinct values. Sharding by country gives you ~200 shards at most, and some shards (India, USA) will be much larger than others.
Even distribution. Values should distribute data uniformly across shards. Auto-incrementing IDs with hash-based sharding achieve this. Sharding by last name does not (too many "Smith" entries on one shard).
Query alignment. Your most frequent query should include the shard key. If 90% of queries are "get all orders for user X," then user_id is a good shard key because each query hits one shard.

Strategy Comparison

Dimension	Range-Based	Hash-Based	Directory-Based
Distribution	Uneven if keys are sequential	Even with good hash function	Depends on mapping logic
Range queries	Efficient (same shard)	Expensive (scatter-gather)	Possible if directory supports ranges
Hotspot risk	High (time-based, sequential keys)	Low (uniform distribution)	Low (configurable placement)
Resharding	Split ranges, move data	Painful (most keys remap)	Update directory (flexible)
Complexity	Low	Medium	High (directory service required)
Extra infrastructure	None	None	Lookup service (must be HA)
Use cases	Time-series data, archival	User data, general purpose	Multi-tenant, geo-based routing

Cross-Shard Queries

The hardest problem in sharding is queries that span multiple shards. If your data is sharded by user_id but you need to "find all users in Jakarta," the city field is not the shard key. The query must hit every shard, collect partial results, and merge them. This is called a scatter-gather query.

Scatter-gather queries are slow and resource-intensive. They scale linearly with the number of shards. With 4 shards, it is tolerable. With 400 shards, it is a serious performance problem.

Solutions include maintaining secondary indexes (either global or per-shard), denormalizing data into a separate lookup table, or using a search engine like Elasticsearch for cross-shard queries. Each solution adds complexity and consistency challenges.

Real-World Examples

Instagram: Sharding PostgreSQL

Instagram needed to handle over 25 photo uploads and 90 likes per second in its early days. They chose to shard PostgreSQL rather than switch to NoSQL. Their approach used thousands of "logical" shards mapped to a smaller number of physical PostgreSQL instances. Each logical shard was a PostgreSQL schema containing the sharded tables. This allowed them to move logical shards between physical servers without re-bucketing data.

For globally unique IDs, Instagram designed a custom ID generation scheme using PostgreSQL's PL/pgSQL. Each ID encodes 41 bits for timestamp, 13 bits for logical shard ID, and 10 bits for an auto-incrementing sequence. This produces IDs that are sortable by time and unique across all shards.

Discord: Cassandra to ScyllaDB

Discord stored trillions of messages in Cassandra, partitioned by channel_id. This worked well for most channels, but large public channels (with millions of messages) created hot partitions. Cassandra's garbage collector caused latency spikes under this workload. Discord eventually migrated to ScyllaDB (a Cassandra-compatible database written in C++) and added a data services layer in Rust with request coalescing to handle hot partitions. The migration reduced their cluster from 177 Cassandra nodes to 72 ScyllaDB nodes, with p99 read latency dropping from 40-125ms to 15ms.

graph LR subgraph Instagram Approach LS1["Logical Shard 1
(PG Schema)"] --> PS1["Physical Server A"] LS2["Logical Shard 2
(PG Schema)"] --> PS1 LS3["Logical Shard 3
(PG Schema)"] --> PS2["Physical Server B"] LS4["Logical Shard 4
(PG Schema)"] --> PS2 end subgraph Resharding LS2 -.->|"Move shard"| PS2 end

Instagram's logical shard abstraction is the key insight. By separating logical partitioning from physical placement, resharding becomes a data migration problem rather than a re-hashing problem. You move schemas between servers without touching the application's routing logic.

Systems Thinking Lens

Sharding introduces a reinforcing complexity loop. More data requires more shards. More shards increase the surface area for cross-shard queries and operational issues. Operational issues require more engineering effort. More engineering effort means less time for feature development. This loop accelerates as the system grows.

The leverage point is the shard key. A well-chosen shard key keeps most queries within a single shard, which limits the impact of the complexity loop. Instagram's choice to shard by user_id aligned with their primary access pattern (show me this user's photos), keeping the complexity manageable even at massive scale.

Assignment

You have a users table with 100 million rows, sharded by hash(user_id) % 4 across four PostgreSQL instances. The system works well for user-level queries.

A new feature requires finding all users in a specific city (e.g., "Find all users in Jakarta"). This is a cross-shard query because city is not the shard key.

Explain why this query is expensive with the current sharding scheme. How many shards must be queried? What does the query execution look like?
Propose at least two solutions to make city-based lookups efficient. For each, describe the tradeoff (what do you gain, what do you pay in complexity or consistency?).
If you could go back and choose a different shard key, what would you pick? Consider that the system still needs fast user_id lookups. Is there a compound key strategy that works for both access patterns?

Why Replicate?

A single database server is a single point of failure. If it goes down, your application goes down. Replication copies data from one server to others, providing two benefits: fault tolerance (if one server dies, another has the data) and read scaling (distribute read queries across multiple copies).

Database replication is the process of maintaining copies of the same data on multiple database servers. The source of writes is called the primary (or leader). The copies are called replicas (or followers). Replication can be synchronous or asynchronous, and uses either single-primary or multi-primary topology.

Replication and sharding solve different problems. Sharding splits data to handle more writes and store more data. Replication copies data to handle more reads and survive failures. Most production databases use both: shard the data across multiple primaries, then replicate each shard to one or more replicas.

Primary-Replica Replication

The most common replication topology. All writes go to a single primary server. The primary streams changes (via a write-ahead log, binary log, or oplog) to one or more replica servers. Replicas apply these changes and serve read queries.

graph TD App["Application"] --> W["Writes"] App --> R["Reads"] W --> Primary[("Primary
(read/write)")] Primary -->|"WAL stream"| R1[("Replica 1
(read-only)")] Primary -->|"WAL stream"| R2[("Replica 2
(read-only)")] Primary -->|"WAL stream"| R3[("Replica 3
(read-only)")] R --> R1 R --> R2 R --> R3

This topology is straightforward. There is no write conflict because only one server accepts writes. Replicas are read-only, which means you can add replicas to scale read throughput linearly. Three replicas handle roughly three times the read queries of a single server.

The downside is write scalability. All writes funnel through one server. If your write throughput exceeds what one machine can handle, primary-replica replication alone will not help. You need sharding for write scaling.

Failover is the other concern. If the primary dies, one of the replicas must be promoted to primary. This can be automated (most managed databases do this) but involves a brief window where writes are unavailable. The promoted replica may also be slightly behind the old primary, meaning some recent writes could be lost in asynchronous replication setups.

Multi-Primary Replication

Multi-primary replication (also called multi-leader or active-active) allows writes on multiple servers. Each primary replicates its changes to the others. This enables write availability across regions but introduces write conflicts that must be detected and resolved.

Multi-primary replication is used when you need to accept writes in multiple geographic regions. A user in Jakarta writes to the Jakarta primary. A user in Singapore writes to the Singapore primary. Each primary sends its changes to the other. Reads are served locally, and writes do not need to cross the ocean before being acknowledged.

The fundamental problem is conflicts. If user A updates their email on the Jakarta primary and user B updates the same email field on the Singapore primary at the same time, which write wins? Conflict resolution strategies include:

Last-write-wins (LWW): The write with the later timestamp wins. Simple but loses data silently. Clock synchronization across data centers is imperfect.
Application-level resolution: The database stores both versions and lets the application decide. CouchDB and DynamoDB use this approach.
CRDTs (Conflict-free Replicated Data Types): Data structures designed to merge automatically without conflicts. Works for counters, sets, and some text editing scenarios but not for arbitrary data.

Synchronous vs. Asynchronous Replication

When the primary commits a write, it can wait for replicas to confirm before acknowledging the client (synchronous), or it can acknowledge immediately and let replicas catch up later (asynchronous). This is a fundamental tradeoff between durability and latency.

Dimension	Synchronous	Asynchronous	Semi-Synchronous
Write latency	High (waits for replica ACK)	Low (immediate ACK from primary)	Medium (waits for 1 replica)
Data durability	Strong (data on multiple nodes)	Weak (data may exist only on primary)	Moderate (data on at least 2 nodes)
Risk of data loss on failure	None (all replicas confirmed)	Possible (unsynced writes lost)	Minimal (1 replica confirmed)
Impact of slow replica	Blocks all writes	No impact on writes	Blocks if the sync replica is slow
Throughput	Limited by slowest replica	Limited only by primary	Limited by one replica
Common use	Financial systems, compliance	Most web applications	PostgreSQL (sync standby), MySQL Group Replication

Most production systems use asynchronous replication because the latency penalty of synchronous replication is too high, especially across data centers. A write that takes 2ms on the primary might take 50-200ms if it must wait for a replica in another region. Semi-synchronous replication is a common compromise: wait for at least one replica to confirm, then acknowledge the client.

Replication Lag

Replication lag is the delay between a write committed on the primary and that write becoming visible on a replica. In asynchronous replication, lag is typically under a second but can spike to minutes during high write load, long-running queries on replicas, or network congestion.

Replication lag causes stale reads. A user updates their profile on the primary, then immediately loads the profile page. If the read is routed to a replica that has not yet received the update, the user sees old data. This is the "read-your-own-writes" problem.

Lag is not constant. During normal operation, it might be 10-50 milliseconds. During a traffic spike that doubles write throughput, the replica's single-threaded apply process falls behind. Long-running analytical queries on replicas can also cause lag by competing for I/O and CPU resources.

The chart above simulates typical replication lag over a 24-hour period. During off-peak hours (midnight to 6 AM), lag stays under 20ms. During peak load (late morning and evening), lag spikes above the 200ms threshold. These spikes correlate with write volume: when the primary processes more writes per second, the replica falls further behind.

Handling Replication Lag

Several strategies mitigate the effects of replication lag:

Read-your-own-writes: After a user writes data, route that user's subsequent reads to the primary (or a replica known to be up-to-date) for a short window. This ensures users see their own changes immediately.
Monotonic reads: Ensure a user always reads from the same replica. This prevents the confusing experience of seeing newer data, then older data on the next request (because a different, more-lagged replica served it).
Lag-aware routing: Monitor replication lag on each replica. Route reads only to replicas within an acceptable lag threshold. If all replicas are too far behind, fall back to the primary.
Causal consistency: Track causal dependencies between writes and reads. A read that depends on a specific write waits until the replica has applied that write before returning results.

Replication Strategy Comparison

Dimension	Primary-Replica	Multi-Primary
Write target	Single primary only	Any primary
Conflict potential	None (single writer)	Yes (concurrent writes on different primaries)
Write availability	Lost during failover	Available in all regions
Write latency	Low if client is near primary	Low (write to nearest primary)
Complexity	Low	High (conflict resolution logic)
Consistency	Strong (reads from primary)	Eventual (cross-region convergence)
Use case	Most applications	Multi-region, offline-capable, collaborative editing

Systems Thinking Lens

Replication creates a fundamental balancing loop between consistency and performance. Adding replicas increases read throughput (reinforcing loop), but each replica introduces replication lag (balancing loop). The faster you write, the further behind replicas fall, and the more stale reads your users experience.

The leverage point is not the replication strategy itself but the read/write ratio of your workload. If your application is 95% reads and 5% writes, primary-replica replication with asynchronous replication works well. The lag affects only a small fraction of operations. If your application is 50% writes, replication lag becomes a dominant concern, and you may need synchronous replication or architectural changes (like event sourcing) that decouple reads from writes entirely.

Multi-primary replication has a hidden reinforcing loop: the more primaries you add, the more conflict resolution logic you need, the more edge cases emerge, the more engineering time you spend debugging subtle data inconsistencies instead of building features.

Assignment

Your application has a primary database server in Jakarta and an asynchronous replica in Surabaya. A user in Jakarta updates their profile (changes their display name). One second later, the same user loads their profile page. The read is routed to the Surabaya replica for load balancing.

What will the user likely see? Explain why, referencing replication lag and the asynchronous replication model.
Describe the "read-your-own-writes" problem in your own words. Why is it particularly frustrating for the user who just made the change?
Propose three different solutions to fix this problem. For each solution, explain the mechanism and the tradeoff (what it costs in terms of complexity, latency, or infrastructure).
Under what circumstances would you choose synchronous replication instead? What would that do to the user's write latency, given the network distance between Jakarta and Surabaya?

Why Indexes Exist

Without an index, every query is a full table scan. The database reads every row, checks if it matches your condition, and discards what does not. For a table with 100 rows, this is fine. For a table with 10 million rows, it is catastrophic.

An index is a separate data structure that maps column values to their physical locations in the table. Think of it like the index at the back of a textbook: instead of reading every page to find "B-tree," you look up "B-tree" in the index and jump directly to page 247.

A database index trades write speed and storage space for faster reads. Every index you add speeds up some queries and slows down every INSERT, UPDATE, and DELETE.

B-Tree Indexes

The B-tree (specifically B+ tree) is the default index type in PostgreSQL, MySQL, SQL Server, and virtually every relational database. It organizes data in a balanced tree structure where every leaf node is the same distance from the root.

A B+ tree with a branching factor of 500 can index 8 TB of data with only 4 levels. That means any lookup requires at most 4 disk I/O operations, regardless of table size. The time complexity is O(log n), which grows very slowly.

B-trees support equality lookups (WHERE id = 42), range queries (WHERE price BETWEEN 10 AND 50), prefix matching (WHERE name LIKE 'Jo%'), and sorting (ORDER BY created_at). This versatility is why they are the default.

graph TD R["Root Node
[30 | 70]"] --> L1["[10 | 20]"] R --> L2["[40 | 50 | 60]"] R --> L3["[80 | 90]"] L1 --> D1["Rows: 1-9"] L1 --> D2["Rows: 10-19"] L1 --> D3["Rows: 20-29"] L2 --> D4["Rows: 30-39"] L2 --> D5["Rows: 40-49"] L2 --> D6["Rows: 50-59"] L2 --> D7["Rows: 60-69"] L3 --> D8["Rows: 70-79"] L3 --> D9["Rows: 80-89"] L3 --> D10["Rows: 90-99"] style R fill:#222221,stroke:#c8a882,color:#ede9e3 style L1 fill:#222221,stroke:#6b8f71,color:#ede9e3 style L2 fill:#222221,stroke:#6b8f71,color:#ede9e3 style L3 fill:#222221,stroke:#6b8f71,color:#ede9e3

The root node contains separator keys that direct the search. Internal nodes narrow the range. Leaf nodes contain the actual pointers to table rows. In a B+ tree, leaf nodes are also linked together, which makes range scans efficient: once you find the starting point, you walk the linked list instead of re-traversing the tree.

Hash Indexes

A hash index applies a hash function to the column value and stores the result in a hash table. Lookups are O(1) on average, which is faster than B-tree's O(log n). PostgreSQL benchmarks show hash indexes can be 10-22% faster than B-trees for pure equality lookups.

The tradeoff is severe: hash indexes support only equality comparisons (WHERE id = 42). They cannot do range queries, sorting, or prefix matching. If you need any of those operations on the column, a hash index is useless.

GIN and GiST Indexes

PostgreSQL offers two specialized index types for complex data. GIN (Generalized Inverted Index) is built for full-text search, arrays, and JSONB. It creates an entry for every element inside a composite value. GIN lookups are about 3x faster than GiST for text search, but GIN indexes take about 3x longer to build.

GiST (Generalized Search Tree) handles geometric data, ranges, and nearest-neighbor queries. GiST indexes are lossy for text search (they may return false positives that require a recheck), but they update faster than GIN, making them better for frequently changing data.

Index Comparison

Dimension	B-Tree	Hash	GIN	GiST
Lookup time	O(log n)	O(1)	O(log n) per element	O(log n)
Range queries	Yes	No	No	Yes
Sorting	Yes	No	No	No
Full-text search	No	No	Yes (preferred)	Yes (lossy)
JSONB/Array	Limited	No	Yes	No
Geometric/Spatial	No	No	No	Yes
Build speed	Moderate	Fast	Slow (3x B-tree)	Moderate
Write overhead	Moderate	Low	High	Moderate
Best for	General purpose	Equality-only lookups	Full-text, JSONB, arrays	Geometry, ranges, nearest-neighbor

Composite Indexes and Column Order

A composite index includes multiple columns. The database sorts entries first by the leftmost column, then by the second column within each group, and so on. This ordering has a critical consequence called the leftmost prefix rule.

The leftmost prefix rule: a composite index on (A, B, C) can serve queries on A, on (A, B), and on (A, B, C). It cannot efficiently serve queries on B alone, C alone, or (B, C). The index must be read from left to right without skipping columns.

Consider an index on (user_id, status, created_at). This index efficiently supports:

WHERE user_id = 5
WHERE user_id = 5 AND status = 'pending'
WHERE user_id = 5 AND status = 'pending' ORDER BY created_at

It does NOT efficiently support WHERE status = 'pending' alone, because status is not the leftmost column.

Index Selectivity

Selectivity measures how well an index narrows down results. A column with 1 million unique values in a 1 million row table has selectivity of 1.0 (perfect). A boolean column with only two values has selectivity close to 0 (terrible).

Put highly selective columns first in composite indexes. If user_id narrows 10 million rows to 50, and status narrows them to 5 million, always lead with user_id.

Covering Indexes

A covering index contains all columns needed by a query. The database can answer the query entirely from the index without touching the table data. This eliminates the random I/O cost of jumping from the index to the table rows.

In PostgreSQL, you use INCLUDE to add non-searchable columns to the index:

CREATE INDEX idx_orders_user_status
ON orders (user_id, status)
INCLUDE (created_at, total);

Now a query selecting user_id, status, created_at, total with a WHERE on user_id and status never touches the heap. The savings are substantial for wide tables with many columns.

The Hidden Write Cost

Every index must be updated on every write operation. An INSERT into a table with 5 indexes means 6 write operations: one for the table and one for each index. An UPDATE that modifies an indexed column triggers an index delete and an index insert for each affected index.

This cost is not theoretical. On write-heavy workloads, excessive indexing is a common cause of degraded performance. The solution is to index deliberately: only create indexes that serve actual queries, and drop indexes that query plans never use.

Query Time vs. Table Size

The following chart shows how query response time changes as table size grows, comparing a full table scan against an indexed lookup.

The table scan time grows linearly with row count. The indexed lookup barely moves. At 10 million rows, the difference is four orders of magnitude.

Systems Thinking Lens

Indexes are a classic balancing loop. Adding an index improves read performance, which encourages more queries, which eventually leads to more indexes, which degrades write performance. The leverage point is not "add more indexes" but "understand your query patterns." A single well-designed composite index often replaces three or four single-column indexes while also reducing write overhead.

graph LR A["Slow Query
Detected"] --> B["Add Index"] B --> C["Reads Faster"] C --> D["More Queries
Written"] D --> E["More Indexes
Requested"] E --> F["Write Performance
Degrades"] F --> G["Review &
Consolidate"] G --> B style A fill:#222221,stroke:#c8a882,color:#ede9e3 style F fill:#222221,stroke:#c8a882,color:#ede9e3

Assignment

You have this query running thousands of times per second:

SELECT * FROM orders
WHERE user_id = ?
  AND status = 'pending'
ORDER BY created_at DESC;

Design the optimal composite index for this query. Answer these questions:

What columns should the index include, and in what order?
Why does column order matter here? What happens if you put status first instead of user_id?
Would making this a covering index (using INCLUDE) help? What columns would you include?
What is the write cost of this index on a table that receives 500 inserts per second?

The Problem with Modulo Hashing

Distributed systems often need to assign data to nodes. The simplest approach is modulo hashing: take the hash of a key, divide by the number of nodes, and use the remainder as the node assignment. node = hash(key) % N.

This works well as long as N never changes. The moment you add or remove a node, nearly every key maps to a different node. If you go from 4 nodes to 5, the formula changes from hash(key) % 4 to hash(key) % 5. For most keys, the remainder is different. In a cache cluster, this means almost every cached item is now on the wrong node. The cache hit rate drops to nearly zero. Every request hits the database. You just turned a scaling event into a cascading failure.

Modulo hashing redistributes nearly all keys when the node count changes. For a system with K keys and N nodes going to N+1, roughly (N-1)/N of all keys move. With 4 nodes, adding a 5th moves about 75% of keys.

Consistent Hashing

Consistent hashing solves this by mapping both nodes and keys onto the same circular hash space (a "hash ring"). Each node is placed at a position on the ring determined by hashing its identifier. Each key is placed on the ring by hashing its value. The key is assigned to the first node encountered when walking clockwise from the key's position.

When a node is added, it takes over a segment of the ring from its clockwise neighbor. Only the keys in that segment move. When a node is removed, its segment is absorbed by the next clockwise node. Only the keys assigned to the removed node need to move.

Consistent hashing guarantees that when the number of nodes changes from N to N+1, only K/N keys need to be remapped on average, where K is the total number of keys. This is the theoretical minimum.

graph TD subgraph "Hash Ring" direction TB A["Node A
position: 0°"] B["Node B
position: 90°"] C["Node C
position: 180°"] D["Node D
position: 270°"] end K1["Key 1 (45°)"] -.-> B K2["Key 2 (120°)"] -.-> C K3["Key 3 (200°)"] -.-> D K4["Key 4 (350°)"] -.-> A style A fill:#222221,stroke:#c8a882,color:#ede9e3 style B fill:#222221,stroke:#6b8f71,color:#ede9e3 style C fill:#222221,stroke:#c8a882,color:#ede9e3 style D fill:#222221,stroke:#6b8f71,color:#ede9e3

In this ring, Key 1 at position 45 degrees walks clockwise and hits Node B at 90 degrees. Key 4 at position 350 degrees wraps around and hits Node A at 0 degrees. Each key belongs to the next node clockwise.

Key Redistribution: The Numbers

The difference between modulo hashing and consistent hashing becomes dramatic when nodes change.

With modulo hashing, the problem gets worse as you scale. Going from 99 to 100 nodes moves 99% of keys. With consistent hashing, the same change moves only 1%. The gap widens at every scale.

Comparison Table

Dimension	Modulo Hashing	Consistent Hashing
Key assignment	`hash(key) % N`	Next clockwise node on hash ring
Keys moved on add/remove	~(N-1)/N of all keys	~K/N keys (theoretical minimum)
Load distribution	Uniform (when N is stable)	Uneven without virtual nodes
Implementation complexity	Trivial	Moderate (ring, sorted map)
Node heterogeneity	Not supported	Supported via virtual nodes
Cache impact on resize	Near-total cache miss storm	Minimal disruption
Used by	Simple hash tables	DynamoDB, Cassandra, Memcached, CDNs

The Problem with Basic Consistent Hashing

With only one position per node on the ring, load distribution is uneven. If Node A sits at 0 degrees and Node B sits at 10 degrees, Node B owns a tiny 10-degree arc while Node A owns a massive 350-degree arc. Node A gets roughly 35x more traffic than Node B. Random placement does not guarantee even spacing.

Removing a node makes this worse. The entire load of the removed node shifts to a single neighbor, potentially doubling that neighbor's load instantly.

Virtual Nodes

Virtual nodes (vnodes) solve load imbalance by giving each physical node multiple positions on the ring. Instead of hashing a node once, you hash it with different suffixes: hash("NodeA-1"), hash("NodeA-2"), hash("NodeA-3"), and so on. Each virtual node is a separate point on the ring, but they all map back to the same physical server.

Virtual nodes spread each physical node across many positions on the hash ring. With 100-200 virtual nodes per physical node, the standard deviation of load distribution drops to 5-10% of the mean.

graph TD subgraph "Hash Ring with Virtual Nodes" direction TB A1["A-v1
15°"] B1["B-v1
45°"] C1["C-v1
80°"] A2["A-v2
130°"] B2["B-v2
170°"] C2["C-v2
210°"] A3["A-v3
260°"] B3["B-v3
300°"] C3["C-v3
340°"] end PA["Physical Node A"] --- A1 PA --- A2 PA --- A3 PB["Physical Node B"] --- B1 PB --- B2 PB --- B3 PC["Physical Node C"] --- C1 PC --- C2 PC --- C3 style PA fill:#222221,stroke:#c8a882,color:#ede9e3 style PB fill:#222221,stroke:#6b8f71,color:#ede9e3 style PC fill:#222221,stroke:#8a8478,color:#ede9e3

With 3 virtual nodes per physical node, each physical node owns 3 arcs on the ring instead of 1. The arcs are spread across the ring, so the load is more evenly distributed. When Node B is removed, its three arcs are absorbed by three different nodes, not just one. This prevents any single node from being overwhelmed.

Virtual Nodes Enable Heterogeneous Clusters

Not all servers are equal. A machine with 64 GB of RAM can handle more data than one with 16 GB. Virtual nodes let you assign more vnodes to more powerful machines. If Server A has 4x the capacity of Server B, give Server A 200 virtual nodes and Server B 50. Server A will own roughly 4x as much of the ring.

This is how DynamoDB and Cassandra handle mixed hardware in production clusters. The vnode count per node is a configuration parameter that can be adjusted without changing the hashing algorithm.

Systems Thinking Lens

Modulo hashing is a brittle system with tight coupling between node count and key assignment. Any change to one variable (node count) causes a cascade through the entire system (mass key migration). Consistent hashing decouples these variables. The hash ring acts as an abstraction layer that absorbs changes locally instead of propagating them globally.

Virtual nodes add a second layer of decoupling: physical capacity from ring position. This makes the system adaptive. You can scale heterogeneously, handle failures gracefully, and rebalance without coordination storms.

Assignment

You have 4 cache nodes (A, B, C, D) arranged on a consistent hash ring at positions 0, 90, 180, and 270 degrees. Each node has 1,000 keys assigned to it (4,000 keys total, uniformly distributed).

Draw the hash ring with the 4 nodes and label the arcs each node owns.
A 5th node E is added at position 135 degrees. Which node loses keys? How many keys move, approximately?
Now calculate the same scenario with modulo hashing: 4,000 keys, going from hash(key) % 4 to hash(key) % 5. How many keys move?
If Node C (at 180 degrees) fails and is removed, what happens to its keys? Which node absorbs them? Why is this a problem, and how do virtual nodes fix it?

Three Kinds of Storage

When engineers say "storage," they might mean three fundamentally different things. Block storage gives you raw disk volumes. File storage gives you a hierarchical file system. Object storage gives you a flat namespace of immutable blobs with metadata. Each model serves different workloads, and choosing wrong means paying too much, building unnecessary complexity, or hitting walls that should not exist.

Dimension	Block Storage	File Storage	Object Storage
Abstraction	Raw disk (sectors, volumes)	Hierarchical (directories, files)	Flat (bucket + key + metadata)
Access pattern	Random read/write, low latency	POSIX file operations	HTTP GET/PUT, whole-object
Mutability	Mutable (overwrite bytes in place)	Mutable (edit files)	Immutable (replace entire object)
Scalability	Limited by volume size (16 TB typical)	Limited by file system (PB range)	Virtually unlimited
Durability	99.999% (replicated volumes)	Depends on implementation	99.999999999% (11 nines, S3)
Cost	$$$ (EBS gp3: $0.08/GB/mo)	$$ (EFS: $0.30/GB/mo)	$ (S3 Standard: $0.023/GB/mo)
Typical use	Databases, boot volumes, OLTP	Shared file access, CMS, legacy apps	Media, backups, data lakes, static assets

Object Storage: S3-Style

Amazon S3 is the canonical object store, and its API has become a de facto standard (MinIO, Cloudflare R2, and Backblaze B2 all implement S3-compatible APIs). The model is simple: you have buckets (containers) and objects (files with metadata). Every object has a unique key within its bucket. You interact with it over HTTP.

Object storage treats data as immutable blobs accessed by key. You cannot edit byte 500 of a file. You replace the entire object. This constraint enables massive durability (11 nines), global replication, and near-infinite scale at low cost.

Objects in S3 are replicated across at least 3 availability zones automatically. The durability guarantee of 99.999999999% means that if you store 10 million objects, you can statistically expect to lose one every 10,000 years. No database offers this durability at this cost.

Storage Cost Comparison

The price difference between storage tiers is dramatic. Choosing the right tier based on access frequency saves substantial money at scale.

EBS block storage costs 80x more per GB than Glacier Deep Archive. Even S3 Standard is 3.5x cheaper than EBS. The tradeoff is access latency: EBS gives you sub-millisecond random reads, while Glacier Deep Archive can take 12 hours to retrieve data. Choose your tier based on how often you access the data, not how important it is.

Pre-Signed URLs

A pre-signed URL is a time-limited URL that grants temporary access to a private S3 object. Your backend generates the URL using its AWS credentials, and the client uses it to upload or download directly from S3. The backend never touches the file bytes.

sequenceDiagram participant Client participant Backend participant S3 Client->>Backend: POST /upload-request (filename, content-type) Backend->>Backend: Generate pre-signed PUT URL (expires in 5 min) Backend-->>Client: Return pre-signed URL + object key Client->>S3: PUT file directly to pre-signed URL S3-->>Client: 200 OK Client->>Backend: POST /upload-complete (object key) Backend->>Backend: Save object key in database

This pattern has three advantages. First, the file upload goes directly from the client to S3, so your backend servers do not handle large file transfers. This saves bandwidth, memory, and CPU. Second, the pre-signed URL expires, so even if it leaks, the exposure window is short. Third, S3 handles multipart uploads, retries, and storage durability. Your backend stays simple.

For downloads, the same pattern applies. The backend generates a pre-signed GET URL, and the client downloads directly from S3. To serve files even faster, put a CDN (CloudFront, Cloudflare) in front of the S3 bucket. The CDN caches popular objects at edge locations close to users, reducing latency from hundreds of milliseconds to single digits.

CDN Integration

The standard architecture for serving static assets at scale combines object storage with a CDN:

On the first request, the CDN edge fetches the object from S3 and caches it. Subsequent requests for the same object are served from the edge, which is geographically close to the user. For popular content (profile photos, product images, CSS files), the cache hit rate typically exceeds 95%. You pay S3 retrieval costs only on cache misses.

Distributed File Systems: HDFS and GFS

For big data workloads (log processing, analytics, machine learning training), the access pattern is different. You need to read and write very large files (gigabytes to terabytes) sequentially, and you want computation to move to the data rather than the other way around.

HDFS (Hadoop Distributed File System) splits large files into fixed-size blocks (128 MB default), replicates each block across 3 nodes, and tracks block locations in a central NameNode. Computation frameworks like MapReduce and Spark process data on the nodes where it lives, avoiding network transfer.

HDFS was inspired by Google's GFS paper (2003). The design optimizes for throughput over latency: sequential reads of entire blocks are fast, but random access to individual bytes is not supported. The NameNode is a single point of failure (mitigated by standby NameNodes in production), and the system is designed for append-only writes. You do not edit files in place.

HDFS and object storage serve different needs. Object storage is for serving individual files to many users (a profile photo, a PDF). HDFS is for processing massive datasets across a compute cluster. In modern architectures, the line is blurring. Systems like Apache Iceberg and Delta Lake store data as Parquet files in S3, combining the durability of object storage with the query patterns of big data.

Systems Thinking Lens

Storage is a system with competing feedback loops. The cost loop pushes data toward colder, cheaper tiers. The latency loop pulls frequently accessed data toward hotter, faster tiers. The durability loop demands replication, which multiplies cost. A well-designed storage strategy does not pick one tier. It classifies data by access pattern and moves it through tiers automatically (S3 Intelligent-Tiering does this for $0.0025 per 1,000 objects per month).

The leverage point is lifecycle policy, not storage selection. The decision of where data starts is less important than the rules that govern where it moves over time.

Assignment

Users of your application upload profile photos. Currently, photos are uploaded to your backend server, which saves them to local disk. The application has 500,000 users, and about 10,000 photos are uploaded per day. Average photo size is 2 MB.

Design a new upload flow using object storage. Answer these questions:

Where does the image go? Which storage service and tier do you choose, and why?
How does the client get the upload URL? Walk through the full request flow from "user clicks upload" to "photo is stored."
How do you serve the photo fast to users around the world? Draw the read path.
Estimate the monthly storage cost after one year, assuming no deletions and 2 MB average size with 10,000 uploads per day.

Why Cache the Read Path

Most applications are read-heavy. A typical e-commerce product page is read 5,000 times for every update. A user profile is read hundreds of times between edits. If every read hits the database, you are doing redundant work. The data has not changed, but the database parses the query, searches the index, fetches the row, and serializes the result every single time.

A cache stores the result of a previous computation so it can be reused. For read-heavy workloads, caching reduces database load, cuts response time from milliseconds to microseconds, and lets you serve more traffic without scaling the database.

There are two primary patterns for caching on the read path: cache-aside and read-through. They differ in who is responsible for populating the cache.

Cache-Aside (Lazy Loading)

In cache-aside, the application manages the cache directly. The cache sits beside the database, not in front of it. The application checks the cache first. On a hit, it returns the cached data. On a miss, it queries the database, writes the result to the cache, and returns it.

Cache-aside puts the application in control. The application decides what to cache, when to cache it, and when to evict it. The cache and database are independent systems with no awareness of each other.

sequenceDiagram participant App as Application participant Cache as Cache (Redis) participant DB as Database Note over App,DB: Cache Hit Flow App->>Cache: GET product:123 Cache-->>App: Return cached data App-->>App: Return to client (0.5-2ms) Note over App,DB: Cache Miss Flow App->>Cache: GET product:456 Cache-->>App: MISS (null) App->>DB: SELECT * FROM products WHERE id=456 DB-->>App: Return row (5-50ms) App->>Cache: SET product:456 = data (TTL 300s) App-->>App: Return to client

Cache-aside has several strengths. The cache only contains data that has actually been requested, so you do not waste memory caching things nobody reads. The application has full control over cache keys, TTLs, and invalidation logic. And because the cache and database are independent, a cache failure does not bring down the application. Reads fall back to the database, which is slower but functional.

The weakness is the cold start problem. When the cache is empty (after a restart or deployment), every request is a cache miss. All traffic hits the database simultaneously. This is the thundering herd problem.

Read-Through

In read-through, the cache itself is responsible for loading data from the database. The application only talks to the cache. If the cache does not have the data, it fetches it from the database, stores it, and returns it. The application never queries the database directly for cached entities.

Read-through delegates data loading to the cache layer. The application treats the cache as the only data source. Cache misses are handled internally by the cache provider.

sequenceDiagram participant App as Application participant Cache as Cache Provider participant DB as Database Note over App,DB: Cache Hit Flow App->>Cache: GET product:123 Cache-->>App: Return cached data Note over App,DB: Cache Miss Flow App->>Cache: GET product:456 Cache->>DB: SELECT * FROM products WHERE id=456 DB-->>Cache: Return row Cache->>Cache: Store data locally Cache-->>App: Return data

Read-through simplifies the application code. The application does not contain any cache-miss logic or database fallback code. The cache provider handles everything. This is particularly useful when multiple services need the same caching behavior. Instead of implementing cache-miss handling in each service, the cache provider handles it once.

Read-through also avoids one specific problem with cache-aside: duplicate database queries on concurrent misses. In cache-aside, if 100 requests arrive for the same uncached key simultaneously, all 100 may query the database before any of them writes to the cache. Read-through implementations typically use internal locking: the first request triggers a database fetch, and subsequent requests for the same key wait for the result instead of making redundant database calls.

Refresh-Ahead

Refresh-ahead is a proactive strategy that reloads cached data before it expires. The cache tracks access patterns and pre-fetches entries that are likely to be requested again. If a cached item with a 300-second TTL is accessed at the 250-second mark, the cache proactively refreshes it in the background so it never actually expires.

This eliminates the latency spike that occurs when a popular cache entry expires and the next request must wait for a database query. The downside is that it can waste resources refreshing data that nobody will request again.

Strategy Comparison

Dimension	Cache-Aside	Read-Through	Refresh-Ahead
Who loads the cache	Application code	Cache provider	Cache provider (proactive)
Cache miss latency	Full DB query time	Full DB query time (first request)	Near zero (pre-fetched)
Thundering herd risk	High (N concurrent misses = N DB queries)	Low (internal locking)	Very low (entries rarely expire)
Application complexity	Higher (miss logic in app)	Lower (cache handles misses)	Lowest (transparent)
Wasted cache memory	Low (only requested data cached)	Low (only requested data cached)	Higher (pre-fetches may go unused)
Cache provider requirements	Any (Redis, Memcached)	Must support data loader callbacks	Must support TTL tracking + async refresh
Staleness control	TTL or explicit invalidation	TTL or explicit invalidation	Minimal staleness (continuous refresh)
Best for	General purpose, read-heavy, simple setups	Multi-service architectures, consistent patterns	Hot keys with strict latency requirements

TTL Strategy

Time-to-live (TTL) determines how long a cached entry remains valid. Set it too short, and you get frequent cache misses that negate the benefit. Set it too long, and users see stale data.

The right TTL depends on two things: how often the underlying data changes and how much staleness your users can tolerate. A product catalog that updates twice a day can tolerate a 5-minute TTL. A stock price that changes every second needs a TTL under 1 second, or you should not cache it at all.

A practical approach: start with the ratio of reads to writes.

10,000:1 read/write ratio (product pages): TTL of 5-15 minutes. The data rarely changes, and even stale data is acceptable for a few minutes.
100:1 read/write ratio (user profiles): TTL of 1-5 minutes with explicit invalidation on write.
10:1 read/write ratio (inventory counts): TTL of 10-30 seconds, or skip caching and use database read replicas instead.

Cache Hit Ratio

The cache hit ratio measures what percentage of requests are served from the cache. Production systems typically target 90-99% hit ratios. Below 80%, the cache is not providing enough benefit to justify its operational cost. Above 95%, you are in excellent shape.

Hit ratio depends on the working set size relative to cache capacity, the access pattern (power-law distributions cache well, uniform distributions do not), and TTL settings. Monitor this metric continuously. A sudden drop in hit ratio usually indicates either a change in traffic patterns or a cache configuration problem.

Systems Thinking Lens

Caching introduces a balancing loop: as database load increases, you add caching, which reduces database load. But caching also introduces a new reinforcing loop: as the cache absorbs more traffic, the application becomes dependent on it. A cache failure that was tolerable at 50% hit ratio becomes catastrophic at 99% hit ratio, because the database has been sized for 1% of the traffic, not 100%.

The leverage point is not the cache itself but the invalidation strategy. A cache that serves stale data is worse than no cache, because it creates bugs that are intermittent and hard to reproduce. Design the invalidation path with the same rigor as the caching path.

Assignment

You are building an e-commerce product page that serves 10,000 reads per second and receives approximately 2 writes per second (price changes, description edits). The product catalog has 50,000 active products. Average product data size is 2 KB.

Which read caching strategy do you choose: cache-aside, read-through, or refresh-ahead? Justify your choice based on the read/write ratio and workload characteristics.
What TTL do you set, and why? Consider the write frequency, acceptable staleness, and the thundering herd risk when a popular product's cache entry expires.
Calculate the cache memory needed if 80% of traffic goes to 20% of products (power-law distribution). How much RAM does your Redis instance need?
A flash sale starts and one product suddenly gets 50,000 reads per second. Its cache entry expires at the worst possible moment. What happens? How do you prevent it?

The Write Side of Caching

Session 3.8 covered the read path: cache-aside, read-through, and refresh-ahead. Those strategies answer one question: how does data get into the cache when someone reads it? This session covers the other half: what happens when someone writes new data?

The write path is where the hard trade-offs live. Every write creates two copies of truth: one in the cache, one in the database. The strategy you pick determines which copy updates first, whether they update together, and what happens when something fails between the two writes.

There are three core patterns: write-through, write-behind (also called write-back), and write-around. Each makes a different bet about what matters most.

Write-Through

In write-through caching, every write goes to both the cache and the database synchronously. The application writes to the cache, and the cache immediately writes to the database. The write is only acknowledged as successful after both stores have been updated.

sequenceDiagram participant App as Application participant Cache as Cache participant DB as Database App->>Cache: 1. Write data Cache->>DB: 2. Write to DB (sync) DB-->>Cache: 3. Acknowledge Cache-->>App: 4. Acknowledge Note over App,DB: Both stores updated before response

The guarantee is strong: cache and database are always consistent. If the database write fails, the entire operation fails, and the cache is not updated. There is no window where the cache holds data that the database does not.

The cost is latency. Every write now has the latency of two operations: writing to cache (sub-millisecond) plus writing to the database (typically 1-10ms for a relational database). For read-heavy workloads where writes are infrequent, this penalty is acceptable. For write-heavy workloads, it adds up quickly.

Write-Behind (Write-Back)

Write-behind flips the priority. The application writes to the cache, and the cache acknowledges immediately. The database is updated asynchronously, sometime later. The delay might be a few milliseconds, a few seconds, or batched into periodic flushes.

sequenceDiagram participant App as Application participant Cache as Cache participant DB as Database App->>Cache: 1. Write data Cache-->>App: 2. Acknowledge (fast) Note over Cache: Data queued for DB write Cache->>DB: 3. Async batch write (later) DB-->>Cache: 4. Acknowledge Note over Cache,DB: Gap between cache write and DB write

Write latency drops dramatically because the application only waits for the cache write. The database sees fewer, larger batch writes instead of many small ones, which can significantly reduce I/O pressure. If your application writes 1,000 updates per second to the same key, write-behind can collapse those into a single database write.

The risk is data loss. If the cache node crashes before the async write reaches the database, those pending writes are gone. The database is now behind. For some systems this is fine. For others, it is catastrophic.

Write-Around

Write-around skips the cache entirely on writes. The application writes directly to the database. The cache is only populated when data is read (through a read strategy like cache-aside or read-through).

sequenceDiagram participant App as Application participant Cache as Cache participant DB as Database App->>DB: 1. Write directly to DB DB-->>App: 2. Acknowledge Note over Cache: Cache not updated App->>Cache: 3. Later: read request Cache-->>App: 4. Cache miss App->>DB: 5. Fetch from DB DB-->>App: 6. Return data App->>Cache: 7. Populate cache

Write-around avoids cache pollution. If you write data that is unlikely to be read soon (log entries, audit records, analytics events), putting it in the cache wastes memory and may evict more frequently accessed data. By writing only to the database, the cache stays focused on hot data.

The downside is that a read immediately after a write will always be a cache miss. The data exists in the database but not in the cache, so the first read pays the full database round-trip cost. For workloads where writes are rarely followed by immediate reads, this is efficient. For workloads where "write then read" is common, write-around creates unnecessary latency.

Comparing the Three Strategies

Dimension	Write-Through	Write-Behind	Write-Around
Write latency	High (cache + DB sync)	Low (cache only)	Medium (DB only)
Read-after-write latency	Low (data in cache)	Low (data in cache)	High (cache miss)
Consistency	Strong	Eventual	Strong (DB is source of truth)
Data loss risk	None	Yes (if cache fails before flush)	None
DB write load	Same as without cache	Reduced (batching)	Same as without cache
Cache pollution	Possible (write-heavy data cached)	Possible	Minimal
Implementation complexity	Low	High (async queue, retry logic)	Low
Best for	Financial, medical, config data	High-throughput counters, metrics	Logs, audit trails, cold writes

Write-through trades latency for consistency. Write-behind trades consistency for throughput. Write-around trades read-after-write speed for cache efficiency. No single strategy wins. The choice depends on what your system cannot afford to lose.

Combining Strategies

Production systems rarely use a single write strategy for everything. A banking application might use write-through for account balances (consistency is non-negotiable), write-behind for session activity tracking (high volume, eventual consistency is fine), and write-around for transaction audit logs (written once, read rarely).

The decision tree is straightforward. Ask three questions about each data type:

Can you tolerate any data loss? If no, eliminate write-behind.
Do reads typically follow writes? If yes, eliminate write-around.
Is write throughput a bottleneck? If yes, consider write-behind with a durable queue (like Redis with AOF persistence) to reduce the data loss risk.

flowchart TD Start["New data type to cache"] --> Q1{"Can you tolerate
any data loss?"} Q1 -->|No| Q2{"Is write latency
critical?"} Q1 -->|Yes| Q3{"Do reads follow
writes immediately?"} Q2 -->|No| WT["Write-Through"] Q2 -->|Yes| WTQ["Write-Through +
async read replica"] Q3 -->|Yes| WB["Write-Behind"] Q3 -->|No| WA["Write-Around"] style Start fill:#2a2a2a,stroke:#c8a882,color:#ede9e3 style Q1 fill:#2a2a2a,stroke:#c8a882,color:#ede9e3 style Q2 fill:#2a2a2a,stroke:#c8a882,color:#ede9e3 style Q3 fill:#2a2a2a,stroke:#c8a882,color:#ede9e3 style WT fill:#2a2a2a,stroke:#6b8f71,color:#ede9e3 style WTQ fill:#2a2a2a,stroke:#6b8f71,color:#ede9e3 style WB fill:#2a2a2a,stroke:#6b8f71,color:#ede9e3 style WA fill:#2a2a2a,stroke:#6b8f71,color:#ede9e3

Systems Thinking Lens

Each write strategy creates a different feedback loop. Write-through creates a tight, synchronous loop: every write immediately confirms both stores are consistent. The system is predictable but rigid. Write-behind introduces delay into the loop. The application "thinks" the write succeeded, but the database has not confirmed it yet. This delay is a stock of pending writes that can grow if the database slows down. If that stock grows faster than it drains, the system eventually fails. Write-around decouples write and read paths entirely, creating two separate loops that only interact when a cache miss triggers a database read.

The key insight: consistency is not free. Every write strategy pays for it somewhere, either in latency, risk, or complexity. Your job is to decide where the payment is least painful for each data type in your system.

Assignment

For each scenario below, choose the most appropriate write caching strategy and explain your reasoning. Then answer the follow-up question.

Banking transaction: A user transfers $500 between accounts. Which write strategy should update the cache? What happens if the chosen strategy fails mid-write?
Social media like counter: A viral post receives 10,000 likes per second. Which write strategy handles this load? What is at stake if the counter is temporarily inaccurate?
Audit log: Every API request generates a compliance record that regulators may review months later. Which write strategy fits? What is at stake if the cache evicts this data before it is ever read?

For each answer, state: (a) the strategy, (b) why the alternatives are worse, and (c) what failure looks like if your chosen strategy breaks.

Two Tools, Different Philosophies

Redis and Memcached are both in-memory key-value stores. Both deliver sub-millisecond latency. Both are open-source and battle-tested at massive scale. But they were built with fundamentally different design goals, and those differences matter when you are choosing one for a real system.

Memcached was built to do one thing well: cache simple key-value pairs with maximum throughput. Redis was built as a data structure server that happens to be excellent at caching. That difference in ambition shapes everything from data model to persistence to operational complexity.

Redis: The Data Structure Server

Redis stores more than strings. Its native data types include strings, hashes, lists, sets, sorted sets, bitmaps, hyperloglogs, and streams. Each type comes with atomic operations. You can increment a counter, push to a list, add to a sorted set by score, or compute set intersections, all in a single command without client-side logic.

This matters because it reduces round trips. Instead of fetching a value, modifying it in your application, and writing it back (three operations with race conditions), you issue a single ZINCRBY or HINCRBY command. The operation is atomic. No locks needed.

Redis also offers persistence through two mechanisms. RDB (Redis Database) takes point-in-time snapshots at configurable intervals. AOF (Append Only File) logs every write operation and can replay them on restart. You can use both together: RDB for fast recovery, AOF for minimal data loss. With AOF set to fsync every second, you lose at most one second of writes on a crash.

Additional features include pub/sub messaging, Lua scripting for server-side logic, transactions with MULTI/EXEC, and Redis Streams for event log workloads similar to a lightweight Kafka.

Memcached: The Pure Cache

Memcached stores strings. That is it. Every value is a blob of bytes up to 1MB. There are no data structures, no persistence, no scripting, no pub/sub. When a Memcached server restarts, all data is gone.

This simplicity is a feature. Memcached is multi-threaded by default, using its slab allocator to handle concurrent reads and writes across all available CPU cores. Redis, historically single-threaded for command execution (though Redis 6+ introduced I/O threading for network operations), cannot match Memcached's throughput on a single node when the workload is many concurrent clients doing simple GET/SET operations on multiple cores.

Memcached's memory management is also predictable. Its slab allocator pre-divides memory into fixed-size chunks, which eliminates fragmentation. Redis uses jemalloc and can fragment under certain workloads (many small keys created and deleted), though in practice this is manageable with activedefrag.

Feature Comparison

Dimension	Redis	Memcached
Data types	Strings, hashes, lists, sets, sorted sets, streams, bitmaps, HyperLogLog	Strings only
Max value size	512 MB	1 MB (default)
Threading	Single-threaded command execution (I/O threads since v6)	Multi-threaded
Persistence	RDB snapshots + AOF log	None
Eviction policies	8 policies (LRU, LFU, random, TTL-based, noeviction)	LRU only
Replication	Built-in leader-follower	None (client-side consistent hashing)
Pub/Sub	Yes	No
Scripting	Lua scripts, Redis Functions	No
Cluster mode	Redis Cluster (automatic sharding)	Client-side sharding only
Memory efficiency	Variable (depends on data structures used)	Predictable (slab allocator)
Typical use cases	Session store, leaderboard, rate limiter, message broker, cache	Simple object caching, HTML fragment caching

Performance: Throughput and Latency

Benchmarks are always context-dependent, but industry testing gives us useful baselines. On a standard Linux server, Redis handles roughly 1.2 to 1.5 million operations per second for simple GET/SET workloads using pipelining. Memcached, with its multi-threaded architecture, achieves 1.0 to 1.2 million operations per second on similar hardware, with an advantage in highly concurrent multi-core scenarios.

For latency, both deliver sub-millisecond response times. Redis typically achieves around 0.1 to 0.15ms for simple GETs. Memcached is in the same range, around 0.15 to 0.25ms. At the P99 level under heavy concurrent load, Memcached's multi-threading can show lower tail latencies because it does not serialize all commands through a single thread.

The practical takeaway: for simple caching of string values with high concurrency, Memcached may edge ahead. For anything requiring data structures, persistence, or atomic operations on complex types, Redis is the only option.

Eviction Policies

When memory is full, the cache must decide what to remove. Memcached uses LRU (Least Recently Used) and that is your only choice. Redis provides eight eviction policies:

Policy	Behavior	When to use
`allkeys-lru`	Evict least recently used key from all keys	General-purpose caching (most common)
`allkeys-lfu`	Evict least frequently used key from all keys	When access frequency matters more than recency
`volatile-lru`	LRU among keys with TTL set	Mix of cache (with TTL) and persistent keys (no TTL)
`volatile-lfu`	LFU among keys with TTL set	Same as above, frequency-weighted
`volatile-ttl`	Evict keys with shortest remaining TTL first	When expiring-soon data is least valuable
`allkeys-random`	Random eviction from all keys	Uniform access patterns (rare)
`volatile-random`	Random eviction from keys with TTL	Rarely useful in practice
`noeviction`	Return error on writes when memory is full	When data loss is unacceptable (persistent data store)

LRU assumes that recently accessed data will be accessed again. LFU assumes that frequently accessed data will be accessed again. For most web applications, allkeys-lru is the correct default. Switch to allkeys-lfu when you have a small set of permanently hot keys mixed with large amounts of infrequently scanned data.

Cache Stampede: The Thundering Herd

A cache stampede happens when a popular cache key expires and many concurrent requests simultaneously discover the cache miss. All of them hit the database at once, potentially overwhelming it. If the database slows down or crashes, the problem cascades: more requests time out, more retries stack up, and the system spirals.

sequenceDiagram participant R1 as Request 1 participant R2 as Request 2 participant RN as Request N participant Cache as Cache participant DB as Database Note over Cache: Popular key expires at T=0 R1->>Cache: GET product:123 Cache-->>R1: MISS R2->>Cache: GET product:123 Cache-->>R2: MISS RN->>Cache: GET product:123 Cache-->>RN: MISS R1->>DB: SELECT * FROM products WHERE id=123 R2->>DB: SELECT * FROM products WHERE id=123 RN->>DB: SELECT * FROM products WHERE id=123 Note over DB: 50,000 identical queries hit simultaneously DB-->>R1: Result (slow, DB overloaded)

Three proven prevention techniques:

1. Distributed locking. When a cache miss occurs, the first request acquires a lock (using Redis SET key NX EX) and fetches from the database. All other requests wait for the lock to release, then read from the refreshed cache. Only one database query executes.

2. Probabilistic early expiration (XFetch). Instead of all keys expiring at exactly their TTL, each request computes a probability of refreshing the cache before expiration. As the TTL approaches zero, the probability increases. In practice, some request will refresh the key slightly before it expires, so the cache never actually goes empty.

3. Staggered TTL with jitter. Set TTLs as base_ttl + random(0, jitter_range). If 10,000 keys all have a 60-second TTL, they all expire at the same second. With jitter of 10 seconds, they expire spread across a 10-second window. This does not prevent stampede on a single key, but it prevents cache avalanche across many keys.

sequenceDiagram participant R1 as Request 1 participant R2 as Request 2 participant RN as Request N participant Cache as Cache participant Lock as Lock (Redis) participant DB as Database Note over Cache: Popular key expires at T=0 R1->>Cache: GET product:123 Cache-->>R1: MISS R1->>Lock: SET lock:product:123 NX EX 5 Lock-->>R1: OK (lock acquired) R2->>Cache: GET product:123 Cache-->>R2: MISS R2->>Lock: SET lock:product:123 NX EX 5 Lock-->>R2: FAIL (already locked) R2->>R2: Wait / return stale R1->>DB: SELECT * FROM products DB-->>R1: Result R1->>Cache: SET product:123 (fresh data) R1->>Lock: DEL lock:product:123 RN->>Cache: GET product:123 Cache-->>RN: HIT (fresh data)

When to Choose Which

Choose Memcached when your workload is simple key-value caching with high concurrency, you do not need persistence, and you want predictable memory behavior. Memcached shines as a pure caching layer in front of a relational database where the application handles all data structure logic.

Choose Redis for everything else. If you need sorted sets for leaderboards, pub/sub for real-time features, streams for event logs, persistence for session stores, or Lua scripts for atomic multi-step operations, Redis is the answer. Its ecosystem is larger, its feature set is richer, and it can serve as both cache and lightweight data store.

In practice, Redis has become the default choice for most new systems. Memcached retains a strong position in legacy architectures and in environments where its multi-threaded performance on simple operations justifies the feature trade-off.

Assignment

It is midnight. The cache TTL for your most popular product page expires. Your e-commerce site has 50,000 active users. Within one second, 50,000 requests discover the cache is empty and all hit the database simultaneously.

Describe what happens to the database, the application servers, and the user experience in this scenario. Be specific about the failure cascade.
Design a fix using at least two of the three prevention techniques discussed (locking, probabilistic early expiry, staggered TTL). Explain how they work together.
What if the database query for this product takes 2 seconds? How does that change your locking strategy? What do the 49,999 waiting requests do during those 2 seconds?

When "Big" Stopped Being an Adjective

For most of this course, we have discussed systems that handle thousands or millions of requests. Big data is what happens when the data itself becomes the bottleneck, not because the queries are complex, but because there is simply too much of it for any single machine to store, process, or analyze in a reasonable time.

The global datasphere hit roughly 64 zettabytes in 2020. By 2025, that number reached approximately 181 zettabytes. Projections for 2028 exceed 390 zettabytes. A single zettabyte is one trillion gigabytes. This growth is not linear. It is exponential, driven by IoT sensors, video streaming, mobile devices, and machine-generated logs.

This session introduces the fundamental processing paradigms and storage architectures that handle data at this scale.

OLTP vs. OLAP

Before diving into big data architectures, you need to understand the two fundamental database workload types. Every system leans toward one or the other.

Dimension	OLTP (Online Transaction Processing)	OLAP (Online Analytical Processing)
Purpose	Process individual transactions	Analyze aggregated data
Queries	Simple: INSERT, UPDATE, SELECT by primary key	Complex: JOIN across millions of rows, GROUP BY, aggregations
Data volume per query	A few rows	Millions to billions of rows
Latency requirement	Milliseconds	Seconds to minutes
Storage format	Row-oriented	Column-oriented
Examples	MySQL, PostgreSQL, DynamoDB	BigQuery, Redshift, Snowflake, ClickHouse
Optimization	Indexes on primary/foreign keys	Columnar compression, partitioning, materialized views

OLTP answers "what happened in this transaction." OLAP answers "what happened across all transactions." They optimize for opposite access patterns, which is why forcing one to do the other's job always ends badly.

Batch vs. Stream Processing

Data processing at scale splits into two paradigms based on when the data is processed relative to when it arrives.

Batch processing collects data over a period (hours, a day), then processes it all at once. MapReduce and Apache Spark are the canonical tools. You run a job that reads a full dataset, transforms it, and writes results. The job might take minutes or hours. The output is complete and consistent, but it is always behind real time by at least the batch interval.

Stream processing processes data as it arrives, record by record or in micro-batches of milliseconds. Apache Flink, Kafka Streams, and Apache Spark Structured Streaming are the primary tools. Latency is seconds or less. But handling out-of-order events, late arrivals, and exactly-once semantics is significantly more complex.

Dimension	Batch Processing	Stream Processing
Latency	Minutes to hours	Milliseconds to seconds
Throughput	Very high (processes entire dataset)	High (but per-record overhead)
Data completeness	Full dataset available at processing time	Partial, must handle late arrivals
Complexity	Simpler (no state management)	Higher (windowing, watermarks, state)
Fault tolerance	Rerun the job	Checkpointing, exactly-once semantics
Tools	Spark, Hadoop MapReduce, dbt	Flink, Kafka Streams, Spark Streaming
Best for	Monthly reports, ETL, model training	Real-time dashboards, fraud detection, alerting

Most production systems use both. The Lambda architecture runs a batch layer for accurate historical data and a speed layer for approximate real-time results. The Kappa architecture simplifies this by using a single stream processing layer for everything, replaying the event log when historical reprocessing is needed.

Data Lake vs. Data Warehouse vs. Data Lakehouse

Where you store data at scale depends on its structure, who needs it, and how it will be queried.

A data lake is cheap storage for everything. Raw files in any format (JSON, CSV, Parquet, images, logs) dumped into object storage like S3 or Azure Data Lake Storage. Schema is applied when reading ("schema-on-read"), not when writing. This makes ingestion fast and flexible, but querying is slow without additional tooling. Data lakes tend to become "data swamps" without governance.

A data warehouse is structured storage optimized for analytical queries. Data is cleaned, transformed, and loaded into a predefined schema ("schema-on-write"). Columnar storage formats enable fast aggregations. Snowflake, BigQuery, and Redshift are the dominant players. Warehouses are fast for BI queries but expensive for raw storage and inflexible for unstructured data.

A data lakehouse combines both. It adds a transaction layer (Delta Lake, Apache Iceberg, Apache Hudi) on top of data lake storage. This gives you ACID transactions, schema enforcement, and query optimization on cheap object storage. You get warehouse-like performance without copying data into a separate warehouse system.

Dimension	Data Lake	Data Warehouse	Data Lakehouse
Storage cost	Low (object storage)	High (proprietary format)	Low (object storage)
Data types	Structured, semi-structured, unstructured	Structured only	All types
Schema	Schema-on-read	Schema-on-write	Both (flexible)
ACID transactions	No	Yes	Yes
Query performance	Slow without optimization	Fast (columnar, indexed)	Fast (with table formats)
Governance	Weak ("data swamp" risk)	Strong	Strong
ML/AI support	Good (direct file access)	Poor (data must be exported)	Good (direct access + structure)
Examples	S3 + Athena, ADLS + Synapse	Snowflake, BigQuery, Redshift	Databricks, Delta Lake, Apache Iceberg

Kafka as the Integration Backbone

Apache Kafka sits at the center of most modern data pipelines. It is a distributed event streaming platform that acts as a durable, high-throughput buffer between data producers and consumers. Kafka does not process data. It transports it reliably and lets multiple downstream systems consume the same events independently.

A Kafka topic is an ordered, append-only log of events. Producers write to topics. Consumers read from topics at their own pace. Because the log is persistent (stored on disk with configurable retention), a new consumer can start reading from the beginning. A real-time dashboard can read events as they arrive. A batch pipeline can read the same events hours later. Neither interferes with the other.

This decoupling is why Kafka is the backbone. It separates "when data is produced" from "when and how data is consumed," which lets teams build and evolve their processing pipelines independently.

A Modern Data Pipeline

Putting these pieces together, here is what a modern data architecture looks like for a company that needs both real-time insights and historical analytics.

flowchart LR subgraph Sources App["Application
Events"] DB["Database
CDC"] IoT["IoT
Sensors"] Logs["Server
Logs"] end subgraph Ingestion Kafka["Apache Kafka
(Event Backbone)"] end subgraph Processing Flink["Stream Processor
(Flink / Kafka Streams)"] Spark["Batch Processor
(Spark)"] end subgraph Storage Lake["Data Lakehouse
(Delta Lake / Iceberg)"] WH["Data Warehouse
(Snowflake / BigQuery)"] end subgraph Serving Dash["Real-Time
Dashboards"] BI["BI Reports
(Monthly)"] ML["ML Training
Pipelines"] end App --> Kafka DB --> Kafka IoT --> Kafka Logs --> Kafka Kafka --> Flink Kafka --> Spark Flink --> Lake Flink --> Dash Spark --> Lake Lake --> WH Lake --> ML WH --> BI style App fill:#2a2a2a,stroke:#c8a882,color:#ede9e3 style DB fill:#2a2a2a,stroke:#c8a882,color:#ede9e3 style IoT fill:#2a2a2a,stroke:#c8a882,color:#ede9e3 style Logs fill:#2a2a2a,stroke:#c8a882,color:#ede9e3 style Kafka fill:#2a2a2a,stroke:#6b8f71,color:#ede9e3 style Flink fill:#2a2a2a,stroke:#c8a882,color:#ede9e3 style Spark fill:#2a2a2a,stroke:#c8a882,color:#ede9e3 style Lake fill:#2a2a2a,stroke:#6b8f71,color:#ede9e3 style WH fill:#2a2a2a,stroke:#c8a882,color:#ede9e3 style Dash fill:#2a2a2a,stroke:#6b8f71,color:#ede9e3 style BI fill:#2a2a2a,stroke:#c8a882,color:#ede9e3 style ML fill:#2a2a2a,stroke:#c8a882,color:#ede9e3

Data flows left to right. All sources publish to Kafka. Stream processors consume from Kafka for real-time needs. Batch processors consume the same data (or read from the lakehouse) for historical analysis. The lakehouse serves as the central storage layer, feeding both the warehouse for BI and ML pipelines for model training.

The critical insight: Kafka decouples producers from consumers. Adding a new downstream system (a fraud detection engine, a recommendation model, a compliance audit) does not require changing any upstream source. You just add a new consumer group to the relevant Kafka topic.

flowchart TD subgraph OLTP["OLTP Layer"] PG["PostgreSQL
(Transactions)"] end subgraph CDC["Change Data Capture"] Deb["Debezium"] end subgraph Stream["Stream Layer"] KF["Kafka"] FL["Flink"] end subgraph OLAP["OLAP Layer"] LH["Lakehouse
(Iceberg)"] SF["Snowflake"] end PG --> Deb Deb --> KF KF --> FL FL --> LH LH --> SF style PG fill:#2a2a2a,stroke:#6b8f71,color:#ede9e3 style Deb fill:#2a2a2a,stroke:#c8a882,color:#ede9e3 style KF fill:#2a2a2a,stroke:#6b8f71,color:#ede9e3 style FL fill:#2a2a2a,stroke:#c8a882,color:#ede9e3 style LH fill:#2a2a2a,stroke:#6b8f71,color:#ede9e3 style SF fill:#2a2a2a,stroke:#c8a882,color:#ede9e3

This second diagram shows a common pattern: Change Data Capture (CDC) using Debezium to stream database changes into Kafka, which then feeds into the analytics layer. This lets you keep your OLTP database optimized for transactions while building your OLAP layer from the same data, without any direct coupling between the two systems.

A data pipeline is not a single tool. It is a system of decoupled components, each optimized for one job: ingestion, transport, processing, storage, and serving. Kafka is the connective tissue. The lakehouse is the central nervous system.

Assignment

Your application generates 1 TB of clickstream data per day. The business needs two things: real-time dashboards showing active users and trending pages (updated every 5 seconds), and monthly reports analyzing user behavior patterns over 30-day windows.

Sketch the pipeline. Identify each component: data source, ingestion layer, stream processor, batch processor, storage layers, and serving layers. Name specific technologies for each.
Where does Kafka sit in your design? How many partitions would you estimate for a topic handling 1 TB/day of clickstream events? Show your math (assume average event size of 500 bytes).
Would you use a data lake, a data warehouse, or a lakehouse for long-term storage? Justify your choice in terms of cost, query performance, and support for both real-time and batch workloads.
Draw the OLTP/OLAP boundary. Which parts of your system are transactional? Which are analytical? Where does the handoff happen?

Module 3: Storage, Databases & Caching

Two Models, Two Philosophies

Relational Databases (SQL)

NoSQL Databases

Throughput Comparison

Comparison Across Dimensions

Polyglot Persistence

The ACID vs. BASE Spectrum

Systems Thinking Lens

Further Reading

Assignment

Four Families, Four Philosophies

Key-Value Stores

Document Databases

Wide-Column Stores

Graph Databases

Data Model Comparison

Comparison Table

Radar Comparison

Choosing the Right Family

Systems Thinking Lens

Further Reading

Assignment

Why Shard?

Three Sharding Strategies

Range-Based Sharding

Hash-Based Sharding

Directory-Based Sharding

Hash-Based Sharding in Action

Shard Key Selection

Strategy Comparison

Cross-Shard Queries

Real-World Examples

Instagram: Sharding PostgreSQL

Discord: Cassandra to ScyllaDB

Systems Thinking Lens

Further Reading

Assignment

Why Replicate?

Primary-Replica Replication

Multi-Primary Replication

Synchronous vs. Asynchronous Replication

Replication Lag

Handling Replication Lag

Replication Strategy Comparison

Systems Thinking Lens

Further Reading

Assignment

Why Indexes Exist

B-Tree Indexes

Hash Indexes

GIN and GiST Indexes

Index Comparison

Composite Indexes and Column Order

Index Selectivity

Covering Indexes

The Hidden Write Cost

Query Time vs. Table Size

Systems Thinking Lens

Further Reading

Assignment

The Problem with Modulo Hashing

Consistent Hashing

Key Redistribution: The Numbers

Comparison Table

The Problem with Basic Consistent Hashing

Virtual Nodes

Virtual Nodes Enable Heterogeneous Clusters

Systems Thinking Lens

Further Reading

Assignment

Three Kinds of Storage

Object Storage: S3-Style

Storage Cost Comparison

Pre-Signed URLs

CDN Integration

Distributed File Systems: HDFS and GFS

Systems Thinking Lens

Further Reading

Assignment