Course → Module 3: Storage, Databases & Caching

Two Models, Two Philosophies

Every application stores data. The question is how that data is organized, queried, and scaled. For decades, relational databases (SQL) dominated. Then, around 2009, a wave of distributed systems gave rise to NoSQL. These are not competing religions. They are engineering tradeoffs. The right choice depends on your data shape, your access patterns, and your consistency requirements.

Relational Databases (SQL)

A relational database stores data in tables with predefined schemas, enforces relationships through foreign keys, and guarantees ACID properties: Atomicity, Consistency, Isolation, Durability. Every write either fully succeeds or fully rolls back.

Relational databases were designed for correctness. A banking system transferring money between accounts needs a guarantee: debit one account and credit the other, or do neither. ACID transactions provide that guarantee. The schema enforces structure. You define columns and types upfront. If you try to insert a string into an integer column, the database rejects it. This strictness prevents entire categories of bugs.

SQL databases excel at complex queries. Joins allow you to combine data from multiple tables in a single query. An e-commerce system can answer "show me all orders from customers in Jakarta who bought more than 3 items last month" with one SQL statement. Try doing that efficiently in a key-value store.

The cost of this power is rigidity. Schema changes (adding a column, changing a type) require migrations. On large tables with billions of rows, migrations can take hours and lock the table. Horizontal scaling is difficult because joins across machines are expensive. Most relational databases scale vertically first, then add read replicas for read-heavy workloads.

Major relational databases: PostgreSQL, MySQL, Oracle, SQL Server, SQLite.

NoSQL Databases

NoSQL databases abandon the rigid table-and-schema model in favor of flexible data structures (documents, key-value pairs, wide columns, or graphs). Most trade strong consistency for horizontal scalability and eventual consistency.

NoSQL emerged to solve problems that relational databases handle poorly: massive write throughput across distributed clusters, highly variable data shapes, and workloads where horizontal scaling matters more than multi-table joins.

A document database like MongoDB stores JSON-like documents. Each document can have different fields. You do not need to run ALTER TABLE to add a new attribute. You just start writing documents with the new field. This flexibility accelerates early development and works well for data that does not fit neatly into rows and columns.

Most NoSQL databases achieve horizontal scalability by partitioning data across nodes. Cassandra, for example, distributes data using consistent hashing. Adding a node automatically rebalances part of the data. This makes scaling to hundreds of nodes straightforward, something relational databases struggle with.

The tradeoff is consistency. Most NoSQL systems offer eventual consistency by default: after a write, replicas will converge to the same state, but a read immediately after a write might return stale data. Some NoSQL databases (like MongoDB with majority write concern, or DynamoDB with strong reads) can provide stronger guarantees at the cost of latency.

Throughput Comparison

Benchmark numbers depend heavily on hardware, configuration, data size, and workload shape. The following chart uses approximate figures from YCSB (Yahoo Cloud Serving Benchmark) studies and vendor benchmarks to illustrate relative throughput ranges. These are order-of-magnitude comparisons, not absolute truths for your specific workload.

Redis dominates because it is an in-memory store. Cassandra's write throughput is high because its LSM-tree storage engine is optimized for sequential writes. PostgreSQL's numbers are lower on raw throughput but it handles complex queries that the others cannot. These numbers shift dramatically with indexing, replication, and query complexity.

Comparison Across Dimensions

Dimension SQL (Relational) NoSQL
Data model Fixed schema, tables with rows and columns Flexible: documents, key-value, wide-column, graph
Schema Schema-on-write (enforced at insert time) Schema-on-read (interpreted at query time)
Query language SQL (standardized, declarative) Varies per database (MongoDB query API, CQL, Cypher)
Joins Native, efficient multi-table joins Generally unsupported or expensive
Transactions Full ACID across multiple tables Limited (single-document or single-partition in most)
Consistency Strong consistency by default Eventual consistency by default (tunable in some)
Scaling Primarily vertical; horizontal via read replicas or sharding Designed for horizontal scaling from the start
Schema evolution ALTER TABLE migrations (can be slow on large tables) Add fields freely; old documents still valid
Best for Transactional systems, complex queries, data integrity High write throughput, variable schemas, massive scale

Polyglot Persistence

Polyglot persistence means using different database technologies for different parts of the same system, matching each data store to the access pattern it handles best.

Most real-world systems do not pick one database for everything. An e-commerce platform might use PostgreSQL for orders and inventory (transactions matter), Redis for session storage and caching (speed matters), Elasticsearch for product search (full-text indexing matters), and a graph database for recommendations (relationship traversal matters).

This is not over-engineering. It is recognizing that no single database excels at every workload. The cost is operational complexity: more systems to monitor, more failure modes, more data synchronization concerns.

graph TB Client[Client Application] --> API[API Layer] API --> PG["PostgreSQL
Orders, Users, Payments
(ACID transactions)"] API --> Redis["Redis
Sessions, Cache
(sub-ms latency)"] API --> ES["Elasticsearch
Product Search
(full-text indexing)"] API --> Mongo["MongoDB
Product Catalog
(flexible schema)"] API --> Neo4j["Neo4j
Recommendations
(graph traversal)"] PG --- |"Source of truth"| Sync[Change Data Capture] Sync --> ES Sync --> Redis

The diagram above shows a polyglot architecture. PostgreSQL serves as the source of truth for transactional data. Change data capture (CDC) streams changes to Elasticsearch for search indexing and to Redis for cache warming. Each database handles the workload it was designed for.

The ACID vs. BASE Spectrum

SQL databases follow ACID. Most NoSQL databases follow BASE: Basically Available, Soft state, Eventually consistent. These are not binary categories. Many databases sit on a spectrum.

MongoDB now supports multi-document ACID transactions (since version 4.0). DynamoDB offers strongly consistent reads. CockroachDB is a distributed SQL database that provides ACID guarantees across multiple nodes. The boundaries between SQL and NoSQL are blurring. What matters is understanding the default behavior and what guarantees you are paying for in terms of latency and throughput.

Systems Thinking Lens

Choosing a database is a leverage point. Get it right early and the system scales naturally. Get it wrong and you spend months migrating under pressure. The feedback loop is delayed: you pick a database in month one, but the pain of a bad choice shows up in month twelve when traffic grows 10x.

Polyglot persistence creates a balancing loop. Each additional database solves one problem but introduces operational overhead. At some point, the cost of managing five different databases outweighs the performance benefit. The systems thinker asks: where is the point of diminishing returns for our team size and operational maturity?

Further Reading

Assignment

You are designing a system that stores four types of data. For each one, choose SQL or NoSQL (and name a specific database). Justify your choice in 2-3 sentences, explaining which properties of the workload drove your decision.

  1. User profiles with fields that vary by user type (individual, business, admin). Some users have 5 fields, others have 50.
  2. Chat messages in a messaging app with 10 million daily active users. Messages are write-heavy, read by time range, and rarely updated.
  3. Product catalog for an e-commerce site with 2 million SKUs. Users search by name, category, price range, and attributes that vary by product type.
  4. Social graph for a platform where users follow each other. Key queries: "who does user X follow?", "who follows user X?", "mutual friends between X and Y."

For each choice, also state the consistency model you would use (strong or eventual) and why.