Course → Module 3: Storage, Databases & Caching

Four Families, Four Philosophies

Saying "NoSQL" tells you almost nothing about how a database actually works. It is like saying "not a car" when someone asks what vehicle you drive. It could be a motorcycle, a bus, a helicopter, or a bicycle. NoSQL is an umbrella term covering at least four fundamentally different data models, each optimized for different access patterns.

The four families are: key-value stores, document databases, wide-column stores, and graph databases. Picking the right family means understanding your data shape and how you will query it.

Key-Value Stores

A key-value store is the simplest NoSQL model: every record is a unique key mapped to a blob of data. The database does not inspect the value. It only knows the key. This simplicity enables extreme speed.

Key-value stores are hash tables at scale. You give the database a key, it returns a value. You cannot query by value contents. You cannot filter or sort. You can only get, set, and delete by key. This constraint is the source of their power. With no secondary indexes, no joins, and no query parsing, operations complete in microseconds.

Redis is the most widely used key-value store. It holds data in memory, supports data structures beyond simple strings (lists, sets, sorted sets, hashes), and provides sub-millisecond latency. DynamoDB is Amazon's managed key-value and document store, built for horizontal scalability with single-digit millisecond reads at any scale.

Use when: Session storage, caching, rate limiting, leaderboards, feature flags. Any workload where the access pattern is "I know the key, give me the value."

Avoid when: You need to search by attributes within the value, or you need relationships between records.

Document Databases

A document database stores self-describing records (usually JSON or BSON) that can contain nested objects and arrays. Unlike key-value stores, the database understands the document structure and can index and query individual fields.

Document databases sit between key-value simplicity and relational richness. Each document is a self-contained unit. A user document might contain the user's name, address (nested object), order history (array of objects), and preferences (nested map). You can query any of these fields.

MongoDB is the dominant document database. It supports rich queries, aggregation pipelines, secondary indexes, and multi-document ACID transactions. Couchbase and Amazon DocumentDB are other options in this space.

The document model works well when your data naturally forms aggregates. An e-commerce product is a good example: the product name, description, images, variants, and reviews form a logical unit that gets read and written together. Storing this as a single document avoids the five-table join you would need in a relational database.

Use when: Content management, product catalogs, user profiles with variable fields, event logging.

Avoid when: You need complex joins across document types or strict referential integrity between collections.

Wide-Column Stores

A wide-column store organizes data into rows and column families, where each row can have a different set of columns. Rows are identified by a partition key, and within each partition, data is sorted by a clustering key. This structure is optimized for high write throughput and time-series access patterns.

Wide-column stores were inspired by Google's Bigtable paper (2006). The data model looks like a table, but it is fundamentally different from relational tables. There is no fixed schema. Each row can have millions of columns. Columns are grouped into families, and the database stores each family contiguously on disk.

Apache Cassandra is the most prominent wide-column store. It provides linear write scalability: adding a node to the cluster increases write throughput proportionally. Cassandra partitions data by a hash of the partition key and sorts data within each partition by the clustering columns. This makes time-ordered queries within a partition very fast.

HBase (built on Hadoop) and ScyllaDB (a Cassandra-compatible database written in C++) are other notable wide-column stores.

Use when: Time-series data (IoT sensor readings, event logs), messaging at scale, write-heavy workloads with known query patterns.

Avoid when: You need ad-hoc queries, complex aggregations, or do not know your access patterns at design time.

Graph Databases

A graph database stores data as nodes (entities) and edges (relationships), with properties on both. It is optimized for traversing connections, making queries like "find all friends of friends who live in Jakarta" fast regardless of total data size.

Relational databases can model graphs using join tables, but traversing deep relationships becomes exponentially expensive. A three-hop query (friends of friends of friends) requires three self-joins, each of which scans the entire relationship table. Graph databases use index-free adjacency: each node directly references its neighbors, so traversal cost is proportional to the local neighborhood size, not the total graph size.

Neo4j is the leading graph database. It uses the Cypher query language, which reads like a visual pattern: MATCH (a)-[:FOLLOWS]->(b)-[:FOLLOWS]->(c) RETURN c. Amazon Neptune and TigerGraph are other graph databases used at scale.

Use when: Social networks, recommendation engines, fraud detection, knowledge graphs, dependency mapping.

Avoid when: Your data has few relationships, or your queries are primarily simple lookups and range scans.

Data Model Comparison

graph LR subgraph Key-Value KV1["key: user:1001"] --> KV2["value: {blob}"] KV3["key: session:abc"] --> KV4["value: {blob}"] end subgraph Document D1["{ _id: 1001,
name: 'Andi',
orders: [...],
prefs: {...} }"] end subgraph Wide-Column WC1["Row key: user:1001"] WC1 --> WC2["profile:name"] WC1 --> WC3["profile:city"] WC1 --> WC4["metrics:logins"] WC1 --> WC5["metrics:last_seen"] end subgraph Graph G1((User A)) -->|FOLLOWS| G2((User B)) G2 -->|FOLLOWS| G3((User C)) G1 -->|LIKES| G4((Post 1)) end

Comparison Table

Dimension Key-Value Document Wide-Column Graph
Data model Key mapped to opaque blob JSON/BSON documents with nested fields Rows with dynamic column families Nodes, edges, properties
Query pattern GET/SET by key only Query by any field, aggregation pipelines Partition key lookup, range scan within partition Pattern matching, graph traversal
Schema None (schemaless) Flexible (schema-on-read) Column families predefined, columns flexible Flexible node/edge types
Scalability Linear horizontal Horizontal with sharding Linear horizontal (masterless) Vertical primarily; some horizontal
Read speed Sub-ms (in-memory) Low ms (indexed) Low ms (within partition) Low ms (local traversal)
Write speed Sub-ms Low ms Very high (LSM-tree) Moderate
Strengths Speed, simplicity Flexibility, rich queries Write throughput, time-series Relationship queries
Products Redis, DynamoDB, Memcached MongoDB, Couchbase, DocumentDB Cassandra, HBase, ScyllaDB Neo4j, Neptune, TigerGraph
Typical use case Cache, sessions, rate limits CMS, catalogs, user profiles IoT, messaging, event logs Social graphs, fraud, recommendations

Radar Comparison

The radar chart below compares the four NoSQL families across five dimensions, scored on a relative scale of 1 (weakest) to 5 (strongest). These are directional, not absolute. Your mileage will vary depending on the specific product, configuration, and workload.

Key-value stores score the highest on speed and scalability but the lowest on query flexibility. Graph databases are the inverse: rich queries but harder to scale horizontally. Document and wide-column databases sit in different middle grounds. This is why polyglot persistence exists. No single family wins on every axis.

Choosing the Right Family

flowchart TD Start["What is your primary access pattern?"] --> Q1{"Simple key lookup?"} Q1 -->|Yes| KV["Key-Value Store
(Redis, DynamoDB)"] Q1 -->|No| Q2{"Query by document fields?"} Q2 -->|Yes| Doc["Document Database
(MongoDB, Couchbase)"] Q2 -->|No| Q3{"Time-series or
write-heavy with
known partitions?"} Q3 -->|Yes| WC["Wide-Column Store
(Cassandra, ScyllaDB)"] Q3 -->|No| Q4{"Relationship
traversal?"} Q4 -->|Yes| Graph["Graph Database
(Neo4j, Neptune)"] Q4 -->|No| SQL["Consider Relational
(PostgreSQL, MySQL)"]

The decision tree above is a starting point, not a prescription. Many workloads could fit multiple families. When in doubt, start with the simplest option that meets your access pattern requirements. You can always add specialized databases later through polyglot persistence.

Systems Thinking Lens

Each NoSQL family creates a different constraint on your system. Key-value stores constrain your query patterns but free you from schema management. Graph databases free your query patterns but constrain your scaling options. These are balancing loops. The freedom you gain on one axis is paid for on another.

The leverage point is access pattern analysis. Before choosing a database family, write down your ten most important queries. If you cannot do that, you are not ready to choose. The database family should emerge from the access patterns, not the other way around. Too many teams pick a database because it is trendy, then spend months fighting its constraints.

Further Reading

Assignment

Match each workload below to the most appropriate NoSQL family (key-value, document, wide-column, or graph). For each, explain your reasoning in 2-3 sentences, focusing on the access pattern that drove your decision.

  1. Session store for a web application with 50 million active sessions. Sessions are created, read by session ID, and expire after 30 minutes.
  2. Time-series IoT data from 100,000 sensors, each sending a reading every 5 seconds. Queries are always "all readings from sensor X between time A and time B."
  3. Recommendation engine for a streaming platform. Key query: "users who watched Movie A also watched..." which requires traversing user-movie-user paths.
  4. Activity logs from a SaaS application. Each log entry has a timestamp, user ID, action type, and a metadata object whose fields vary by action type (login has IP and device; purchase has amount and items; error has stack trace).