Scalability Vocabulary
Session 1.9 · ~5 min read
The Language of Distributed Systems
Before you can design systems that scale, you need a shared vocabulary. These terms appear in every system design discussion, every architecture review, and every post-mortem. They are not abstract academic concepts. Each one describes a concrete property that either exists in your system or does not.
This session defines the core terms, gives you a practical example for each, and introduces the CAP theorem, which formalizes the tradeoffs between three of them.
Core Terms
| Term | Definition | One-Line Example |
|---|---|---|
| Scalability | The ability of a system to handle increased load by adding resources | Adding more web servers behind a load balancer to serve more users |
| Availability | The proportion of time a system is operational and accessible | 99.99% availability means less than 53 minutes of downtime per year |
| Consistency | All nodes in a distributed system return the same data at the same time | After updating your profile photo, every server shows the new photo immediately |
| Fault Tolerance | The ability to continue operating correctly when components fail | A database cluster continues serving reads when one replica crashes |
| SPOF (Single Point of Failure) | A component whose failure brings down the entire system | A single database server with no replicas: if it dies, everything stops |
| Partition Tolerance | The system continues to operate despite network splits between nodes | Two data centers lose connectivity but both keep serving requests |
Scalability: Vertical vs. Horizontal
Vertical scaling (scaling up) means adding more resources to a single machine: more CPU, more RAM, faster disks. Horizontal scaling (scaling out) means adding more machines to distribute the load.
Vertical scaling is simpler. You upgrade the server and your application code does not change. But every machine has a ceiling. You cannot add infinite RAM. You cannot buy a CPU with 10,000 cores. And while you are upgrading, the machine is typically offline.
Horizontal scaling has no theoretical ceiling, but it introduces complexity. Your application must handle multiple instances, shared state, load distribution, and network communication between nodes. Most production systems use a combination: scale vertically until it becomes cost-ineffective, then scale horizontally.
Availability: Measuring Uptime
Availability is expressed as a percentage, commonly referred to by the number of nines:
| Availability | Downtime per Year | Downtime per Month |
|---|---|---|
| 99% (two nines) | 3.65 days | 7.3 hours |
| 99.9% (three nines) | 8.76 hours | 43.8 minutes |
| 99.99% (four nines) | 52.6 minutes | 4.38 minutes |
| 99.999% (five nines) | 5.26 minutes | 26.3 seconds |
Each additional nine is exponentially harder and more expensive to achieve. Moving from 99.9% to 99.99% often requires redundant infrastructure across multiple availability zones, automated failover, and rigorous testing of failure scenarios. Most consumer web applications target three or four nines. Financial trading systems and emergency services aim for five.
Consistency: Strong vs. Eventual
Strong consistency guarantees that after a write completes, every subsequent read returns the updated value. If you transfer $100 from account A to account B, strong consistency means no reader will ever see the money in both accounts or neither account. The system behaves as if there is a single copy of the data.
Eventual consistency allows replicas to diverge temporarily. After a write, some replicas may return stale data for a period of time. Eventually, all replicas converge to the same value. The "eventually" part can range from milliseconds to seconds, depending on the system.
Strong consistency is easier to reason about but harder to scale. It often requires coordination between nodes (locks, consensus protocols), which adds latency. Eventual consistency scales better because replicas can operate independently, but your application logic must handle stale reads gracefully.
Social media feeds use eventual consistency. If you post a photo and your friend sees it two seconds later, nobody notices. Banking transactions use strong consistency. If the balance is wrong even briefly, the consequences are severe.
Fault Tolerance and Single Points of Failure
A fault-tolerant system is designed to continue operating when things break. Hardware fails. Networks partition. Disks corrupt. Software crashes. The question is not whether failures happen but what the system does when they happen.
The first step in designing for fault tolerance is identifying every Single Point of Failure (SPOF). A SPOF is any component that, if it fails, takes the entire system down. Common SPOFs include:
- A single database server with no replicas
- A single load balancer with no failover
- A single DNS provider
- An application that depends on one external API with no fallback
The remedy for a SPOF is redundancy: run multiple instances, in multiple locations, with automatic failover. This does not eliminate failure. It reduces the probability that a single failure becomes a system-wide outage.
The CAP Theorem
The CAP theorem, proposed by Eric Brewer in 2000 and formally proven by Seth Gilbert and Nancy Lynch in 2002, states that a distributed data store can guarantee at most two of three properties simultaneously: Consistency, Availability, and Partition Tolerance.
All nodes see same data] A[Availability
Every request gets a response] P[Partition Tolerance
System works despite network splits] CAP --- C CAP --- A CAP --- P CP[CP Systems
MongoDB, HBase, Redis Cluster] AP[AP Systems
Cassandra, DynamoDB, CouchDB] CA[CA Systems
Single-node RDBMS
Not viable in distributed systems] C --- CP C --- CA A --- AP A --- CA P --- CP P --- AP
In practice, partition tolerance is not optional. Networks fail. Packets get lost. Data centers lose connectivity. Any distributed system must tolerate partitions. The real choice is between consistency and availability during a partition:
- CP systems (Consistency + Partition Tolerance): When a network partition occurs, the system refuses to serve requests that might return stale data. It sacrifices availability to maintain consistency. Example: MongoDB in its default configuration will reject writes to a minority partition.
- AP systems (Availability + Partition Tolerance): When a partition occurs, the system continues serving requests, but different nodes may return different data. It sacrifices consistency to remain available. Example: Cassandra continues accepting writes on both sides of a partition and reconciles later.
- CA systems (Consistency + Availability): This combination requires no partitions, which means a single node or a network that never fails. It does not exist in real distributed systems. A single PostgreSQL server is "CA" only because it is not distributed.
The CAP theorem does not say you must permanently give up consistency or availability. It says that during a network partition, you must choose which one to sacrifice. When the network is healthy, you can have both.
Putting It Together
These terms are not independent. They interact. A system that prioritizes strong consistency may sacrifice availability during partitions. A system designed for high availability may accept eventual consistency. Eliminating SPOFs improves fault tolerance, which improves availability. Horizontal scaling enables higher throughput but makes strong consistency harder to achieve.
Understanding these terms and their tradeoffs is the foundation of every design decision you will make in the rest of this course. When someone says "we need 99.99% availability," you should immediately think about what that costs in terms of consistency, complexity, and infrastructure.
Further Reading
- Eric Brewer, "Brewer's Conjecture and the Feasibility of Consistent, Available, Partition-Tolerant Web Services" (Gilbert & Lynch, 2002). The formal proof of the CAP theorem.
- CAP Theorem, Wikipedia. Comprehensive overview with history, formal definition, and system classification.
- IBM, "What Is the CAP Theorem?". Accessible explanation with real-world database examples.
- Martin Kleppmann, "Please Stop Calling Databases CP or AP" (2015). An important critique of oversimplified CAP classifications.
Assignment
Without looking at the session content, define each of the following terms in your own words. One or two sentences each.
- Scalability
- Availability
- Consistency
- Fault Tolerance
- Single Point of Failure
- Partition Tolerance
After writing your definitions, compare them to the table at the top of this session. Where did your understanding differ? Which term was hardest to define precisely?
Bonus: Pick a service you use daily (Gmail, Spotify, Grab). Based on its behavior, would you classify it as a CP or AP system? What evidence supports your classification?