Session 1.9: Scalability Vocabulary

Course → Module 1: Architectural Foundations & Core Concepts

The Language of Distributed Systems

Before you can design systems that scale, you need a shared vocabulary. These terms appear in every system design discussion, every architecture review, and every post-mortem. They are not abstract academic concepts. Each one describes a concrete property that either exists in your system or does not.

This session defines the core terms, gives you a practical example for each, and introduces the CAP theorem, which formalizes the tradeoffs between three of them.

Core Terms

Term	Definition	One-Line Example
Scalability	The ability of a system to handle increased load by adding resources	Adding more web servers behind a load balancer to serve more users
Availability	The proportion of time a system is operational and accessible	99.99% availability means less than 53 minutes of downtime per year
Consistency	All nodes in a distributed system return the same data at the same time	After updating your profile photo, every server shows the new photo immediately
Fault Tolerance	The ability to continue operating correctly when components fail	A database cluster continues serving reads when one replica crashes
SPOF (Single Point of Failure)	A component whose failure brings down the entire system	A single database server with no replicas: if it dies, everything stops
Partition Tolerance	The system continues to operate despite network splits between nodes	Two data centers lose connectivity but both keep serving requests

Scalability: Vertical vs. Horizontal

Vertical scaling (scaling up) means adding more resources to a single machine: more CPU, more RAM, faster disks. Horizontal scaling (scaling out) means adding more machines to distribute the load.

Vertical scaling is simpler. You upgrade the server and your application code does not change. But every machine has a ceiling. You cannot add infinite RAM. You cannot buy a CPU with 10,000 cores. And while you are upgrading, the machine is typically offline.

Horizontal scaling has no theoretical ceiling, but it introduces complexity. Your application must handle multiple instances, shared state, load distribution, and network communication between nodes. Most production systems use a combination: scale vertically until it becomes cost-ineffective, then scale horizontally.

Availability: Measuring Uptime

Availability is expressed as a percentage, commonly referred to by the number of nines:

Availability	Downtime per Year	Downtime per Month
99% (two nines)	3.65 days	7.3 hours
99.9% (three nines)	8.76 hours	43.8 minutes
99.99% (four nines)	52.6 minutes	4.38 minutes
99.999% (five nines)	5.26 minutes	26.3 seconds

Each additional nine is exponentially harder and more expensive to achieve. Moving from 99.9% to 99.99% often requires redundant infrastructure across multiple availability zones, automated failover, and rigorous testing of failure scenarios. Most consumer web applications target three or four nines. Financial trading systems and emergency services aim for five.

Consistency: Strong vs. Eventual

Strong consistency guarantees that after a write completes, every subsequent read returns the updated value. If you transfer $100 from account A to account B, strong consistency means no reader will ever see the money in both accounts or neither account. The system behaves as if there is a single copy of the data.

Eventual consistency allows replicas to diverge temporarily. After a write, some replicas may return stale data for a period of time. Eventually, all replicas converge to the same value. The "eventually" part can range from milliseconds to seconds, depending on the system.

Strong consistency is easier to reason about but harder to scale. It often requires coordination between nodes (locks, consensus protocols), which adds latency. Eventual consistency scales better because replicas can operate independently, but your application logic must handle stale reads gracefully.

Social media feeds use eventual consistency. If you post a photo and your friend sees it two seconds later, nobody notices. Banking transactions use strong consistency. If the balance is wrong even briefly, the consequences are severe.

Fault Tolerance and Single Points of Failure

A fault-tolerant system is designed to continue operating when things break. Hardware fails. Networks partition. Disks corrupt. Software crashes. The question is not whether failures happen but what the system does when they happen.

The first step in designing for fault tolerance is identifying every Single Point of Failure (SPOF). A SPOF is any component that, if it fails, takes the entire system down. Common SPOFs include:

A single database server with no replicas
A single load balancer with no failover
A single DNS provider
An application that depends on one external API with no fallback

The remedy for a SPOF is redundancy: run multiple instances, in multiple locations, with automatic failover. This does not eliminate failure. It reduces the probability that a single failure becomes a system-wide outage.

The CAP Theorem

The CAP theorem, proposed by Eric Brewer in 2000 and formally proven by Seth Gilbert and Nancy Lynch in 2002, states that a distributed data store can guarantee at most two of three properties simultaneously: Consistency, Availability, and Partition Tolerance.

graph TD CAP((CAP Theorem)) C[Consistency
All nodes see same data] A[Availability
Every request gets a response] P[Partition Tolerance
System works despite network splits] CAP --- C CAP --- A CAP --- P CP[CP Systems
MongoDB, HBase, Redis Cluster] AP[AP Systems
Cassandra, DynamoDB, CouchDB] CA[CA Systems
Single-node RDBMS
Not viable in distributed systems] C --- CP C --- CA A --- AP A --- CA P --- CP P --- AP

In practice, partition tolerance is not optional. Networks fail. Packets get lost. Data centers lose connectivity. Any distributed system must tolerate partitions. The real choice is between consistency and availability during a partition:

CP systems (Consistency + Partition Tolerance): When a network partition occurs, the system refuses to serve requests that might return stale data. It sacrifices availability to maintain consistency. Example: MongoDB in its default configuration will reject writes to a minority partition.
AP systems (Availability + Partition Tolerance): When a partition occurs, the system continues serving requests, but different nodes may return different data. It sacrifices consistency to remain available. Example: Cassandra continues accepting writes on both sides of a partition and reconciles later.
CA systems (Consistency + Availability): This combination requires no partitions, which means a single node or a network that never fails. It does not exist in real distributed systems. A single PostgreSQL server is "CA" only because it is not distributed.

The CAP theorem does not say you must permanently give up consistency or availability. It says that during a network partition, you must choose which one to sacrifice. When the network is healthy, you can have both.

Putting It Together

These terms are not independent. They interact. A system that prioritizes strong consistency may sacrifice availability during partitions. A system designed for high availability may accept eventual consistency. Eliminating SPOFs improves fault tolerance, which improves availability. Horizontal scaling enables higher throughput but makes strong consistency harder to achieve.

Understanding these terms and their tradeoffs is the foundation of every design decision you will make in the rest of this course. When someone says "we need 99.99% availability," you should immediately think about what that costs in terms of consistency, complexity, and infrastructure.

Assignment

Without looking at the session content, define each of the following terms in your own words. One or two sentences each.

Scalability
Availability
Consistency
Fault Tolerance
Single Point of Failure
Partition Tolerance

After writing your definitions, compare them to the table at the top of this session. Where did your understanding differ? Which term was hardest to define precisely?

Bonus: Pick a service you use daily (Gmail, Spotify, Grab). Based on its behavior, would you classify it as a CP or AP system? What evidence supports your classification?

Scalability Vocabulary