Course → Module 1: Architectural Foundations & Core Concepts

The Language of Distributed Systems

Before you can design systems that scale, you need a shared vocabulary. These terms appear in every system design discussion, every architecture review, and every post-mortem. They are not abstract academic concepts. Each one describes a concrete property that either exists in your system or does not.

This session defines the core terms, gives you a practical example for each, and introduces the CAP theorem, which formalizes the tradeoffs between three of them.

Core Terms

Term Definition One-Line Example
Scalability The ability of a system to handle increased load by adding resources Adding more web servers behind a load balancer to serve more users
Availability The proportion of time a system is operational and accessible 99.99% availability means less than 53 minutes of downtime per year
Consistency All nodes in a distributed system return the same data at the same time After updating your profile photo, every server shows the new photo immediately
Fault Tolerance The ability to continue operating correctly when components fail A database cluster continues serving reads when one replica crashes
SPOF (Single Point of Failure) A component whose failure brings down the entire system A single database server with no replicas: if it dies, everything stops
Partition Tolerance The system continues to operate despite network splits between nodes Two data centers lose connectivity but both keep serving requests

Scalability: Vertical vs. Horizontal

Vertical scaling (scaling up) means adding more resources to a single machine: more CPU, more RAM, faster disks. Horizontal scaling (scaling out) means adding more machines to distribute the load.

Vertical scaling is simpler. You upgrade the server and your application code does not change. But every machine has a ceiling. You cannot add infinite RAM. You cannot buy a CPU with 10,000 cores. And while you are upgrading, the machine is typically offline.

Horizontal scaling has no theoretical ceiling, but it introduces complexity. Your application must handle multiple instances, shared state, load distribution, and network communication between nodes. Most production systems use a combination: scale vertically until it becomes cost-ineffective, then scale horizontally.

Availability: Measuring Uptime

Availability is expressed as a percentage, commonly referred to by the number of nines:

Availability Downtime per Year Downtime per Month
99% (two nines) 3.65 days 7.3 hours
99.9% (three nines) 8.76 hours 43.8 minutes
99.99% (four nines) 52.6 minutes 4.38 minutes
99.999% (five nines) 5.26 minutes 26.3 seconds

Each additional nine is exponentially harder and more expensive to achieve. Moving from 99.9% to 99.99% often requires redundant infrastructure across multiple availability zones, automated failover, and rigorous testing of failure scenarios. Most consumer web applications target three or four nines. Financial trading systems and emergency services aim for five.

Consistency: Strong vs. Eventual

Strong consistency guarantees that after a write completes, every subsequent read returns the updated value. If you transfer $100 from account A to account B, strong consistency means no reader will ever see the money in both accounts or neither account. The system behaves as if there is a single copy of the data.

Eventual consistency allows replicas to diverge temporarily. After a write, some replicas may return stale data for a period of time. Eventually, all replicas converge to the same value. The "eventually" part can range from milliseconds to seconds, depending on the system.

Strong consistency is easier to reason about but harder to scale. It often requires coordination between nodes (locks, consensus protocols), which adds latency. Eventual consistency scales better because replicas can operate independently, but your application logic must handle stale reads gracefully.

Social media feeds use eventual consistency. If you post a photo and your friend sees it two seconds later, nobody notices. Banking transactions use strong consistency. If the balance is wrong even briefly, the consequences are severe.

Fault Tolerance and Single Points of Failure

A fault-tolerant system is designed to continue operating when things break. Hardware fails. Networks partition. Disks corrupt. Software crashes. The question is not whether failures happen but what the system does when they happen.

The first step in designing for fault tolerance is identifying every Single Point of Failure (SPOF). A SPOF is any component that, if it fails, takes the entire system down. Common SPOFs include:

The remedy for a SPOF is redundancy: run multiple instances, in multiple locations, with automatic failover. This does not eliminate failure. It reduces the probability that a single failure becomes a system-wide outage.

The CAP Theorem

The CAP theorem, proposed by Eric Brewer in 2000 and formally proven by Seth Gilbert and Nancy Lynch in 2002, states that a distributed data store can guarantee at most two of three properties simultaneously: Consistency, Availability, and Partition Tolerance.

graph TD CAP((CAP Theorem)) C[Consistency
All nodes see same data] A[Availability
Every request gets a response] P[Partition Tolerance
System works despite network splits] CAP --- C CAP --- A CAP --- P CP[CP Systems
MongoDB, HBase, Redis Cluster] AP[AP Systems
Cassandra, DynamoDB, CouchDB] CA[CA Systems
Single-node RDBMS
Not viable in distributed systems] C --- CP C --- CA A --- AP A --- CA P --- CP P --- AP

In practice, partition tolerance is not optional. Networks fail. Packets get lost. Data centers lose connectivity. Any distributed system must tolerate partitions. The real choice is between consistency and availability during a partition:

The CAP theorem does not say you must permanently give up consistency or availability. It says that during a network partition, you must choose which one to sacrifice. When the network is healthy, you can have both.

Putting It Together

These terms are not independent. They interact. A system that prioritizes strong consistency may sacrifice availability during partitions. A system designed for high availability may accept eventual consistency. Eliminating SPOFs improves fault tolerance, which improves availability. Horizontal scaling enables higher throughput but makes strong consistency harder to achieve.

Understanding these terms and their tradeoffs is the foundation of every design decision you will make in the rest of this course. When someone says "we need 99.99% availability," you should immediately think about what that costs in terms of consistency, complexity, and infrastructure.

Further Reading

Assignment

Without looking at the session content, define each of the following terms in your own words. One or two sentences each.

  1. Scalability
  2. Availability
  3. Consistency
  4. Fault Tolerance
  5. Single Point of Failure
  6. Partition Tolerance

After writing your definitions, compare them to the table at the top of this session. Where did your understanding differ? Which term was hardest to define precisely?

Bonus: Pick a service you use daily (Gmail, Spotify, Grab). Based on its behavior, would you classify it as a CP or AP system? What evidence supports your classification?