Course → Module 0: Foundation: Systems Thinking Principles

Why We Keep Fixing Symptoms

A server crashes at 3 AM. The on-call engineer restarts it. The incident is marked "resolved." Two weeks later, it happens again. Another restart. Another resolution. After the fourth occurrence, someone finally asks: why does this keep happening?

Most incident response operates at the surface level. Something breaks, we fix the immediate problem, and we move on. This is not laziness. It is a natural consequence of how we perceive systems. We see events because events are visible. The structures that produce those events are hidden beneath layers of abstraction, history, and assumption.

The Iceberg Model, rooted in work by Daniel Kim and earlier articulations by Peter Senge in The Fifth Discipline (1990), gives us a framework for looking deeper.

The Four Levels

The Iceberg Model describes four levels of understanding, arranged from the most visible (above the waterline) to the most hidden (deep below it). Each level explains more than the one above it, and each requires more effort to see.

graph TB subgraph "Visible (Above the Waterline)" E["Events
What happened?"] end subgraph "Below the Surface" P["Patterns
What trends repeat?"] S["Structures
What causes the patterns?"] M["Mental Models
What assumptions created the structures?"] end E --> P --> S --> M

Most incident response stops at the waterline. The fix lives deeper.

Level 1: Events

Events are individual occurrences. They are specific, concrete, and observable. "The server crashed at 3 AM on Tuesday." "The API returned 500 errors for 12 minutes." "A customer complained about slow checkout."

Events are where most incident response begins and ends. The server crashed, so we restarted it. The API returned errors, so we rolled back the deployment. The checkout was slow, so we cleared the cache. Problem solved, ticket closed.

Event-level responses are reactive. They address what happened without asking why it happened. They are necessary (you have to restart the crashed server) but insufficient (you have not prevented the next crash).

Level 2: Patterns

Patterns emerge when you look at events over time. One server crash is an event. A server crash every Monday at 3 AM is a pattern. One slow checkout is an event. Slow checkouts every time a flash sale starts is a pattern.

Pattern recognition requires data and patience. You need to track events across time, group them, and look for recurring themes. This is why good logging and monitoring matter. Without historical data, every event looks isolated.

When you identify a pattern, you can move from reactive to anticipatory. You know the server will crash next Monday at 3 AM, so you can prepare. But you still do not know why it crashes. For that, you need to go deeper.

Level 3: Structures

Structures are the architectural and organizational arrangements that produce the patterns. The server crashes every Monday at 3 AM because a cron job triggers a full table scan on a 500GB table. The table scan consumes all available memory, and the OOM killer terminates the database process.

Now you have something actionable. The pattern is caused by a specific structural arrangement: a scheduled job, a full table scan, an undersized instance, and the absence of a memory limit on that process. You can change any of these structural elements to break the pattern.

Structural analysis is where system design lives. When you redesign a data model, split a monolith, add a queue between two services, or change a cron schedule, you are intervening at the structural level. These changes are more difficult than event-level fixes, but they address root causes instead of symptoms.

Level 4: Mental Models

Mental models are the assumptions, beliefs, and values that shaped the structures in the first place. The 500GB table exists because the original developers assumed the table would never exceed 10GB. They chose a schema that made sense at 10GB but becomes catastrophic at 500GB. No one revisited that assumption as the data grew.

Mental model analysis asks: what were we thinking when we built this? What did we assume about scale, usage patterns, failure modes, or user behavior? Were those assumptions correct? Are they still correct?

Mental models are the deepest and most powerful level. Changing a mental model changes every structure built on top of it. If the team adopts the mental model "tables will grow indefinitely, and we must design for that," future schemas will include partitioning strategies, archival policies, and growth projections from the start.

Worked Example: Recurring Production Incident

Here is a complete walk through all four levels using a realistic scenario.

Level Question Observation Action Taken
Event What happened? The checkout service returned HTTP 503 for 8 minutes during a flash sale on March 15. Restart the service, scale up manually.
Pattern What trends repeat? The checkout service goes down during every flash sale. It has happened 4 times in the past 6 months. Pre-scale before known sale events.
Structure What causes the pattern? The checkout service makes a synchronous call to the inventory service for every item in the cart. During flash sales, inventory checks spike 50x. The inventory service has a fixed connection pool of 100 and no autoscaling. Add autoscaling to the inventory service. Make inventory checks asynchronous. Introduce a circuit breaker between checkout and inventory.
Mental Model What assumption created the structure? The original design assumed checkout volume would be roughly constant. The inventory service was sized for average load, not peak load. No one anticipated 50x traffic spikes because the product did not originally run flash sales. Adopt the assumption that traffic is bursty and unpredictable. Design all new services for 10x peak over average. Run load tests that simulate sale conditions.

Notice how each level produces a different quality of fix. The event-level fix stops the bleeding. The pattern-level fix anticipates the next occurrence. The structure-level fix eliminates the vulnerability. The mental model fix prevents the same class of mistake from being built into future systems.

Reactive vs. Proactive Engineering

The Iceberg Model clarifies the difference between reactive and proactive engineering:

Most organizations spend 80% of their engineering effort at the event level, fighting fires. High-performing organizations invest the time to work at the structure and mental model levels, so there are fewer fires to fight.

Applying the Iceberg to System Design

The Iceberg Model is not just a postmortem tool. It is a design tool. When you are designing a new system, you can work the iceberg in reverse:

  1. Start with mental models. What assumptions are we making about scale, usage, and failure? Are those assumptions well-founded?
  2. Design structures accordingly. Choose architectures that match your realistic assumptions, not your optimistic ones.
  3. Anticipate patterns. What usage patterns will this structure produce? What failure patterns?
  4. Plan for events. When (not if) something goes wrong, what does the incident response look like?

Working top-down through the iceberg during design prevents you from having to work bottom-up through the iceberg during incidents.

Key takeaway: Events are symptoms. Patterns are clues. Structures are causes. Mental models are origins. Effective system design and incident response require moving below the waterline. Fixing events is necessary but temporary. Fixing structures and mental models is difficult but lasting.

Further Reading

Assignment

Think of a recurring production issue you have experienced or read about. Walk it down the iceberg through all four levels:

  1. Event: What is the visible incident? Describe it specifically (service, error, time, duration).
  2. Pattern: How often does it recur? Under what conditions? What is the rhythm?
  3. Structure: What architectural or organizational arrangement causes the pattern? What component, configuration, or dependency is responsible?
  4. Mental Model: What assumption or belief led to that structural choice? Was it ever valid? When did it stop being valid?

For each level, describe what fix you would apply at that level and how durable that fix would be.