Course → Module 4: Reliability, Security & System Resilience

The Problem with "It Works"

A system can be running and still be failing. A request can succeed in 3 seconds when it should take 300 milliseconds. An error can occur once per thousand requests, invisible in aggregate metrics but catastrophic for the affected users. A memory leak can consume 1% more RAM per hour, unnoticed for days until the process crashes at 3 AM.

Without observability, you only know a system is broken when users tell you. By then, the damage is done. Observability is not about collecting data. It is about building the ability to ask new questions about your system's behavior without deploying new code.

Without observability, debugging a distributed system is archaeology.

The Three Pillars

Observability rests on three types of telemetry data. Each answers a different question. None is sufficient alone. Together, they provide the ability to understand why a system is behaving the way it is.

Metrics

Metrics are numeric measurements collected at regular intervals. CPU usage at 72%. Request rate at 1,200 per second. Error rate at 0.3%. P99 latency at 450ms. Metrics are cheap to store, fast to query, and excellent for dashboards and alerts.

Metrics answer the question: "Is something wrong?" They tell you that error rates spiked at 14:32, or that memory usage is trending upward. They do not tell you why. For that, you need logs and traces.

Logs

Logs are timestamped records of discrete events. A user logged in. A database query took 2.3 seconds. An API call returned a 500 error with a stack trace. Logs are detailed, high-volume, and essential for post-incident investigation.

Structured logging means emitting logs as key-value pairs or JSON objects rather than free-form strings. This matters for searchability. {"level":"error","service":"payment","user_id":"u-1234","msg":"charge failed","reason":"card_declined"} is searchable and filterable. Error: charge failed for user u-1234, card was declined requires regex and hope.

Traces

A trace follows a single request as it moves through multiple services. When a user clicks "Place Order," that request might touch the API gateway, the order service, the inventory service, the payment service, and the notification service. A trace connects all of those interactions into a single story with a shared trace ID.

Each service's contribution to the request is a span. Spans have a start time, duration, parent span, and metadata. The full trace is a tree of spans that shows exactly where time was spent and where errors occurred.

Pillar Comparison

Dimension Metrics Logs Traces
What it captures Numeric measurements over time Discrete events with context Request flow across services
Question it answers "Is something wrong?" "What exactly happened?" "Where did time go?"
Granularity Aggregated (per-minute, per-second) Per-event Per-request
Volume Low (time series points) Very high Medium (sampled)
Storage cost Low High Medium
Best for Dashboards, alerting, SLO tracking Debugging, auditing, compliance Latency analysis, dependency mapping
Common tools Prometheus, CloudWatch, Datadog ELK Stack, Loki, CloudWatch Logs Jaeger, Zipkin, AWS X-Ray

Distributed Tracing with Correlation IDs

In a monolith, a stack trace shows you the full execution path. In a distributed system, the execution path crosses process and network boundaries. Stack traces stop at each service's boundary. You need a way to stitch the story back together.

A correlation ID (or trace ID) is a unique identifier generated at the entry point of a request and propagated through every service call. Each service includes this ID in its logs and passes it to downstream services via HTTP headers (typically X-Request-ID or the W3C traceparent header).

sequenceDiagram participant U as User participant GW as API Gateway participant OS as Order Service participant PS as Payment Service Note over U,PS: Trace ID: abc-123 U->>GW: POST /orders Note right of GW: Generate trace ID: abc-123
Span A starts GW->>OS: Create order
Header: traceparent=abc-123 Note right of OS: Span B starts (parent: A) OS->>PS: Charge card
Header: traceparent=abc-123 Note right of PS: Span C starts (parent: B) PS-->>OS: Payment confirmed Note right of PS: Span C ends (220ms) OS-->>GW: Order created Note right of OS: Span B ends (350ms) GW-->>U: 201 Created Note right of GW: Span A ends (400ms)

With this trace, you can see that the total request took 400ms, the order service spent 350ms, and 220ms of that was waiting for the payment service. If latency increases, you can pinpoint exactly which service is responsible.

Log Volume Distribution

Not all log entries are equally important, but the volume distribution follows a predictable pattern. Most logs are informational. A small fraction represent actual problems. Understanding this distribution helps you design log storage, set retention policies, and configure alerts.

The implication: if you alert on every log entry, you drown in noise. If you only alert on CRITICAL, you miss the ERROR entries that are precursors to outages. Effective alerting uses metrics (error rate exceeding a threshold) rather than individual log entries. Logs are for investigation after the alert fires.

The Observability Stack

A complete observability stack needs tools for each pillar, plus a way to correlate between them. The most common open-source combination:

Commercial alternatives (Datadog, New Relic, Splunk) bundle all three pillars into a single platform. The trade-off is cost versus operational complexity. Running your own Prometheus, Grafana, and ELK cluster requires engineering effort. Paying for a managed platform requires budget.

Structured Logging Best Practices

Every log entry should include, at minimum: timestamp, severity level, service name, trace ID (if available), and a structured message. Beyond that, include any context that will help during debugging.

Do not log sensitive data: passwords, tokens, credit card numbers, personal identifiable information. Log redaction is easier to implement at the point of emission than after the fact. Build it into your logging library, not into your log pipeline.

Use consistent field names across services. If one service logs user_id and another logs userId and a third logs uid, correlating across services becomes painful. Agree on a schema and enforce it.

Further Reading

Assignment

A user reports: "The app is slow." You have no observability in place. No metrics, no centralized logs, no traces. All you have is SSH access to the servers.

  1. Describe how you would debug this without observability. What commands would you run? What files would you check? How long would it take?
  2. Now design a minimum viable observability stack. Specify:
    • 3 metrics to collect (what, from where, alert threshold)
    • 2 structured log fields to add to every log entry beyond timestamp and message
    • 1 trace configuration: which request path to instrument first and why
  3. A developer argues that adding tracing will slow down the application. How do you respond? What is the typical performance overhead of distributed tracing with sampling?