Course → Module 5: Distributed Systems & Consensus

The Observability Gap in Distributed Systems

In a monolith, debugging is straightforward. A request enters the application, moves through functions, and you can trace the entire flow in a single stack trace. In a microservices architecture, a single user request might touch 6, 10, or 20 services. Each service has its own logs. If the response is slow, which service caused the delay? If there is an error, where did it originate? Without distributed tracing, you are left searching through logs from a dozen services, trying to correlate timestamps manually.

A trace is a story. Each span is a chapter. Without the story, you are debugging in the dark. Distributed tracing connects the dots across service boundaries, giving you a single timeline for a request that touches many systems.

Traces, Spans, and Context Propagation

Distributed tracing relies on three concepts:

Trace. A trace represents the complete journey of a single request through the system. It has a unique trace ID (typically a 128-bit random value) that is propagated across every service the request touches. All work related to that request shares this trace ID.

Span. A span represents a single unit of work within a trace. Each service creates one or more spans. A span has a start time, duration, operation name, status, and optional metadata (tags/attributes). Spans have parent-child relationships, forming a tree. The root span is created by the first service. Each downstream call creates a child span.

Context propagation. When Service A calls Service B, the trace ID and the current span ID must travel with the request. This is context propagation. For HTTP calls, the trace context is passed in headers (the W3C traceparent header is the standard). For Kafka messages, it is embedded in message headers. Without context propagation, spans from different services cannot be linked into a single trace.

graph TB subgraph "Trace: abc-123" R[Root Span: API Gateway
0ms - 450ms] --> S1[Span: Order Service
10ms - 400ms] S1 --> S2[Span: Inventory Service
20ms - 120ms] S1 --> S3[Span: Payment Service
130ms - 380ms] S3 --> S4[Span: Fraud Check
140ms - 250ms] end style R fill:#222221,stroke:#c8a882,color:#ede9e3 style S1 fill:#222221,stroke:#6b8f71,color:#ede9e3 style S2 fill:#191918,stroke:#c8a882,color:#ede9e3 style S3 fill:#191918,stroke:#c47a5a,color:#ede9e3 style S4 fill:#191918,stroke:#8a8478,color:#ede9e3

In this trace, the API Gateway receives the request (root span, 450ms total). It calls the Order Service (child span, 390ms). The Order Service calls Inventory (100ms) and then Payment (250ms). Payment calls Fraud Check (110ms). The waterfall view shows that Payment is the bottleneck: it accounts for more than half the total latency. Without tracing, you would only know the overall response was 450ms.

The Span Hierarchy as a Timeline

Tracing UIs display spans as a waterfall diagram. Each span is a horizontal bar. Nested spans are indented under their parent. The width of the bar represents duration. This immediately reveals where time is spent.

gantt title Request Trace: abc-123 (450ms total) dateFormat X axisFormat %Lms section API Gateway Root span :0, 450 section Order Service Process order :10, 400 section Inventory Check stock :20, 120 section Payment Charge card :130, 380 section Fraud Check Verify transaction :140, 250

This Gantt-style view shows the same trace as a timeline. The Inventory check completes quickly (100ms). Payment takes 250ms, with the Fraud Check consuming most of that time. If you needed to reduce overall latency, you would focus on the Payment/Fraud Check path.

OpenTelemetry: The Standard

OpenTelemetry (OTel) is the industry-standard framework for distributed tracing, metrics, and logs. It was formed by merging two earlier projects: OpenTracing and OpenCensus. The value proposition is simple: instrument your code once with OpenTelemetry, then send the data to any compatible backend (Jaeger, Zipkin, AWS X-Ray, Datadog, Grafana Tempo).

OpenTelemetry provides:

Tracing Tools Compared

Tool Origin Storage Strengths Limitations
Jaeger Uber, CNCF graduated Cassandra, Elasticsearch, Kafka, Badger Adaptive sampling, dependency graph, strong Kubernetes integration Requires self-hosting and storage management
Zipkin Twitter, open source Cassandra, Elasticsearch, MySQL, in-memory Lightweight, simple UI, broad language support Fewer advanced features than Jaeger, simpler sampling
AWS X-Ray Amazon Web Services Managed (AWS) Native AWS integration (Lambda, ECS, API Gateway), no infrastructure to manage Vendor lock-in, limited customization, AWS-only
Grafana Tempo Grafana Labs Object storage (S3, GCS) No indexing required, cost-effective at scale, Grafana dashboard integration Search requires TraceQL or trace ID lookup, newer ecosystem
Datadog APM Datadog (commercial) Managed (Datadog) Unified metrics, logs, and traces in one platform. Powerful search and alerting. Expensive at scale. Proprietary.

Sampling Strategies

In a system processing millions of requests per second, storing a trace for every request is impractical. Storage costs would be enormous and most traces are uninteresting (successful requests with normal latency). Sampling decides which traces to keep.

Strategy How It Works Advantage Disadvantage
Head-based (probabilistic) Decision made at the start of the trace. Example: sample 1% of all requests. The decision propagates to all downstream services. Simple, low overhead. All spans in a sampled trace are captured. Interesting traces (errors, slow requests) are missed at the same rate as boring ones.
Tail-based Decision made after the trace is complete. The collector buffers all spans and decides to keep or drop based on the full trace (e.g., keep all traces with errors or latency > 2s). Captures all interesting traces. No important data is lost. Requires buffering all spans until the trace completes. Higher memory and compute cost at the collector.
Rate-limiting Cap at N traces per second, regardless of traffic volume. Useful for controlling costs. Predictable storage costs. Under-samples during traffic spikes. Over-samples during low traffic.
Adaptive (Jaeger) Automatically adjusts sampling rates per operation based on traffic volume. High-traffic endpoints get lower rates, low-traffic endpoints get higher rates. Balanced representation across all endpoints without manual tuning. More complex configuration. Requires Jaeger's collector infrastructure.

In practice, many organizations use a combination: head-based sampling at 1-5% for general visibility, plus tail-based sampling to always capture errors and high-latency traces. This balances cost against the need to debug production issues.

Designing a Tracing Setup

A production tracing pipeline typically looks like this:

graph LR A[Service A] -->|traceparent header| B[Service B] B -->|traceparent header| C[Service C] C -->|traceparent header| D[Service D] A -->|OTLP| Coll[OTel Collector] B -->|OTLP| Coll C -->|OTLP| Coll D -->|OTLP| Coll Coll -->|sampling + export| Backend[Jaeger / Tempo / X-Ray] Backend --> UI[Tracing UI] style Coll fill:#222221,stroke:#c8a882,color:#ede9e3 style Backend fill:#191918,stroke:#6b8f71,color:#ede9e3 style UI fill:#191918,stroke:#c47a5a,color:#ede9e3

Each service instruments its code with the OpenTelemetry SDK. On every incoming request, the SDK extracts the trace context from headers (or creates a new trace if none exists). On every outgoing call, the SDK injects the trace context into headers. Spans are exported via OTLP (OpenTelemetry Protocol) to the OTel Collector. The collector applies sampling, batching, and filtering, then exports to the tracing backend.

What to Capture in Spans

Not all spans are equally useful. At minimum, capture:

Avoid capturing sensitive data in span attributes: passwords, credit card numbers, PII. Sanitize database queries to remove parameter values. Set clear policies for what is and is not included in trace data.

Systems Thinking Lens

Distributed tracing introduces a meta-feedback loop. Without tracing, teams lack visibility into cross-service behavior, so they make local optimizations that may not improve overall system performance. Tracing closes this loop by making the global behavior visible. A delay in Service D is now traceable to a slow database query, which is traceable to a missing index. The systems thinker recognizes tracing as a leverage point: it does not fix problems directly, but it makes problems visible, which accelerates every other improvement.

However, tracing itself is a system with its own feedback dynamics. More services generate more spans. More spans require more collector capacity and storage. Costs grow with adoption. Sampling is the balancing loop that keeps the system sustainable. Without it, the observability infrastructure becomes a scaling problem of its own.

Further Reading

Assignment

A user reports that a page load takes 2 seconds. The request touches 6 microservices: API Gateway, User Service, Product Service, Recommendation Service, Cart Service, and Pricing Service.

  1. Without tracing: Describe your debugging process. Which logs do you check first? How do you correlate events across 6 services? How long might this investigation take?
  2. Design a tracing setup:
    • What ID propagates across all 6 services? How is it passed (which header)?
    • Where are spans created? Name at least 8 spans you would expect in this trace.
    • Which sampling strategy would you use if the system handles 50,000 requests per second? Justify your choice.
  3. With tracing in place: You see that the Recommendation Service span takes 1.4 seconds out of the 2-second total. What are your next steps? What span attributes would help you narrow down the cause?