Distributed Tracing
Session 5.8 · ~5 min read
The Observability Gap in Distributed Systems
In a monolith, debugging is straightforward. A request enters the application, moves through functions, and you can trace the entire flow in a single stack trace. In a microservices architecture, a single user request might touch 6, 10, or 20 services. Each service has its own logs. If the response is slow, which service caused the delay? If there is an error, where did it originate? Without distributed tracing, you are left searching through logs from a dozen services, trying to correlate timestamps manually.
A trace is a story. Each span is a chapter. Without the story, you are debugging in the dark. Distributed tracing connects the dots across service boundaries, giving you a single timeline for a request that touches many systems.
Traces, Spans, and Context Propagation
Distributed tracing relies on three concepts:
Trace. A trace represents the complete journey of a single request through the system. It has a unique trace ID (typically a 128-bit random value) that is propagated across every service the request touches. All work related to that request shares this trace ID.
Span. A span represents a single unit of work within a trace. Each service creates one or more spans. A span has a start time, duration, operation name, status, and optional metadata (tags/attributes). Spans have parent-child relationships, forming a tree. The root span is created by the first service. Each downstream call creates a child span.
Context propagation. When Service A calls Service B, the trace ID and the current span ID must travel with the request. This is context propagation. For HTTP calls, the trace context is passed in headers (the W3C traceparent header is the standard). For Kafka messages, it is embedded in message headers. Without context propagation, spans from different services cannot be linked into a single trace.
0ms - 450ms] --> S1[Span: Order Service
10ms - 400ms] S1 --> S2[Span: Inventory Service
20ms - 120ms] S1 --> S3[Span: Payment Service
130ms - 380ms] S3 --> S4[Span: Fraud Check
140ms - 250ms] end style R fill:#222221,stroke:#c8a882,color:#ede9e3 style S1 fill:#222221,stroke:#6b8f71,color:#ede9e3 style S2 fill:#191918,stroke:#c8a882,color:#ede9e3 style S3 fill:#191918,stroke:#c47a5a,color:#ede9e3 style S4 fill:#191918,stroke:#8a8478,color:#ede9e3
In this trace, the API Gateway receives the request (root span, 450ms total). It calls the Order Service (child span, 390ms). The Order Service calls Inventory (100ms) and then Payment (250ms). Payment calls Fraud Check (110ms). The waterfall view shows that Payment is the bottleneck: it accounts for more than half the total latency. Without tracing, you would only know the overall response was 450ms.
The Span Hierarchy as a Timeline
Tracing UIs display spans as a waterfall diagram. Each span is a horizontal bar. Nested spans are indented under their parent. The width of the bar represents duration. This immediately reveals where time is spent.
This Gantt-style view shows the same trace as a timeline. The Inventory check completes quickly (100ms). Payment takes 250ms, with the Fraud Check consuming most of that time. If you needed to reduce overall latency, you would focus on the Payment/Fraud Check path.
OpenTelemetry: The Standard
OpenTelemetry (OTel) is the industry-standard framework for distributed tracing, metrics, and logs. It was formed by merging two earlier projects: OpenTracing and OpenCensus. The value proposition is simple: instrument your code once with OpenTelemetry, then send the data to any compatible backend (Jaeger, Zipkin, AWS X-Ray, Datadog, Grafana Tempo).
OpenTelemetry provides:
- SDKs for Java, Python, Go, JavaScript, .NET, and more. Auto-instrumentation libraries hook into popular frameworks (Spring Boot, Express, Django) and create spans automatically for HTTP calls, database queries, and message broker interactions.
- The OpenTelemetry Collector, a proxy that receives telemetry data, processes it (filtering, sampling, batching), and exports it to one or more backends. This decouples your application from the tracing backend.
- W3C Trace Context propagation out of the box, ensuring interoperability across services written in different languages.
Tracing Tools Compared
| Tool | Origin | Storage | Strengths | Limitations |
|---|---|---|---|---|
| Jaeger | Uber, CNCF graduated | Cassandra, Elasticsearch, Kafka, Badger | Adaptive sampling, dependency graph, strong Kubernetes integration | Requires self-hosting and storage management |
| Zipkin | Twitter, open source | Cassandra, Elasticsearch, MySQL, in-memory | Lightweight, simple UI, broad language support | Fewer advanced features than Jaeger, simpler sampling |
| AWS X-Ray | Amazon Web Services | Managed (AWS) | Native AWS integration (Lambda, ECS, API Gateway), no infrastructure to manage | Vendor lock-in, limited customization, AWS-only |
| Grafana Tempo | Grafana Labs | Object storage (S3, GCS) | No indexing required, cost-effective at scale, Grafana dashboard integration | Search requires TraceQL or trace ID lookup, newer ecosystem |
| Datadog APM | Datadog (commercial) | Managed (Datadog) | Unified metrics, logs, and traces in one platform. Powerful search and alerting. | Expensive at scale. Proprietary. |
Sampling Strategies
In a system processing millions of requests per second, storing a trace for every request is impractical. Storage costs would be enormous and most traces are uninteresting (successful requests with normal latency). Sampling decides which traces to keep.
| Strategy | How It Works | Advantage | Disadvantage |
|---|---|---|---|
| Head-based (probabilistic) | Decision made at the start of the trace. Example: sample 1% of all requests. The decision propagates to all downstream services. | Simple, low overhead. All spans in a sampled trace are captured. | Interesting traces (errors, slow requests) are missed at the same rate as boring ones. |
| Tail-based | Decision made after the trace is complete. The collector buffers all spans and decides to keep or drop based on the full trace (e.g., keep all traces with errors or latency > 2s). | Captures all interesting traces. No important data is lost. | Requires buffering all spans until the trace completes. Higher memory and compute cost at the collector. |
| Rate-limiting | Cap at N traces per second, regardless of traffic volume. Useful for controlling costs. | Predictable storage costs. | Under-samples during traffic spikes. Over-samples during low traffic. |
| Adaptive (Jaeger) | Automatically adjusts sampling rates per operation based on traffic volume. High-traffic endpoints get lower rates, low-traffic endpoints get higher rates. | Balanced representation across all endpoints without manual tuning. | More complex configuration. Requires Jaeger's collector infrastructure. |
In practice, many organizations use a combination: head-based sampling at 1-5% for general visibility, plus tail-based sampling to always capture errors and high-latency traces. This balances cost against the need to debug production issues.
Designing a Tracing Setup
A production tracing pipeline typically looks like this:
Each service instruments its code with the OpenTelemetry SDK. On every incoming request, the SDK extracts the trace context from headers (or creates a new trace if none exists). On every outgoing call, the SDK injects the trace context into headers. Spans are exported via OTLP (OpenTelemetry Protocol) to the OTel Collector. The collector applies sampling, batching, and filtering, then exports to the tracing backend.
What to Capture in Spans
Not all spans are equally useful. At minimum, capture:
- Operation name:
GET /api/orders/{id},db.query,kafka.produce - Duration: Start and end timestamps.
- Status: OK, ERROR, with error message if applicable.
- Attributes:
http.method,http.status_code,db.system,db.statement(sanitized),messaging.system,user.id(if relevant). - Events: Log entries attached to the span (e.g., "retry attempt 2", "cache miss").
Avoid capturing sensitive data in span attributes: passwords, credit card numbers, PII. Sanitize database queries to remove parameter values. Set clear policies for what is and is not included in trace data.
Systems Thinking Lens
Distributed tracing introduces a meta-feedback loop. Without tracing, teams lack visibility into cross-service behavior, so they make local optimizations that may not improve overall system performance. Tracing closes this loop by making the global behavior visible. A delay in Service D is now traceable to a slow database query, which is traceable to a missing index. The systems thinker recognizes tracing as a leverage point: it does not fix problems directly, but it makes problems visible, which accelerates every other improvement.
However, tracing itself is a system with its own feedback dynamics. More services generate more spans. More spans require more collector capacity and storage. Costs grow with adoption. Sampling is the balancing loop that keeps the system sustainable. Without it, the observability infrastructure becomes a scaling problem of its own.
Further Reading
- OpenTelemetry, Traces (official documentation). Definitive reference for trace concepts, span structure, context propagation, and the OpenTelemetry data model.
- Jaeger, Architecture Overview (official documentation). How Jaeger collectors, agents, and storage backends work together.
- Charity Majors, Liz Fong-Jones, George Miranda, Observability Engineering (O'Reilly, 2022). Comprehensive treatment of tracing, metrics, and logs in production systems, with practical guidance on sampling and cost management.
- W3C, Trace Context Specification. The standard for the
traceparentandtracestateHTTP headers used for context propagation. - Ajit Singh, Distributed Tracing: Jaeger vs Tempo vs Zipkin. Side-by-side comparison of tracing backends with architecture diagrams and feature matrices.
Assignment
A user reports that a page load takes 2 seconds. The request touches 6 microservices: API Gateway, User Service, Product Service, Recommendation Service, Cart Service, and Pricing Service.
- Without tracing: Describe your debugging process. Which logs do you check first? How do you correlate events across 6 services? How long might this investigation take?
- Design a tracing setup:
- What ID propagates across all 6 services? How is it passed (which header)?
- Where are spans created? Name at least 8 spans you would expect in this trace.
- Which sampling strategy would you use if the system handles 50,000 requests per second? Justify your choice.
- With tracing in place: You see that the Recommendation Service span takes 1.4 seconds out of the 2-second total. What are your next steps? What span attributes would help you narrow down the cause?