Session 9.6: Observability at Scale

Course → Module 9: Advanced Topics & Emerging Architectures

Beyond the Three Pillars

Session 4.9 introduced observability through metrics, logs, and traces. That foundation is necessary but insufficient for systems operating at scale. When your platform processes millions of requests per second across hundreds of services, the volume of telemetry data itself becomes an engineering problem. You need instrumentation standards, intelligent sampling, formal reliability targets, and deliberate failure injection.

This session covers the operational machinery that makes observability work in production: OpenTelemetry as the instrumentation layer, sampling strategies that keep costs manageable, SLI/SLO/SLA frameworks that translate reliability into numbers, error budgets that quantify acceptable risk, and chaos engineering that validates your assumptions before production does it for you.

OpenTelemetry

OpenTelemetry (OTel) is a vendor-neutral, open-source observability framework maintained by the Cloud Native Computing Foundation. It provides APIs, SDKs, and tools for generating, collecting, and exporting telemetry data: traces, metrics, and logs. The key insight behind OTel is separation of concerns. Your application code should produce telemetry. A separate component, the Collector, should decide where that telemetry goes.

The OTel Collector sits between your applications and your observability backends (Datadog, Grafana, Jaeger, or any other tool). It receives telemetry, processes it (filtering, batching, enriching, sampling), and exports it to one or more destinations. This means you instrument your code once and change backends without touching application code.

graph LR A1[Service A
OTel SDK] --> C[OTel Collector
Gateway] A2[Service B
OTel SDK] --> C A3[Service C
OTel SDK] --> C C --> P1[Processor:
Batch] P1 --> P2[Processor:
Tail Sampling] P2 --> E1[Exporter:
Jaeger] P2 --> E2[Exporter:
Prometheus] P2 --> E3[Exporter:
Loki]

In production, many teams deploy a two-layer Collector architecture. A local agent Collector runs as a sidecar or DaemonSet on each node, handling buffering and basic processing. A gateway Collector receives data from all agents and performs cross-service operations like tail sampling, which requires seeing all spans for a trace before making a decision.

Sampling Strategies

At scale, you cannot store every trace. A service handling 50,000 requests per second generates terabytes of trace data per day. Sampling reduces this volume while preserving the traces that matter most.

Head-based sampling makes the decision at the start of a trace. The root span decides whether to sample, and that decision propagates to all downstream services via trace context headers. It is simple and cheap. The downside: the decision happens before you know whether the trace is interesting. You might drop a trace that later encounters an error.

Tail-based sampling makes the decision after the trace is complete. The Collector holds all spans in memory for a configurable window (typically 30 seconds), then evaluates policies: keep all traces with errors, keep all traces slower than 2 seconds, sample 1% of everything else. This produces higher-quality data but requires significant memory and a load-balancing layer that routes all spans for the same trace to the same Collector instance.

Most production systems combine both. Head sampling at 10% reduces the firehose. Tail sampling on the remaining 10% ensures error traces and slow traces are always retained.

SLI, SLO, and SLA

Reliability needs a shared language between engineering and business. That language is built on three concepts.

Concept	Definition	Who Sets It	Example
SLI (Service Level Indicator)	A quantitative measure of service behavior	Engineering	Proportion of requests completing in <300ms
SLO (Service Level Objective)	A target value or range for an SLI	Engineering + Product	99.9% of requests complete in <300ms over 30 days
SLA (Service Level Agreement)	A contractual commitment with consequences for missing it	Business + Legal	99.9% availability; breach triggers 10% service credit

The relationship flows upward. SLIs are raw measurements. SLOs are internal targets set against those measurements. SLAs are external promises, typically set slightly below SLOs so you have a buffer before contractual penalties apply. If your SLO is 99.95%, your SLA might promise 99.9%.

Error Budgets

An error budget is the inverse of your SLO. If your SLO is 99.9% availability over 30 days, your error budget is 0.1% of that period. Translated to time:

SLO	Error Budget (%)	Downtime per Month	Downtime per Year
99%	1%	7 hours 18 min	3.65 days
99.9%	0.1%	43.8 min	8.77 hours
99.95%	0.05%	21.9 min	4.38 hours
99.99%	0.01%	4.38 min	52.6 min
99.999%	0.001%	26.3 sec	5.26 min

Error budgets create a shared decision framework. When budget remains, engineering can deploy risky features, run experiments, and refactor aggressively. When the budget is nearly exhausted, the team shifts to stability work: fixing flaky tests, improving rollback speed, adding circuit breakers. The budget makes this tradeoff explicit rather than political.

An error budget is not permission to fail. It is permission to take risks.

Error Budget Burn Rate

The chart below shows a hypothetical 30-day window for a service with a 99.9% SLO. The budget starts at 43.8 minutes. A deploy on day 8 causes 15 minutes of degradation. A brief incident on day 19 burns another 10 minutes. By day 22, the team has consumed 70% of the budget and freezes risky deploys for the rest of the month.

Chaos Engineering

Chaos engineering is the discipline of experimenting on a production system to build confidence in its ability to withstand turbulent conditions. The idea originated at Netflix in 2010 when engineers built Chaos Monkey, a tool that randomly terminates virtual machine instances in production during business hours. The reasoning: if a single instance failure can cause a customer-facing outage, the architecture is not resilient enough.

Netflix expanded Chaos Monkey into the Simian Army, a suite of tools that simulate different failure modes: Latency Monkey injects delays, Chaos Gorilla takes down an entire availability zone, and Chaos Kong simulates the loss of an entire AWS region. The modern successor, Gremlin, provides a commercial platform for controlled fault injection with safety controls, targeting specific services, hosts, or containers.

The process follows the scientific method. Define steady state (normal request rate, latency, error rate). Form a hypothesis ("if we kill one database replica, the system should failover within 5 seconds with no user-facing errors"). Run the experiment. Observe. If the hypothesis holds, confidence increases. If it fails, you found a weakness before your customers did.

ML-Based Anomaly Detection

Static thresholds break at scale. A CPU usage alert at 80% makes sense for a single server. It is meaningless for an auto-scaling group where instances spin up at 70% and the fleet average fluctuates between 40% and 85% depending on time of day. Machine learning models can learn normal patterns and flag deviations. Seasonal decomposition handles daily and weekly cycles. Isolation forests detect outliers in multidimensional metric space. LSTM networks predict expected values and alert when actuals diverge beyond a confidence interval.

The risk with ML-based alerting is alert fatigue. A model that flags every statistical anomaly will generate hundreds of alerts per day, most of them irrelevant. The best implementations combine ML detection with SLO correlation: an anomaly is only escalated if it is burning error budget faster than expected.

Assignment

You are the SRE for a payment processing service. The service handles credit card charges, refunds, and balance inquiries.

Define SLIs. Choose at least three SLIs for this service. For each, specify the metric, how it is measured, and why it matters. Example: "Availability: proportion of non-5xx responses out of total requests, measured at the load balancer."
Set SLOs. For each SLI, set a 30-day SLO target. Justify why you chose that number. A payment service likely needs higher reliability than a recommendation engine.
Calculate error budget. For a 99.9% availability SLO over 30 days, calculate the exact error budget in minutes. If the service processes 10,000 requests per minute, how many failed requests can it tolerate per month before breaching the SLO?
Design one chaos experiment. Pick a failure mode (database replica failure, network partition to payment gateway, spike in request volume). Define the steady state, hypothesis, experiment procedure, and rollback plan.

Observability at Scale