Distributed Tracing in System Design: Spans, Context Propagation & Latency Debugging (Visualized)
Distributed tracing follows a single request as it threads through dozens of microservices, stitching together a timeline that shows exactly where time was spent. This guide covers traces, spans, context propagation, sampling, OpenTelemetry, Jaeger, and Zipkin โ with live animations of each concept.
Distributed tracing is an observability technique that records the end-to-end journey of a single request as it propagates across multiple services, collecting timing and metadata at every hop so engineers can visualize exactly where latency is introduced or where errors occur.
In a monolith, a slow endpoint is easy to profile โ you grep the logs and attach a debugger. In a microservices architecture a single user-facing request might touch an API gateway, an auth service, a product catalog, a pricing engine, an inventory service, and a database before returning. Each of those services logs independently. Without tracing you end up correlating timestamps across five log streams and still have no clear picture. Tracing solves this by issuing a unique trace ID at the edge and threading it through every service so all the timing data from a single request can be assembled into one coherent timeline.
Traces and Spans: The Building Blocks
A trace represents the entire lifecycle of one request. It is made up of spans โ individual units of work. Each span records: a span_id, the trace_id it belongs to, the parent_span_id (so spans nest into a tree), a start timestamp, a duration, a service name, an operation name, and a bag of key-value attributes (also called tags). The root span is created by the first service that touches the request; every downstream call creates a child span. The result is a directed acyclic tree of spans that, when rendered as a waterfall chart, immediately shows the critical path.
Spans also carry events (timestamped log-like annotations inside a span โ useful for marking retries or cache misses) and a status (OK, ERROR, or UNSET). An ERROR status with an attached exception event is what lights up red in Jaeger's UI and tells you exactly which service threw the exception and at what point in the call chain.
Trace Context and Propagation
For a trace tree to be assembled after the fact, every service must carry the same trace_id and must know its parent's span_id. This is done through context propagation: when Service A calls Service B, it injects the trace context into the outbound request headers, and Service B extracts those headers before processing the request. The W3C Traceparent header has become the standard format: 00-{traceId}-{parentSpanId}-{flags}. OpenTelemetry's SDK handles this injection and extraction automatically for most HTTP clients and servers, gRPC, Kafka, and other transports.
Baggage is a related concept: arbitrary key-value data that piggybacks on the trace context. You might attach user.tier=premium as baggage so every span in the trace is tagged with it automatically โ useful for filtering traces in your backend UI. Keep baggage small; it is copied into every outbound call and can add measurable overhead if abused.
Finding the Bottleneck: Latency Waterfall Analysis
The canonical output of a trace viewer like Jaeger or Zipkin is the waterfall chart: each span is a horizontal bar, positioned on a timeline. The width of the bar is the span's duration; the left edge is when work started. Child spans are indented under their parent. At a glance you can see: total request duration, which spans run in parallel versus sequentially, and which single span is the longest โ the latency bottleneck. The critical path is the chain of spans from root to leaf whose combined duration equals the total request time; shortening any span on the critical path directly improves end-to-end latency.
Common patterns that jump out in a waterfall: N+1 queries (dozens of tiny identical Database spans repeated in a loop), serial fan-out (three service calls that could be parallelised but are done one after another), and cold-start latency (the first span of a service is 10x longer than subsequent ones, betraying a connection pool warmup or a JIT compilation event). Tracing is the fastest way to find all three.
Sampling: Controlling the Overhead
At high traffic volume, recording and storing every single span for every request becomes expensive โ a 50 000 RPS service emitting ten spans per request generates half a million spans per second. Sampling is the practice of only collecting traces for a fraction of requests. The two main strategies are:
Head-based sampling: a decision is made at the entry point (the root span) before the request is processed. If the coin flip says no, all spans for that trace are dropped immediately. It has zero overhead for discarded traces but it is blind to outcomes โ a slow request is just as likely to be dropped as a fast one. Tail-based sampling: spans are buffered at a collector until the trace is complete, then a decision is made based on the outcome โ keep 100% of error traces and all traces slower than 500 ms, keep only 1% of normal ones. It costs more memory at the collector layer but captures exactly the traces you care about. OpenTelemetry's Collector supports both modes.
OpenTelemetry, Jaeger, and Zipkin
OpenTelemetry (OTel) is the CNCF standard for instrumenting applications to emit traces, metrics, and logs. It provides language SDKs (Go, Python, Java, Node.js, Rust, etc.), auto-instrumentation agents that instrument popular frameworks with zero code changes, and the OTel Collector โ a vendor-neutral pipeline that receives, processes, and exports telemetry. By instrumenting once with OTel, you can route data to any backend without changing application code.
Jaeger (open-source, CNCF graduated) and Zipkin (open-source, originally from Twitter) are trace storage and query backends. Both accept spans over HTTP or gRPC, store them in Cassandra, Elasticsearch, or in-memory, and provide a UI for searching traces and rendering waterfall charts. Jaeger is generally preferred for new deployments for its adaptive sampling support and Kubernetes-native operator. Commercial alternatives (Datadog APM, Honeycomb, Grafana Tempo, AWS X-Ray) offer similar query UIs with managed infrastructure.
# OpenTelemetry Collector config โ receives OTLP, exports to Jaeger
receivers:
otlp:
protocols:
grpc:
endpoint: 0.0.0.0:4317
http:
endpoint: 0.0.0.0:4318
processors:
batch:
timeout: 5s
send_batch_size: 1024
# Tail-based sampling: keep all errors + slow traces
tail_sampling:
decision_wait: 10s
policies:
- name: errors-policy
type: status_code
status_code: {status_codes: [ERROR]}
- name: slow-traces-policy
type: latency
latency: {threshold_ms: 500}
- name: probabilistic-policy
type: probabilistic
probabilistic: {sampling_percentage: 5}
exporters:
jaeger:
endpoint: jaeger-collector:14250
tls:
insecure: true
service:
pipelines:
traces:
receivers: [otlp]
processors: [batch, tail_sampling]
exporters: [jaeger]Tracing vs Logging vs Metrics
Traces, logs, and metrics are the three pillars of observability โ each answers a different question. Metrics (counters, gauges, histograms) tell you that something is wrong: p99 latency spiked. Logs tell you what happened on a single machine at a single moment. Traces tell you why it was slow across the entire request path. In a mature system you start with a metric alert, jump to the traces for that time window, identify the slow service, then drill into that service's logs for the specific error message. The three pillars are complementary, not interchangeable.
| Metrics | Logs | Traces | |
|---|---|---|---|
| What it captures | Aggregated numbers over time | Discrete text events per process | End-to-end request timeline across services |
| Primary question | Is the system healthy right now? | What happened on this machine? | Why was this request slow? |
| Cardinality | Low (labels kept small) | High (free-form text) | Medium (one record per sampled request) |
| Storage cost | Very low | High | Medium |
| Best tool | Prometheus + Grafana | Loki, Elasticsearch, Splunk | Jaeger, Zipkin, Honeycomb, Tempo |
| Correlation key | Labels / time window | Timestamp + host | Trace ID (links to logs and metrics) |
Frequently Asked Questions
What is the difference between a trace and a span?
A trace is the complete record of one request's journey from start to finish โ it is identified by a globally unique trace_id and contains all the timing data from every service the request visited. A span is one unit of work within that trace: a single function call, a database query, or an outbound HTTP request. Spans are the nodes of the trace tree; the trace is the tree itself. Every span carries the same trace_id and a parent_span_id pointing to the span that triggered it.
Does distributed tracing add significant latency overhead?
With a well-tuned setup the overhead is negligible. The OTel SDK records spans in a lock-free buffer in the application process and a background thread exports them asynchronously โ the hot path is not blocked. The per-span cost is typically under 10 microseconds of CPU. Header injection adds a few hundred bytes to each outbound request. The real cost is storage and processing at the collector, which is why sampling is essential at scale. At 1โ5% head-based sampling, the application overhead is effectively zero.
How do I correlate a trace with logs from the same request?
Inject the trace_id and span_id into every log line emitted during that request. With OTel this is done automatically if you use a structured logging library that reads from the OTel context (e.g., opentelemetry-instrumentation-logging in Python, or the OTel log bridge API). Once both the trace and the logs carry the same trace_id, tools like Grafana can deep-link from a span directly to the correlated log lines in Loki โ no manual timestamp hunting required.
Metrics tell you something is wrong. Logs tell you what happened. Traces tell you why โ and exactly which service in your fleet is to blame.
โ alokknight Engineering
