Distributed Tracing in System Design: Spans, Context Propagation & Latency Debugging (Visualized)

Distributed tracing is an observability technique that records the end-to-end journey of a single request as it propagates across multiple services, collecting timing and metadata at every hop so engineers can visualize exactly where latency is introduced or where errors occur.

In a monolith, a slow endpoint is easy to profile — you grep the logs and attach a debugger. In a microservices architecture a single user-facing request might touch an API gateway, an auth service, a product catalog, a pricing engine, an inventory service, and a database before returning. Each of those services logs independently. Without tracing you end up correlating timestamps across five log streams and still have no clear picture. Tracing solves this by issuing a unique trace ID at the edge and threading it through every service so all the timing data from a single request can be assembled into one coherent timeline.

Traces and Spans: The Building Blocks

A trace represents the entire lifecycle of one request. It is made up of spans — individual units of work. Each span records: a span_id, the trace_id it belongs to, the parent_span_id (so spans nest into a tree), a start timestamp, a duration, a service name, an operation name, and a bag of key-value attributes (also called tags). The root span is created by the first service that touches the request; every downstream call creates a child span. The result is a directed acyclic tree of spans that, when rendered as a waterfall chart, immediately shows the critical path.

Spans also carry events (timestamped log-like annotations inside a span — useful for marking retries or cache misses) and a status (OK, ERROR, or UNSET). An ERROR status with an attached exception event is what lights up red in Jaeger's UI and tells you exactly which service threw the exception and at what point in the call chain.

A request building a trace tree of spans across services

Watch a request travel from the API Gateway through Auth, Product, and Database services, creating nested spans that form the full trace tree.

Trace Context and Propagation

For a trace tree to be assembled after the fact, every service must carry the same trace_id and must know its parent's span_id. This is done through context propagation: when Service A calls Service B, it injects the trace context into the outbound request headers, and Service B extracts those headers before processing the request. The W3C Traceparent header has become the standard format: 00-{traceId}-{parentSpanId}-{flags}. OpenTelemetry's SDK handles this injection and extraction automatically for most HTTP clients and servers, gRPC, Kafka, and other transports.

Baggage is a related concept: arbitrary key-value data that piggybacks on the trace context. You might attach user.tier=premium as baggage so every span in the trace is tagged with it automatically — useful for filtering traces in your backend UI. Keep baggage small; it is copied into every outbound call and can add measurable overhead if abused.

Trace context propagating through HTTP headers between services

Watch the traceparent header carry trace ID and span ID from service to service, linking all spans into one trace.

Finding the Bottleneck: Latency Waterfall Analysis

The canonical output of a trace viewer like Jaeger or Zipkin is the waterfall chart: each span is a horizontal bar, positioned on a timeline. The width of the bar is the span's duration; the left edge is when work started. Child spans are indented under their parent. At a glance you can see: total request duration, which spans run in parallel versus sequentially, and which single span is the longest — the latency bottleneck. The critical path is the chain of spans from root to leaf whose combined duration equals the total request time; shortening any span on the critical path directly improves end-to-end latency.

Common patterns that jump out in a waterfall: N+1 queries (dozens of tiny identical Database spans repeated in a loop), serial fan-out (three service calls that could be parallelised but are done one after another), and cold-start latency (the first span of a service is 10x longer than subsequent ones, betraying a connection pool warmup or a JIT compilation event). Tracing is the fastest way to find all three.

Latency waterfall — identifying the slow span

The waterfall reveals which span is the bottleneck. The highlighted slow span (amber) dominates the total request duration.

Sampling: Controlling the Overhead

At high traffic volume, recording and storing every single span for every request becomes expensive — a 50 000 RPS service emitting ten spans per request generates half a million spans per second. Sampling is the practice of only collecting traces for a fraction of requests. The two main strategies are:

Head-based sampling: a decision is made at the entry point (the root span) before the request is processed. If the coin flip says no, all spans for that trace are dropped immediately. It has zero overhead for discarded traces but it is blind to outcomes — a slow request is just as likely to be dropped as a fast one. Tail-based sampling: spans are buffered at a collector until the trace is complete, then a decision is made based on the outcome — keep 100% of error traces and all traces slower than 500 ms, keep only 1% of normal ones. It costs more memory at the collector layer but captures exactly the traces you care about. OpenTelemetry's Collector supports both modes.

OpenTelemetry, Jaeger, and Zipkin

OpenTelemetry (OTel) is the CNCF standard for instrumenting applications to emit traces, metrics, and logs. It provides language SDKs (Go, Python, Java, Node.js, Rust, etc.), auto-instrumentation agents that instrument popular frameworks with zero code changes, and the OTel Collector — a vendor-neutral pipeline that receives, processes, and exports telemetry. By instrumenting once with OTel, you can route data to any backend without changing application code.

Jaeger (open-source, CNCF graduated) and Zipkin (open-source, originally from Twitter) are trace storage and query backends. Both accept spans over HTTP or gRPC, store them in Cassandra, Elasticsearch, or in-memory, and provide a UI for searching traces and rendering waterfall charts. Jaeger is generally preferred for new deployments for its adaptive sampling support and Kubernetes-native operator. Commercial alternatives (Datadog APM, Honeycomb, Grafana Tempo, AWS X-Ray) offer similar query UIs with managed infrastructure.

# OpenTelemetry Collector config — receives OTLP, exports to Jaeger
receivers:
  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317
      http:
        endpoint: 0.0.0.0:4318

processors:
  batch:
    timeout: 5s
    send_batch_size: 1024
  # Tail-based sampling: keep all errors + slow traces
  tail_sampling:
    decision_wait: 10s
    policies:
      - name: errors-policy
        type: status_code
        status_code: {status_codes: [ERROR]}
      - name: slow-traces-policy
        type: latency
        latency: {threshold_ms: 500}
      - name: probabilistic-policy
        type: probabilistic
        probabilistic: {sampling_percentage: 5}

exporters:
  jaeger:
    endpoint: jaeger-collector:14250
    tls:
      insecure: true

service:
  pipelines:
    traces:
      receivers: [otlp]
      processors: [batch, tail_sampling]
      exporters: [jaeger]

Tracing vs Logging vs Metrics

Traces, logs, and metrics are the three pillars of observability — each answers a different question. Metrics (counters, gauges, histograms) tell you that something is wrong: p99 latency spiked. Logs tell you what happened on a single machine at a single moment. Traces tell you why it was slow across the entire request path. In a mature system you start with a metric alert, jump to the traces for that time window, identify the slow service, then drill into that service's logs for the specific error message. The three pillars are complementary, not interchangeable.

	Metrics	Logs	Traces
What it captures	Aggregated numbers over time	Discrete text events per process	End-to-end request timeline across services
Primary question	Is the system healthy right now?	What happened on this machine?	Why was this request slow?
Cardinality	Low (labels kept small)	High (free-form text)	Medium (one record per sampled request)
Storage cost	Very low	High	Medium
Best tool	Prometheus + Grafana	Loki, Elasticsearch, Splunk	Jaeger, Zipkin, Honeycomb, Tempo
Correlation key	Labels / time window	Timestamp + host	Trace ID (links to logs and metrics)

Frequently Asked Questions

What is the difference between a trace and a span?

A trace is the complete record of one request's journey from start to finish — it is identified by a globally unique trace_id and contains all the timing data from every service the request visited. A span is one unit of work within that trace: a single function call, a database query, or an outbound HTTP request. Spans are the nodes of the trace tree; the trace is the tree itself. Every span carries the same trace_id and a parent_span_id pointing to the span that triggered it.

Does distributed tracing add significant latency overhead?

With a well-tuned setup the overhead is negligible. The OTel SDK records spans in a lock-free buffer in the application process and a background thread exports them asynchronously — the hot path is not blocked. The per-span cost is typically under 10 microseconds of CPU. Header injection adds a few hundred bytes to each outbound request. The real cost is storage and processing at the collector, which is why sampling is essential at scale. At 1–5% head-based sampling, the application overhead is effectively zero.

How do I correlate a trace with logs from the same request?

Inject the trace_id and span_id into every log line emitted during that request. With OTel this is done automatically if you use a structured logging library that reads from the OTel context (e.g., opentelemetry-instrumentation-logging in Python, or the OTel log bridge API). Once both the trace and the logs carry the same trace_id, tools like Grafana can deep-link from a span directly to the correlated log lines in Loki — no manual timestamp hunting required.

Metrics tell you something is wrong. Logs tell you what happened. Traces tell you why — and exactly which service in your fleet is to blame.
— alokknight Engineering