Observability in System Design: Metrics, Logs, Traces & OpenTelemetry (Visualized)

Observability is the engineering discipline of instrumenting a system so that any internal question — including questions you have not yet thought to ask — can be answered solely from its external outputs: metrics, logs, and traces. The term is borrowed from control theory: a system is said to be observable if its complete internal state can be inferred from its outputs alone.

As architectures migrate from monoliths to dozens of microservices, a single user request may touch ten services, three databases, and two queues before returning a response. When something goes wrong, the failure signal (a slow response, a 500 error) tells you almost nothing about where the problem is. Observability provides the breadcrumbs — rich telemetry emitted by every component — that let you reconstruct exactly what happened and why.

Observability vs Monitoring: Known-Unknowns vs Unknown-Unknowns

Monitoring is the practice of collecting predefined metrics and alerting when they breach known thresholds. It answers known-unknown questions: "Is CPU above 80%?" or "Did the error rate exceed 1%?" Monitoring is essential, but its blind spot is the vast space of things that can go wrong that you have not pre-defined a threshold for.

Observability addresses unknown-unknowns — failure modes you did not anticipate. A fully observable system lets an engineer sit in front of a dashboard, see a performance regression in a single customer's requests, drill down through correlated traces, inspect the exact SQL query that caused a lock wait, and correlate it with a deployment event — all without writing new instrumentation first. This exploratory debugging capability is the hallmark of observability.

Dimension	Monitoring	Observability
Core question	"Is the system healthy?"	"Why is the system behaving this way?"
Failure mode coverage	Known-unknowns (pre-defined alerts)	Unknown-unknowns (ad-hoc exploration)
Primary data	Aggregated metrics + dashboards	Metrics + structured logs + distributed traces
Debugging style	Check dashboards, react to alerts	Correlate signals, drill into individual events
Cardinality tolerance	Low (aggregated counters)	High (per-request, per-user dimensions)
Tooling examples	Nagios, Zabbix, CloudWatch alarms	Honeycomb, Datadog APM, Grafana + Tempo + Loki

The Three Pillars of Observability

Observability is built on three complementary telemetry types, often called the three pillars. Each pillar answers a different question, and together they give you a complete picture of any incident: metrics tell you something is wrong, logs tell you what happened, and traces tell you where in the system it happened and how long each step took. The power comes from linking all three to the same request context.

The Three Pillars Combining to Debug an Incident

Watch how a latency spike in metrics leads to log inspection, then trace drill-down, to pinpoint the exact slow database call.

Pillar 1 — Metrics: Aggregated Numbers Over Time

Metrics are numeric measurements aggregated over time — counters, gauges, and histograms. A counter might record the total number of HTTP requests served; a gauge records the current number of active database connections; a histogram records a distribution of request latencies so you can compute percentiles like p50, p95, and p99. Metrics are extremely cheap to store because they collapse millions of events into a handful of numbers per interval.

The key limitation of metrics is that aggregation loses detail. If your p99 latency spikes, the metric alone cannot tell you which users were affected, which endpoint was slow, or what dependency caused the slowdown. That is why metrics must be paired with the other pillars. Popular metrics stacks include Prometheus (scrape-based) and StatsD + Graphite (push-based). The standard query language for Prometheus metrics is PromQL.

Pillar 2 — Logs: Structured Records of Events

Logs are timestamped records of discrete events — one log line per meaningful event. Historically logs were unstructured text, which made them hard to query at scale. Modern observability practice requires structured logging: emitting each log as a JSON object with well-defined fields so you can filter, aggregate, and correlate by any field. A structured log entry for an HTTP request might include request_id, user_id, endpoint, status, latency_ms, and trace_id.

The trace_id field is the critical link: when logs carry the same trace ID as the distributed trace for that request, an engineer can jump directly from a suspicious log line into the full trace. Tools like Grafana Loki, Elasticsearch, and Datadog Logs index structured logs for fast field-level search. Log retention is expensive, so teams often keep raw logs for 7–30 days and push aggregates to cold storage for longer.

Pillar 3 — Distributed Tracing: Following a Request Across Services

Distributed tracing records the path a single request takes through every service in your system, producing a trace — a directed acyclic graph of spans. A span represents one unit of work: an HTTP handler, a database query, a cache lookup, or an outbound RPC call. Each span records its operation name, start time, duration, parent span ID, and a set of key-value attributes.

The root span represents the entire request. Every downstream call creates a child span that references the parent via the parent_span_id. All spans in a trace share a single globally-unique trace_id, which is propagated across service boundaries in HTTP headers (for example traceparent in the W3C Trace Context standard). The resulting flame-graph view in a tool like Jaeger or Grafana Tempo instantly shows exactly where time was spent and which service caused a slowdown.

Distributed Trace: A Request Flowing Across Services with Parent and Child Spans

A single request travels through API Gateway → Orders Service → DB, emitting correlated spans and logs at each hop.

High-Cardinality Data: The Real Superpower of Observability

Cardinality refers to the number of unique values a dimension (label, tag, or attribute) can take. A metric labelled by http_method has low cardinality — maybe five values (GET, POST, PUT, DELETE, PATCH). A metric labelled by user_id has high cardinality — potentially millions of unique values. Traditional time-series databases like Prometheus store one time series per unique label combination, so high-cardinality dimensions cause a cardinality explosion that makes the database slow and expensive.

This is why high-cardinality debugging requires traces and structured logs, not metrics. A trace can carry user_id, request_id, customer_plan, region, and feature_flag as span attributes without any database penalty — because traces are stored as individual documents, not aggregated series. Tools like Honeycomb and Grafana Tempo are purpose-built for high-cardinality trace data, allowing you to ask "show me all requests from user_id=9821 that took over 500ms in the last hour" in milliseconds.

Low Cardinality vs High Cardinality: Why It Matters for Debugging

Low-cardinality metrics average away the outlier. High-cardinality trace data reveals the single slow user — invisible in aggregates.

Exemplars: Linking Metrics to Traces

An exemplar is a sample data point attached to a metric that carries the trace_id of the specific request that produced it. When your p99 latency histogram spikes, clicking the exemplar point in Grafana jumps you directly to the distributed trace of the worst-performing request that contributed to that spike — no manual log searching required. Exemplars bridge the gap between the aggregate view (metrics) and the individual view (traces), and are supported by Prometheus, OpenMetrics, and Grafana natively.

OpenTelemetry: The Unified Standard

OpenTelemetry (OTel) is a CNCF project that provides a vendor-neutral, open-source standard for generating, collecting, and exporting all three pillars of telemetry. Before OTel, every observability vendor had its own SDK and wire format, locking you in. OTel defines a single API and SDK (in every major language), a standard wire protocol (OTLP — the OpenTelemetry Protocol), and the OTel Collector — a vendor-agnostic pipeline that can receive telemetry from your services and route it to any backend (Jaeger, Tempo, Prometheus, Datadog, etc.).

With OTel you instrument your service once using the standard SDK, and then change backends without touching application code. The OTel Collector can also enrich spans (adding Kubernetes pod metadata), sample aggressively to reduce storage costs, and fan out telemetry to multiple backends simultaneously. OTel is now the de-facto industry standard; all major cloud providers and observability vendors have committed to supporting OTLP natively.

# OTel Collector config: receive OTLP, export to Prometheus + Jaeger
receivers:
  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317
      http:
        endpoint: 0.0.0.0:4318

exporters:
  prometheus:
    endpoint: '0.0.0.0:8889'
  jaeger:
    endpoint: jaeger:14250
    tls:
      insecure: true
  logging:
    verbosity: normal

service:
  pipelines:
    traces:
      receivers: [otlp]
      exporters: [jaeger, logging]
    metrics:
      receivers: [otlp]
      exporters: [prometheus]
    logs:
      receivers: [otlp]
      exporters: [logging]

Sampling Strategies: Managing Telemetry Volume

Recording every single trace in a high-traffic system is prohibitively expensive. Sampling decides which traces to keep. Head-based sampling makes the decision at the root span (before downstream services run), usually using a fixed percentage — simple but may discard rare errors. Tail-based sampling defers the decision until the entire trace is complete, keeping 100% of traces that contain errors, high latency, or other interesting signals, and dropping the rest. Tail-based sampling is superior for debugging but requires buffering complete traces in the collector before making a decision, adding latency and memory overhead.

Strategy	Decision point	Keeps errors?	Cost	Typical use
Head-based (fixed %)	Root span, before execution	Only by chance	Very low	High-volume, low-error services
Head-based (rate limit)	Root span, per-service rate cap	Only by chance	Low	Bursty traffic control
Tail-based	After full trace collected	Always	Medium (buffering)	Latency/error debugging
Always-on	Every request	Always	High	Low-traffic dev/staging

Frequently Asked Questions

What is the difference between observability and monitoring?

Monitoring answers predefined questions about known failure modes — you set up dashboards and alerts ahead of time. Observability lets you explore and answer any question about your system's internal state after the fact, including questions you did not think to ask before an incident. Monitoring is a subset of observability: a fully observable system makes monitoring easier, but monitoring alone cannot make a system observable. In practice, good observability means shipping rich structured logs, distributed traces, and high-cardinality metrics so that an on-call engineer can debug any incident without deploying new code to gain visibility.

Do I need all three pillars, or can I start with just metrics?

Metrics alone are a fine starting point for a simple monolithic service — they are cheap, well-understood, and integrate easily with alerting. As you add services, metrics will tell you something is wrong but not where or why. Structured logs add context for diagnosing individual events. Distributed traces become essential the moment a single request spans more than one service. The pragmatic approach: start with metrics and structured logs on day one, add distributed tracing when you have more than two services communicating, and wire them together with a shared trace_id and OTel exemplars for instant cross-signal navigation.

Why does high cardinality break traditional metrics databases?

A time-series database like Prometheus stores one series per unique combination of label values. If you add a label user_id with one million unique users, and you have 10 existing label combinations, you now have 10 million series — each requiring its own in-memory index entry and on-disk chunk. This is called cardinality explosion: memory usage grows linearly with the number of unique label value combinations, and query performance degrades sharply. Prometheus mitigates this with limits (--storage.tsdb.max-block-chunk-segment-size), but the fundamental limitation remains. Use traces and structured logs with high-cardinality dimensions; keep metrics labels to low-cardinality values like status_code, endpoint, and region.

Observability is not a tool you buy — it is a property you engineer into your system by shipping rich telemetry from day one. Metrics tell you something broke; traces tell you where; logs tell you why. Together they give your team the power to debug anything, not just the things you anticipated.
— alokknight Engineering