Observability in System Design: Metrics, Logs, Traces & OpenTelemetry (Visualized)
Observability is the ability to understand the internal state of a system by examining its external outputs. Unlike monitoring, which answers known questions, observability lets you ask questions you have never thought to ask before โ a critical property for debugging distributed systems at scale.
Observability is the engineering discipline of instrumenting a system so that any internal question โ including questions you have not yet thought to ask โ can be answered solely from its external outputs: metrics, logs, and traces. The term is borrowed from control theory: a system is said to be observable if its complete internal state can be inferred from its outputs alone.
As architectures migrate from monoliths to dozens of microservices, a single user request may touch ten services, three databases, and two queues before returning a response. When something goes wrong, the failure signal (a slow response, a 500 error) tells you almost nothing about where the problem is. Observability provides the breadcrumbs โ rich telemetry emitted by every component โ that let you reconstruct exactly what happened and why.
Observability vs Monitoring: Known-Unknowns vs Unknown-Unknowns
Monitoring is the practice of collecting predefined metrics and alerting when they breach known thresholds. It answers known-unknown questions: "Is CPU above 80%?" or "Did the error rate exceed 1%?" Monitoring is essential, but its blind spot is the vast space of things that can go wrong that you have not pre-defined a threshold for.
Observability addresses unknown-unknowns โ failure modes you did not anticipate. A fully observable system lets an engineer sit in front of a dashboard, see a performance regression in a single customer's requests, drill down through correlated traces, inspect the exact SQL query that caused a lock wait, and correlate it with a deployment event โ all without writing new instrumentation first. This exploratory debugging capability is the hallmark of observability.
| Dimension | Monitoring | Observability |
|---|---|---|
| Core question | "Is the system healthy?" | "Why is the system behaving this way?" |
| Failure mode coverage | Known-unknowns (pre-defined alerts) | Unknown-unknowns (ad-hoc exploration) |
| Primary data | Aggregated metrics + dashboards | Metrics + structured logs + distributed traces |
| Debugging style | Check dashboards, react to alerts | Correlate signals, drill into individual events |
| Cardinality tolerance | Low (aggregated counters) | High (per-request, per-user dimensions) |
| Tooling examples | Nagios, Zabbix, CloudWatch alarms | Honeycomb, Datadog APM, Grafana + Tempo + Loki |
The Three Pillars of Observability
Observability is built on three complementary telemetry types, often called the three pillars. Each pillar answers a different question, and together they give you a complete picture of any incident: metrics tell you something is wrong, logs tell you what happened, and traces tell you where in the system it happened and how long each step took. The power comes from linking all three to the same request context.
Pillar 1 โ Metrics: Aggregated Numbers Over Time
Metrics are numeric measurements aggregated over time โ counters, gauges, and histograms. A counter might record the total number of HTTP requests served; a gauge records the current number of active database connections; a histogram records a distribution of request latencies so you can compute percentiles like p50, p95, and p99. Metrics are extremely cheap to store because they collapse millions of events into a handful of numbers per interval.
The key limitation of metrics is that aggregation loses detail. If your p99 latency spikes, the metric alone cannot tell you which users were affected, which endpoint was slow, or what dependency caused the slowdown. That is why metrics must be paired with the other pillars. Popular metrics stacks include Prometheus (scrape-based) and StatsD + Graphite (push-based). The standard query language for Prometheus metrics is PromQL.
Pillar 2 โ Logs: Structured Records of Events
Logs are timestamped records of discrete events โ one log line per meaningful event. Historically logs were unstructured text, which made them hard to query at scale. Modern observability practice requires structured logging: emitting each log as a JSON object with well-defined fields so you can filter, aggregate, and correlate by any field. A structured log entry for an HTTP request might include request_id, user_id, endpoint, status, latency_ms, and trace_id.
The trace_id field is the critical link: when logs carry the same trace ID as the distributed trace for that request, an engineer can jump directly from a suspicious log line into the full trace. Tools like Grafana Loki, Elasticsearch, and Datadog Logs index structured logs for fast field-level search. Log retention is expensive, so teams often keep raw logs for 7โ30 days and push aggregates to cold storage for longer.
Pillar 3 โ Distributed Tracing: Following a Request Across Services
Distributed tracing records the path a single request takes through every service in your system, producing a trace โ a directed acyclic graph of spans. A span represents one unit of work: an HTTP handler, a database query, a cache lookup, or an outbound RPC call. Each span records its operation name, start time, duration, parent span ID, and a set of key-value attributes.
The root span represents the entire request. Every downstream call creates a child span that references the parent via the parent_span_id. All spans in a trace share a single globally-unique trace_id, which is propagated across service boundaries in HTTP headers (for example traceparent in the W3C Trace Context standard). The resulting flame-graph view in a tool like Jaeger or Grafana Tempo instantly shows exactly where time was spent and which service caused a slowdown.
High-Cardinality Data: The Real Superpower of Observability
Cardinality refers to the number of unique values a dimension (label, tag, or attribute) can take. A metric labelled by http_method has low cardinality โ maybe five values (GET, POST, PUT, DELETE, PATCH). A metric labelled by user_id has high cardinality โ potentially millions of unique values. Traditional time-series databases like Prometheus store one time series per unique label combination, so high-cardinality dimensions cause a cardinality explosion that makes the database slow and expensive.
This is why high-cardinality debugging requires traces and structured logs, not metrics. A trace can carry user_id, request_id, customer_plan, region, and feature_flag as span attributes without any database penalty โ because traces are stored as individual documents, not aggregated series. Tools like Honeycomb and Grafana Tempo are purpose-built for high-cardinality trace data, allowing you to ask "show me all requests from user_id=9821 that took over 500ms in the last hour" in milliseconds.
Exemplars: Linking Metrics to Traces
An exemplar is a sample data point attached to a metric that carries the trace_id of the specific request that produced it. When your p99 latency histogram spikes, clicking the exemplar point in Grafana jumps you directly to the distributed trace of the worst-performing request that contributed to that spike โ no manual log searching required. Exemplars bridge the gap between the aggregate view (metrics) and the individual view (traces), and are supported by Prometheus, OpenMetrics, and Grafana natively.
OpenTelemetry: The Unified Standard
OpenTelemetry (OTel) is a CNCF project that provides a vendor-neutral, open-source standard for generating, collecting, and exporting all three pillars of telemetry. Before OTel, every observability vendor had its own SDK and wire format, locking you in. OTel defines a single API and SDK (in every major language), a standard wire protocol (OTLP โ the OpenTelemetry Protocol), and the OTel Collector โ a vendor-agnostic pipeline that can receive telemetry from your services and route it to any backend (Jaeger, Tempo, Prometheus, Datadog, etc.).
With OTel you instrument your service once using the standard SDK, and then change backends without touching application code. The OTel Collector can also enrich spans (adding Kubernetes pod metadata), sample aggressively to reduce storage costs, and fan out telemetry to multiple backends simultaneously. OTel is now the de-facto industry standard; all major cloud providers and observability vendors have committed to supporting OTLP natively.
# OTel Collector config: receive OTLP, export to Prometheus + Jaeger
receivers:
otlp:
protocols:
grpc:
endpoint: 0.0.0.0:4317
http:
endpoint: 0.0.0.0:4318
exporters:
prometheus:
endpoint: '0.0.0.0:8889'
jaeger:
endpoint: jaeger:14250
tls:
insecure: true
logging:
verbosity: normal
service:
pipelines:
traces:
receivers: [otlp]
exporters: [jaeger, logging]
metrics:
receivers: [otlp]
exporters: [prometheus]
logs:
receivers: [otlp]
exporters: [logging]Sampling Strategies: Managing Telemetry Volume
Recording every single trace in a high-traffic system is prohibitively expensive. Sampling decides which traces to keep. Head-based sampling makes the decision at the root span (before downstream services run), usually using a fixed percentage โ simple but may discard rare errors. Tail-based sampling defers the decision until the entire trace is complete, keeping 100% of traces that contain errors, high latency, or other interesting signals, and dropping the rest. Tail-based sampling is superior for debugging but requires buffering complete traces in the collector before making a decision, adding latency and memory overhead.
| Strategy | Decision point | Keeps errors? | Cost | Typical use |
|---|---|---|---|---|
| Head-based (fixed %) | Root span, before execution | Only by chance | Very low | High-volume, low-error services |
| Head-based (rate limit) | Root span, per-service rate cap | Only by chance | Low | Bursty traffic control |
| Tail-based | After full trace collected | Always | Medium (buffering) | Latency/error debugging |
| Always-on | Every request | Always | High | Low-traffic dev/staging |
Frequently Asked Questions
What is the difference between observability and monitoring?
Monitoring answers predefined questions about known failure modes โ you set up dashboards and alerts ahead of time. Observability lets you explore and answer any question about your system's internal state after the fact, including questions you did not think to ask before an incident. Monitoring is a subset of observability: a fully observable system makes monitoring easier, but monitoring alone cannot make a system observable. In practice, good observability means shipping rich structured logs, distributed traces, and high-cardinality metrics so that an on-call engineer can debug any incident without deploying new code to gain visibility.
Do I need all three pillars, or can I start with just metrics?
Metrics alone are a fine starting point for a simple monolithic service โ they are cheap, well-understood, and integrate easily with alerting. As you add services, metrics will tell you something is wrong but not where or why. Structured logs add context for diagnosing individual events. Distributed traces become essential the moment a single request spans more than one service. The pragmatic approach: start with metrics and structured logs on day one, add distributed tracing when you have more than two services communicating, and wire them together with a shared trace_id and OTel exemplars for instant cross-signal navigation.
Why does high cardinality break traditional metrics databases?
A time-series database like Prometheus stores one series per unique combination of label values. If you add a label user_id with one million unique users, and you have 10 existing label combinations, you now have 10 million series โ each requiring its own in-memory index entry and on-disk chunk. This is called cardinality explosion: memory usage grows linearly with the number of unique label value combinations, and query performance degrades sharply. Prometheus mitigates this with limits (--storage.tsdb.max-block-chunk-segment-size), but the fundamental limitation remains. Use traces and structured logs with high-cardinality dimensions; keep metrics labels to low-cardinality values like status_code, endpoint, and region.
Observability is not a tool you buy โ it is a property you engineer into your system by shipping rich telemetry from day one. Metrics tell you something broke; traces tell you where; logs tell you why. Together they give your team the power to debug anything, not just the things you anticipated.
โ alokknight Engineering
