Monitoring in System Design: Metrics, Logs, Traces, Golden Signals & Alerting (Visualized)

Monitoring is the practice of continuously collecting, storing, and visualizing signals from a running system so you can tell whether it is healthy and be alerted automatically when it is not. It answers a predefined question — “is this number within the range I expect?” — by tracking things you already know matter: request rate, error rate, latency, CPU, memory.

Monitoring is closely related to, but not the same as, observability. Monitoring tells you that something is wrong by watching known signals against known thresholds; observability is the broader property that lets you ask new questions about novel failures — to debug “unknown unknowns” by slicing rich, high-cardinality telemetry after the fact. Monitoring is a subset of observability: good monitoring is necessary, but on its own it only catches the failure modes you anticipated.

The Three Pillars: Metrics, Logs, and Traces

Most monitoring stacks are built on three complementary telemetry types. Metrics are numeric measurements aggregated over time — cheap to store, fast to query, ideal for dashboards and alerts (e.g. http_requests_total). Logs are timestamped, discrete event records — high detail, great for forensics, but expensive at scale. Traces follow a single request as it hops across services, breaking end-to-end latency down into per-service spans so you can see exactly where time went.

	Metrics	Logs	Traces
What it is	Aggregated numbers over time	Discrete timestamped events	Causal path of one request across services
Best for	Dashboards, alerting, trends	Forensic debugging, audit	Latency breakdown, dependency mapping
Cost / cardinality	Low, fixed cardinality	High volume, expensive	Sampled; moderate
Example tool	Prometheus, Datadog	Loki, ELK / OpenSearch	Jaeger, Tempo, OpenTelemetry

White-Box vs Black-Box Monitoring

White-box monitoring uses signals exposed from inside the system — application metrics, queue depths, GC pauses, the /metrics endpoint your code emits. It tells you why something is happening. Black-box monitoring probes the system from the outside, the way a user would — a synthetic HTTP check, a ping, an uptime probe. It tells you that something is broken right now, symptom-first, with no insight into internals. Mature setups use both: black-box to catch user-visible outages, white-box to explain them.

The Four Golden Signals

Google's SRE book distills service health into four golden signals. Latency is how long requests take — measured at percentiles (p50, p95, p99), separating successful from failed requests. Traffic is demand on the system (requests/sec, transactions/sec). Errors is the rate of failed requests (5xx, timeouts, wrong answers). Saturation is how full the system is — the most constrained resource (CPU, memory, I/O, connection pool) as a fraction of capacity. If you can only instrument four things, instrument these.

The four golden signals, live

Latency, traffic, errors and saturation as live gauges. When a value crosses its threshold the gauge turns red — exactly the moment an SRE wants to be paged.

RED and USE Methods

Two popular checklists turn the golden signals into something you can apply per-component. The RED method (Tom Wilkie) is request-centric and best for services: track Rate (requests/sec), Errors (failed requests/sec), and Duration (latency distribution) for every endpoint. The USE method (Brendan Gregg) is resource-centric and best for infrastructure: for every resource track Utilization, Saturation, and Errors. RED watches what users experience; USE watches what your machines are doing. Together they cover both ends of a request.

SLI, SLO, and SLA

These three define what “good enough” means. An SLI (Service Level Indicator) is a measured number, e.g. “the fraction of requests served in under 300 ms.” An SLO (Service Level Objective) is the target you hold that SLI to, e.g. “99.9% of requests under 300 ms over 30 days.” An SLA (Service Level Agreement) is a contractual promise to customers, usually looser than the internal SLO, with financial penalties if breached. The gap between 100% and your SLO is your error budget — the amount of unreliability you can spend on shipping features before you must stop and fix stability.

Dashboards, Thresholds & Alerting

Dashboards visualize metrics over time so humans can spot trends — a wall of the four golden signals per service is the canonical layout. Alerting closes the loop: a rule watches a metric and fires when it crosses a threshold for a sustained window (e.g. “p99 latency > 300 ms for 5 minutes”). The animation below shows a latency time-series scrolling under a threshold line; the moment it stays above the line, an alert fires and routes to an on-call engineer.

Threshold breach fires an alert

A live latency time-series scrolls left under a threshold line. When the value stays above the threshold the series turns red and an alert is dispatched to on-call.

The biggest operational hazard here is alert fatigue: too many noisy, low-value alerts train on-call engineers to ignore their pager, so the one real alert gets missed. Good practice is to alert on symptoms that users feel (high error rate, slow latency) rather than every internal cause, require a sustained duration to avoid flapping, page only on things that need a human now, and send everything else to a ticket or dashboard.

Push vs Pull: How Metrics Get Collected

There are two ways metrics reach the monitoring backend. In the pull model — used by Prometheus — the server periodically scrapes an HTTP /metrics endpoint exposed by each target on a fixed interval. The monitoring system owns service discovery and knows immediately if a target is unreachable (the scrape fails). In the push model — used by StatsD, Graphite, and many Datadog agents — each application actively sends its metrics to a collector. Push suits short-lived jobs and serverless functions that may be gone before a scrape; pull suits long-running services and makes “is it up?” trivial to answer.

Prometheus scraping targets on an interval

Every scrape interval the Prometheus server pulls /metrics from each target. Reachable targets flash green and report a value; an unreachable target's scrape fails and is marked DOWN.

Common Tools

Prometheus is the de facto open-source metrics engine: a pull-based time-series database with its own query language, PromQL. Grafana is the visualization layer of choice — it builds dashboards on top of Prometheus, Loki, and dozens of other sources. Datadog is a popular hosted, all-in-one platform combining metrics, logs, traces, and alerting behind an agent. Other common names: Loki/ELK for logs, Jaeger/Tempo for traces, OpenTelemetry as the vendor-neutral instrumentation standard, and Alertmanager for routing Prometheus alerts.

A minimal Prometheus scrape config makes the pull model concrete — you declare targets and an interval, and the server does the rest:

global:
  scrape_interval: 15s        # how often to pull /metrics
  evaluation_interval: 15s    # how often to evaluate alert rules

scrape_configs:
  - job_name: api
    metrics_path: /metrics
    static_configs:
      - targets: ['api-1:9100', 'api-2:9100']

rule_files:
  - alert_rules.yml            # e.g. p99 latency > 300ms for 5m -> page

Frequently Asked Questions

What is the difference between monitoring and observability?

Monitoring watches predefined signals against known thresholds to tell you that something is wrong — it catches the failure modes you anticipated. Observability is the broader ability to ask new, unplanned questions of rich telemetry to debug “unknown unknowns” after they happen. Monitoring is a subset of observability; you need both, but they are not interchangeable.

What are the four golden signals?

Latency (how long requests take), traffic (how much demand the system is under), errors (the rate of failed requests), and saturation (how full the most constrained resource is). They come from Google's SRE practice and, if you can only monitor a few things per service, these are the four to choose.

Why does Prometheus pull metrics instead of receiving them?

Pulling lets the monitoring server own service discovery and scheduling, makes “is the target up?” a free side effect (a failed scrape means the target is unreachable), and avoids targets overwhelming the backend. The trade-off is that short-lived jobs may vanish between scrapes — those use a push gateway or the push model instead.

Monitor symptoms, not just causes: alert on what your users feel, instrument enough to explain it, and guard the pager so the one alert that matters is never the one ignored.
— alokknight Engineering