Monitoring in System Design: Metrics, Logs, Traces, Golden Signals & Alerting (Visualized)
Monitoring is how you collect, store, and visualize signals from a running system so you can tell whether it is healthy and get alerted when it is not. This guide covers the three pillars, white-box vs black-box, the four golden signals, RED & USE, SLI/SLO/SLA, dashboards, alerting and push vs pull — with live animations.
Monitoring is the practice of continuously collecting, storing, and visualizing signals from a running system so you can tell whether it is healthy and be alerted automatically when it is not. It answers a predefined question — “is this number within the range I expect?” — by tracking things you already know matter: request rate, error rate, latency, CPU, memory.
Monitoring is closely related to, but not the same as, observability. Monitoring tells you that something is wrong by watching known signals against known thresholds; observability is the broader property that lets you ask new questions about novel failures — to debug “unknown unknowns” by slicing rich, high-cardinality telemetry after the fact. Monitoring is a subset of observability: good monitoring is necessary, but on its own it only catches the failure modes you anticipated.
The Three Pillars: Metrics, Logs, and Traces
Most monitoring stacks are built on three complementary telemetry types. Metrics are numeric measurements aggregated over time — cheap to store, fast to query, ideal for dashboards and alerts (e.g. http_requests_total). Logs are timestamped, discrete event records — high detail, great for forensics, but expensive at scale. Traces follow a single request as it hops across services, breaking end-to-end latency down into per-service spans so you can see exactly where time went.
| Metrics | Logs | Traces | |
|---|---|---|---|
| What it is | Aggregated numbers over time | Discrete timestamped events | Causal path of one request across services |
| Best for | Dashboards, alerting, trends | Forensic debugging, audit | Latency breakdown, dependency mapping |
| Cost / cardinality | Low, fixed cardinality | High volume, expensive | Sampled; moderate |
| Example tool | Prometheus, Datadog | Loki, ELK / OpenSearch | Jaeger, Tempo, OpenTelemetry |
White-Box vs Black-Box Monitoring
White-box monitoring uses signals exposed from inside the system — application metrics, queue depths, GC pauses, the /metrics endpoint your code emits. It tells you why something is happening. Black-box monitoring probes the system from the outside, the way a user would — a synthetic HTTP check, a ping, an uptime probe. It tells you that something is broken right now, symptom-first, with no insight into internals. Mature setups use both: black-box to catch user-visible outages, white-box to explain them.
The Four Golden Signals
Google's SRE book distills service health into four golden signals. Latency is how long requests take — measured at percentiles (p50, p95, p99), separating successful from failed requests. Traffic is demand on the system (requests/sec, transactions/sec). Errors is the rate of failed requests (5xx, timeouts, wrong answers). Saturation is how full the system is — the most constrained resource (CPU, memory, I/O, connection pool) as a fraction of capacity. If you can only instrument four things, instrument these.
RED and USE Methods
Two popular checklists turn the golden signals into something you can apply per-component. The RED method (Tom Wilkie) is request-centric and best for services: track Rate (requests/sec), Errors (failed requests/sec), and Duration (latency distribution) for every endpoint. The USE method (Brendan Gregg) is resource-centric and best for infrastructure: for every resource track Utilization, Saturation, and Errors. RED watches what users experience; USE watches what your machines are doing. Together they cover both ends of a request.
SLI, SLO, and SLA
These three define what “good enough” means. An SLI (Service Level Indicator) is a measured number, e.g. “the fraction of requests served in under 300 ms.” An SLO (Service Level Objective) is the target you hold that SLI to, e.g. “99.9% of requests under 300 ms over 30 days.” An SLA (Service Level Agreement) is a contractual promise to customers, usually looser than the internal SLO, with financial penalties if breached. The gap between 100% and your SLO is your error budget — the amount of unreliability you can spend on shipping features before you must stop and fix stability.
Dashboards, Thresholds & Alerting
Dashboards visualize metrics over time so humans can spot trends — a wall of the four golden signals per service is the canonical layout. Alerting closes the loop: a rule watches a metric and fires when it crosses a threshold for a sustained window (e.g. “p99 latency > 300 ms for 5 minutes”). The animation below shows a latency time-series scrolling under a threshold line; the moment it stays above the line, an alert fires and routes to an on-call engineer.
The biggest operational hazard here is alert fatigue: too many noisy, low-value alerts train on-call engineers to ignore their pager, so the one real alert gets missed. Good practice is to alert on symptoms that users feel (high error rate, slow latency) rather than every internal cause, require a sustained duration to avoid flapping, page only on things that need a human now, and send everything else to a ticket or dashboard.
Push vs Pull: How Metrics Get Collected
There are two ways metrics reach the monitoring backend. In the pull model — used by Prometheus — the server periodically scrapes an HTTP /metrics endpoint exposed by each target on a fixed interval. The monitoring system owns service discovery and knows immediately if a target is unreachable (the scrape fails). In the push model — used by StatsD, Graphite, and many Datadog agents — each application actively sends its metrics to a collector. Push suits short-lived jobs and serverless functions that may be gone before a scrape; pull suits long-running services and makes “is it up?” trivial to answer.
Common Tools
Prometheus is the de facto open-source metrics engine: a pull-based time-series database with its own query language, PromQL. Grafana is the visualization layer of choice — it builds dashboards on top of Prometheus, Loki, and dozens of other sources. Datadog is a popular hosted, all-in-one platform combining metrics, logs, traces, and alerting behind an agent. Other common names: Loki/ELK for logs, Jaeger/Tempo for traces, OpenTelemetry as the vendor-neutral instrumentation standard, and Alertmanager for routing Prometheus alerts.
A minimal Prometheus scrape config makes the pull model concrete — you declare targets and an interval, and the server does the rest:
global:
scrape_interval: 15s # how often to pull /metrics
evaluation_interval: 15s # how often to evaluate alert rules
scrape_configs:
- job_name: api
metrics_path: /metrics
static_configs:
- targets: ['api-1:9100', 'api-2:9100']
rule_files:
- alert_rules.yml # e.g. p99 latency > 300ms for 5m -> pageFrequently Asked Questions
What is the difference between monitoring and observability?
Monitoring watches predefined signals against known thresholds to tell you that something is wrong — it catches the failure modes you anticipated. Observability is the broader ability to ask new, unplanned questions of rich telemetry to debug “unknown unknowns” after they happen. Monitoring is a subset of observability; you need both, but they are not interchangeable.
What are the four golden signals?
Latency (how long requests take), traffic (how much demand the system is under), errors (the rate of failed requests), and saturation (how full the most constrained resource is). They come from Google's SRE practice and, if you can only monitor a few things per service, these are the four to choose.
Why does Prometheus pull metrics instead of receiving them?
Pulling lets the monitoring server own service discovery and scheduling, makes “is the target up?” a free side effect (a failed scrape means the target is unreachable), and avoids targets overwhelming the backend. The trade-off is that short-lived jobs may vanish between scrapes — those use a push gateway or the push model instead.
Monitor symptoms, not just causes: alert on what your users feel, instrument enough to explain it, and guard the pager so the one alert that matters is never the one ignored.
— alokknight Engineering
