Metrics in System Design: Counters, Gauges, Histograms & the Four Golden Signals (Visualized)
Metrics are the numeric pulse of a running system β sampled, stored, and graphed over time so engineers can spot anomalies before users do. This guide covers metric types, the four golden signals, labels, cardinality, aggregation, pull vs push collection, and how metrics differ from logs and traces.
Metrics are numeric measurements of a system's behavior, sampled at regular intervals and stored in a time-series database so engineers can observe trends, set alerts, and diagnose incidents without digging through raw logs. Unlike a log entry β which is an unstructured event that happened once β a metric is a number that is recorded repeatedly over time, making it cheap to store, fast to query, and easy to graph or alert on.
Every production system worth operating emits metrics: CPU usage, request counts, error rates, queue depths, cache hit ratios, database connection pool sizes. Collectively these numbers answer the question "is the system healthy right now?" in a way that no individual log line can. A good metrics pipeline is typically the first layer of observability you build, because it gives you dashboards and pager-level alerts at very low storage and query cost.
The Four Core Metric Types
Metrics systems β Prometheus, StatsD, OpenTelemetry β converge on four fundamental types. Understanding which type to use for each signal prevents subtle bugs in your dashboards and alerts.
Counter β a monotonically increasing integer that only goes up (or resets to zero on restart). Use it for things you count: total HTTP requests served, total errors thrown, total bytes sent. You never query the raw value of a counter; you query its rate (requests per second) over a time window. A counter going to zero usually means the process restarted.
Gauge β a value that can go up or down arbitrarily. CPU percentage, memory in use, active database connections, queue depth β anything that represents a current state, not an accumulation. You read the gauge's current value directly; no rate calculation needed.
Histogram β splits observations into configurable buckets (e.g., request latency <10 ms, <50 ms, <200 ms, <1 s, >1 s) and tracks how many observations fall in each bucket, plus a running sum and count. From a histogram you can derive percentiles (p50, p95, p99) server-side during query time β ideal when you want SLO-level latency analysis across a large fleet.
Summary β like a histogram but calculates quantiles client-side in the instrumented process before shipping. Summaries are accurate for the local process but cannot be aggregated across multiple instances, making them unsuitable for horizontal fleets. Prefer histograms in nearly every distributed setting.
Histogram Deep-Dive: Latency Buckets
A histogram is the right tool for latency. When Prometheus collects a histogram metric like http_request_duration_seconds, it stores one counter per bucket boundary (le = less than or equal). If a request takes 47 ms it increments every bucket whose boundary is ≥ 47 ms. At query time, PromQL's histogram_quantile(0.99, ...) interpolates across buckets to estimate the 99th-percentile latency β across every instance in your fleet, summed server-side, with no individual quantile shipped over the wire.
Bucket design matters: too few buckets and your quantile estimates are coarse; too many and you multiply your time-series cardinality. A common starting point for web latency is boundaries at 5 ms, 25 ms, 50 ms, 100 ms, 250 ms, 500 ms, 1 s, 2.5 s, 5 s, 10 s. Adjust based on your p50 and SLO target.
Metric Types at a Glance
| Type | Goes up / down? | Query pattern | Best for |
|---|---|---|---|
| Counter | Up only (resets on restart) | rate() / irate() | Total requests, total errors, bytes sent |
| Gauge | Both directions | Current value directly | CPU %, memory, queue depth, connections |
| Histogram | Up only (per bucket) | histogram_quantile() | Latency percentiles, request size distribution |
| Summary | Up only (per quantile) | Current quantile value | Accurate local quantiles (single instance only) |
The Four Golden Signals
Google's SRE book distilled years of production experience into four metrics that, together, tell you almost everything about whether a service is healthy. They are the minimum viable dashboard for any user-facing service.
Latency β how long a request takes. Always track latency for errors separately: a fast error is not the same as a fast successful response, and conflating the two masks problems. Track p50, p95, and p99 as histograms; set your SLO alert on p99.
Traffic β how much demand is hitting the system. HTTP requests per second, messages consumed per second, database queries per second. Traffic is the denominator in your error-rate calculation and the number you feed into capacity planning. A counter metric queried with rate() gives you this directly.
Errors β the rate at which requests fail. This includes explicit failures (HTTP 5xx, gRPC INTERNAL), implicit failures (HTTP 200 with a wrong payload), and policy failures (a response that took too long counts as a failure under an SLO). Track error rate as a percentage of traffic; alert when it crosses your error budget threshold.
Saturation β how "full" your service is. CPU utilization, memory pressure, thread-pool queue depth, disk I/O wait. Saturation often predicts latency degradation before users notice it. When saturation hits 100% the service is at or beyond capacity; at 80% you should be thinking about scaling.
| Signal | What it measures | Metric type | Alert when⦠|
|---|---|---|---|
| Latency | Request duration (p50 / p99) | Histogram | p99 > SLO threshold |
| Traffic | Request rate (req/s) | Counter (rate) | Sudden drop or unexpected spike |
| Errors | Error rate (% of traffic) | Counter (rate) | Error % > error budget burn rate |
| Saturation | Resource fullness (CPU, memory, queue) | Gauge | Utilization > 80% sustained |
Labels and Cardinality
A raw metric value is rarely useful on its own. Labels (also called dimensions or tags) let you slice one metric into many views. For example, http_requests_total{method="GET", status="200", endpoint="/api/users"} uses three labels. Each unique combination of label values creates a separate time series in the database.
Cardinality is the number of unique time series a metric generates, and it is the most common scaling trap in metrics pipelines. Adding a label with high cardinality β user ID, IP address, session token, request UUID β can explode a metric from 100 series to 10 million, overwhelming your TSDB. The rule: labels should identify categories, not identities. Good labels: region, http_method, status_code. Bad labels: user_id, trace_id, raw_url.
Aggregation and Rollups
A time-series database accumulates raw samples at the scrape interval (typically 15 s or 30 s). Querying six months of per-second data for a dashboard would be painfully slow, so TSDBs apply rollups: older data is downsampled to coarser resolutions (1-minute averages, then 5-minute, then hourly). Thanos and Cortex call these "compaction" levels; InfluxDB calls them "continuous queries". The trade-off is resolution: a one-hour rollup cannot show you a 30-second spike that happened two months ago β it has been averaged away.
When aggregating across label dimensions with PromQL, use sum by (region) (rate(...)) to sum per region, or without (instance) to drop the per-instance dimension and get a fleet-wide total. Aggregating a counter requires rate() first β summing raw counter values across instances gives you a meaningless number that resets unpredictably.
Pull vs Push Collection: The Prometheus Model
Metrics collection architectures split into two camps. In the push model (StatsD, Graphite, InfluxDB line protocol), each instrumented process sends metric samples to a central collector at an interval it controls. In the pull model (Prometheus), the collector scrapes an HTTP endpoint on each target at an interval it controls.
Prometheus's pull approach has notable advantages: the collector can detect a target that stopped responding (scrape failure = alert); targets expose their own debug endpoint that humans can curl for instant inspection; and the scrape rate is uniform, making rate calculations reliable. The downside is that pull requires the target to be reachable by the collector β ephemeral batch jobs and serverless functions need a Pushgateway to bridge the gap.
# Prometheus scrape config β pull model example
scrape_configs:
- job_name: 'api-server'
scrape_interval: 15s # Prometheus pulls every 15 seconds
static_configs:
- targets: ['api-1:8080', 'api-2:8080', 'api-3:8080']
metrics_path: /metrics # standard Prometheus exposition endpoint
- job_name: 'batch-jobs' # ephemeral jobs can't be scraped
static_configs:
- targets: ['pushgateway:9091'] # they push to Pushgateway instead
# Resulting metric example (Prometheus text format on /metrics):
# http_requests_total{method="GET",status="200",endpoint="/api/users"} 1847
# http_request_duration_seconds_bucket{le="0.05"} 1623
# http_request_duration_seconds_bucket{le="0.25"} 1790
# http_request_duration_seconds_bucket{le="1.0"} 1842
# http_request_duration_seconds_bucket{le="+Inf"} 1847Metrics vs Logs vs Traces
The three pillars of observability serve complementary, non-interchangeable roles. Metrics give you aggregated numeric health at low cost β ideal for dashboards, capacity planning, and alerting. They cannot tell you which specific request failed or why. Logs give you a verbatim event trail β rich context, but expensive to store and slow to query at scale. Traces (distributed tracing with tools like Jaeger or Tempo) follow a single request across microservice boundaries and reveal where latency was spent β but they sample only a fraction of requests and are even more expensive than logs.
A practical workflow: your alert fires on a metric threshold (p99 latency > 500 ms). You open a dashboard and narrow down which endpoint or region is affected β still using metrics. You then jump to logs filtered by that endpoint and time window to find an error message. Finally, you pull a trace ID from the log and load the flame graph in your tracing tool to pinpoint the slow database call. Metrics get you on-call; logs and traces get you to root cause.
| Metrics | Logs | Traces | |
|---|---|---|---|
| Data shape | Numeric time series | Structured / unstructured text events | Spans with parent-child relationships |
| Storage cost | Very low | High | High (sampled) |
| Query speed | Very fast (TSDB) | Slow at scale | Medium (by trace ID) |
| Best for | Alerting, dashboards, SLOs | Debugging specific errors | Latency attribution across services |
| Sampling | All data (aggregated) | All events (or sampled) | Typically 1β10 % of requests |
| Tools | Prometheus, Datadog, CloudWatch | Elasticsearch, Loki, Splunk | Jaeger, Tempo, Zipkin, X-Ray |
Frequently Asked Questions
What is the difference between a counter and a gauge in Prometheus?
A counter only ever increases (or resets to zero when the process restarts), so you always compute its rate() or increase() over a time window to get meaningful numbers. A gauge can go up or down freely and represents a current state β you read its instantaneous value directly. Using a gauge for something that only increases (like total requests) will produce incorrect rate calculations after a restart, because Prometheus cannot detect the reset. Always match the type to the semantic: counters for accumulations, gauges for snapshots.
Why is high cardinality a problem in metrics?
Every unique combination of label values creates a separate time series that the TSDB must index, store, and query. A metric with a user_id label on a platform with one million users creates one million time series for that single metric β this can exhaust RAM in Prometheus (which indexes all active series in memory), spike ingestion cost in managed services like Datadog or Grafana Cloud, and slow down queries. The fix is to use low-cardinality labels that group data into a bounded number of categories (e.g., status_code has ~10 values, not millions) and move high-cardinality analysis to logs or traces.
Can metrics replace logs?
No β they are complementary. Metrics aggregate numeric signals across all requests and are excellent for detecting that something is wrong: error rate is 5%, p99 latency doubled, queue depth is growing. But they discard the individual event data you need to understand why it is wrong β which specific user got an error, what the stack trace was, which SQL query was slow. Logs preserve that event-level detail at higher storage cost. In practice, a healthy observability stack uses all three pillars: metrics for alerting, logs for investigation, traces for latency attribution β each doing the job it is cheapest and most accurate at.
Metrics are the heartbeat of a production system β cheap to collect, instant to query, and the first thing that tells you something is wrong. Build your golden signals first, guard cardinality fiercely, and let logs and traces answer the questions metrics cannot.
β alokknight Engineering
