Logging in System Design: Structured Logs, Aggregation, Correlation IDs & the ELK Stack (Visualized)
Logging is the practice of recording timestamped events from your software so engineers can debug issues, trace user journeys, and understand system behaviour. This guide covers structured vs unstructured logs, log levels, centralized aggregation with the ELK/EFK stack, correlation IDs, sampling, retention, cost, and how logs fit alongside metrics and traces.
Logging is the practice of recording discrete, timestamped records of events that occur inside a running system β capturing what happened, when, where, and why β so engineers can later reconstruct the state of the application for debugging, auditing, and performance analysis.
Every production system emits logs: a web server records each HTTP request, an order service records payment attempts, a database driver records slow queries. Without logs, a bug in production is essentially invisible. With rich, well-structured logs piped into a central store, on-call engineers can answer the question "what exactly happened at 3 AM?" in minutes rather than hours.
What Is a Log Line?
A log line is a single record emitted at a specific moment in time. At minimum it carries a timestamp, a severity level, and a message. In practice it also carries metadata: the service name, host, request ID, user ID, and any other fields relevant to the event. The richer the metadata, the faster you can slice and filter logs when something goes wrong.
Structured vs Unstructured Logs
Unstructured logs are plain-text lines designed for humans to read: [2024-03-31 09:12:44] ERROR: Payment failed for user 8821 β timeout. They are easy to write but painful to query at scale because machines cannot reliably parse them. Structured logs emit every field as a typed keyβvalue pair β almost always serialized as JSON. A log aggregator can then index every field individually, letting you filter by user_id=8821, level=ERROR, or duration_ms>500 in milliseconds across billions of events.
{
"timestamp": "2024-03-31T09:12:44Z",
"level": "ERROR",
"service": "payment-svc",
"host": "pod-7f9d4",
"trace_id": "abc123xyz",
"user_id": 8821,
"order_id": "ORD-992",
"duration_ms": 3021,
"message": "Payment gateway timeout"
}Log Levels: Filtering the Noise
Every log line carries a severity level that tells readers how important the event is. The standard hierarchy from noisiest to most severe is DEBUG β INFO β WARN β ERROR β FATAL. In production you typically configure your log pipeline to suppress DEBUG entirely (too noisy and expensive), surface INFO for normal operation, and alert on ERROR and above. Raising the log level is the fastest way to cut log volume and cost during a traffic spike without deploying new code.
| Level | When to use | Example event |
|---|---|---|
| DEBUG | Detailed tracing for developers during development; suppress in production | Entering function calculateTax with args {amount:99} |
| INFO | Normal operational events worth recording | User 8821 signed in; order ORD-992 created |
| WARN | Unexpected but recoverable situations that may need attention | Payment retry 2 of 3 due to transient timeout |
| ERROR | Failures that affect a specific request or operation | Payment gateway timeout after 3 attempts for ORD-992 |
| FATAL | System-level failure that causes a process to exit | Database connection pool exhausted β service shutting down |
Centralized Log Aggregation: The ELK and EFK Stacks
In a microservices architecture, dozens of services running on hundreds of pods each write their own logs. SSH-ing into every machine to read log files is impractical. Centralized log aggregation solves this by collecting all logs in one place where they can be searched and analyzed together. The two dominant open-source stacks are:
ELK Stack: Elasticsearch (distributed full-text search and analytics store), Logstash (log ingestion and transformation pipeline), and Kibana (visualization and querying UI). EFK Stack: replaces Logstash with Fluent Bit or Fluentd, which are lighter-weight log shippers better suited to Kubernetes environments. Both stacks follow the same pattern: collect β parse β index β query.
The log shipper (Fluent Bit, Filebeat, or Logstash) runs as a sidecar or DaemonSet on every node. It tails log files or reads from stdout, parses fields, enriches with pod metadata (namespace, pod name, labels), and forwards batches to Elasticsearch. Kibana then provides a query UI where engineers can search across all services simultaneously β for instance, finding every log line for a single HTTP request in under a second.
Correlation IDs: Stitching a Request Across Services
In a microservices system, a single user request might touch an API gateway, an auth service, an order service, and a payment service β each writing its own log lines. A correlation ID (also called a trace ID or request ID) is a unique token generated at the edge of the system for each incoming request. Every service that handles that request attaches the same ID to every log line it emits. In Kibana you can then filter on trace_id=abc123xyz and instantly see the entire journey of that one request β across all services, in chronological order.
Correlation IDs are generated at the API gateway (or a middleware layer) and propagated downstream via HTTP headers such as X-Request-ID or X-B3-TraceId. Every service reads the header on incoming requests and writes it into its own log context β typically via a logging middleware that injects the ID into a thread-local or async context so all log calls in the request's handling chain automatically include it.
Sampling, Retention, Cost, and PII Concerns
At scale, logging costs real money. A service handling 100,000 requests per second, each emitting ten INFO lines, produces one million log events per second. At typical cloud log ingestion prices this costs thousands of dollars per day. Teams manage this with four techniques:
Log level control: suppress DEBUG in production and suppress INFO during traffic spikes. Sampling: instead of logging every successful request, log a random 1% sample for INFO-level events while logging 100% of errors. Retention policies: keep detailed logs for 7β30 days in a hot store (Elasticsearch) and archive to cheap object storage (S3 Glacier) for compliance. PII scrubbing: before logs reach the central store, a pipeline stage should redact or hash fields that contain personal data β email addresses, credit card numbers, social security numbers β to comply with GDPR and CCPA.
Logs vs Metrics vs Traces: The Observability Triangle
Logs are one pillar of the observability trinity. Understanding how they differ from metrics and traces helps you choose the right tool for each question.
Logs answer "what happened?" β they record discrete events with full context. Metrics answer "how is the system behaving over time?" β they are numeric time-series values (request rate, error rate, p99 latency) that are cheap to store and easy to alert on, but carry no per-request detail. Traces answer "how long did each step of this request take?" β a distributed trace records spans for every service call in a request's path, forming a tree that reveals latency bottlenecks at the code level.
| Pillar | What it records | Best for | Typical tool |
|---|---|---|---|
| Logs | Discrete timestamped events with full context | Debugging specific failures, auditing, root cause analysis | Elasticsearch / Loki / CloudWatch Logs |
| Metrics | Numeric aggregates over time (counters, gauges, histograms) | Alerting, capacity planning, dashboards | Prometheus + Grafana / Datadog |
| Traces | Spans forming a tree of a request across services | Latency profiling, dependency mapping | Jaeger / Zipkin / OpenTelemetry |
| Combined | Correlate logs + metrics + traces via shared trace_id | Full end-to-end incident investigation | Grafana Stack / Datadog / Elastic APM |
Best Practices Summary
To build a robust logging system: (1) emit structured JSON logs from every service. (2) Always include timestamp, level, service, trace_id, and a human-readable message. (3) Propagate correlation IDs end-to-end via HTTP headers. (4) Set the minimum log level per environment β DEBUG locally, INFO in staging, WARN or higher in production. (5) Use a centralized collector (Fluent Bit, Filebeat) feeding Elasticsearch or a managed service. (6) Define retention policies and automate PII redaction in the pipeline. (7) Alert on error rates, not individual log lines β use metrics derived from logs for alerting, and logs themselves for investigation.
Frequently Asked Questions
What is the difference between logging and monitoring?
Logging records discrete events (a user signed in, a payment failed) with full context. Monitoring tracks numeric health signals over time β CPU, request rate, error percentage β and triggers alerts when thresholds are breached. In practice you need both: monitoring tells you that something is wrong, and logs tell you why. Many teams derive metrics from logs (counting ERROR lines per minute) so the two systems share the same data source.
Why use structured (JSON) logs instead of plain text?
Plain-text logs are readable to humans but require fragile regex parsing to extract fields, which breaks whenever the message format changes. Structured JSON logs give every field a consistent name and type, allowing log aggregators to index them directly. You can then filter by user_id, group by service, or compute the average duration_ms across a time window β queries that would be impractical on unstructured text at any meaningful scale.
How do you reduce logging costs without losing visibility?
The most effective levers are: raise the minimum log level to WARN in production for high-volume services; apply head-based sampling on INFO events (log 1β5% of successful requests, 100% of errors); set aggressive hot-tier retention (7 days in Elasticsearch) and archive older data to S3 at a fraction of the cost; and use log-derived metrics (counters in Prometheus) for alerting instead of querying raw logs. Together these can cut log storage costs by 80β95% while preserving full detail for errors and a representative sample for normal operations.
Logs are the memory of your system. Make them structured, ship them centrally, stamp every line with a correlation ID, and you will find any bug in minutes rather than hours.
β alokknight Engineering
