Data Pipelines in System Design: Stages, Orchestration, Batch vs Streaming (Visualized)
A data pipeline moves raw records from one or more sources through a series of transformation steps into a destination where they can be queried or acted on. This guide covers pipeline stages, batch vs streaming, DAG orchestration, idempotency, retries, and data quality โ with live animations of each idea.
A data pipeline is an automated sequence of stages that ingests raw data from one or more sources, applies transformations and quality checks, and delivers clean, structured records to a destination system where they can be stored, analyzed, or acted on. It is the backbone of every analytics platform, ML feature store, and event-driven application at scale.
Without a pipeline, data stays siloed in databases, message queues, third-party APIs, and log files. A pipeline bridges those systems reliably โ handling retries when a source is temporarily down, deduplicating records so downstream counts stay correct, and enforcing a schema so consumers never receive surprising nulls or type mismatches.
The Core Pipeline Stages: Ingest โ Transform โ Sink
Every pipeline โ no matter how complex โ reduces to three logical stages. Ingest (also called Extract in ETL) reads raw data from the source: a relational database, an S3 bucket, a Kafka topic, a REST API, or a CDC stream from a transaction log. Transform cleans, enriches, joins, and reshapes that data โ dropping duplicates, casting types, computing derived columns, and applying business rules. Sink (Load) writes the final records to the destination: a data warehouse like BigQuery or Redshift, an Elasticsearch index, a feature store, or another Kafka topic for the next stage to consume.
In the classic ETL pattern, data is transformed before it lands in the warehouse. The modern ELT pattern (popularised by tools like dbt) loads raw data first and runs SQL-based transformations inside the warehouse, leveraging its compute instead of a separate cluster. ELT is now the dominant pattern for analytical pipelines because cloud warehouses are cheap and SQL is universally understood by data teams.
Batch Pipelines vs Streaming Pipelines
The most fundamental design choice in a pipeline is whether to process data in batches or as a continuous stream. A batch pipeline wakes up on a schedule (hourly, nightly, weekly), reads a bounded chunk of data accumulated since the last run, processes it, and goes back to sleep. A streaming pipeline subscribes to an event source (such as a Kafka topic) and processes each record as it arrives, often within milliseconds.
Batch is simpler to build, easier to debug, and cost-efficient when freshness requirements are relaxed โ a daily sales report doesn't need sub-second latency. Streaming is necessary when the business requires real-time decisions: fraud detection, live leaderboards, IoT anomaly alerts, or user-facing personalisation. Many real-world systems run a Lambda architecture: a fast streaming layer for recent data and a batch layer that reprocesses history to correct errors, with a serving layer that merges both views.
| Batch Pipeline | Streaming Pipeline | |
|---|---|---|
| Trigger | Scheduled (cron / time window) | Event-driven (new record arrives) |
| Latency | Minutes to hours | Milliseconds to seconds |
| Throughput | Very high (bulk reads/writes) | High, but per-event overhead |
| Complexity | Low โ simple job runs | Higher โ state, watermarks, exactly-once |
| Use cases | Nightly reports, ML training, backfills | Fraud detection, live dashboards, alerts |
| Tools | Spark, Hadoop, dbt, Airflow | Kafka Streams, Flink, Spark Structured Streaming |
Orchestration with DAGs
Real pipelines are not a single task โ they are a graph of tasks, where some tasks can run in parallel and others must wait for upstream dependencies to succeed. Orchestrators like Apache Airflow, Prefect, and Dagster model this as a Directed Acyclic Graph (DAG): each node is a task, each edge is a dependency. The orchestrator schedules tasks respecting dependency order, tracks their state (pending, running, success, failed), handles retries, and sends alerts when something goes wrong.
A typical Airflow DAG for a nightly analytics pipeline might look like: extract_from_postgres and extract_from_s3 run in parallel (no dependency between them); both feed into load_to_warehouse; which then triggers run_dbt_models; finally send_report_email runs last. If extract_from_postgres fails, Airflow retries it up to three times before marking the DAG run failed and paging on-call.
from airflow import DAG
from airflow.operators.python import PythonOperator
from airflow.providers.common.sql.operators.sql import SQLExecuteQueryOperator
from datetime import datetime, timedelta
default_args = {
'owner': 'data-eng',
'retries': 3,
'retry_delay': timedelta(minutes=5),
'email_on_failure': True,
}
with DAG(
dag_id='nightly_analytics',
schedule_interval='0 2 * * *', # 2 AM every night
start_date=datetime(2024, 1, 1),
catchup=True, # backfill missed runs
default_args=default_args,
) as dag:
extract_pg = PythonOperator(
task_id='extract_from_postgres',
python_callable=extract_postgres,
op_kwargs={'ds': '{{ ds }}'}, # logical date for idempotency
)
extract_s3 = PythonOperator(
task_id='extract_from_s3',
python_callable=extract_s3_files,
op_kwargs={'ds': '{{ ds }}'},
)
load_warehouse = PythonOperator(
task_id='load_to_warehouse',
python_callable=load_to_bigquery,
)
run_dbt = SQLExecuteQueryOperator(
task_id='run_dbt_models',
conn_id='bigquery_default',
sql='CALL run_dbt_daily();',
)
# dependency graph
[extract_pg, extract_s3] >> load_warehouse >> run_dbtIdempotency, Retries, and Backfills
Network calls fail. Third-party APIs return 500s. Database connections drop. A production pipeline must handle these failures gracefully โ which is why idempotency is one of the most important properties to design for. An idempotent task produces the same result no matter how many times it is executed for the same input parameters. A nightly extract task is idempotent if it uses a deterministic output path keyed on the date, for example s3://bucket/raw/events/date=2024-04-04/: re-running it overwrites exactly that partition, so running it twice is harmless.
A backfill is a special pipeline run that processes historical data โ for example, when you add a new transformation or fix a bug and need to recompute the last six months of records. Orchestrators like Airflow support backfills natively through the catchup=True setting and the --reset-dagruns CLI flag. Idempotency makes backfills safe: you re-run any date range without risking double-counting or corrupt state.
Retry strategies should use exponential backoff with jitter to avoid thundering-herd effects when many tasks fail simultaneously and hammer the same external service. A typical policy: retry 3 times, waiting 1 min, 5 min, 20 min between attempts, then page on-call. For transient network errors this is usually enough; for upstream data unavailability, a sensor task that waits for the source to become ready is preferable to infinite retries.
Data Quality Checks
Even a perfectly reliable pipeline can propagate bad data: NULL primary keys, out-of-range values, duplicate rows, or referential integrity violations. Data quality gates are assertions that run between pipeline stages and halt or quarantine a run if they fail. The tool dbt has built-in tests (not_null, unique, accepted_values, relationships) that run after each model transformation. Tools like Great Expectations and Soda Core let you write Python-native expectations that run as Airflow tasks.
Common checks to embed in every pipeline: (1) row count is within expected bounds (not 0, not 10ร normal โ both signal problems); (2) no NULLs in mandatory columns; (3) date ranges are as expected (no records from years in the future or distant past); (4) referential integrity โ every foreign key exists in the parent table; (5) distribution checks โ the fraction of events per category hasn't shifted dramatically (a common signal of an upstream schema change).
Key Tools in the Data Pipeline Ecosystem
The ecosystem has consolidated around a handful of battle-tested open-source tools. Apache Airflow is the dominant DAG orchestrator: Python-defined DAGs, a rich UI for monitoring, and thousands of provider integrations. Apache Kafka is the de-facto distributed event log for streaming pipelines โ producers write events to partitioned topics, consumers read at their own pace, and data is retained for days so consumers can replay. Apache Spark is the standard engine for large-scale distributed transformations โ both batch and structured streaming. dbt (data build tool) has revolutionised the Transform layer by letting analysts write SQL SELECT statements that dbt compiles into full CREATE TABLE AS SELECT or CREATE VIEW statements, with lineage tracking and built-in testing built in.
| Tool | Role | Key Strength | When to use |
|---|---|---|---|
| Apache Airflow | Orchestration | Python DAGs, backfill, UI | Batch scheduling, complex dependency graphs |
| Apache Kafka | Message broker / event log | High-throughput, durable, replayable | Streaming ingest, decoupling producers & consumers |
| Apache Spark | Distributed compute engine | Batch + streaming, massive scale | Large-scale ETL, ML feature engineering |
| dbt | SQL transformation layer | Lineage, testing, docs, incremental models | Transforming data already in the warehouse |
| Apache Flink | Stateful stream processing | Exactly-once semantics, event time | Low-latency aggregations, complex event processing |
| Prefect / Dagster | Modern orchestration | Python-native, observability, cloud-hosted | Teams wanting an Airflow alternative with better DX |
Frequently Asked Questions
What is the difference between ETL and ELT?
ETL (Extract, Transform, Load) transforms data in a separate compute layer before writing it to the destination โ historically necessary because storage in data warehouses was expensive and you only wanted to store clean, shaped data. ELT (Extract, Load, Transform) loads raw data into the warehouse first and runs SQL-based transformations inside it using tools like dbt. ELT has become the norm in the cloud era because storage is cheap, warehouse compute is powerful and scalable, and keeping raw data means you can always re-derive derived tables from scratch if your transformation logic changes.
How do you prevent duplicate records in a data pipeline?
Duplicates enter pipelines in two main ways: retried tasks re-writing records, and sources delivering events more than once (Kafka's at-least-once default). The primary defence is idempotent writes โ use an INSERT ... ON CONFLICT DO NOTHING or MERGE statement keyed on a natural primary key, or partition your output by a deterministic key (such as the event date) so a re-run overwrites the same partition rather than appending. In Kafka Streams and Flink, enabling exactly-once semantics with distributed transactions prevents duplicates end-to-end. For batch pipelines, a deduplication step using a ROW_NUMBER() OVER (PARTITION BY id ORDER BY updated_at DESC) window function is a reliable final safeguard.
When should I choose streaming over batch?
Choose streaming when your business has a low-latency requirement โ fraud must be detected within seconds of a transaction, not the next morning. Also choose streaming when your data volume is so continuous that batching would create impractically large jobs or unacceptable delays. Choose batch when latency requirements are loose (daily or hourly freshness is fine), when your source systems only support bulk exports, or when the total engineering cost of operating a streaming cluster (Kafka brokers, consumer groups, watermark tuning) is not justified. A good heuristic: start with batch, prove value, and migrate to streaming only when the business genuinely needs the reduced latency.
A pipeline is only as trustworthy as its weakest quality gate. Build idempotency in from day one, automate data quality checks at every stage, and treat your pipeline code with the same rigour as your product code.
โ alokknight Engineering
