Data Lake in System Design: Architecture, Schema-on-Read, and the Lakehouse Pattern (Visualized)
A data lake is a centralized repository that stores raw structured, semi-structured, and unstructured data at any scale and low cost โ all without imposing a schema upfront. This guide covers object storage, schema-on-read, the data swamp risk, the lakehouse pattern, and how a lake differs from a warehouse, with live animations.
A data lake is a centralized storage repository that ingests and retains raw data in its native format โ structured tables, semi-structured JSON and logs, and unstructured files like images or audio โ at massive scale and low cost, deferring any structure or transformation until the data is actually read and queried.
Traditional data warehouses required engineers to define a strict schema before any data could be loaded. This worked well for predictable, uniform data but fell apart when organizations needed to store raw clickstreams, IoT sensor feeds, application logs, images, and third-party JSON blobs alongside relational tables. The data lake concept, popularized around 2010โ2013 alongside Hadoop and later object storage services like Amazon S3, solved this by separating storage from compute โ store everything cheaply first, figure out what to do with it later.
What Goes Into a Data Lake
A data lake accepts data from virtually every source without transformation at ingestion time. Common sources include: structured data (relational database snapshots, CSV exports, transactional records), semi-structured data (application logs in JSON or XML, Kafka event streams, clickstream data, API responses), and unstructured data (images, video, audio, PDFs, ML model artifacts, raw text corpora). All of it lands in the lake unchanged โ the producer's format is preserved exactly. This makes the lake a single source of truth that retains full audit history and enables replay of any historical pipeline.
Object Storage as the Foundation
Modern data lakes are built on object storage โ primarily Amazon S3, Google Cloud Storage, or Azure Data Lake Storage Gen2. Object storage offers near-infinite horizontal scalability at cents per gigabyte-month (versus dollars per gigabyte for SSDs), eleven-nines durability, and a flat namespace accessible by any compute engine. Data is organized by convention into a logical folder hierarchy, typically s3://bucket/year/month/day/source/, forming what is commonly called the Bronze / Silver / Gold tier pattern: raw ingested data, cleaned and joined data, and fully aggregated business-ready data.
Compute is decoupled from storage entirely. Apache Spark, Presto/Trino, AWS Athena, or Databricks spin up, read files from S3 over the network, and shut down โ you pay only for the compute time used. This separation makes the lake vastly cheaper than a warehouse for cold or infrequently queried data, and means you can swap your query engine without moving the data.
Schema-on-Read vs Schema-on-Write
The defining philosophical difference between a data lake and a data warehouse is when structure is imposed. In a schema-on-write system (traditional warehouses like Redshift or BigQuery), the schema is defined before any row can be inserted โ the ETL pipeline must transform, validate, and coerce every field to the target types before loading. Any schema change requires a migration. In a schema-on-read system (data lakes), raw bytes are stored as-is; structure is only applied at query time. An analyst can read the same JSON log file as a flat table of string columns today and as a deeply nested structured type tomorrow, without moving a byte.
Schema-on-read gives enormous flexibility for exploratory data science and ad-hoc analysis โ you do not need to know the right questions before storing the data. The trade-off is that query-time parsing is slower, and inconsistent producers can silently corrupt downstream analyses if a field changes type or name.
The Data Swamp Problem
The freedom of a data lake is also its greatest danger. Without governance, a lake rapidly degrades into a data swamp โ a repository where nobody knows what data exists, who owns it, whether it is still valid, or what format version it is in. Classic symptoms include: directories named data_final_v2_USE_THIS, duplicate tables from four different ETL runs with no lineage tracking, JSON files where the same field is sometimes a string and sometimes a number depending on the producer version, and no record of which files are safe to delete. The swamp makes the lake worthless for analytics and actively dangerous for ML models trained on stale or corrupted data.
Preventing the swamp requires investment in a data catalog (e.g., AWS Glue, Apache Atlas, DataHub) to track schema and ownership, data quality checks at ingestion (Great Expectations, dbt tests), access control through table-level and column-level permissions, and a clear retention policy so old raw data is archived or purged systematically. Governance is not optional โ it is the engineering work that makes the lake usable.
The Lakehouse: Bringing ACID to the Lake
The lakehouse architecture, introduced by Databricks around 2020, merges the best of both worlds. It keeps cheap object-storage files as the physical layer but adds a transactional metadata layer โ implemented by open table formats like Delta Lake, Apache Iceberg, and Apache Hudi โ on top of those files. This metadata layer gives you ACID transactions, schema enforcement, time travel (query the table as it was 30 days ago), and data versioning, all without leaving object storage.
Delta Lake stores Parquet files in S3 alongside a _delta_log/ folder containing JSON commit entries. Each write appends a new commit log entry recording which files were added and which were removed, enabling atomic multi-file updates. Apache Iceberg uses a similar approach with a manifest system, and adds hidden partitioning (no partition path embedded in file names) and row-level deletes via positional delete files โ making it the preferred choice for very large tables at companies like Apple and Netflix. The lakehouse pattern is now the default architecture for new data platform builds, replacing the separate lake + warehouse stack.
# Delta Lake: ACID writes on top of S3 object storage
from delta import DeltaTable
from pyspark.sql import SparkSession
spark = SparkSession.builder \
.config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension") \
.getOrCreate()
# Write raw JSON events to Delta format (Parquet + transaction log)
df = spark.read.json("s3://raw/events/2024-04-17/")
df.write.format("delta").mode("append").save("s3://lake/events/")
# Time travel: query the table as it was 7 days ago
dt = DeltaTable.forPath(spark, "s3://lake/events/")
history_df = spark.read.format("delta") \
.option("versionAsOf", dt.history(10).select("version").collect()[-1][0]) \
.load("s3://lake/events/")
# MERGE (upsert) โ ACID guaranteed, no duplicates
dt.alias("target").merge(
source=df.alias("source"),
condition="target.uid = source.uid AND target.event_date = source.event_date"
).whenMatchedUpdateAll().whenNotMatchedInsertAll().execute()Data Lake vs Data Warehouse
A data warehouse is optimized for fast, structured, ad-hoc SQL queries over cleaned, modeled data โ it is where your BI dashboards and executive reports run. A data lake is optimized for raw storage, flexibility, and enabling diverse consumers including ML training pipelines, data science notebooks, and streaming processors. They are complementary, not competing: most mature data platforms ingest raw data into the lake and then promote curated, aggregated subsets into a warehouse for business intelligence.
| Data Lake | Data Warehouse | Lakehouse | |
|---|---|---|---|
| Storage | Object storage (S3, GCS) | Proprietary columnar (Redshift, Snowflake) | Object storage (S3) + open table format |
| Data format | Raw: JSON, CSV, Parquet, images, logs | Structured, typed columns only | Parquet + Delta/Iceberg transaction log |
| Schema | Schema-on-read (flexible) | Schema-on-write (rigid, enforced) | Schema-on-write with ACID, optional evolution |
| Query speed | Slower (scan raw files) | Very fast (pre-optimized indexes) | Fast (Parquet + partition pruning + Z-order) |
| ACID support | None natively | Full | Full (Delta Lake, Iceberg, Hudi) |
| Cost per TB | Very low (~$23/TB/month S3) | High ($200โ$500+/TB/month) | Very low (same as object storage) |
| Best for | Raw storage, ML training, exploration | BI dashboards, structured reporting | Unified analytics, ML + BI on same data |
| Examples | S3 + Glue, HDFS, ADLS Gen2 | Redshift, BigQuery, Snowflake, Synapse | Databricks, Apache Iceberg on S3, Delta Lake |
When to Use a Data Lake
Choose a data lake when your requirements include storing data before you know how you will use it, when you have diverse data types (binary files, logs, nested JSON) that a warehouse cannot store natively, or when you need to feed ML training pipelines that consume raw or lightly processed data. A lake is also the right choice when data volume is very large and cost per terabyte must be minimized โ for example archiving 10 PB of clickstream history at $0.023/GB/month on S3 costs a tiny fraction of what a warehouse would charge for equivalent storage.
A lake is a poor fit if all your consumers are business analysts who need fast SQL and sub-second dashboards over clean, narrow tables โ in that case a warehouse is a better primary tool. Modern practice is to use both: a lake as the raw landing zone and single source of truth, and a warehouse (or the Gold tier of the lakehouse) as the curated serving layer for BI.
Frequently Asked Questions
What is the difference between a data lake and a data warehouse?
A data lake stores raw data in its native format on cheap object storage with schema-on-read โ you pay almost nothing per terabyte and can store any type of data, but queries are slower and require compute to parse files. A data warehouse stores pre-modeled, typed, cleaned data in a proprietary columnar format optimized for fast SQL queries, at significantly higher cost per terabyte. The two are complementary: most organizations land raw data in the lake and promote curated subsets to the warehouse for BI dashboards and reporting.
What is a data swamp and how do you prevent it?
A data swamp is a data lake that has lost usability due to missing governance โ no data catalog, no ownership metadata, inconsistent schemas, and no quality checks. Files accumulate without anyone knowing what they contain, their freshness, or whether they are safe to read. Prevention requires a data catalog (AWS Glue, DataHub) to register every dataset with owner and schema, automated data quality gates at ingestion, column-level access control, and a retention policy enforced by lifecycle rules on the storage bucket. Governance tooling is not optional infrastructure โ it is what turns a pile of files into a reliable data asset.
What is a lakehouse and why is it replacing separate lake + warehouse stacks?
A lakehouse (Delta Lake, Apache Iceberg, Apache Hudi) adds a transactional metadata layer โ a commit log and manifest files โ on top of standard Parquet files in object storage. This gives you ACID transactions, schema enforcement, upserts, deletes, and time travel at warehouse-level reliability while keeping storage costs at lake prices. It eliminates the need to maintain two separate systems and two copies of data: ML engineers and data scientists query the same Parquet files that power the BI dashboards, with full ACID guarantees. This unification, combined with open file formats that any engine can read, is why the lakehouse has become the dominant architecture for new data platform designs.
A data lake that nobody governs is just an expensive trash bin. Add a catalog, quality checks, and a transaction layer โ then it becomes the foundation for everything from dashboards to ML models.
โ alokknight Engineering
