Data Compression in System Design: Lossless vs Lossy, Codecs & Trade-offs (Visualized)

Data compression is the process of encoding information using fewer bits than the original representation, exploiting patterns and redundancy in data to shrink file sizes, reduce network transfer time, and lower storage costs. It is one of the most pervasive optimizations in modern computing — silently at work in every HTTP response, every image you view, every audio file you stream, and every database column you scan.

The core trade-off in compression is simple: you spend CPU cycles during encoding and decoding in exchange for smaller data. The right point on that curve depends on your workload. A CDN serving billions of static files cares deeply about the compression ratio; a real-time video call cares deeply about latency and can sacrifice quality. Understanding this trade-off is what separates an engineer who blindly enables gzip from one who picks the right tool for the job.

Lossless vs Lossy Compression

Every compression algorithm belongs to one of two families. Lossless compression guarantees a perfect round-trip: decompress and you recover exactly the original bytes. Lossy compression permanently discards information deemed imperceptible to humans — lower-frequency audio components, high-frequency image detail — in exchange for dramatically smaller files. You can never recover the discarded data, but for human consumption the result is often indistinguishable from the original.

The choice is not a matter of preference — it is dictated by the data type. Source code, database records, log files, and binary executables must use lossless compression; a single flipped bit can corrupt a program or lose a financial record. Images, audio, and video files can use lossy compression because the human eye and ear do not register small quality drops, and the bandwidth savings are enormous: a raw 24-bit bitmap might compress 6:1 losslessly with PNG, but a JPEG at equivalent perceptual quality can reach 20:1 or better.

Lossless vs Lossy: perfect restore vs smaller but approximate

Watch how lossless compression restores data exactly, while lossy compression trades fidelity for a much smaller result.

How Lossless Compression Works

Lossless algorithms exploit two kinds of redundancy: statistical redundancy (some symbols appear more often than others) and structural redundancy (the same byte sequences repeat across the file). Most practical codecs combine both.

Run-Length Encoding and Dictionary Coding (LZ77)

Run-length encoding (RLE) is the simplest compression idea: replace a run of repeated symbols with a (count, symbol) pair. The string AAAABBBCCDDDDDD becomes 4A3B2C6D — a 15-byte input stored in 8 bytes. RLE shines on images with large uniform regions (fax pages, PNG solid fills) but gives no benefit on already-random data.

LZ77 (the basis of gzip, zlib, and modern codecs) generalises this with a sliding window. As it scans forward through input, it looks back in a window of recently seen bytes for the longest match to the current position. When a match is found, it emits a back-reference (offset, length) instead of the raw bytes. A back-reference to offset 20, length 9 says "copy 9 bytes from 20 positions ago" — potentially encoding a 9-byte phrase with just 2-3 bytes. Dictionary coding extends this idea by building a shared vocabulary of common phrases, allowing a single token to represent an entire word or sequence.

Run-Length Encoding and LZ77 Dictionary Compression — Visualized

Watch a repetitive string get compressed via RLE and LZ77 back-references, with a size bar shrinking in real time.

Entropy Coding: Huffman and Arithmetic Coding

Huffman coding tackles statistical redundancy: it assigns shorter bit-strings to frequently occurring symbols and longer ones to rare symbols. Given the input AAABBC, 'A' (50% frequency) gets code 0, 'B' (33%) gets 10, and 'C' (17%) gets 11 — storing 6 symbols in just 9 bits instead of 48. Huffman codes are optimal when symbol frequencies are known in advance, which is why a Huffman table is precomputed and stored alongside the compressed data. Arithmetic coding goes further, treating the whole message as a single fractional number in [0,1) and approaching the theoretical entropy limit, at the cost of more complex implementation. Both are used as the final entropy stage in gzip, bzip2, and DEFLATE.

Modern Lossless Codecs: gzip, zstd, Brotli, Snappy

The LZ77 + Huffman combination (DEFLATE) is the engine inside both gzip and zlib, which have been the dominant general-purpose lossless codecs for 30 years. zstd (Zstandard, Facebook, 2016) replaces DEFLATE with a faster variant (LZ77 + ANS entropy coding) and offers a compression-level dial from 1 (fast, ~2x) to 22 (very slow, ~3.5x), all while decompressing at several GB/s. Brotli (Google) optimises for web assets — it ships a large precomputed dictionary of common HTML/CSS/JS patterns, giving better ratios for text but slower compression. Snappy (Google) deliberately sacrifices ratio for speed, targeting in-memory storage engines like LevelDB where decompression at 500 MB/s matters more than saving an extra 10%.

The Compression Ratio vs CPU Trade-off

There is no free lunch: stronger compression requires more work. This trade-off has two axes — compression speed (how fast you encode) and decompression speed (how fast you decode). In most systems, data is written once and read many times, so a slow encode with fast decode is often acceptable. The sweet spot shifts by use-case: a cold-storage archive tolerates hours of encoding for maximum ratio; an HTTP server must compress a response in microseconds; a stream processor must both compress and decompress at wire speed.

Compression Ratio vs CPU Speed — Codec Comparison

Each codec bubble shows its ratio and speed. Watch how fast codecs sacrifice ratio, and strong codecs sacrifice speed.

Codec	Type	Ratio (typical)	Compress speed	Decompress speed	Best use-case
Snappy	Lossless	~1.7x	500 MB/s	1700 MB/s	In-memory storage, Kafka
LZ4	Lossless	~2.1x	400 MB/s	2000 MB/s	Real-time streaming
gzip (level 6)	Lossless	~2.7x	100 MB/s	400 MB/s	HTTP responses, logs
zstd (level 3)	Lossless	~3.0x	280 MB/s	900 MB/s	General purpose, S3
Brotli (level 11)	Lossless	~3.8x	2 MB/s	400 MB/s	Static web assets
JPEG (quality 80)	Lossy	~15-25x	varies	fast	Photographs, thumbnails
Opus (96kbps)	Lossy	~10x vs WAV	real-time	real-time	Voice/audio streaming
H.264 (video)	Lossy	~50-200x vs raw	varies	real-time	Video streaming

Columnar Compression in Analytics Databases

Row-oriented databases (MySQL, PostgreSQL) store all columns of a row together, which is optimal for OLTP point-lookups but terrible for compression. Analytical databases like Parquet, Apache ORC, ClickHouse, and Amazon Redshift store data column by column. Within a column, all values share the same data type and often the same domain (e.g., a country column that only ever contains 50 distinct strings). This column homogeneity makes compression dramatically more effective.

Columnar formats layer multiple compression techniques: dictionary encoding replaces repeated string values with small integer codes (turning "United States" into 0 across millions of rows); delta encoding stores only the difference between successive sorted integers rather than the full value; bit-packing removes leading zeros from small integers (a column of values 0–255 needs only 8 bits, not 64); run-length encoding collapses long sorted runs. The result: Parquet files with zstd compression typically reach 10–20x compression on real-world analytics tables, making it cheap to store petabytes on object storage like S3.

import pandas as pd

# Write a DataFrame as Parquet with zstd columnar compression
df = pd.DataFrame({
    'user_id': range(1_000_000),
    'country': ['US'] * 800_000 + ['UK'] * 150_000 + ['IN'] * 50_000,
    'event': ['click'] * 600_000 + ['view'] * 400_000,
    'timestamp': pd.date_range('2024-01-01', periods=1_000_000, freq='s'),
})

# CSV: ~55 MB  |  Parquet+zstd: ~3 MB  =>  ~18x compression
df.to_parquet(
    'events.parquet',
    engine='pyarrow',
    compression='zstd',           # per-column entropy coding
    index=False,
)

# Scan only the 'country' column — only 3 MB column page read from disk
country_counts = pd.read_parquet('events.parquet', columns=['country'])['country'].value_counts()
print(country_counts)

Where Compression Helps and Where It Hurts

Compression is not universally beneficial. It helps wherever I/O is the bottleneck: network transfers, disk reads, cold storage. It hurts wherever CPU is the bottleneck or data is already compressed. Compressing JPEG images or MP4 video files again with gzip will increase their size slightly, waste CPU, and accomplish nothing. Compressing small payloads (under ~1 KB for HTTP responses) often produces larger output than the original because the codec's header overhead dominates.

Encryption and compression must also be ordered carefully: always compress before encrypting. Encryption intentionally destroys all patterns in data, making it appear random — a compressed-then-encrypted payload compresses well; an encrypted-then-compressed payload does not compress at all and may slightly expand. This is also a security consideration: the CRIME and BREACH attacks exploited TLS compression to extract secrets by measuring how payload size changed with attacker-controlled input.

Scenario	Compress?	Recommended codec	Notes
HTTP API JSON responses	Yes	gzip or Brotli	Browser decompresses natively; saves 60-80% bandwidth
Log files (archival)	Yes	zstd or gzip	Cold storage; accept slow encode for ratio
PostgreSQL WAL / backups	Yes	zstd	pg_basebackup --compress=zstd
Kafka messages (text/JSON)	Yes	lz4 or zstd	Fast encode matters for throughput
Analytics storage (Parquet)	Yes	zstd or snappy	Column-level; dramatic ratio on typed data
JPEG / PNG images	No (already done)	—	Re-compressing wastes CPU, adds size
Encrypted payloads (TLS)	Before encrypt	zstd / gzip	Compress then encrypt, not after
Tiny payloads (<1 KB)	No	—	Header overhead exceeds savings
Real-time video/audio	Use lossy codec	H.265, Opus, AAC	Lossy is the only viable path at these ratios

Compression in HTTP: Content-Encoding and Transfer-Encoding

HTTP compression is negotiated by the client advertising its supported encodings in the Accept-Encoding request header (gzip, br, zstd). The server picks one, compresses the body, and signals the choice in the Content-Encoding response header. Nginx enables this with gzip on; and a one-line config; adding Brotli requires compiling the ngx_brotli module or using a CDN that supports it natively (Cloudflare, Fastly). Modern CDNs serve Brotli to capable browsers and fall back to gzip for older ones automatically. On average, switching from gzip to Brotli on text assets saves an additional 15–25% on top of gzip's savings.

Frequently Asked Questions

When should I use zstd instead of gzip?

Use zstd whenever you control both the encoder and decoder — server-to-server communication, database backups, log archival, Kafka messages, and S3 object storage. zstd achieves better compression ratios than gzip at every speed level and decompresses 2–3x faster, so it is a strict upgrade in most backend contexts. Stick with gzip or Brotli for HTTP responses to browsers, because browser support for zstd in HTTP is still rolling out (supported in Chrome 123+ and Firefox 126+ as of mid-2024). Brotli remains the best choice for static text assets where you can afford slow offline compression.

Does compression hurt database query performance?

For columnar analytical databases (ClickHouse, Redshift, Parquet on Spark), compression almost always improves query performance. Queries scan far fewer bytes from disk, and modern CPUs can decompress faster than SSDs or S3 can supply bytes — so the decompression CPU cost is hidden behind I/O savings. For row-oriented OLTP databases, compression on hot data can add latency on index lookups because every access decompresses a page. PostgreSQL's TOAST mechanism automatically compresses large column values inline; use it deliberately but do not compress small, frequently-accessed columns.

What is the theoretical limit of lossless compression?

Claude Shannon's source coding theorem (1948) proves that the minimum average bits per symbol is the entropy of the source: H = -Σ p(x) log₂ p(x). No lossless algorithm can compress below this limit on average. A truly random byte stream has maximum entropy (8 bits/byte) and is incompressible — this is why encrypted data and already-compressed data resist further compression. Practical codecs reach within a few percent of the theoretical limit for structured text; Huffman coding hits the limit when symbol frequencies are exact powers of two, and arithmetic coding approaches it asymptotically for any source distribution.

Compression is not about making data smaller — it is about making systems faster by letting bandwidth and storage do less work so the CPU can do more of what matters.
— alokknight Engineering