Data Compression in System Design: Lossless vs Lossy, Codecs & Trade-offs (Visualized)
Data compression reduces the number of bits needed to represent information, saving storage space and network bandwidth. This guide covers how lossless and lossy compression work, key algorithms (gzip, zstd, Huffman, LZ77, run-length), the compression-ratio vs CPU trade-off, columnar encoding in analytics, and where compression helps vs hurts β with live animations.
Data compression is the process of encoding information using fewer bits than the original representation, exploiting patterns and redundancy in data to shrink file sizes, reduce network transfer time, and lower storage costs. It is one of the most pervasive optimizations in modern computing β silently at work in every HTTP response, every image you view, every audio file you stream, and every database column you scan.
The core trade-off in compression is simple: you spend CPU cycles during encoding and decoding in exchange for smaller data. The right point on that curve depends on your workload. A CDN serving billions of static files cares deeply about the compression ratio; a real-time video call cares deeply about latency and can sacrifice quality. Understanding this trade-off is what separates an engineer who blindly enables gzip from one who picks the right tool for the job.
Lossless vs Lossy Compression
Every compression algorithm belongs to one of two families. Lossless compression guarantees a perfect round-trip: decompress and you recover exactly the original bytes. Lossy compression permanently discards information deemed imperceptible to humans β lower-frequency audio components, high-frequency image detail β in exchange for dramatically smaller files. You can never recover the discarded data, but for human consumption the result is often indistinguishable from the original.
The choice is not a matter of preference β it is dictated by the data type. Source code, database records, log files, and binary executables must use lossless compression; a single flipped bit can corrupt a program or lose a financial record. Images, audio, and video files can use lossy compression because the human eye and ear do not register small quality drops, and the bandwidth savings are enormous: a raw 24-bit bitmap might compress 6:1 losslessly with PNG, but a JPEG at equivalent perceptual quality can reach 20:1 or better.
How Lossless Compression Works
Lossless algorithms exploit two kinds of redundancy: statistical redundancy (some symbols appear more often than others) and structural redundancy (the same byte sequences repeat across the file). Most practical codecs combine both.
Run-Length Encoding and Dictionary Coding (LZ77)
Run-length encoding (RLE) is the simplest compression idea: replace a run of repeated symbols with a (count, symbol) pair. The string AAAABBBCCDDDDDD becomes 4A3B2C6D β a 15-byte input stored in 8 bytes. RLE shines on images with large uniform regions (fax pages, PNG solid fills) but gives no benefit on already-random data.
LZ77 (the basis of gzip, zlib, and modern codecs) generalises this with a sliding window. As it scans forward through input, it looks back in a window of recently seen bytes for the longest match to the current position. When a match is found, it emits a back-reference (offset, length) instead of the raw bytes. A back-reference to offset 20, length 9 says "copy 9 bytes from 20 positions ago" β potentially encoding a 9-byte phrase with just 2-3 bytes. Dictionary coding extends this idea by building a shared vocabulary of common phrases, allowing a single token to represent an entire word or sequence.
Entropy Coding: Huffman and Arithmetic Coding
Huffman coding tackles statistical redundancy: it assigns shorter bit-strings to frequently occurring symbols and longer ones to rare symbols. Given the input AAABBC, 'A' (50% frequency) gets code 0, 'B' (33%) gets 10, and 'C' (17%) gets 11 β storing 6 symbols in just 9 bits instead of 48. Huffman codes are optimal when symbol frequencies are known in advance, which is why a Huffman table is precomputed and stored alongside the compressed data. Arithmetic coding goes further, treating the whole message as a single fractional number in [0,1) and approaching the theoretical entropy limit, at the cost of more complex implementation. Both are used as the final entropy stage in gzip, bzip2, and DEFLATE.
Modern Lossless Codecs: gzip, zstd, Brotli, Snappy
The LZ77 + Huffman combination (DEFLATE) is the engine inside both gzip and zlib, which have been the dominant general-purpose lossless codecs for 30 years. zstd (Zstandard, Facebook, 2016) replaces DEFLATE with a faster variant (LZ77 + ANS entropy coding) and offers a compression-level dial from 1 (fast, ~2x) to 22 (very slow, ~3.5x), all while decompressing at several GB/s. Brotli (Google) optimises for web assets β it ships a large precomputed dictionary of common HTML/CSS/JS patterns, giving better ratios for text but slower compression. Snappy (Google) deliberately sacrifices ratio for speed, targeting in-memory storage engines like LevelDB where decompression at 500 MB/s matters more than saving an extra 10%.
The Compression Ratio vs CPU Trade-off
There is no free lunch: stronger compression requires more work. This trade-off has two axes β compression speed (how fast you encode) and decompression speed (how fast you decode). In most systems, data is written once and read many times, so a slow encode with fast decode is often acceptable. The sweet spot shifts by use-case: a cold-storage archive tolerates hours of encoding for maximum ratio; an HTTP server must compress a response in microseconds; a stream processor must both compress and decompress at wire speed.
| Codec | Type | Ratio (typical) | Compress speed | Decompress speed | Best use-case |
|---|---|---|---|---|---|
| Snappy | Lossless | ~1.7x | 500 MB/s | 1700 MB/s | In-memory storage, Kafka |
| LZ4 | Lossless | ~2.1x | 400 MB/s | 2000 MB/s | Real-time streaming |
| gzip (level 6) | Lossless | ~2.7x | 100 MB/s | 400 MB/s | HTTP responses, logs |
| zstd (level 3) | Lossless | ~3.0x | 280 MB/s | 900 MB/s | General purpose, S3 |
| Brotli (level 11) | Lossless | ~3.8x | 2 MB/s | 400 MB/s | Static web assets |
| JPEG (quality 80) | Lossy | ~15-25x | varies | fast | Photographs, thumbnails |
| Opus (96kbps) | Lossy | ~10x vs WAV | real-time | real-time | Voice/audio streaming |
| H.264 (video) | Lossy | ~50-200x vs raw | varies | real-time | Video streaming |
Columnar Compression in Analytics Databases
Row-oriented databases (MySQL, PostgreSQL) store all columns of a row together, which is optimal for OLTP point-lookups but terrible for compression. Analytical databases like Parquet, Apache ORC, ClickHouse, and Amazon Redshift store data column by column. Within a column, all values share the same data type and often the same domain (e.g., a country column that only ever contains 50 distinct strings). This column homogeneity makes compression dramatically more effective.
Columnar formats layer multiple compression techniques: dictionary encoding replaces repeated string values with small integer codes (turning "United States" into 0 across millions of rows); delta encoding stores only the difference between successive sorted integers rather than the full value; bit-packing removes leading zeros from small integers (a column of values 0β255 needs only 8 bits, not 64); run-length encoding collapses long sorted runs. The result: Parquet files with zstd compression typically reach 10β20x compression on real-world analytics tables, making it cheap to store petabytes on object storage like S3.
import pandas as pd
# Write a DataFrame as Parquet with zstd columnar compression
df = pd.DataFrame({
'user_id': range(1_000_000),
'country': ['US'] * 800_000 + ['UK'] * 150_000 + ['IN'] * 50_000,
'event': ['click'] * 600_000 + ['view'] * 400_000,
'timestamp': pd.date_range('2024-01-01', periods=1_000_000, freq='s'),
})
# CSV: ~55 MB | Parquet+zstd: ~3 MB => ~18x compression
df.to_parquet(
'events.parquet',
engine='pyarrow',
compression='zstd', # per-column entropy coding
index=False,
)
# Scan only the 'country' column β only 3 MB column page read from disk
country_counts = pd.read_parquet('events.parquet', columns=['country'])['country'].value_counts()
print(country_counts)Where Compression Helps and Where It Hurts
Compression is not universally beneficial. It helps wherever I/O is the bottleneck: network transfers, disk reads, cold storage. It hurts wherever CPU is the bottleneck or data is already compressed. Compressing JPEG images or MP4 video files again with gzip will increase their size slightly, waste CPU, and accomplish nothing. Compressing small payloads (under ~1 KB for HTTP responses) often produces larger output than the original because the codec's header overhead dominates.
Encryption and compression must also be ordered carefully: always compress before encrypting. Encryption intentionally destroys all patterns in data, making it appear random β a compressed-then-encrypted payload compresses well; an encrypted-then-compressed payload does not compress at all and may slightly expand. This is also a security consideration: the CRIME and BREACH attacks exploited TLS compression to extract secrets by measuring how payload size changed with attacker-controlled input.
| Scenario | Compress? | Recommended codec | Notes |
|---|---|---|---|
| HTTP API JSON responses | Yes | gzip or Brotli | Browser decompresses natively; saves 60-80% bandwidth |
| Log files (archival) | Yes | zstd or gzip | Cold storage; accept slow encode for ratio |
| PostgreSQL WAL / backups | Yes | zstd | pg_basebackup --compress=zstd |
| Kafka messages (text/JSON) | Yes | lz4 or zstd | Fast encode matters for throughput |
| Analytics storage (Parquet) | Yes | zstd or snappy | Column-level; dramatic ratio on typed data |
| JPEG / PNG images | No (already done) | β | Re-compressing wastes CPU, adds size |
| Encrypted payloads (TLS) | Before encrypt | zstd / gzip | Compress then encrypt, not after |
| Tiny payloads (<1 KB) | No | β | Header overhead exceeds savings |
| Real-time video/audio | Use lossy codec | H.265, Opus, AAC | Lossy is the only viable path at these ratios |
Compression in HTTP: Content-Encoding and Transfer-Encoding
HTTP compression is negotiated by the client advertising its supported encodings in the Accept-Encoding request header (gzip, br, zstd). The server picks one, compresses the body, and signals the choice in the Content-Encoding response header. Nginx enables this with gzip on; and a one-line config; adding Brotli requires compiling the ngx_brotli module or using a CDN that supports it natively (Cloudflare, Fastly). Modern CDNs serve Brotli to capable browsers and fall back to gzip for older ones automatically. On average, switching from gzip to Brotli on text assets saves an additional 15β25% on top of gzip's savings.
Frequently Asked Questions
When should I use zstd instead of gzip?
Use zstd whenever you control both the encoder and decoder β server-to-server communication, database backups, log archival, Kafka messages, and S3 object storage. zstd achieves better compression ratios than gzip at every speed level and decompresses 2β3x faster, so it is a strict upgrade in most backend contexts. Stick with gzip or Brotli for HTTP responses to browsers, because browser support for zstd in HTTP is still rolling out (supported in Chrome 123+ and Firefox 126+ as of mid-2024). Brotli remains the best choice for static text assets where you can afford slow offline compression.
Does compression hurt database query performance?
For columnar analytical databases (ClickHouse, Redshift, Parquet on Spark), compression almost always improves query performance. Queries scan far fewer bytes from disk, and modern CPUs can decompress faster than SSDs or S3 can supply bytes β so the decompression CPU cost is hidden behind I/O savings. For row-oriented OLTP databases, compression on hot data can add latency on index lookups because every access decompresses a page. PostgreSQL's TOAST mechanism automatically compresses large column values inline; use it deliberately but do not compress small, frequently-accessed columns.
What is the theoretical limit of lossless compression?
Claude Shannon's source coding theorem (1948) proves that the minimum average bits per symbol is the entropy of the source: H = -Ξ£ p(x) logβ p(x). No lossless algorithm can compress below this limit on average. A truly random byte stream has maximum entropy (8 bits/byte) and is incompressible β this is why encrypted data and already-compressed data resist further compression. Practical codecs reach within a few percent of the theoretical limit for structured text; Huffman coding hits the limit when symbol frequencies are exact powers of two, and arithmetic coding approaches it asymptotically for any source distribution.
Compression is not about making data smaller β it is about making systems faster by letting bandwidth and storage do less work so the CPU can do more of what matters.
β alokknight Engineering
