Serialization & Deserialization in System Design: Formats, Schema Evolution & Insecure Deserialization (Visualized)

Serialization is the process of converting an in-memory object into a flat sequence of bytes that can be stored on disk or sent over a network, and deserialization is the reverse — reconstructing the original object from those bytes. Every time a service writes JSON to a queue, caches a struct in Redis, or sends a gRPC call, it is serializing on one side and deserializing on the other.

The hard part is that memory and the wire are fundamentally different. In memory an object is a graph of pointers, references, and type information laid out for fast access. On the wire it must become a single linear stream of bytes with no pointers at all. The chosen format decides how that flattening happens — how compact it is, how fast it parses, whether a human can read it, and how safely it can be rebuilt from untrusted input.

From Object Graph to Byte Stream

Conceptually, serialization walks the object graph and emits a token for each field: a key, a type tag, and a value. Nested objects are visited recursively, and references are either inlined or replaced with identifiers. Deserialization reads those tokens back in order and rebuilds the structure. The animation below shows a structured object being flattened into a byte stream on the left and reconstructed into an identical object on the right.

Object flattened into bytes and reconstructed

Fields of an in-memory object are walked top-to-bottom, emitted as a byte stream, sent across the wire, then re-read in order to rebuild an identical object on the other side.

The Major Formats

JSON is text-based, self-describing, and human-readable. Every value carries its key inline, which makes it easy to debug and universally supported, but verbose and relatively slow to parse. MessagePack is essentially binary JSON: the same schema-less model, but values are packed into compact bytes, so payloads shrink and parsing speeds up while keeping the flexibility of dynamic fields.

Protocol Buffers (Protobuf) and Apache Avro are schema-driven binary formats. Both define fields in a separate schema and drop the field names from the wire entirely — Protobuf sends numeric field tags, Avro relies on a shared schema and sends almost nothing but the values in order. The result is dramatically smaller and faster than JSON, at the cost of needing the schema to read the data. The animation below contrasts the same record as verbose JSON text versus a compact Protobuf byte sequence.

JSON text vs compact Protobuf bytes

The same record encoded two ways: JSON repeats every field name as readable text, while Protobuf replaces names with one-byte tags. Watch the byte counters tick up as each field is encoded.

Finally, every language ships a native serialization format — Python's pickle, Java's Serializable, Ruby's Marshal, .NET's BinaryFormatter. These are convenient because they capture arbitrary object graphs with zero schema work, but they encode type and construction instructions directly into the byte stream. That convenience is exactly what makes them dangerous on untrusted input, as we will see below.

Format	Size	Speed	Schema	Human-readable
JSON	Large (verbose)	Moderate	Schemaless	Yes
MessagePack	Small	Fast	Schemaless	No (binary)
Protobuf	Very small	Very fast	Required (.proto)	No (binary)
Avro	Very small	Very fast	Required (shared)	No (binary)
Native (pickle/Java)	Medium	Fast	Implicit in types	No (unsafe)

Schema Evolution

Real systems deploy producers and consumers independently, so the format must tolerate schema evolution: old readers must survive new fields, and new readers must survive old data. Protobuf achieves this with stable numeric field tags — adding a field means assigning a new number, and unknown tags are simply skipped. Avro uses reader and writer schemas together, resolving differences via defaults. The golden rule across both: add optional fields, never reuse or renumber tags, and never change a field's type in place. JSON evolves loosely (extra keys are ignored), but without a schema there is nothing stopping a typo from silently becoming data.

Performance and Size Trade-offs

The choice is a classic trade-off between ergonomics and efficiency. JSON is unbeatable for public APIs, debugging, and config because anyone can read it and no schema distribution is needed. But at scale — high-throughput RPC, event streams, columnar storage — the savings from binary schema formats are enormous: 3-10x smaller payloads, far less CPU spent parsing, and lower garbage-collection pressure. A common pattern is JSON at the edge (browser-facing APIs) and Protobuf or Avro internally (service-to-service and Kafka topics).

import json, pickle

user = {"id": 42, "name": "Ada", "admin": True}

# Safe, portable, self-describing
blob = json.dumps(user).encode()      # b'{"id": 42, ...}'
back = json.loads(blob)               # dict, only data

# Convenient but DANGEROUS on untrusted input:
# pickle.loads can construct ARBITRARY objects and run
# code during reconstruction.
dump = pickle.dumps(user)
restored = pickle.loads(dump)         # never do this on network data

Insecure Deserialization: The Security Angle

Insecure deserialization is a vulnerability where an application rebuilds objects from attacker-controlled bytes without restriction, letting the attacker influence which objects get created and what runs during reconstruction. It is dangerous precisely because native formats treat the byte stream as instructions, not just data. With pickle, the stream can invoke __reduce__ to call any callable; with Java serialization, crafted gadget chains (sequences of existing library classes whose readObject side effects combine) can reach a method that executes commands — turning a deserialize call into remote code execution.

The mechanism is subtle: the developer expected a User object, but the payload smuggles in an entirely different object whose mere construction triggers a side effect. The animation below shows a benign payload rebuilding the expected object, then an untrusted payload that deserializes into an unexpected object which fires a side effect the application never intended.

Untrusted payload triggers an unexpected object

A trusted byte stream deserializes into the expected User object. An attacker-crafted stream deserializes into an unexpected gadget object whose construction fires a side effect (RCE) the app never asked for.

Mitigations

The single most effective defense is to never deserialize untrusted data with a native format. Use a pure-data format like JSON, Protobuf, or Avro for anything crossing a trust boundary — they carry values, not executable construction logic. When native deserialization is unavoidable, layer defenses: (1) use a strict allowlist of permitted classes (Java's ObjectInputFilter, restricted unpicklers) so only known-safe types can be instantiated; (2) wrap payloads with an integrity check — sign them with an HMAC so any tampering is rejected before deserialization even begins; (3) apply schema validation on the decoded data so unexpected shapes are rejected; and (4) run deserialization with least privilege so a successful exploit has nowhere to go.

Order matters: verify the signature first, then deserialize. Validating after deserialization is too late, because the damage with native formats happens during reconstruction. Treat every byte from the network as hostile until proven otherwise.

Frequently Asked Questions

What is the difference between serialization and marshalling?

They overlap heavily. Serialization specifically means converting an object's state into a byte stream for storage or transport. Marshalling is a broader term that also implies packaging data so it can be moved across a boundary (such as a process or network), and historically includes transforming references or codebase information, not just raw state. In everyday usage the two are often used interchangeably.

Why is pickle (or Java serialization) considered unsafe?

Because these formats embed type and construction instructions in the byte stream, deserializing them can instantiate arbitrary classes and execute code during reconstruction. An attacker who controls the bytes can craft a payload that runs commands on your server. They are safe only for data you fully trust and have verified — never for input arriving from a network, user, cache, or queue without an integrity check.

Should I use JSON or Protobuf for my service?

Use JSON for public, browser-facing, or low-volume APIs where human readability and zero schema setup matter most. Use Protobuf or Avro for high-throughput internal RPC and event streams where smaller payloads, faster parsing, and enforced schema evolution pay off. Many systems do both: JSON at the edge, binary schema formats internally.

Serialization is just data leaving your process; deserialization is untrusted data entering it. Pick a format that carries values, not instructions — and verify the signature before you parse a single byte.
— alokknight Engineering