Full-Text Search in System Design: Inverted Indexes, BM25, and Elasticsearch Explained (Visualized)

Full-text search is a technique for finding documents whose textual content contains one or more query terms, using a pre-built index of every word so that results are returned in milliseconds regardless of corpus size. Unlike a primary-key lookup or a range scan, full-text search must rank results by relevance, handle linguistic variation (plural vs. singular, stemmed vs. exact), and tolerate typos — making it a fundamentally different problem from ordinary database querying.

Modern applications — e-commerce sites, documentation portals, code search tools, job boards — all depend on fast, relevant full-text search. A poor implementation can mean slow queries, irrelevant results, and frustrated users. Getting it right requires understanding the data structures and algorithms that power search engines like Elasticsearch, Lucene, and even PostgreSQL's built-in tsvector.

Why SQL LIKE Queries Do Not Scale

The first instinct is to use a SQL LIKE clause: WHERE body LIKE '%elasticsearch%'. This works on tiny datasets, but it has two fatal flaws at scale. First, a leading wildcard (%term) forces a full table scan — the database must read every row and every byte of every text column. No index can help. Second, LIKE is purely positional: it cannot understand that "run", "ran", and "running" are the same concept, or that "elastisearch" with a typo should still find Elasticsearch documents.

On a table with 10 million rows, a LIKE '%keyword%' query can take 20–60 seconds. A dedicated full-text search index returns the same result in under 10 milliseconds. The difference comes down to the data structure that full-text systems pre-compute at index time: the inverted index.

The Inverted Index: The Core Data Structure

An inverted index maps every unique term (word) to the list of documents that contain it — the opposite direction of a normal document store, hence "inverted". For each term, the index stores a postings list: an ordered array of document IDs (and optionally positions and frequencies). When a query arrives for "database", the engine looks up "database" in the index in O(1) time and immediately has the complete list of matching documents, without scanning any content at all.

A multi-term query like "database index" is answered by fetching both postings lists and intersecting them — only documents that appear in both lists match. Because postings lists are sorted by document ID, intersection runs in O(n) time using a merge algorithm, similar to the merge step in merge sort. This is exactly what Lucene (which powers both Elasticsearch and OpenSearch) does internally, and it is why search feels instant even over billions of documents.

Building an Inverted Index

Watch three documents get tokenized and each term mapped to the document IDs that contain it.

The Analysis Pipeline: From Raw Text to Index Terms

Before a word enters the inverted index, it passes through an analysis pipeline — a sequence of transformations that normalize text so that "Running", "runs", and "ran" all end up as the same index token. The pipeline typically consists of: (1) Tokenization — splitting the raw text on whitespace and punctuation into individual tokens; (2) Lowercasing — converting "Database" to "database" so queries are case-insensitive; (3) Stop-word removal — discarding ultra-common words like "the", "is", "a" that appear in nearly every document and would produce enormous, useless postings lists; and (4) Stemming or lemmatization — reducing inflected forms to a common base: "running" → "run", "databases" → "databas" (Porter stemmer) or "database" (lemmatizer).

The same pipeline runs at query time as well, so when a user types "Searching databases", the query is also lowercased, stop-words dropped, and stemmed before the engine looks up the postings lists. This symmetry is what makes stemming work: the document token and the query token land on the same form. Elasticsearch lets you configure a custom analyzer per field, mixing character filters, tokenizers, and token filters from a rich library.

Relevance Ranking: TF-IDF and BM25

Returning all matching documents is only half the job — the user wants the best matches first. Full-text engines assign every (document, query) pair a relevance score and sort results descending. The classic model is TF-IDF: Term Frequency (how often the term appears in the document — more occurrences = more relevant) multiplied by Inverse Document Frequency (how rare the term is across all documents — rare terms are more discriminating). A document that mentions "Elasticsearch" ten times scores higher than one that mentions it once, but common words like "the" have near-zero IDF so they barely move the score even if they appear hundreds of times.

BM25 (Best Match 25) is the successor to TF-IDF and the default scoring function in Elasticsearch since version 5.0, as well as in Lucene, OpenSearch, and PostgreSQL FTS. BM25 adds two key improvements: term frequency saturation (doubling the occurrence count does not double the score — the gain diminishes beyond a threshold controlled by parameter k1, typically 1.2) and length normalization (a long document that mentions a term once is penalized relative to a short document that mentions it once, controlled by parameter b, typically 0.75). These two corrections make BM25 significantly more accurate on real-world corpora.

Postings List Intersection for a Multi-Term Query

A two-term query fetches two postings lists and merges them to find documents matching both terms.

Relevance Scoring in Action: BM25 Ranking

BM25 Relevance Ranking

Documents are scored by BM25 (term frequency + length normalization + IDF), then sorted so the best match rises to the top.

Fuzzy Matching and Edit Distance

Users make typos. A search for "elastisearch" should still return Elasticsearch results. Fuzzy matching allows a query term to match index terms within a bounded edit distance (the number of single-character insertions, deletions, or substitutions needed to transform one word into another — also called Levenshtein distance). Elasticsearch's fuzziness: AUTO setting uses edit distance 1 for words of 5–8 characters and distance 2 for longer words, which covers the vast majority of real typos without producing too many false positives.

Under the hood, Lucene uses a Levenshtein automaton — a finite automaton built for the query term that accepts any string within the configured edit distance. Instead of comparing the query to every index term, the automaton traverses the index's term dictionary (stored as a finite state transducer) and collects all terms within edit distance in one pass. This makes fuzzy queries far faster than a naive O(V) scan over all vocabulary terms.

Named Systems: Elasticsearch, OpenSearch, and PostgreSQL FTS

Elasticsearch (and its open-source fork OpenSearch) are the most widely deployed dedicated search engines. They are built on top of Apache Lucene, a Java library that provides the inverted index, BM25 scoring, fuzzy matching, and more. Elasticsearch adds a distributed layer on top: documents are spread across shards (Lucene index instances), each shard is replicated across multiple nodes for fault tolerance, and query results from all shards are merged and re-ranked at the coordinating node. This architecture scales horizontally to billions of documents.

PostgreSQL full-text search is built into the database and uses the same conceptual model — tsvector (the pre-analyzed, indexed representation of a document) and tsquery (the analyzed query) — with a GIN (Generalized Inverted Index) or GiST index for fast lookup. For datasets under a few million documents where you already run PostgreSQL, its FTS is often sufficient and eliminates the operational complexity of a separate search cluster. For larger corpora, multi-language needs, semantic search, or advanced features like autocomplete and faceted navigation, a dedicated system like Elasticsearch is the right tool.

Feature	Elasticsearch / OpenSearch	PostgreSQL FTS
Underlying engine	Apache Lucene	Built-in GIN/GiST index
Default scorer	BM25	BM25 (pg 11+) / Ts_rank
Horizontal scaling	Native sharding + replication	Requires partitioning / Citus
Fuzzy matching	Levenshtein automaton, fuzziness param	pg_trgm extension (trigram similarity)
Tokenization/analyzers	Rich built-in + custom plugin ecosystem	Configurable dictionaries per language
Operational cost	Separate cluster to maintain	Already in your DB stack
Best for	Large corpora, complex queries, analytics	Smaller datasets, simpler ops

Indexing Strategy and Common Pitfalls

A few design decisions make or break a full-text search implementation. Index only what users search: avoid indexing raw HTML, markdown syntax characters, or internal metadata fields — they pollute the vocabulary with noise tokens and bloat the index. Choose the right analyzer per field: a product title benefits from an edge n-gram analyzer for prefix autocomplete, while a body field should use a standard analyzer with stemming. Store term vectors if you need to highlight matched snippets in results — without them, the engine cannot tell you which offsets in the original text matched the query. Control index refresh rate: in Elasticsearch, index.refresh_interval defaults to 1 second; during bulk indexing, set it to -1 and re-enable it afterward to get 5–10x indexing throughput.

Frequently Asked Questions

What is the difference between full-text search and vector search?

Full-text search (BM25-based) matches documents that share exact tokens with the query after analysis — it is lexical. Vector search (semantic search) encodes documents and queries as dense embedding vectors using a neural language model and retrieves documents whose vectors are nearest to the query vector in cosine or dot-product distance. Vector search can find a document about "automobile" even if the query says "car", because the embeddings capture meaning, not just spelling. In practice, modern search stacks combine both in a hybrid search pattern: BM25 for precision on exact terms, vectors for recall on conceptual matches, with Reciprocal Rank Fusion to merge the two result lists.

How does Elasticsearch handle distributed full-text search?

When you index a document, Elasticsearch routes it to a primary shard (determined by hash(document_id) % num_shards) and then replicates it to replica shards on other nodes. At query time, the coordinating node broadcasts the query to one copy of each shard (primary or replica), collects top-N results from each, and performs a global merge-and-sort. This means IDF scores are computed per-shard by default, which can cause slight ranking inconsistencies on small indices — you can force global IDF computation with search_type=dfs_query_then_fetch at the cost of an extra round-trip.

When should I use PostgreSQL FTS instead of Elasticsearch?

Use PostgreSQL FTS when your corpus is under roughly 5–10 million documents, your team already operates PostgreSQL, and you do not need features like real-time analytics, autocomplete with edge n-grams, or complex aggregations (facets). PostgreSQL FTS eliminates the complexity of keeping a separate search cluster synchronized with your database — data is always consistent because it is the database. A GIN index on a tsvector column with BM25 ranking via ts_rank_cd handles the majority of blog, documentation, and product search use cases without leaving your SQL stack.

Full-text search is not a feature you add to a database — it is a discipline with its own data structures, ranking models, and failure modes. Build the inverted index right, tune BM25 for your corpus, and your users will find what they mean, not just what they typed.
— alokknight Engineering