Sparse Retrieval

A retrieval method that represents documents as sparse term-frequency vectors, enabling fast keyword-based matching.

What Is Sparse Retrieval?

Sparse retrieval is a class of information retrieval methods that represent documents as high-dimensional vectors where most values are zero. Each dimension corresponds to a term in the vocabulary, and the non-zero values represent that term's importance in the document.

The "sparse" name reflects the vector structure: a vocabulary of 100,000 terms produces 100,000-dimensional vectors, but any given document contains only a few hundred distinct terms — so 99%+ of values are zero.

The Inverted Index

Sparse retrieval is implemented via an inverted index: a mapping from each term to the list of documents containing it (the posting list), along with term frequency information.

"refund"  → [(doc_3, tf=2), (doc_7, tf=1), (doc_12, tf=4)]
"cancel"  → [(doc_1, tf=1), (doc_3, tf=3), (doc_9, tf=2)]
"policy"  → [(doc_3, tf=1), (doc_5, tf=2)]

At query time, the query terms are looked up in the index, and their posting lists are intersected or merged to produce a ranked result set. This operation is extremely fast — milliseconds even for billions of documents.

TF-IDF: The Foundation

Term Frequency–Inverse Document Frequency (TF-IDF) is the classic sparse scoring function:

TF-IDF(term, doc) = TF(term, doc) × IDF(term)

TF = frequency of term in doc / total terms in doc
IDF = log(total docs / docs containing term)

A term that appears often in a document but rarely in the corpus gets a high score — capturing the idea of a "discriminative" term.

BM25: The Modern Standard

BM25 (Best Match 25) is the dominant sparse retrieval algorithm today. It improves on TF-IDF by adding:

Term frequency saturation — repeated occurrences have diminishing returns
Document length normalization — penalizes longer documents to avoid length bias

BM25 is the default ranking function in Elasticsearch, OpenSearch, and Solr.

Sparse vs Dense: Trade-offs

	Sparse	Dense
Vocabulary handling	Exact match only	Semantic similarity
Latency	Very fast (ms)	Fast (5–20ms with ANN)
Infrastructure	Inverted index	Vector index (HNSW)
Handles synonyms	No	Yes
Handles rare terms	Perfectly	Poorly
No training required	Yes	Requires embedding model

SPLADE: Learned Sparse Retrieval

Recent work like SPLADE (SParse Lexical AnD Expansion) uses a neural network to learn which terms to expand a document with — keeping the sparse representation but improving recall by adding semantically related terms to the index. It bridges dense and sparse retrieval.

Why Sparse Retrieval Still Matters

Despite the rise of vector databases, sparse retrieval remains essential:

Exact term matching is reliable — product IDs, error codes, names, and rare jargon are retrieved precisely
No embedding model dependency — works without GPU-based inference
Explainability — it is easy to see which terms caused a document to rank
Speed — inverted index lookup is faster than most ANN implementations

Sparse Retrieval in KnowledgeSDK

KnowledgeSDK's POST /v1/search endpoint uses Typesense, which maintains an inverted BM25 index alongside the vector index. Sparse and dense retrieval are combined automatically using Reciprocal Rank Fusion, giving you the benefits of both approaches.

This means queries containing exact product codes or version strings will retrieve correctly even when semantic similarity would miss them.

Related Terms

RAG & Retrievalintermediate

BM25

A probabilistic ranking function used in information retrieval that scores documents based on term frequency and inverse document frequency.

RAG & Retrievalintermediate

Dense Retrieval

A retrieval method that represents both queries and documents as dense vectors and finds matches via nearest-neighbor search.

RAG & Retrievalintermediate

Hybrid Search

A retrieval strategy that combines dense vector search with sparse keyword search (like BM25) to improve recall and precision.

← Sliding Window Chunking Structured Data Extraction →

Try it now

Build with Sparse Retrieval using one API.

Extract, index, and search any web content. First 1,000 requests free.

GET API KEY →

← Back to glossary