BM25

A probabilistic ranking function used in information retrieval that scores documents based on term frequency and inverse document frequency.

What Is BM25?

BM25 (Best Match 25), also called Okapi BM25, is a probabilistic ranking function for information retrieval. It scores each document against a query based on how often query terms appear in the document, adjusted for document length and corpus-wide term rarity.

BM25 is the default ranking function in Elasticsearch, OpenSearch, Apache Solr, and most traditional search systems. Despite being developed in the 1990s, it remains highly competitive and is a critical component of modern hybrid search pipelines.

The BM25 Formula

For a query Q containing terms q₁, q₂, ... qₙ, the score for document D is:

BM25(D, Q) = Σ IDF(qᵢ) × [ f(qᵢ, D) × (k₁ + 1) ]
                              ————————————————————————————
                              [ f(qᵢ, D) + k₁ × (1 - b + b × |D|/avgdl) ]

Where:

f(qᵢ, D) — raw term frequency of term qᵢ in document D
|D| — length of document D in words
avgdl — average document length across the corpus
k₁ — term frequency saturation parameter (typically 1.2–2.0)
b — length normalization parameter (typically 0.75)
IDF(qᵢ) — inverse document frequency of term qᵢ

Understanding the Parameters

k₁: TF Saturation

Controls how quickly term frequency saturates. With k₁ = 1.2:

A term appearing once contributes significantly
A term appearing 10 times contributes only marginally more
Prevents a document from dominating purely by repeating a keyword

b: Length Normalization

Controls how much document length affects scoring:

b = 1.0 — full length normalization (shorter docs preferred)
b = 0.0 — no length normalization
b = 0.75 — the recommended default

IDF: Discriminative Power

Terms appearing in many documents (like "the", "and") get near-zero IDF scores. Terms appearing in few documents (like "HNSW" or "pgvector") get high IDF scores and dominate matching.

BM25 vs TF-IDF

	TF-IDF	BM25
TF saturation	None (linear)	Logarithmic (tunable)
Length normalization	Basic	Tunable with `b`
Probabilistic basis	No	Yes
Performance	Good	Better (empirically)

BM25 is universally preferred over TF-IDF in modern systems.

BM25 Strengths

Exact term matching — reliable for product codes, names, error messages
No training required — works on any text without a model
Extremely fast — inverted index lookups are sub-millisecond at scale
Interpretable — you can see exactly which terms drove a match
Multilingual — works in any language without additional tooling

BM25 Weaknesses

No synonym handling — "cancel" and "unsubscribe" are completely different terms
No conceptual similarity — "machine learning" and "AI" are unrelated in BM25
Vocabulary mismatch — the query must share words with the document

These weaknesses are precisely where dense (vector) retrieval excels, which is why BM25 and semantic search are combined in hybrid retrieval.

BM25 in Practice with KnowledgeSDK

KnowledgeSDK uses Typesense as its search layer. Typesense maintains a BM25 inverted index over the content field of each knowledge_item. When you call POST /v1/search, BM25 scoring runs in parallel with vector similarity search, and the results are fused via Reciprocal Rank Fusion.

This is especially valuable for technical queries — if a user searches for an exact error code or SKU, BM25 will surface the right document even if the vector similarity is imperfect.

Related Terms

RAG & Retrievalintermediate

Sparse Retrieval

A retrieval method that represents documents as sparse term-frequency vectors, enabling fast keyword-based matching.

RAG & Retrievalintermediate

Hybrid Search

A retrieval strategy that combines dense vector search with sparse keyword search (like BM25) to improve recall and precision.

RAG & Retrievalintermediate

Retrieval Pipeline

The end-to-end sequence of steps — query processing, search, re-ranking, and context assembly — that retrieves relevant documents for an LLM.

← Background Job Chain of Thought →

Try it now

Build with BM25 using one API.

Extract, index, and search any web content. First 1,000 requests free.

GET API KEY →

← Back to glossary