What Is BM25?
BM25 (Best Match 25), also called Okapi BM25, is a probabilistic ranking function for information retrieval. It scores each document against a query based on how often query terms appear in the document, adjusted for document length and corpus-wide term rarity.
BM25 is the default ranking function in Elasticsearch, OpenSearch, Apache Solr, and most traditional search systems. Despite being developed in the 1990s, it remains highly competitive and is a critical component of modern hybrid search pipelines.
The BM25 Formula
For a query Q containing terms q₁, q₂, ... qₙ, the score for document D is:
BM25(D, Q) = Σ IDF(qᵢ) × [ f(qᵢ, D) × (k₁ + 1) ]
————————————————————————————
[ f(qᵢ, D) + k₁ × (1 - b + b × |D|/avgdl) ]
Where:
f(qᵢ, D)— raw term frequency of term qᵢ in document D|D|— length of document D in wordsavgdl— average document length across the corpusk₁— term frequency saturation parameter (typically 1.2–2.0)b— length normalization parameter (typically 0.75)IDF(qᵢ)— inverse document frequency of term qᵢ
Understanding the Parameters
k₁: TF Saturation
Controls how quickly term frequency saturates. With k₁ = 1.2:
- A term appearing once contributes significantly
- A term appearing 10 times contributes only marginally more
- Prevents a document from dominating purely by repeating a keyword
b: Length Normalization
Controls how much document length affects scoring:
b = 1.0— full length normalization (shorter docs preferred)b = 0.0— no length normalizationb = 0.75— the recommended default
IDF: Discriminative Power
Terms appearing in many documents (like "the", "and") get near-zero IDF scores. Terms appearing in few documents (like "HNSW" or "pgvector") get high IDF scores and dominate matching.
BM25 vs TF-IDF
| TF-IDF | BM25 | |
|---|---|---|
| TF saturation | None (linear) | Logarithmic (tunable) |
| Length normalization | Basic | Tunable with b |
| Probabilistic basis | No | Yes |
| Performance | Good | Better (empirically) |
BM25 is universally preferred over TF-IDF in modern systems.
BM25 Strengths
- Exact term matching — reliable for product codes, names, error messages
- No training required — works on any text without a model
- Extremely fast — inverted index lookups are sub-millisecond at scale
- Interpretable — you can see exactly which terms drove a match
- Multilingual — works in any language without additional tooling
BM25 Weaknesses
- No synonym handling — "cancel" and "unsubscribe" are completely different terms
- No conceptual similarity — "machine learning" and "AI" are unrelated in BM25
- Vocabulary mismatch — the query must share words with the document
These weaknesses are precisely where dense (vector) retrieval excels, which is why BM25 and semantic search are combined in hybrid retrieval.
BM25 in Practice with KnowledgeSDK
KnowledgeSDK uses Typesense as its search layer. Typesense maintains a BM25 inverted index over the content field of each knowledge_item. When you call POST /v1/search, BM25 scoring runs in parallel with vector similarity search, and the results are fused via Reciprocal Rank Fusion.
This is especially valuable for technical queries — if a user searches for an exact error code or SKU, BM25 will surface the right document even if the vector similarity is imperfect.