knowledgesdk.com/glossary/bm25
RAG & Retrievalintermediate

Also known as: Best Match 25, Okapi BM25

BM25

A probabilistic ranking function used in information retrieval that scores documents based on term frequency and inverse document frequency.

What Is BM25?

BM25 (Best Match 25), also called Okapi BM25, is a probabilistic ranking function for information retrieval. It scores each document against a query based on how often query terms appear in the document, adjusted for document length and corpus-wide term rarity.

BM25 is the default ranking function in Elasticsearch, OpenSearch, Apache Solr, and most traditional search systems. Despite being developed in the 1990s, it remains highly competitive and is a critical component of modern hybrid search pipelines.

The BM25 Formula

For a query Q containing terms q₁, q₂, ... qₙ, the score for document D is:

BM25(D, Q) = Σ IDF(qᵢ) × [ f(qᵢ, D) × (k₁ + 1) ]
                              ————————————————————————————
                              [ f(qᵢ, D) + k₁ × (1 - b + b × |D|/avgdl) ]

Where:

  • f(qᵢ, D) — raw term frequency of term qᵢ in document D
  • |D| — length of document D in words
  • avgdl — average document length across the corpus
  • k₁ — term frequency saturation parameter (typically 1.2–2.0)
  • b — length normalization parameter (typically 0.75)
  • IDF(qᵢ) — inverse document frequency of term qᵢ

Understanding the Parameters

k₁: TF Saturation

Controls how quickly term frequency saturates. With k₁ = 1.2:

  • A term appearing once contributes significantly
  • A term appearing 10 times contributes only marginally more
  • Prevents a document from dominating purely by repeating a keyword

b: Length Normalization

Controls how much document length affects scoring:

  • b = 1.0 — full length normalization (shorter docs preferred)
  • b = 0.0 — no length normalization
  • b = 0.75 — the recommended default

IDF: Discriminative Power

Terms appearing in many documents (like "the", "and") get near-zero IDF scores. Terms appearing in few documents (like "HNSW" or "pgvector") get high IDF scores and dominate matching.

BM25 vs TF-IDF

TF-IDF BM25
TF saturation None (linear) Logarithmic (tunable)
Length normalization Basic Tunable with b
Probabilistic basis No Yes
Performance Good Better (empirically)

BM25 is universally preferred over TF-IDF in modern systems.

BM25 Strengths

  • Exact term matching — reliable for product codes, names, error messages
  • No training required — works on any text without a model
  • Extremely fast — inverted index lookups are sub-millisecond at scale
  • Interpretable — you can see exactly which terms drove a match
  • Multilingual — works in any language without additional tooling

BM25 Weaknesses

  • No synonym handling — "cancel" and "unsubscribe" are completely different terms
  • No conceptual similarity — "machine learning" and "AI" are unrelated in BM25
  • Vocabulary mismatch — the query must share words with the document

These weaknesses are precisely where dense (vector) retrieval excels, which is why BM25 and semantic search are combined in hybrid retrieval.

BM25 in Practice with KnowledgeSDK

KnowledgeSDK uses Typesense as its search layer. Typesense maintains a BM25 inverted index over the content field of each knowledge_item. When you call POST /v1/search, BM25 scoring runs in parallel with vector similarity search, and the results are fused via Reciprocal Rank Fusion.

This is especially valuable for technical queries — if a user searches for an exact error code or SKU, BM25 will surface the right document even if the vector similarity is imperfect.

Related Terms

RAG & Retrievalintermediate
Sparse Retrieval
A retrieval method that represents documents as sparse term-frequency vectors, enabling fast keyword-based matching.
RAG & Retrievalintermediate
Hybrid Search
A retrieval strategy that combines dense vector search with sparse keyword search (like BM25) to improve recall and precision.
RAG & Retrievalintermediate
Retrieval Pipeline
The end-to-end sequence of steps — query processing, search, re-ranking, and context assembly — that retrieves relevant documents for an LLM.
Background JobChain of Thought

Try it now

Build with BM25 using one API.

Extract, index, and search any web content. First 1,000 requests free.

GET API KEY →
← Back to glossary