Re-ranking

A post-retrieval step that re-scores and reorders retrieved documents using a more powerful cross-encoder model to improve relevance.

What Is Re-ranking?

Re-ranking is a post-retrieval step that takes a set of candidate documents returned by an initial retrieval stage and re-scores them using a more accurate but computationally heavier model. The goal is to improve the final ranked list's relevance before passing context to the LLM.

Re-ranking is the difference between "pretty good" and "production-quality" RAG.

Why Re-rank?

Initial retrieval (whether semantic, keyword, or hybrid) is optimized for speed and recall — it retrieves a broad set of plausibly relevant documents quickly. But bi-encoder similarity scores are approximate and sometimes miss subtle relevance signals.

A re-ranker is optimized for precision — it examines the query and each candidate document jointly, producing a much more accurate relevance score.

The trade-off: re-ranking is 10–100x slower per document than bi-encoder retrieval. By running it only on a small candidate set (top 20–50 from initial retrieval), you get the accuracy benefit at an acceptable latency cost.

Cross-Encoder Architecture

Re-rankers are typically cross-encoders: the query and document are concatenated and passed through a transformer together, rather than being encoded independently.

Input:  [CLS] query [SEP] document [SEP]
Output: scalar relevance score

Because the query and document attend to each other's tokens during encoding, the model captures fine-grained interaction signals that bi-encoders miss.

The Two-Stage Pipeline

Initial Retrieval (fast, high recall)
   Query → Embed → ANN Search → Top-50 candidates
                 ↓
Re-ranking (slow, high precision)
   (query, candidate_1) → cross-encoder → score: 0.95
   (query, candidate_2) → cross-encoder → score: 0.41
   (query, candidate_3) → cross-encoder → score: 0.87
   ...
                 ↓
Final Top-K (e.g., top 5) sent to LLM

Popular Re-ranking Models

Model	Notes
Cohere Rerank	Hosted API, easy to integrate
`cross-encoder/ms-marco-MiniLM-L-6-v2`	Lightweight, open source
`BAAI/bge-reranker-large`	Strong open-source model
Jina Rerank	Hosted API, multilingual

Implementation Example

from sentence_transformers import CrossEncoder

reranker = CrossEncoder("cross-encoder/ms-marco-MiniLM-L-6-v2")

# Initial retrieval returns 20 candidates
candidates = vector_db.search(query, top_k=20)

# Re-rank
pairs = [(query, doc.content) for doc in candidates]
scores = reranker.predict(pairs)

# Sort by re-rank score, take top 5
ranked = sorted(zip(scores, candidates), reverse=True)
top_5 = [doc for score, doc in ranked[:5]]

With the Cohere Rerank API:

import cohere
co = cohere.Client("...")

results = co.rerank(
    query=query,
    documents=[doc.content for doc in candidates],
    top_n=5,
    model="rerank-english-v3.0"
)

When to Add Re-ranking

Add a re-ranker when:

LLM responses include irrelevant or off-topic context
Initial retrieval returns many plausibly relevant but only a few truly relevant documents
You are willing to add 50–200ms of latency for better answer quality
Your application involves multi-document synthesis where ranking order matters

Re-ranking and KnowledgeSDK

KnowledgeSDK's POST /v1/search returns scored results from hybrid retrieval. These scores can be used as initial retrieval input to a re-ranker in your application layer. Retrieve a larger candidate set (limit: 20) and apply a cross-encoder locally or via the Cohere API before passing the top 3–5 results to your LLM prompt.

Related Terms

RAG & Retrievalintermediate

Retrieval Pipeline

The end-to-end sequence of steps — query processing, search, re-ranking, and context assembly — that retrieves relevant documents for an LLM.

RAG & Retrievalbeginner

Semantic Search

A search approach that finds results based on meaning and intent rather than exact keyword matching.

RAG & Retrievalintermediate

Hybrid Search

A retrieval strategy that combines dense vector search with sparse keyword search (like BM25) to improve recall and precision.

← Rate Limiting ReAct →

Try it now

Build with Re-ranking using one API.

Extract, index, and search any web content. First 1,000 requests free.

GET API KEY →

← Back to glossary