What Is Re-ranking?
Re-ranking is a post-retrieval step that takes a set of candidate documents returned by an initial retrieval stage and re-scores them using a more accurate but computationally heavier model. The goal is to improve the final ranked list's relevance before passing context to the LLM.
Re-ranking is the difference between "pretty good" and "production-quality" RAG.
Why Re-rank?
Initial retrieval (whether semantic, keyword, or hybrid) is optimized for speed and recall — it retrieves a broad set of plausibly relevant documents quickly. But bi-encoder similarity scores are approximate and sometimes miss subtle relevance signals.
A re-ranker is optimized for precision — it examines the query and each candidate document jointly, producing a much more accurate relevance score.
The trade-off: re-ranking is 10–100x slower per document than bi-encoder retrieval. By running it only on a small candidate set (top 20–50 from initial retrieval), you get the accuracy benefit at an acceptable latency cost.
Cross-Encoder Architecture
Re-rankers are typically cross-encoders: the query and document are concatenated and passed through a transformer together, rather than being encoded independently.
Input: [CLS] query [SEP] document [SEP]
Output: scalar relevance score
Because the query and document attend to each other's tokens during encoding, the model captures fine-grained interaction signals that bi-encoders miss.
The Two-Stage Pipeline
Initial Retrieval (fast, high recall)
Query → Embed → ANN Search → Top-50 candidates
↓
Re-ranking (slow, high precision)
(query, candidate_1) → cross-encoder → score: 0.95
(query, candidate_2) → cross-encoder → score: 0.41
(query, candidate_3) → cross-encoder → score: 0.87
...
↓
Final Top-K (e.g., top 5) sent to LLM
Popular Re-ranking Models
| Model | Notes |
|---|---|
| Cohere Rerank | Hosted API, easy to integrate |
cross-encoder/ms-marco-MiniLM-L-6-v2 |
Lightweight, open source |
BAAI/bge-reranker-large |
Strong open-source model |
| Jina Rerank | Hosted API, multilingual |
Implementation Example
from sentence_transformers import CrossEncoder
reranker = CrossEncoder("cross-encoder/ms-marco-MiniLM-L-6-v2")
# Initial retrieval returns 20 candidates
candidates = vector_db.search(query, top_k=20)
# Re-rank
pairs = [(query, doc.content) for doc in candidates]
scores = reranker.predict(pairs)
# Sort by re-rank score, take top 5
ranked = sorted(zip(scores, candidates), reverse=True)
top_5 = [doc for score, doc in ranked[:5]]
With the Cohere Rerank API:
import cohere
co = cohere.Client("...")
results = co.rerank(
query=query,
documents=[doc.content for doc in candidates],
top_n=5,
model="rerank-english-v3.0"
)
When to Add Re-ranking
Add a re-ranker when:
- LLM responses include irrelevant or off-topic context
- Initial retrieval returns many plausibly relevant but only a few truly relevant documents
- You are willing to add 50–200ms of latency for better answer quality
- Your application involves multi-document synthesis where ranking order matters
Re-ranking and KnowledgeSDK
KnowledgeSDK's POST /v1/search returns scored results from hybrid retrieval. These scores can be used as initial retrieval input to a re-ranker in your application layer. Retrieve a larger candidate set (limit: 20) and apply a cross-encoder locally or via the Cohere API before passing the top 3–5 results to your LLM prompt.