What Is a Retrieval Pipeline?
A retrieval pipeline is the complete sequence of processing steps that transforms a user's query into a set of relevant document chunks ready for injection into an LLM prompt. It is the "retrieval" half of Retrieval-Augmented Generation.
A well-designed retrieval pipeline dramatically improves answer quality. A poorly designed one leads to hallucinations, irrelevant context, and degraded user experience.
Anatomy of a Retrieval Pipeline
User Query
│
▼
1. Query Processing
├── Preprocessing (lowercase, strip PII)
├── Query expansion (optional)
└── Query embedding
│
▼
2. Retrieval
├── Dense search (vector ANN)
├── Sparse search (BM25)
└── Score fusion (RRF or weighted)
│
▼
3. Post-Retrieval Processing
├── Re-ranking (cross-encoder)
├── Deduplication
└── Score threshold filtering
│
▼
4. Context Assembly
├── Token budget management
├── Source attribution metadata
└── Final chunk ordering
│
▼
LLM Prompt
Stage 1: Query Processing
Before retrieval, the query may be:
- Cleaned — strip personally identifiable information, normalize whitespace
- Expanded — generate paraphrases or sub-questions to improve recall
- Embedded — converted to a vector for dense retrieval
- Classified — routed to a specialized index or retrieval strategy
Stage 2: Retrieval
The core search step. For production RAG, hybrid search (dense + sparse) is the standard:
const [denseResults, sparseResults] = await Promise.all([
vectorDb.search(queryEmbedding, { topK: 20 }),
bm25Index.search(queryText, { topK: 20 })
]);
const fused = reciprocalRankFusion(denseResults, sparseResults);
Stage 3: Post-Retrieval Processing
The initial retrieval returns a large candidate set (20–50 documents). Post-retrieval narrows this to the 3–10 chunks that will actually enter the prompt:
- Re-ranking — a cross-encoder re-scores candidates with higher accuracy
- Deduplication — remove chunks with >90% content overlap
- Threshold filtering — discard chunks below a minimum relevance score
Stage 4: Context Assembly
The surviving chunks must be assembled into a coherent prompt context:
def assemble_context(chunks, max_tokens=3000):
context_parts = []
token_budget = max_tokens
for chunk in chunks: # ordered by relevance
tokens = count_tokens(chunk.content)
if tokens > token_budget:
break
context_parts.append(
f"Source: {chunk.source_url}\n{chunk.content}"
)
token_budget -= tokens
return "\n\n---\n\n".join(context_parts)
Common Pipeline Configurations
Minimal (prototyping)
Query → Embed → Vector Search → Top-3 Chunks → LLM
Standard (production)
Query → Expand → Hybrid Search (top-20) → Re-rank → Top-5 → LLM
Advanced (high-stakes)
Query → Classify → Route → Expand → Hybrid Search → Re-rank →
Deduplicate → Score Filter → Context Assembly → LLM
Retrieval Pipeline with KnowledgeSDK
KnowledgeSDK handles stages 1 and 2 for you. POST /v1/search runs query embedding, hybrid retrieval (dense + BM25), and score fusion in a single API call:
curl -X POST https://api.knowledgesdk.com/v1/search \
-H "x-api-key: knowledgesdk_live_..." \
-d '{
"query": "how do I integrate with Zapier?",
"limit": 10
}'
You are responsible for stages 3 and 4 in your application: optionally re-ranking the returned results, filtering by score, and assembling the final prompt context. This split lets you customize post-retrieval behavior while offloading the infrastructure-heavy retrieval work.
Measuring Pipeline Quality
- Recall@K — what fraction of relevant documents appear in the top-K results?
- Precision@K — what fraction of top-K results are actually relevant?
- MRR — Mean Reciprocal Rank; where does the first relevant result appear?
- End-to-end answer quality — ultimately the metric that matters most (human evaluation or LLM-as-judge)