Retrieval Pipeline

The end-to-end sequence of steps — query processing, search, re-ranking, and context assembly — that retrieves relevant documents for an LLM.

What Is a Retrieval Pipeline?

A retrieval pipeline is the complete sequence of processing steps that transforms a user's query into a set of relevant document chunks ready for injection into an LLM prompt. It is the "retrieval" half of Retrieval-Augmented Generation.

A well-designed retrieval pipeline dramatically improves answer quality. A poorly designed one leads to hallucinations, irrelevant context, and degraded user experience.

Anatomy of a Retrieval Pipeline

User Query
    │
    ▼
1. Query Processing
   ├── Preprocessing (lowercase, strip PII)
   ├── Query expansion (optional)
   └── Query embedding

    │
    ▼
2. Retrieval
   ├── Dense search (vector ANN)
   ├── Sparse search (BM25)
   └── Score fusion (RRF or weighted)

    │
    ▼
3. Post-Retrieval Processing
   ├── Re-ranking (cross-encoder)
   ├── Deduplication
   └── Score threshold filtering

    │
    ▼
4. Context Assembly
   ├── Token budget management
   ├── Source attribution metadata
   └── Final chunk ordering

    │
    ▼
LLM Prompt

Stage 1: Query Processing

Before retrieval, the query may be:

Cleaned — strip personally identifiable information, normalize whitespace
Expanded — generate paraphrases or sub-questions to improve recall
Embedded — converted to a vector for dense retrieval
Classified — routed to a specialized index or retrieval strategy

Stage 2: Retrieval

The core search step. For production RAG, hybrid search (dense + sparse) is the standard:

const [denseResults, sparseResults] = await Promise.all([
  vectorDb.search(queryEmbedding, { topK: 20 }),
  bm25Index.search(queryText, { topK: 20 })
]);
const fused = reciprocalRankFusion(denseResults, sparseResults);

Stage 3: Post-Retrieval Processing

The initial retrieval returns a large candidate set (20–50 documents). Post-retrieval narrows this to the 3–10 chunks that will actually enter the prompt:

Re-ranking — a cross-encoder re-scores candidates with higher accuracy
Deduplication — remove chunks with >90% content overlap
Threshold filtering — discard chunks below a minimum relevance score

Stage 4: Context Assembly

The surviving chunks must be assembled into a coherent prompt context:

def assemble_context(chunks, max_tokens=3000):
    context_parts = []
    token_budget = max_tokens
    for chunk in chunks:  # ordered by relevance
        tokens = count_tokens(chunk.content)
        if tokens > token_budget:
            break
        context_parts.append(
            f"Source: {chunk.source_url}\n{chunk.content}"
        )
        token_budget -= tokens
    return "\n\n---\n\n".join(context_parts)

Common Pipeline Configurations

Minimal (prototyping)

Query → Embed → Vector Search → Top-3 Chunks → LLM

Standard (production)

Query → Expand → Hybrid Search (top-20) → Re-rank → Top-5 → LLM

Advanced (high-stakes)

Query → Classify → Route → Expand → Hybrid Search → Re-rank → 
Deduplicate → Score Filter → Context Assembly → LLM

Retrieval Pipeline with KnowledgeSDK

KnowledgeSDK handles stages 1 and 2 for you. POST /v1/search runs query embedding, hybrid retrieval (dense + BM25), and score fusion in a single API call:

curl -X POST https://api.knowledgesdk.com/v1/search \
  -H "x-api-key: knowledgesdk_live_..." \
  -d '{
    "query": "how do I integrate with Zapier?",
    "limit": 10
  }'

You are responsible for stages 3 and 4 in your application: optionally re-ranking the returned results, filtering by score, and assembling the final prompt context. This split lets you customize post-retrieval behavior while offloading the infrastructure-heavy retrieval work.

Measuring Pipeline Quality

Recall@K — what fraction of relevant documents appear in the top-K results?
Precision@K — what fraction of top-K results are actually relevant?
MRR — Mean Reciprocal Rank; where does the first relevant result appear?
End-to-end answer quality — ultimately the metric that matters most (human evaluation or LLM-as-judge)