knowledgesdk.com/blog/rag-benchmarking-guide

technicalMarch 20, 2026·10 min read

How to Benchmark Your RAG Pipeline (RAGAS, LongMemEval, MemoryBench)

You can't improve what you don't measure. A practical guide to evaluating retrieval quality, answer faithfulness, and knowledge freshness in your RAG system.

Most RAG systems are evaluated by feel. Someone asks ten questions, the answers seem reasonable, and the system ships. Then a user asks something important, the retrieval returns the wrong chunk, the LLM confidently answers with wrong information, and nobody knows why because there was never a baseline to compare against.

Benchmarking is how you turn "it seems to work" into "it retrieves the right chunks 87% of the time, and faithfulness scores are 0.92." Those numbers let you diagnose what's wrong, measure whether a change actually helped, and set a quality bar before you ship.

This guide covers the frameworks, the metrics that matter, and how to build a test suite for your specific domain.

The Four Axes of RAG Quality

Before picking tools, understand what you're measuring. RAG quality has four dimensions that can fail independently:

1. Retrieval recall — When a user asks a question, do the right chunks come back? A system can have excellent retrieval precision (everything it returns is relevant) while having poor recall (it misses key passages). Recall@5 — the fraction of questions where the correct chunk appears in the top 5 results — is the most useful single retrieval metric.

2. Answer faithfulness — Does the LLM's answer actually reflect what the retrieved chunks say? High faithfulness means every claim in the answer can be traced to a retrieved passage. Low faithfulness means the LLM is adding context, making inferences, or fabricating details not present in the sources. This is the hallucination metric.

3. Answer relevance — Does the answer address the question that was asked? A system can have high faithfulness (everything it says is in the chunks) but low relevance (it answers a different question than the one posed). This happens when retrieval returns tangentially related content.

4. Knowledge freshness — Are the indexed chunks current? A retrieval system that correctly surfaces a chunk containing outdated information scores well on recall and faithfulness but fails in practice. Freshness is the dimension most evaluation frameworks ignore, which is why it deserves explicit tracking.

RAGAS: The Standard Framework

RAGAS (Retrieval Augmented Generation Assessment) is the most widely used open-source framework for evaluating RAG pipelines. It measures faithfulness, answer relevancy, context precision, and context recall using LLM-as-judge scoring.

Install it:

pip install ragas

A minimal evaluation run:

from ragas import evaluate
from ragas.metrics import (
    faithfulness,
    answer_relevancy,
    context_precision,
    context_recall,
)
from datasets import Dataset

# Your test set: questions, generated answers, retrieved contexts, ground truth answers
data = {
    "question": [
        "What is the data retention policy for user records?",
        "How long does the trial period last?",
        "What payment methods are accepted?",
    ],
    "answer": [
        # Generated by your RAG system
        "User records are retained for 3 years after account closure.",
        "The trial period lasts 14 days with no credit card required.",
        "We accept Visa, Mastercard, and PayPal.",
    ],
    "contexts": [
        # Retrieved chunks used to generate each answer
        [["Data retention: user records are kept for 3 years post-account closure..."]],
        [["14-day free trial, no credit card required..."]],
        [["Accepted payment methods: Visa, Mastercard, American Express, PayPal..."]],
    ],
    "ground_truth": [
        # Correct answers from your gold standard
        "User records are retained for 3 years after account closure.",
        "The trial period lasts 14 days with no credit card required.",
        "We accept Visa, Mastercard, American Express, and PayPal.",
    ],
}

dataset = Dataset.from_dict(data)

results = evaluate(
    dataset,
    metrics=[faithfulness, answer_relevancy, context_precision, context_recall],
)

print(results)

RAGAS scores each metric on a 0–1 scale. Target numbers for a production-ready RAG system:

Faithfulness: > 0.90. Below this, users will encounter hallucinated information.
Answer relevancy: > 0.85. Below this, answers are technically accurate but not addressing the question.
Context precision: > 0.80. Below this, irrelevant chunks are being retrieved and polluting the context.
Context recall: > 0.75. Below this, relevant information exists in your index but isn't being retrieved.

LongMemEval: Testing Long-Context Memory

LongMemEval evaluates how well systems retrieve information that was indexed much earlier in a conversation or over a long time horizon. It's particularly relevant for web knowledge systems where content was extracted weeks or months ago and needs to be accurately retrieved now.

The benchmark tests scenarios that are common in production but rarely tested:

Information extracted 30+ days ago that must be retrieved accurately today
Contradictory information from multiple sources (old and new versions of a page) where the system must surface the correct version
Long documents where relevant information is buried in the middle, not the beginning or end

LongMemEval is the benchmark Supermemory scores 81.6% on — a useful reference point for what current-generation web memory systems achieve. Running your own pipeline against LongMemEval gives you a concrete comparison point.

To adapt LongMemEval for web knowledge evaluation:

Extract a set of URLs at time T
Wait a realistic interval (or simulate it)
Run LongMemEval questions against your index
Measure recall on information that was "old" vs. "recent"

The degradation curve — how recall drops as indexed content ages — tells you a lot about whether your freshness strategy is working.

MemoryBench: Evaluating Knowledge Across Sessions

MemoryBench evaluates memory persistence: whether a system can accurately recall information across multiple sessions and over time. For web knowledge systems, this maps directly to whether your index correctly serves queries on content extracted in previous runs.

Key things MemoryBench tests:

Entity consistency: Does the system correctly recall facts about the same entity across multiple queries?
Update propagation: When a page is re-extracted with updated content, does the old version stop surfacing in results?
Source attribution accuracy: Can the system correctly attribute facts to the right source URL?

The last point — update propagation — is where many RAG systems fail. If you re-extract a page that changed, your system may have both the old and new version indexed. Searches may return the old chunk with outdated information. Good systems handle this by replacing rather than appending on re-extraction; MemoryBench tests whether yours does.

Building Your Domain-Specific Test Set

Standard benchmarks give you a baseline. Your own test set tells you whether your system works for your actual use case.

Building a good test set:

1. Write 50 questions from your domain. Cover the full range of what users actually ask — simple lookups ("what is the refund policy?"), multi-hop questions ("what are the data retention requirements for EU customers under our enterprise plan?"), and edge cases ("what happens to data if I downgrade my plan?").

2. Write gold-standard answers. These should be what a subject matter expert would say — accurate, complete, from the right source. Don't generate these with an LLM; write them manually or have a domain expert write them.

3. Identify the correct source chunks. For each question, mark which specific passages in your indexed content contain the answer. This is what RAGAS uses for context recall evaluation — it checks whether your retrieval system actually returns those passages.

4. Run your RAG system on all 50 questions. Capture: the retrieved chunks, the generated answer, which source URLs were used.

5. Score with RAGAS. Compare generated answers against gold standard answers and retrieved chunks against the correct source chunks.

Fifty questions is enough to identify systemic problems. Once you've fixed those, expand to 200 questions for a more statistically significant baseline.

Diagnosing Failures

When scores are low, the fix depends on which metric is failing:

Low context recall (< 0.75): Your retrieval is missing relevant chunks. Common causes:

Chunks are too small — increase chunk size so more context is included per chunk
Query doesn't match chunk language — the user says "cancellation policy" but the document says "subscription termination terms"; hybrid search (semantic + BM25) handles this better than pure vector search
Content isn't indexed — add the missing source URLs

Low faithfulness (< 0.90): The LLM is generating claims not supported by the retrieved chunks. Common causes:

System prompt isn't strict enough — add explicit instruction to only use provided sources
Context window is too long — too many chunks dilutes attention; try returning fewer, higher-quality chunks
Model is "smart" and adding inferences — use a lower temperature; 0.0–0.2 for factual RAG

Low answer relevancy (< 0.85): Retrieved chunks are related but don't address the question. Common causes:

Wrong section indexed — you indexed a summary page but the relevant detail is on a sub-page
Query reformulation needed — rewrite the user's question before searching

Low freshness: You're surfacing outdated chunks alongside current ones. Common causes:

Re-extraction appending rather than replacing — check whether your knowledge store updates in-place
No freshness signal in retrieval ranking — consider recency-weighted ranking for time-sensitive content

KnowledgeSDK's Hybrid Search and Recall@5

Pure vector search retrieves by semantic similarity. BM25 (keyword) search retrieves by term matching. Hybrid search combines both — and it consistently scores higher on recall@5 than either approach alone.

The reason is simple: some queries are best answered by keyword matching ("what is the maxRetries config option?") while others are best answered by semantic similarity ("how do I handle errors when a request fails?"). Hybrid search handles both.

KnowledgeSDK uses hybrid search by default — every query runs semantic and keyword search in parallel, combines the scores, and returns the best results. If you're running benchmarks and comparing against other retrieval systems, verify whether the comparison is hybrid-vs-hybrid or hybrid-vs-pure-vector. Pure vector search typically scores 10–15% lower on recall@5 on mixed query sets.

Benchmark Cadence

Run your benchmark:

After any change to chunking strategy — chunk size, overlap, or splitting method directly affects recall
After adding or removing source URLs — new content can affect retrieval patterns for existing queries
After upgrading the embedding model — different models have different strengths; re-benchmark before committing
Monthly in production — as your indexed content evolves, retrieval quality can drift; catch it before users do

A benchmark that runs automatically in CI on every configuration change pays off quickly. The first time it catches a chunking change that dropped recall from 0.84 to 0.71, it's earned its keep.

The Metric That Actually Matters

After all the benchmarking, the number that most closely predicts user satisfaction is faithfulness times recall@5. A system that retrieves the right chunks 90% of the time but hallucinates in answers is worse than a system with 80% recall and near-perfect faithfulness. Users can live with "I don't know" (low recall). They cannot live with confident wrong answers (low faithfulness).

Optimize faithfulness first. Then improve recall. The combination of the two — finding the right information and accurately reporting it — is what makes a RAG system genuinely useful rather than just impressive in demos.

Try it now