Web Scraping with Haystack: Build a Live RAG Pipeline with KnowledgeSDK
Haystack by deepset is one of the most mature production RAG frameworks available. Unlike LangChain (which trades flexibility for speed-to-prototype), Haystack is designed for teams building systems that have to run in production: typed component interfaces, YAML-based pipeline definitions, built-in evaluation tooling, and strong support for hybrid retrieval.
The challenge with Haystack's web scraping story is that the default approach — the Apify integration — requires managing an Apify account, configuring Apify actors, and dealing with HTML output that still needs post-processing before it reaches your LLM.
This tutorial shows a cleaner path: a custom KnowledgeSDKFetcher component that integrates directly into your Haystack pipeline and returns LLM-ready markdown. You'll build a complete end-to-end RAG pipeline: give it a URL, get an answer.
Why Haystack?
Before diving into code, it's worth being precise about when to choose Haystack over alternatives like LangChain or LlamaIndex:
| Criterion | Haystack | LangChain | LlamaIndex |
|---|---|---|---|
| Production-readiness | Excellent | Good | Good |
| Pipeline typing | Strong (component interfaces) | Loose | Moderate |
| YAML pipeline definitions | Yes | No | Partial |
| Evaluation tooling | Built-in | Third-party | Third-party |
| Learning curve | Steeper | Gentle | Moderate |
| Component ecosystem | Growing | Very large | Large |
Choose Haystack when you need a pipeline that a team can reason about, test independently, and deploy to production with confidence.
Setup
Install dependencies:
pip install haystack-ai knowledgesdk openai
Set environment variables:
export KNOWLEDGESDK_API_KEY="knowledgesdk_live_..."
export OPENAI_API_KEY="sk-..."
Step 1: Define the Custom KnowledgeSDKFetcher Component
In Haystack, every pipeline stage is a Component. Components have typed inputs and outputs defined with @component decorator and dataclasses. This makes your pipeline self-documenting and type-safe.
import os
from typing import Optional
from dataclasses import dataclass
from haystack import component, Document
from knowledgesdk import KnowledgeSDK
knowledge_client = KnowledgeSDK(api_key=os.environ["KNOWLEDGESDK_API_KEY"])
@component
class KnowledgeSDKFetcher:
"""
Haystack component that fetches URLs using KnowledgeSDK and returns Documents.
Handles JavaScript rendering and anti-bot protections transparently.
Returns clean markdown suitable for LLM processing.
"""
@component.output_types(documents=list[Document])
def run(self, urls: list[str]) -> dict:
documents = []
for url in urls:
try:
result = knowledge_client.extract(
url,
include_markdown=True,
include_structured=True,
)
doc = Document(
content=result.markdown,
meta={
"url": url,
"title": result.title or "",
"description": result.structured.get("description", "") if result.structured else "",
"source": "knowledgesdk",
},
)
documents.append(doc)
print(f"Fetched: {url} ({len(result.markdown)} chars)")
except Exception as e:
print(f"Failed to fetch {url}: {e}")
return {"documents": documents}
Notice that the component returns a Document — Haystack's standard content unit. This makes KnowledgeSDKFetcher compatible with every downstream Haystack component: splitters, embedders, retrievers, and generators all work with Document objects.
Step 2: Build the Indexing Pipeline
The indexing pipeline takes URLs, fetches them, splits them into chunks, embeds them, and writes to a document store.
from haystack import Pipeline
from haystack.components.preprocessors import DocumentSplitter
from haystack.components.embedders import OpenAIDocumentEmbedder
from haystack.document_stores.in_memory import InMemoryDocumentStore
from haystack.components.writers import DocumentWriter
# Initialize document store
document_store = InMemoryDocumentStore()
# Build indexing pipeline
indexing_pipeline = Pipeline()
indexing_pipeline.add_component("fetcher", KnowledgeSDKFetcher())
indexing_pipeline.add_component(
"splitter",
DocumentSplitter(split_by="word", split_length=200, split_overlap=20),
)
indexing_pipeline.add_component(
"embedder",
OpenAIDocumentEmbedder(model="text-embedding-3-small"),
)
indexing_pipeline.add_component(
"writer",
DocumentWriter(document_store=document_store),
)
# Connect components
indexing_pipeline.connect("fetcher.documents", "splitter.documents")
indexing_pipeline.connect("splitter.documents", "embedder.documents")
indexing_pipeline.connect("embedder.documents", "writer.documents")
# Run the indexing pipeline
urls_to_index = [
"https://docs.example.com/getting-started",
"https://docs.example.com/api-reference",
"https://docs.example.com/authentication",
"https://docs.example.com/webhooks",
]
result = indexing_pipeline.run({"fetcher": {"urls": urls_to_index}})
print(f"Indexed {result['writer']['documents_written']} document chunks")
Step 3: Build the Query Pipeline
The query pipeline takes a question, embeds it, retrieves relevant chunks, and generates an answer with an OpenAI model.
from haystack.components.embedders import OpenAITextEmbedder
from haystack.components.retrievers.in_memory import InMemoryEmbeddingRetriever
from haystack.components.builders import PromptBuilder
from haystack.components.generators import OpenAIGenerator
# Prompt template
PROMPT_TEMPLATE = """
You are a helpful assistant. Answer the question based on the provided context.
If the context doesn't contain enough information to answer, say so.
Context:
{% for doc in documents %}
---
Source: {{ doc.meta.url }}
Title: {{ doc.meta.title }}
{{ doc.content }}
{% endfor %}
Question: {{ question }}
Answer:
"""
# Build query pipeline
query_pipeline = Pipeline()
query_pipeline.add_component("embedder", OpenAITextEmbedder(model="text-embedding-3-small"))
query_pipeline.add_component(
"retriever",
InMemoryEmbeddingRetriever(document_store=document_store, top_k=5),
)
query_pipeline.add_component("prompt_builder", PromptBuilder(template=PROMPT_TEMPLATE))
query_pipeline.add_component("generator", OpenAIGenerator(model="gpt-4o"))
# Connect components
query_pipeline.connect("embedder.embedding", "retriever.query_embedding")
query_pipeline.connect("retriever.documents", "prompt_builder.documents")
query_pipeline.connect("prompt_builder.prompt", "generator.prompt")
# Query the pipeline
def ask(question: str) -> str:
result = query_pipeline.run({
"embedder": {"text": question},
"prompt_builder": {"question": question},
})
return result["generator"]["replies"][0]
# Example queries
print(ask("How do I authenticate API requests?"))
print(ask("What events can I subscribe to with webhooks?"))
Step 4: Export as YAML
One of Haystack's strengths is that pipelines can be serialized to YAML. This lets you version control your pipeline configuration and deploy it without changing Python code.
# Export pipeline to YAML
with open("rag_pipeline.yaml", "w") as f:
query_pipeline.dump(f)
The resulting YAML defines every component, its parameters, and the connections:
# rag_pipeline.yaml (excerpt)
components:
embedder:
type: haystack.components.embedders.openai_text_embedder.OpenAITextEmbedder
init_parameters:
model: text-embedding-3-small
retriever:
type: haystack.components.retrievers.in_memory.embedding_retriever.InMemoryEmbeddingRetriever
init_parameters:
top_k: 5
prompt_builder:
type: haystack.components.builders.prompt_builder.PromptBuilder
init_parameters:
template: "..."
generator:
type: haystack.components.generators.openai.OpenAIGenerator
init_parameters:
model: gpt-4o
connections:
- sender: embedder.embedding
receiver: retriever.query_embedding
- sender: retriever.documents
receiver: prompt_builder.documents
- sender: prompt_builder.prompt
receiver: generator.prompt
Load it back:
from haystack import Pipeline
with open("rag_pipeline.yaml", "r") as f:
loaded_pipeline = Pipeline.load(f)
Step 5: Add the KnowledgeSDK Search Component
For knowledge bases that grow beyond what fits in memory, add a KnowledgeSDK search component. This lets you search over all previously scraped content using semantic search without managing a separate vector store.
import httpx
from haystack import component, Document
@component
class KnowledgeSDKSearchRetriever:
"""
Haystack retriever component that uses KnowledgeSDK's semantic search API.
Searches over all content previously indexed in your KnowledgeSDK knowledge base.
"""
def __init__(self, api_key: str, top_k: int = 5):
self.api_key = api_key
self.top_k = top_k
@component.output_types(documents=list[Document])
def run(self, query: str) -> dict:
response = httpx.post(
"https://api.knowledgesdk.com/v1/search",
headers={"x-api-key": self.api_key},
json={"query": query, "limit": self.top_k},
)
response.raise_for_status()
results = response.json()
documents = [
Document(
content=item["content"],
meta={
"url": item.get("url", ""),
"title": item.get("title", ""),
"score": item.get("score", 0),
},
)
for item in results.get("results", [])
]
return {"documents": documents}
Use it as a drop-in retriever in your query pipeline:
search_pipeline = Pipeline()
search_pipeline.add_component(
"retriever",
KnowledgeSDKSearchRetriever(
api_key=os.environ["KNOWLEDGESDK_API_KEY"],
top_k=5,
),
)
search_pipeline.add_component("prompt_builder", PromptBuilder(template=PROMPT_TEMPLATE))
search_pipeline.add_component("generator", OpenAIGenerator(model="gpt-4o"))
search_pipeline.connect("retriever.documents", "prompt_builder.documents")
search_pipeline.connect("prompt_builder.prompt", "generator.prompt")
result = search_pipeline.run({
"retriever": {"query": "How does authentication work?"},
"prompt_builder": {"question": "How does authentication work?"},
})
print(result["generator"]["replies"][0])
Step 6: Complete End-to-End Example
Here is a self-contained script combining everything:
import os
from haystack import Pipeline, component, Document
from haystack.components.preprocessors import DocumentSplitter
from haystack.components.embedders import OpenAIDocumentEmbedder, OpenAITextEmbedder
from haystack.document_stores.in_memory import InMemoryDocumentStore
from haystack.components.writers import DocumentWriter
from haystack.components.retrievers.in_memory import InMemoryEmbeddingRetriever
from haystack.components.builders import PromptBuilder
from haystack.components.generators import OpenAIGenerator
from knowledgesdk import KnowledgeSDK
knowledge_client = KnowledgeSDK(api_key=os.environ["KNOWLEDGESDK_API_KEY"])
@component
class KnowledgeSDKFetcher:
@component.output_types(documents=list[Document])
def run(self, urls: list[str]) -> dict:
documents = []
for url in urls:
result = knowledge_client.extract(url, include_markdown=True)
documents.append(Document(
content=result.markdown,
meta={"url": url, "title": result.title or ""},
))
return {"documents": documents}
# Build and run
store = InMemoryDocumentStore()
idx = Pipeline()
idx.add_component("fetcher", KnowledgeSDKFetcher())
idx.add_component("splitter", DocumentSplitter(split_by="word", split_length=200, split_overlap=20))
idx.add_component("embedder", OpenAIDocumentEmbedder(model="text-embedding-3-small"))
idx.add_component("writer", DocumentWriter(document_store=store))
idx.connect("fetcher.documents", "splitter.documents")
idx.connect("splitter.documents", "embedder.documents")
idx.connect("embedder.documents", "writer.documents")
idx.run({"fetcher": {"urls": ["https://docs.example.com/api-reference"]}})
print("Indexed.")
TEMPLATE = """Context: {% for doc in documents %}{{ doc.content }}\n{% endfor %}\nQuestion: {{ question }}\nAnswer:"""
qry = Pipeline()
qry.add_component("embedder", OpenAITextEmbedder(model="text-embedding-3-small"))
qry.add_component("retriever", InMemoryEmbeddingRetriever(document_store=store, top_k=5))
qry.add_component("prompt", PromptBuilder(template=TEMPLATE))
qry.add_component("generator", OpenAIGenerator(model="gpt-4o"))
qry.connect("embedder.embedding", "retriever.query_embedding")
qry.connect("retriever.documents", "prompt.documents")
qry.connect("prompt.prompt", "generator.prompt")
result = qry.run({
"embedder": {"text": "What authentication methods are supported?"},
"prompt": {"question": "What authentication methods are supported?"},
})
print(result["generator"]["replies"][0])
Comparison: KnowledgeSDK vs. Apify Actor for Haystack
The standard Apify-based Haystack web fetching approach:
| Aspect | Apify + Haystack | KnowledgeSDK + Haystack |
|---|---|---|
| Setup required | Apify account, actor configuration, webhook setup | One API key |
| Output format | HTML (needs post-processing) | Clean markdown |
| Component complexity | Complex (Apify webhook → Haystack) | Simple (direct component) |
| JS rendering | Yes | Yes |
| Anti-bot handling | Yes | Yes |
| Semantic search | No (Apify doesn't provide search) | Yes (built-in) |
| Change detection | Manual polling | Webhooks |
| Cost | Apify usage + Haystack | KnowledgeSDK only |
The KnowledgeSDK component is simpler, returns better output, and adds search and webhook capabilities that the Apify actor doesn't provide.
Conclusion
Haystack is a strong framework for production RAG pipelines — typed components, YAML-configurable, and built for teams. The missing piece has been a web scraping component that returns clean markdown without HTML post-processing overhead.
The KnowledgeSDKFetcher component in this tutorial fills that gap. It's a standard Haystack component that works with every existing Haystack splitter, embedder, and retriever. You get JavaScript rendering and anti-bot handling from KnowledgeSDK, and the full Haystack ecosystem for chunking, embedding, retrieval, and generation.
Ready to build your Haystack RAG pipeline? Start a free KnowledgeSDK trial at knowledgesdk.com.