Web Scraping with Haystack: Build a Live RAG Pipeline with KnowledgeSDK

Build a production Haystack RAG pipeline with live web scraping. Custom KnowledgeSDKFetcher component, pipeline YAML, and end-to-end Q&A from URL to answer.

Web Scraping with Haystack: Build a Live RAG Pipeline with KnowledgeSDK

Haystack by deepset is one of the most mature production RAG frameworks available. Unlike LangChain (which trades flexibility for speed-to-prototype), Haystack is designed for teams building systems that have to run in production: typed component interfaces, YAML-based pipeline definitions, built-in evaluation tooling, and strong support for hybrid retrieval.

The challenge with Haystack's web scraping story is that the default approach — the Apify integration — requires managing an Apify account, configuring Apify actors, and dealing with HTML output that still needs post-processing before it reaches your LLM.

This tutorial shows a cleaner path: a custom KnowledgeSDKFetcher component that integrates directly into your Haystack pipeline and returns LLM-ready markdown. You'll build a complete end-to-end RAG pipeline: give it a URL, get an answer.

Why Haystack?

Before diving into code, it's worth being precise about when to choose Haystack over alternatives like LangChain or LlamaIndex:

Criterion	Haystack	LangChain	LlamaIndex
Production-readiness	Excellent	Good	Good
Pipeline typing	Strong (component interfaces)	Loose	Moderate
YAML pipeline definitions	Yes	No	Partial
Evaluation tooling	Built-in	Third-party	Third-party
Learning curve	Steeper	Gentle	Moderate
Component ecosystem	Growing	Very large	Large

Choose Haystack when you need a pipeline that a team can reason about, test independently, and deploy to production with confidence.

Setup

Install dependencies:

pip install haystack-ai knowledgesdk openai

Set environment variables:

export KNOWLEDGESDK_API_KEY="knowledgesdk_live_..."
export OPENAI_API_KEY="sk-..."

Step 1: Define the Custom KnowledgeSDKFetcher Component

In Haystack, every pipeline stage is a Component. Components have typed inputs and outputs defined with @component decorator and dataclasses. This makes your pipeline self-documenting and type-safe.

import os
from typing import Optional
from dataclasses import dataclass
from haystack import component, Document
from knowledgesdk import KnowledgeSDK

knowledge_client = KnowledgeSDK(api_key=os.environ["KNOWLEDGESDK_API_KEY"])

@component
class KnowledgeSDKFetcher:
    """
    Haystack component that fetches URLs using KnowledgeSDK and returns Documents.
    Handles JavaScript rendering and anti-bot protections transparently.
    Returns clean markdown suitable for LLM processing.
    """

    @component.output_types(documents=list[Document])
    def run(self, urls: list[str]) -> dict:
        documents = []

        for url in urls:
            try:
                result = knowledge_client.extract(
                    url,
                    include_markdown=True,
                    include_structured=True,
                )

                doc = Document(
                    content=result.markdown,
                    meta={
                        "url": url,
                        "title": result.title or "",
                        "description": result.structured.get("description", "") if result.structured else "",
                        "source": "knowledgesdk",
                    },
                )
                documents.append(doc)
                print(f"Fetched: {url} ({len(result.markdown)} chars)")

            except Exception as e:
                print(f"Failed to fetch {url}: {e}")

        return {"documents": documents}

Notice that the component returns a Document — Haystack's standard content unit. This makes KnowledgeSDKFetcher compatible with every downstream Haystack component: splitters, embedders, retrievers, and generators all work with Document objects.

Step 2: Build the Indexing Pipeline

The indexing pipeline takes URLs, fetches them, splits them into chunks, embeds them, and writes to a document store.

from haystack import Pipeline
from haystack.components.preprocessors import DocumentSplitter
from haystack.components.embedders import OpenAIDocumentEmbedder
from haystack.document_stores.in_memory import InMemoryDocumentStore
from haystack.components.writers import DocumentWriter

# Initialize document store
document_store = InMemoryDocumentStore()

# Build indexing pipeline
indexing_pipeline = Pipeline()

indexing_pipeline.add_component("fetcher", KnowledgeSDKFetcher())
indexing_pipeline.add_component(
    "splitter",
    DocumentSplitter(split_by="word", split_length=200, split_overlap=20),
)
indexing_pipeline.add_component(
    "embedder",
    OpenAIDocumentEmbedder(model="text-embedding-3-small"),
)
indexing_pipeline.add_component(
    "writer",
    DocumentWriter(document_store=document_store),
)

# Connect components
indexing_pipeline.connect("fetcher.documents", "splitter.documents")
indexing_pipeline.connect("splitter.documents", "embedder.documents")
indexing_pipeline.connect("embedder.documents", "writer.documents")

# Run the indexing pipeline
urls_to_index = [
    "https://docs.example.com/getting-started",
    "https://docs.example.com/api-reference",
    "https://docs.example.com/authentication",
    "https://docs.example.com/webhooks",
]

result = indexing_pipeline.run({"fetcher": {"urls": urls_to_index}})
print(f"Indexed {result['writer']['documents_written']} document chunks")

Step 3: Build the Query Pipeline

The query pipeline takes a question, embeds it, retrieves relevant chunks, and generates an answer with an OpenAI model.

from haystack.components.embedders import OpenAITextEmbedder
from haystack.components.retrievers.in_memory import InMemoryEmbeddingRetriever
from haystack.components.builders import PromptBuilder
from haystack.components.generators import OpenAIGenerator

# Prompt template
PROMPT_TEMPLATE = """
You are a helpful assistant. Answer the question based on the provided context.
If the context doesn't contain enough information to answer, say so.

Context:
{% for doc in documents %}
---
Source: {{ doc.meta.url }}
Title: {{ doc.meta.title }}

{{ doc.content }}
{% endfor %}

Question: {{ question }}

Answer:
"""

# Build query pipeline
query_pipeline = Pipeline()

query_pipeline.add_component("embedder", OpenAITextEmbedder(model="text-embedding-3-small"))
query_pipeline.add_component(
    "retriever",
    InMemoryEmbeddingRetriever(document_store=document_store, top_k=5),
)
query_pipeline.add_component("prompt_builder", PromptBuilder(template=PROMPT_TEMPLATE))
query_pipeline.add_component("generator", OpenAIGenerator(model="gpt-4o"))

# Connect components
query_pipeline.connect("embedder.embedding", "retriever.query_embedding")
query_pipeline.connect("retriever.documents", "prompt_builder.documents")
query_pipeline.connect("prompt_builder.prompt", "generator.prompt")

# Query the pipeline
def ask(question: str) -> str:
    result = query_pipeline.run({
        "embedder": {"text": question},
        "prompt_builder": {"question": question},
    })
    return result["generator"]["replies"][0]

# Example queries
print(ask("How do I authenticate API requests?"))
print(ask("What events can I subscribe to with webhooks?"))

Step 4: Export as YAML

One of Haystack's strengths is that pipelines can be serialized to YAML. This lets you version control your pipeline configuration and deploy it without changing Python code.

# Export pipeline to YAML
with open("rag_pipeline.yaml", "w") as f:
    query_pipeline.dump(f)

The resulting YAML defines every component, its parameters, and the connections:

# rag_pipeline.yaml (excerpt)
components:
  embedder:
    type: haystack.components.embedders.openai_text_embedder.OpenAITextEmbedder
    init_parameters:
      model: text-embedding-3-small
  retriever:
    type: haystack.components.retrievers.in_memory.embedding_retriever.InMemoryEmbeddingRetriever
    init_parameters:
      top_k: 5
  prompt_builder:
    type: haystack.components.builders.prompt_builder.PromptBuilder
    init_parameters:
      template: "..."
  generator:
    type: haystack.components.generators.openai.OpenAIGenerator
    init_parameters:
      model: gpt-4o
connections:
  - sender: embedder.embedding
    receiver: retriever.query_embedding
  - sender: retriever.documents
    receiver: prompt_builder.documents
  - sender: prompt_builder.prompt
    receiver: generator.prompt

Load it back:

from haystack import Pipeline

with open("rag_pipeline.yaml", "r") as f:
    loaded_pipeline = Pipeline.load(f)

Step 5: Add the KnowledgeSDK Search Component

For knowledge bases that grow beyond what fits in memory, add a KnowledgeSDK search component. This lets you search over all previously scraped content using semantic search without managing a separate vector store.

import httpx
from haystack import component, Document

@component
class KnowledgeSDKSearchRetriever:
    """
    Haystack retriever component that uses KnowledgeSDK's semantic search API.
    Searches over all content previously indexed in your KnowledgeSDK knowledge base.
    """

    def __init__(self, api_key: str, top_k: int = 5):
        self.api_key = api_key
        self.top_k = top_k

    @component.output_types(documents=list[Document])
    def run(self, query: str) -> dict:
        response = httpx.post(
            "https://api.knowledgesdk.com/v1/search",
            headers={"x-api-key": self.api_key},
            json={"query": query, "limit": self.top_k},
        )
        response.raise_for_status()
        results = response.json()

        documents = [
            Document(
                content=item["content"],
                meta={
                    "url": item.get("url", ""),
                    "title": item.get("title", ""),
                    "score": item.get("score", 0),
                },
            )
            for item in results.get("results", [])
        ]

        return {"documents": documents}

Use it as a drop-in retriever in your query pipeline:

search_pipeline = Pipeline()

search_pipeline.add_component(
    "retriever",
    KnowledgeSDKSearchRetriever(
        api_key=os.environ["KNOWLEDGESDK_API_KEY"],
        top_k=5,
    ),
)
search_pipeline.add_component("prompt_builder", PromptBuilder(template=PROMPT_TEMPLATE))
search_pipeline.add_component("generator", OpenAIGenerator(model="gpt-4o"))

search_pipeline.connect("retriever.documents", "prompt_builder.documents")
search_pipeline.connect("prompt_builder.prompt", "generator.prompt")

result = search_pipeline.run({
    "retriever": {"query": "How does authentication work?"},
    "prompt_builder": {"question": "How does authentication work?"},
})
print(result["generator"]["replies"][0])

Step 6: Complete End-to-End Example

Here is a self-contained script combining everything:

import os
from haystack import Pipeline, component, Document
from haystack.components.preprocessors import DocumentSplitter
from haystack.components.embedders import OpenAIDocumentEmbedder, OpenAITextEmbedder
from haystack.document_stores.in_memory import InMemoryDocumentStore
from haystack.components.writers import DocumentWriter
from haystack.components.retrievers.in_memory import InMemoryEmbeddingRetriever
from haystack.components.builders import PromptBuilder
from haystack.components.generators import OpenAIGenerator
from knowledgesdk import KnowledgeSDK

knowledge_client = KnowledgeSDK(api_key=os.environ["KNOWLEDGESDK_API_KEY"])

@component
class KnowledgeSDKFetcher:
    @component.output_types(documents=list[Document])
    def run(self, urls: list[str]) -> dict:
        documents = []
        for url in urls:
            result = knowledge_client.extract(url, include_markdown=True)
            documents.append(Document(
                content=result.markdown,
                meta={"url": url, "title": result.title or ""},
            ))
        return {"documents": documents}

# Build and run
store = InMemoryDocumentStore()

idx = Pipeline()
idx.add_component("fetcher", KnowledgeSDKFetcher())
idx.add_component("splitter", DocumentSplitter(split_by="word", split_length=200, split_overlap=20))
idx.add_component("embedder", OpenAIDocumentEmbedder(model="text-embedding-3-small"))
idx.add_component("writer", DocumentWriter(document_store=store))
idx.connect("fetcher.documents", "splitter.documents")
idx.connect("splitter.documents", "embedder.documents")
idx.connect("embedder.documents", "writer.documents")

idx.run({"fetcher": {"urls": ["https://docs.example.com/api-reference"]}})
print("Indexed.")

TEMPLATE = """Context: {% for doc in documents %}{{ doc.content }}\n{% endfor %}\nQuestion: {{ question }}\nAnswer:"""

qry = Pipeline()
qry.add_component("embedder", OpenAITextEmbedder(model="text-embedding-3-small"))
qry.add_component("retriever", InMemoryEmbeddingRetriever(document_store=store, top_k=5))
qry.add_component("prompt", PromptBuilder(template=TEMPLATE))
qry.add_component("generator", OpenAIGenerator(model="gpt-4o"))
qry.connect("embedder.embedding", "retriever.query_embedding")
qry.connect("retriever.documents", "prompt.documents")
qry.connect("prompt.prompt", "generator.prompt")

result = qry.run({
    "embedder": {"text": "What authentication methods are supported?"},
    "prompt": {"question": "What authentication methods are supported?"},
})
print(result["generator"]["replies"][0])

Comparison: KnowledgeSDK vs. Apify Actor for Haystack

The standard Apify-based Haystack web fetching approach:

Aspect	Apify + Haystack	KnowledgeSDK + Haystack
Setup required	Apify account, actor configuration, webhook setup	One API key
Output format	HTML (needs post-processing)	Clean markdown
Component complexity	Complex (Apify webhook → Haystack)	Simple (direct component)
JS rendering	Yes	Yes
Anti-bot handling	Yes	Yes
Semantic search	No (Apify doesn't provide search)	Yes (built-in)
Change detection	Manual polling	Webhooks
Cost	Apify usage + Haystack	KnowledgeSDK only

The KnowledgeSDK component is simpler, returns better output, and adds search and webhook capabilities that the Apify actor doesn't provide.

Conclusion

Haystack is a strong framework for production RAG pipelines — typed components, YAML-configurable, and built for teams. The missing piece has been a web scraping component that returns clean markdown without HTML post-processing overhead.

The KnowledgeSDKFetcher component in this tutorial fills that gap. It's a standard Haystack component that works with every existing Haystack splitter, embedder, and retriever. You get JavaScript rendering and anti-bot handling from KnowledgeSDK, and the full Haystack ecosystem for chunking, embedding, retrieval, and generation.

Ready to build your Haystack RAG pipeline? Start a free KnowledgeSDK trial at knowledgesdk.com.

Try it now