knowledgesdk.com/blog/haystack-web-scraping
integrationMarch 20, 2026·15 min read

Web Scraping with Haystack: Build a Live RAG Pipeline with KnowledgeSDK

Build a production Haystack RAG pipeline with live web scraping. Custom KnowledgeSDKFetcher component, pipeline YAML, and end-to-end Q&A from URL to answer.

Web Scraping with Haystack: Build a Live RAG Pipeline with KnowledgeSDK

Web Scraping with Haystack: Build a Live RAG Pipeline with KnowledgeSDK

Haystack by deepset is one of the most mature production RAG frameworks available. Unlike LangChain (which trades flexibility for speed-to-prototype), Haystack is designed for teams building systems that have to run in production: typed component interfaces, YAML-based pipeline definitions, built-in evaluation tooling, and strong support for hybrid retrieval.

The challenge with Haystack's web scraping story is that the default approach — the Apify integration — requires managing an Apify account, configuring Apify actors, and dealing with HTML output that still needs post-processing before it reaches your LLM.

This tutorial shows a cleaner path: a custom KnowledgeSDKFetcher component that integrates directly into your Haystack pipeline and returns LLM-ready markdown. You'll build a complete end-to-end RAG pipeline: give it a URL, get an answer.


Why Haystack?

Before diving into code, it's worth being precise about when to choose Haystack over alternatives like LangChain or LlamaIndex:

Criterion Haystack LangChain LlamaIndex
Production-readiness Excellent Good Good
Pipeline typing Strong (component interfaces) Loose Moderate
YAML pipeline definitions Yes No Partial
Evaluation tooling Built-in Third-party Third-party
Learning curve Steeper Gentle Moderate
Component ecosystem Growing Very large Large

Choose Haystack when you need a pipeline that a team can reason about, test independently, and deploy to production with confidence.


Setup

Install dependencies:

pip install haystack-ai knowledgesdk openai

Set environment variables:

export KNOWLEDGESDK_API_KEY="knowledgesdk_live_..."
export OPENAI_API_KEY="sk-..."

Step 1: Define the Custom KnowledgeSDKFetcher Component

In Haystack, every pipeline stage is a Component. Components have typed inputs and outputs defined with @component decorator and dataclasses. This makes your pipeline self-documenting and type-safe.

import os
from typing import Optional
from dataclasses import dataclass
from haystack import component, Document
from knowledgesdk import KnowledgeSDK

knowledge_client = KnowledgeSDK(api_key=os.environ["KNOWLEDGESDK_API_KEY"])

@component
class KnowledgeSDKFetcher:
    """
    Haystack component that fetches URLs using KnowledgeSDK and returns Documents.
    Handles JavaScript rendering and anti-bot protections transparently.
    Returns clean markdown suitable for LLM processing.
    """

    @component.output_types(documents=list[Document])
    def run(self, urls: list[str]) -> dict:
        documents = []

        for url in urls:
            try:
                result = knowledge_client.extract(
                    url,
                    include_markdown=True,
                    include_structured=True,
                )

                doc = Document(
                    content=result.markdown,
                    meta={
                        "url": url,
                        "title": result.title or "",
                        "description": result.structured.get("description", "") if result.structured else "",
                        "source": "knowledgesdk",
                    },
                )
                documents.append(doc)
                print(f"Fetched: {url} ({len(result.markdown)} chars)")

            except Exception as e:
                print(f"Failed to fetch {url}: {e}")

        return {"documents": documents}

Notice that the component returns a Document — Haystack's standard content unit. This makes KnowledgeSDKFetcher compatible with every downstream Haystack component: splitters, embedders, retrievers, and generators all work with Document objects.


Step 2: Build the Indexing Pipeline

The indexing pipeline takes URLs, fetches them, splits them into chunks, embeds them, and writes to a document store.

from haystack import Pipeline
from haystack.components.preprocessors import DocumentSplitter
from haystack.components.embedders import OpenAIDocumentEmbedder
from haystack.document_stores.in_memory import InMemoryDocumentStore
from haystack.components.writers import DocumentWriter

# Initialize document store
document_store = InMemoryDocumentStore()

# Build indexing pipeline
indexing_pipeline = Pipeline()

indexing_pipeline.add_component("fetcher", KnowledgeSDKFetcher())
indexing_pipeline.add_component(
    "splitter",
    DocumentSplitter(split_by="word", split_length=200, split_overlap=20),
)
indexing_pipeline.add_component(
    "embedder",
    OpenAIDocumentEmbedder(model="text-embedding-3-small"),
)
indexing_pipeline.add_component(
    "writer",
    DocumentWriter(document_store=document_store),
)

# Connect components
indexing_pipeline.connect("fetcher.documents", "splitter.documents")
indexing_pipeline.connect("splitter.documents", "embedder.documents")
indexing_pipeline.connect("embedder.documents", "writer.documents")

# Run the indexing pipeline
urls_to_index = [
    "https://docs.example.com/getting-started",
    "https://docs.example.com/api-reference",
    "https://docs.example.com/authentication",
    "https://docs.example.com/webhooks",
]

result = indexing_pipeline.run({"fetcher": {"urls": urls_to_index}})
print(f"Indexed {result['writer']['documents_written']} document chunks")

Step 3: Build the Query Pipeline

The query pipeline takes a question, embeds it, retrieves relevant chunks, and generates an answer with an OpenAI model.

from haystack.components.embedders import OpenAITextEmbedder
from haystack.components.retrievers.in_memory import InMemoryEmbeddingRetriever
from haystack.components.builders import PromptBuilder
from haystack.components.generators import OpenAIGenerator

# Prompt template
PROMPT_TEMPLATE = """
You are a helpful assistant. Answer the question based on the provided context.
If the context doesn't contain enough information to answer, say so.

Context:
{% for doc in documents %}
---
Source: {{ doc.meta.url }}
Title: {{ doc.meta.title }}

{{ doc.content }}
{% endfor %}

Question: {{ question }}

Answer:
"""

# Build query pipeline
query_pipeline = Pipeline()

query_pipeline.add_component("embedder", OpenAITextEmbedder(model="text-embedding-3-small"))
query_pipeline.add_component(
    "retriever",
    InMemoryEmbeddingRetriever(document_store=document_store, top_k=5),
)
query_pipeline.add_component("prompt_builder", PromptBuilder(template=PROMPT_TEMPLATE))
query_pipeline.add_component("generator", OpenAIGenerator(model="gpt-4o"))

# Connect components
query_pipeline.connect("embedder.embedding", "retriever.query_embedding")
query_pipeline.connect("retriever.documents", "prompt_builder.documents")
query_pipeline.connect("prompt_builder.prompt", "generator.prompt")

# Query the pipeline
def ask(question: str) -> str:
    result = query_pipeline.run({
        "embedder": {"text": question},
        "prompt_builder": {"question": question},
    })
    return result["generator"]["replies"][0]

# Example queries
print(ask("How do I authenticate API requests?"))
print(ask("What events can I subscribe to with webhooks?"))

Step 4: Export as YAML

One of Haystack's strengths is that pipelines can be serialized to YAML. This lets you version control your pipeline configuration and deploy it without changing Python code.

# Export pipeline to YAML
with open("rag_pipeline.yaml", "w") as f:
    query_pipeline.dump(f)

The resulting YAML defines every component, its parameters, and the connections:

# rag_pipeline.yaml (excerpt)
components:
  embedder:
    type: haystack.components.embedders.openai_text_embedder.OpenAITextEmbedder
    init_parameters:
      model: text-embedding-3-small
  retriever:
    type: haystack.components.retrievers.in_memory.embedding_retriever.InMemoryEmbeddingRetriever
    init_parameters:
      top_k: 5
  prompt_builder:
    type: haystack.components.builders.prompt_builder.PromptBuilder
    init_parameters:
      template: "..."
  generator:
    type: haystack.components.generators.openai.OpenAIGenerator
    init_parameters:
      model: gpt-4o
connections:
  - sender: embedder.embedding
    receiver: retriever.query_embedding
  - sender: retriever.documents
    receiver: prompt_builder.documents
  - sender: prompt_builder.prompt
    receiver: generator.prompt

Load it back:

from haystack import Pipeline

with open("rag_pipeline.yaml", "r") as f:
    loaded_pipeline = Pipeline.load(f)

Step 5: Add the KnowledgeSDK Search Component

For knowledge bases that grow beyond what fits in memory, add a KnowledgeSDK search component. This lets you search over all previously scraped content using semantic search without managing a separate vector store.

import httpx
from haystack import component, Document

@component
class KnowledgeSDKSearchRetriever:
    """
    Haystack retriever component that uses KnowledgeSDK's semantic search API.
    Searches over all content previously indexed in your KnowledgeSDK knowledge base.
    """

    def __init__(self, api_key: str, top_k: int = 5):
        self.api_key = api_key
        self.top_k = top_k

    @component.output_types(documents=list[Document])
    def run(self, query: str) -> dict:
        response = httpx.post(
            "https://api.knowledgesdk.com/v1/search",
            headers={"x-api-key": self.api_key},
            json={"query": query, "limit": self.top_k},
        )
        response.raise_for_status()
        results = response.json()

        documents = [
            Document(
                content=item["content"],
                meta={
                    "url": item.get("url", ""),
                    "title": item.get("title", ""),
                    "score": item.get("score", 0),
                },
            )
            for item in results.get("results", [])
        ]

        return {"documents": documents}

Use it as a drop-in retriever in your query pipeline:

search_pipeline = Pipeline()

search_pipeline.add_component(
    "retriever",
    KnowledgeSDKSearchRetriever(
        api_key=os.environ["KNOWLEDGESDK_API_KEY"],
        top_k=5,
    ),
)
search_pipeline.add_component("prompt_builder", PromptBuilder(template=PROMPT_TEMPLATE))
search_pipeline.add_component("generator", OpenAIGenerator(model="gpt-4o"))

search_pipeline.connect("retriever.documents", "prompt_builder.documents")
search_pipeline.connect("prompt_builder.prompt", "generator.prompt")

result = search_pipeline.run({
    "retriever": {"query": "How does authentication work?"},
    "prompt_builder": {"question": "How does authentication work?"},
})
print(result["generator"]["replies"][0])

Step 6: Complete End-to-End Example

Here is a self-contained script combining everything:

import os
from haystack import Pipeline, component, Document
from haystack.components.preprocessors import DocumentSplitter
from haystack.components.embedders import OpenAIDocumentEmbedder, OpenAITextEmbedder
from haystack.document_stores.in_memory import InMemoryDocumentStore
from haystack.components.writers import DocumentWriter
from haystack.components.retrievers.in_memory import InMemoryEmbeddingRetriever
from haystack.components.builders import PromptBuilder
from haystack.components.generators import OpenAIGenerator
from knowledgesdk import KnowledgeSDK

knowledge_client = KnowledgeSDK(api_key=os.environ["KNOWLEDGESDK_API_KEY"])

@component
class KnowledgeSDKFetcher:
    @component.output_types(documents=list[Document])
    def run(self, urls: list[str]) -> dict:
        documents = []
        for url in urls:
            result = knowledge_client.extract(url, include_markdown=True)
            documents.append(Document(
                content=result.markdown,
                meta={"url": url, "title": result.title or ""},
            ))
        return {"documents": documents}

# Build and run
store = InMemoryDocumentStore()

idx = Pipeline()
idx.add_component("fetcher", KnowledgeSDKFetcher())
idx.add_component("splitter", DocumentSplitter(split_by="word", split_length=200, split_overlap=20))
idx.add_component("embedder", OpenAIDocumentEmbedder(model="text-embedding-3-small"))
idx.add_component("writer", DocumentWriter(document_store=store))
idx.connect("fetcher.documents", "splitter.documents")
idx.connect("splitter.documents", "embedder.documents")
idx.connect("embedder.documents", "writer.documents")

idx.run({"fetcher": {"urls": ["https://docs.example.com/api-reference"]}})
print("Indexed.")

TEMPLATE = """Context: {% for doc in documents %}{{ doc.content }}\n{% endfor %}\nQuestion: {{ question }}\nAnswer:"""

qry = Pipeline()
qry.add_component("embedder", OpenAITextEmbedder(model="text-embedding-3-small"))
qry.add_component("retriever", InMemoryEmbeddingRetriever(document_store=store, top_k=5))
qry.add_component("prompt", PromptBuilder(template=TEMPLATE))
qry.add_component("generator", OpenAIGenerator(model="gpt-4o"))
qry.connect("embedder.embedding", "retriever.query_embedding")
qry.connect("retriever.documents", "prompt.documents")
qry.connect("prompt.prompt", "generator.prompt")

result = qry.run({
    "embedder": {"text": "What authentication methods are supported?"},
    "prompt": {"question": "What authentication methods are supported?"},
})
print(result["generator"]["replies"][0])

Comparison: KnowledgeSDK vs. Apify Actor for Haystack

The standard Apify-based Haystack web fetching approach:

Aspect Apify + Haystack KnowledgeSDK + Haystack
Setup required Apify account, actor configuration, webhook setup One API key
Output format HTML (needs post-processing) Clean markdown
Component complexity Complex (Apify webhook → Haystack) Simple (direct component)
JS rendering Yes Yes
Anti-bot handling Yes Yes
Semantic search No (Apify doesn't provide search) Yes (built-in)
Change detection Manual polling Webhooks
Cost Apify usage + Haystack KnowledgeSDK only

The KnowledgeSDK component is simpler, returns better output, and adds search and webhook capabilities that the Apify actor doesn't provide.


Conclusion

Haystack is a strong framework for production RAG pipelines — typed components, YAML-configurable, and built for teams. The missing piece has been a web scraping component that returns clean markdown without HTML post-processing overhead.

The KnowledgeSDKFetcher component in this tutorial fills that gap. It's a standard Haystack component that works with every existing Haystack splitter, embedder, and retriever. You get JavaScript rendering and anti-bot handling from KnowledgeSDK, and the full Haystack ecosystem for chunking, embedding, retrieval, and generation.

Ready to build your Haystack RAG pipeline? Start a free KnowledgeSDK trial at knowledgesdk.com.

Try it now

Scrape, search, and monitor any website with one API.

Get your API key in 30 seconds. First 1,000 requests free.

GET API KEY →

Related Articles

integration

DSPy + Web Scraping: Optimize Your Retrieval Prompts Automatically

integration

Google ADK Web Scraping: Custom Grounding Beyond Google Search

integration

LangGraph Web Scraping: Build a Stateful Web Research Agent

integration

smolagents Web Scraping: Give HuggingFace Agents Web Access

← Back to blog