KnowledgeSDK + CrewAI: Give Your Multi-Agent System Web Research Capabilities

Build a 3-agent CrewAI system with web research capabilities. Full working code: Researcher scrapes URLs, Analyst searches the knowledge base, Writer synthesizes.

CrewAI is one of the most popular multi-agent orchestration frameworks for Python. It lets you define roles, goals, and tasks for AI agents that collaborate toward a shared outcome. Out of the box, however, CrewAI agents have no live web access — they operate only on context you provide at initialization.

KnowledgeSDK gives CrewAI agents real web research capabilities: scraping URLs on demand, searching a growing knowledge base, and monitoring pages for changes. This tutorial builds a three-agent competitive intelligence crew using full working code.

The Crew Architecture

Agent 1: The Researcher — scrapes competitor websites and indexes content in the knowledge base using KnowledgeSDK's scrape and extract tools.

Agent 2: The Analyst — searches the knowledge base for specific answers using KnowledgeSDK's hybrid search.

Agent 3: The Writer — synthesizes research findings into a structured competitive intelligence brief.

Prerequisites

pip install crewai knowledgesdk

Step 1: KnowledgeSDK Tools for CrewAI

CrewAI tools extend BaseTool. Here are three tools wrapping KnowledgeSDK's API:

import os
from typing import Optional
from crewai.tools import BaseTool
from pydantic import BaseModel, Field
from knowledgesdk import KnowledgeSDK

ks_client = KnowledgeSDK(api_key=os.environ["KNOWLEDGESDK_API_KEY"])


class ScrapeURLInput(BaseModel):
    url: str = Field(description="The URL to scrape and add to the knowledge base")

class ScrapeURLTool(BaseTool):
    name: str = "scrape_url"
    description: str = (
        "Scrape a web page and add it to the knowledge base. "
        "Use this to gather information from competitor websites, "
        "documentation pages, or any publicly accessible URL. "
        "Returns the page content as markdown."
    )
    args_schema: type[BaseModel] = ScrapeURLInput

    def _run(self, url: str) -> str:
        result = ks_client.scrape(url)
        chars = len(result.markdown)
        return (
            f"Successfully scraped and indexed: {url}\n"
            f"Content length: {chars} characters\n\n"
            f"Preview:\n{result.markdown[:500]}..."
        )


class SearchKnowledgeInput(BaseModel):
    query: str = Field(description="The search query to run against the knowledge base")
    limit: int = Field(default=5, description="Number of results to return (1-10)")

class SearchKnowledgeTool(BaseTool):
    name: str = "search_knowledge"
    description: str = (
        "Search the knowledge base using semantic + keyword hybrid search. "
        "Returns ranked results with titles, snippets, and source URLs."
    )
    args_schema: type[BaseModel] = SearchKnowledgeInput

    def _run(self, query: str, limit: int = 5) -> str:
        results = ks_client.search(query, limit=limit)
        if not results.items:
            return f"No results found for: '{query}'"
        output = f"Search results for: '{query}'\n\n"
        for i, item in enumerate(results.items, 1):
            output += f"{i}. [{item.score:.2f}] {item.title}\n"
            output += f"   URL: {item.url}\n"
            output += f"   {item.snippet}\n\n"
        return output


class ExtractSiteInput(BaseModel):
    url: str = Field(description="The base URL of the site to fully extract")
    max_pages: Optional[int] = Field(default=20, description="Max pages to extract")

class ExtractSiteTool(BaseTool):
    name: str = "extract_site"
    description: str = (
        "Extract and index an entire website. Use this for comprehensive coverage "
        "of a competitor's documentation or product site."
    )
    args_schema: type[BaseModel] = ExtractSiteInput

    def _run(self, url: str, max_pages: int = 20) -> str:
        result = ks_client.extract(url, options={"maxPages": max_pages})
        return (
            f"Extracted and indexed: {url}\n"
            f"Pages: {result.pageCount}\n"
            f"All pages are now searchable in the knowledge base."
        )

Step 2: Define the Agents

from crewai import Agent
from langchain_openai import ChatOpenAI

llm = ChatOpenAI(model="gpt-4o", api_key=os.environ["OPENAI_API_KEY"], temperature=0.1)

scrape_tool = ScrapeURLTool()
search_tool = SearchKnowledgeTool()
extract_tool = ExtractSiteTool()

researcher = Agent(
    role="Web Research Specialist",
    goal=(
        "Gather comprehensive information from competitor websites by scraping "
        "their documentation, product pages, pricing, and API references. "
        "Ensure all relevant pages are indexed before handing off."
    ),
    backstory=(
        "You are an expert at navigating web sources to gather competitive intelligence. "
        "You know which pages matter most: pricing, API docs, SDKs, changelogs."
    ),
    tools=[scrape_tool, extract_tool],
    llm=llm,
    verbose=True,
    max_iter=10
)

analyst = Agent(
    role="Competitive Intelligence Analyst",
    goal=(
        "Search the knowledge base to extract specific competitive insights. "
        "Answer precise questions about competitor products with source citations."
    ),
    backstory=(
        "You extract precise information from large amounts of web content, "
        "running targeted searches and cross-referencing results."
    ),
    tools=[search_tool],
    llm=llm,
    verbose=True,
    max_iter=15
)

writer = Agent(
    role="Technical Content Strategist",
    goal=(
        "Synthesize competitive research into a clear, structured brief "
        "that product and sales teams can act on."
    ),
    backstory=(
        "You write concise competitive briefs for technical audiences, "
        "highlighting key differentiators and strategic recommendations."
    ),
    tools=[],
    llm=llm,
    verbose=True
)

Step 3: Define the Tasks

from crewai import Task

def create_research_tasks(competitor_url: str, research_questions: list[str]):
    questions_text = "\n".join(f"- {q}" for q in research_questions)

    research_task = Task(
        description=f"""
        Gather comprehensive information about the competitor at: {competitor_url}

        Use extract_site to crawl their documentation. Then use scrape_url for:
        - Pricing page
        - API documentation  
        - Getting started guide
        - SDK/integration pages
        - Changelog

        Report what you found and how many pages were indexed.
        """,
        expected_output="Summary of pages scraped and indexed with content types found.",
        agent=researcher
    )

    analysis_task = Task(
        description=f"""
        Search the knowledge base to answer these questions:

        {questions_text}

        For each question: run a targeted search, review top results, extract
        the specific answer with source URLs. Note any unanswered questions.
        """,
        expected_output=(
            "Structured analysis answering each research question "
            "with evidence and source URLs."
        ),
        agent=analyst,
        context=[research_task]
    )

    brief_task = Task(
        description="""
        Write a competitive intelligence brief using the team's research.

        Structure:
        1. Executive Summary (3-4 sentences)
        2. Pricing Model
        3. Core Capabilities
        4. API and Developer Experience
        5. Key Differentiators vs Us
        6. Weaknesses and Gaps
        7. Strategic Recommendations (2-3 action items)

        Keep under 800 words. Write for a technical audience.
        """,
        expected_output=(
            "Structured competitive intelligence brief under 800 words."
        ),
        agent=writer,
        context=[research_task, analysis_task]
    )

    return [research_task, analysis_task, brief_task]

Step 4: Assemble and Run the Crew

from crewai import Crew, Process

competitor_url = "https://docs.competitor.com"
research_questions = [
    "What are their pricing tiers and monthly costs?",
    "What programming languages do their SDKs support?",
    "What are their API rate limits?",
    "Do they support webhooks or real-time notifications?",
    "How do they handle authentication?",
    "What are their main limitations?"
]

tasks = create_research_tasks(competitor_url, research_questions)

crew = Crew(
    agents=[researcher, analyst, writer],
    tasks=tasks,
    process=Process.sequential,
    verbose=True
)

result = crew.kickoff()
print(result)

Step 5: Keep the Knowledge Base Fresh with Webhooks

from knowledgesdk import KnowledgeSDK

client = KnowledgeSDK(api_key=os.environ["KNOWLEDGESDK_API_KEY"])

pages_to_monitor = [
    "https://competitor.com/pricing",
    "https://docs.competitor.com/changelog",
]

for url in pages_to_monitor:
    webhook = client.webhooks.create(
        url=url,
        callback_url="https://your-app.com/webhooks/content-changed",
        events=["content.changed"]
    )
    print(f"Monitoring: {url}")

Handle the webhook to re-scrape automatically:

# FastAPI webhook handler
from fastapi import FastAPI, Request
from knowledgesdk import KnowledgeSDK

app = FastAPI()
client = KnowledgeSDK(api_key=os.environ["KNOWLEDGESDK_API_KEY"])

@app.post("/webhooks/content-changed")
async def handle_change(request: Request):
    payload = await request.json()
    changed_url = payload.get("url")
    if payload.get("event") == "content.changed" and changed_url:
        client.scrape(changed_url)  # Re-index the changed page
    return {"status": "ok"}

CrewAI Without vs With KnowledgeSDK

Capability	CrewAI Alone	CrewAI + KnowledgeSDK
Live web access	No	Yes
JavaScript rendering	No	Yes
Persistent knowledge base	No	Yes
Semantic search over scraped content	No	Yes
Automatic content refresh	No	Yes (webhooks)
Setup complexity	N/A	5 minutes

Running Multiple Competitors

import json
from datetime import datetime

competitors = [
    {"name": "Firecrawl", "url": "https://docs.firecrawl.dev"},
    {"name": "Jina Reader", "url": "https://docs.jina.ai"},
    {"name": "Apify", "url": "https://docs.apify.com"}
]

briefs = {}
for competitor in competitors:
    print(f"Researching: {competitor['name']}...")
    tasks = create_research_tasks(
        competitor_url=competitor["url"],
        research_questions=[
            "What is their pricing?",
            "What are their main features?",
            "What are their API capabilities?",
            "What are their weaknesses?"
        ]
    )
    crew = Crew(
        agents=[researcher, analyst, writer],
        tasks=tasks,
        process=Process.sequential,
        verbose=False
    )
    briefs[competitor["name"]] = {
        "brief": str(crew.kickoff()),
        "generated_at": datetime.now().isoformat()
    }

with open("competitive_briefs.json", "w") as f:
    json.dump(briefs, f, indent=2)

FAQ

Does KnowledgeSDK work with CrewAI's built-in tool registry? Yes. KnowledgeSDK tools implemented as BaseTool subclasses are fully compatible with CrewAI's tool system. You can mix them with DuckDuckGo search, file reading, or any other tool.

Can I use CrewAI with Node.js instead of Python? CrewAI is Python-only. For Node.js multi-agent workflows, use KnowledgeSDK's @knowledgesdk/node SDK with frameworks like AutoGen or custom OpenAI function-calling implementations.

How do I prevent the Researcher from re-scraping pages already indexed? KnowledgeSDK automatically updates existing entries when you re-scrape the same URL rather than creating duplicates. For efficiency, track scraped URLs in the task context to avoid unnecessary calls.

Can agents search content from previous crew runs? Yes. The KnowledgeSDK knowledge base is persistent and scoped to your API key. Content indexed in one run is immediately available in all subsequent runs — your crew accumulates knowledge over time.

Is there a limit on pages the Researcher can scrape per run? Limits depend on your plan. The extract_site tool's max_pages parameter controls crawl depth. For large competitor sites, start with 20-30 pages focused on the most important sections.

Can I give the Writer agent search capabilities too? Yes. Any agent can have any tool. The Researcher → Analyst → Writer pattern is efficient, but a Writer with search access can do targeted lookups during drafting.

What LLMs work with this setup? Any LLM that LangChain supports: OpenAI GPT-4o, Anthropic Claude, Google Gemini, Mistral, or local models via Ollama. Replace ChatOpenAI with your preferred provider.

Conclusion

CrewAI handles agent coordination and role specialization. KnowledgeSDK provides the web research capabilities those agents need: JavaScript rendering, anti-bot bypass, persistent search, and webhook-triggered updates.

The Researcher → Analyst → Writer pattern is a proven starting point. Real production systems often add more agents (fact-checker, email drafter, social media writer) and additional KnowledgeSDK tools. The core pattern — scrape to knowledge base, search to retrieve, synthesize to output — scales to much more complex workflows.

Get your KnowledgeSDK API key and plug web research into your CrewAI agents today.