How to Use KnowledgeSDK with AutoGen for Web Research Agents

Add live web capabilities to Microsoft AutoGen agents. Build a web research agent using AutoGen function calling and KnowledgeSDK's scrape and search endpoints.

Microsoft AutoGen is a powerful multi-agent framework that supports conversational AI agents with tool use. By default, AutoGen agents are limited to the information they were trained on — they have no live web access. Every time your agent needs to look something up, it either hallucinates or tells the user it cannot help.

KnowledgeSDK changes this. By registering KnowledgeSDK's scrape and search functions as AutoGen tools, your agents can fetch live web content, build a persistent knowledge base, and answer questions grounded in real, current data.

This tutorial is Python-focused — AutoGen is Python-first and most production deployments use Python.

What We Are Building

A two-agent AutoGen system:

WebResearchAgent: Equipped with KnowledgeSDK tools for scraping URLs and searching the knowledge base
UserProxyAgent: Represents the human user, drives the conversation, and executes tool calls

The agents collaborate to research any topic from the live web, answer questions, and cite their sources.

Prerequisites

pip install pyautogen knowledgesdk

You need:

A KnowledgeSDK API key (free tier at knowledgesdk.com)
An OpenAI API key (or any AutoGen-compatible LLM)

Step 1: Define KnowledgeSDK Functions

AutoGen tools are plain Python functions with docstrings and type hints. AutoGen uses the docstring and signature to describe the tool to the LLM.

# Python — KnowledgeSDK tool functions for AutoGen
import os
from knowledgesdk import KnowledgeSDK

ks_client = KnowledgeSDK(api_key=os.environ["KNOWLEDGESDK_API_KEY"])


def scrape_url(url: str) -> str:
    """
    Scrape a web page and add it to the knowledge base.

    Use this tool when you need to gather information from a specific URL.
    The page content will be automatically indexed for future searches.
    Returns the scraped content as clean markdown.

    Args:
        url: The full URL to scrape (must start with https://)

    Returns:
        The page content as markdown text, or an error message.
    """
    result = ks_client.scrape(url)
    char_count = len(result.markdown)
    return (
        f"Successfully scraped: {url}\n"
        f"Content length: {char_count} characters\n\n"
        f"---CONTENT---\n"
        f"{result.markdown[:2000]}"
        f"\n[...truncated, full content indexed for search]"
        if char_count > 2000 else
        f"Successfully scraped: {url}\n\n{result.markdown}"
    )


def search_knowledge(query: str, limit: int = 5) -> str:
    """
    Search the knowledge base using semantic and keyword hybrid search.

    Use this tool to find information from previously scraped web pages.
    The search uses both semantic similarity and keyword matching for
    accurate results. Always prefer this over re-scraping if you have
    already indexed the relevant pages.

    Args:
        query: Natural language search query
        limit: Number of results to return (1-10, default 5)

    Returns:
        Ranked search results with titles, snippets, and source URLs.
    """
    results = ks_client.search(query, limit=limit)

    if not results.items:
        return f"No results found for query: '{query}'. Try scraping relevant pages first."

    output = f"Search results for: '{query}'\n"
    output += f"Found {len(results.items)} results:\n\n"

    for i, item in enumerate(results.items, 1):
        output += f"{i}. {item.title} (score: {item.score:.2f})\n"
        output += f"   Source: {item.url}\n"
        output += f"   {item.snippet}\n\n"

    return output


def extract_site(url: str, max_pages: int = 15) -> str:
    """
    Extract and index an entire website, crawling up to max_pages pages.

    Use this when you need comprehensive knowledge from a site
    (documentation, competitor analysis, news site). More thorough
    than scraping individual pages but takes longer.

    Args:
        url: The base URL of the site to extract
        max_pages: Maximum pages to extract (1-50, default 15)

    Returns:
        Summary of extraction including page count and indexed content.
    """
    result = ks_client.extract(url, options={"maxPages": max_pages})
    return (
        f"Successfully extracted: {url}\n"
        f"Pages indexed: {result.pageCount}\n"
        f"Total content: {result.totalCharacters} characters\n"
        f"All content is now searchable via search_knowledge."
    )

Step 2: Create AutoGen Agents with KnowledgeSDK Tools

# Python — AutoGen agents with KnowledgeSDK tool registration
import autogen

# LLM configuration
llm_config = {
    "config_list": [
        {
            "model": "gpt-4o",
            "api_key": os.environ["OPENAI_API_KEY"]
        }
    ],
    "temperature": 0.1,
    "timeout": 120
}

# Web Research Agent — has web research capabilities
web_research_agent = autogen.AssistantAgent(
    name="WebResearchAgent",
    system_message="""You are a web research specialist with access to live web content.

You have three tools available:
1. scrape_url — Fetch and index a specific URL
2. search_knowledge — Search all previously indexed content
3. extract_site — Crawl and index an entire website

Research workflow:
- For known URLs: use scrape_url or extract_site to gather content, then search_knowledge
- For unknown sources: start with scrape_url on the most relevant pages
- Always cite your sources with URLs in your responses
- Use search_knowledge first before scraping — content may already be indexed

Be thorough but efficient. Use search_knowledge before re-scraping pages.""",
    llm_config=llm_config
)

# User Proxy — executes tool calls, represents the human
user_proxy = autogen.UserProxyAgent(
    name="User",
    human_input_mode="NEVER",  # Fully automated — change to "TERMINATE" for manual control
    max_consecutive_auto_reply=15,
    is_termination_msg=lambda msg: "RESEARCH_COMPLETE" in msg.get("content", ""),
    code_execution_config=False  # Disable code execution, we use function tools only
)

# Register KnowledgeSDK functions as AutoGen tools
autogen.register_function(
    scrape_url,
    caller=web_research_agent,
    executor=user_proxy,
    name="scrape_url",
    description="Scrape a web page and add it to the searchable knowledge base"
)

autogen.register_function(
    search_knowledge,
    caller=web_research_agent,
    executor=user_proxy,
    name="search_knowledge",
    description="Search all indexed web content using hybrid semantic + keyword search"
)

autogen.register_function(
    extract_site,
    caller=web_research_agent,
    executor=user_proxy,
    name="extract_site",
    description="Crawl and index an entire website for comprehensive coverage"
)

Step 3: Run a Web Research Session

# Python — Run an AutoGen web research session
def research(question: str, seed_urls: list[str] = None):
    """
    Run a web research session with AutoGen + KnowledgeSDK.

    Args:
        question: The research question to answer
        seed_urls: Optional list of URLs to scrape before researching
    """
    # Build the initial message
    if seed_urls:
        url_list = "\n".join(f"- {url}" for url in seed_urls)
        message = f"""Please research the following question:

{question}

Start by indexing these relevant pages:
{url_list}

Then search the knowledge base and provide a comprehensive answer with source citations.
When done, end your response with RESEARCH_COMPLETE."""
    else:
        message = f"""Please research the following question:

{question}

Find and scrape relevant web sources, then provide a comprehensive answer with citations.
When done, end your response with RESEARCH_COMPLETE."""

    # Initiate the conversation
    user_proxy.initiate_chat(
        web_research_agent,
        message=message,
        clear_history=True
    )


# Example 1: Competitor research
research(
    question="What are Firecrawl's pricing tiers and rate limits in 2026?",
    seed_urls=[
        "https://firecrawl.dev/pricing",
        "https://docs.firecrawl.dev/rate-limits"
    ]
)

# Example 2: Technology research (no seed URLs)
research(
    question="What are the main differences between AutoGen and CrewAI for multi-agent systems?"
)

Step 4: Multi-Agent Pattern with Specialist Agents

For more complex research tasks, use multiple specialized agents:

# Python — Multi-agent research team
scraper_agent = autogen.AssistantAgent(
    name="Scraper",
    system_message="""You are a web scraping specialist. Your only job is to
    gather web content by calling scrape_url and extract_site. Do not analyze
    content — just gather and index it, then report what you scraped.""",
    llm_config=llm_config
)

analyst_agent = autogen.AssistantAgent(
    name="Analyst",
    system_message="""You are a research analyst. Use search_knowledge to find
    specific information from the indexed knowledge base. Extract precise facts,
    quote sources, and identify patterns. Always include source URLs.""",
    llm_config=llm_config
)

# Register tools to appropriate agents
autogen.register_function(
    scrape_url,
    caller=scraper_agent,
    executor=user_proxy,
    name="scrape_url",
    description="Scrape a URL and index it"
)

autogen.register_function(
    extract_site,
    caller=scraper_agent,
    executor=user_proxy,
    name="extract_site",
    description="Extract an entire site"
)

autogen.register_function(
    search_knowledge,
    caller=analyst_agent,
    executor=user_proxy,
    name="search_knowledge",
    description="Search indexed content"
)

# Group chat with all three agents
group_chat = autogen.GroupChat(
    agents=[user_proxy, scraper_agent, analyst_agent],
    messages=[],
    max_round=20,
    speaker_selection_method="round_robin"  # Or "auto" for LLM-driven selection
)

manager = autogen.GroupChatManager(
    groupchat=group_chat,
    llm_config=llm_config
)

user_proxy.initiate_chat(
    manager,
    message="""Research task: Analyze the top 3 web scraping APIs (Firecrawl, 
    KnowledgeSDK, Jina Reader) and compare their pricing, features, and 
    developer experience.
    
    Scraper: index their websites first.
    Analyst: search and extract the comparison data.
    End with RESEARCH_COMPLETE when done."""
)

Step 5: Persistent Knowledge Base Across Sessions

One major advantage of KnowledgeSDK over in-memory approaches: the knowledge base persists across Python sessions. Content scraped in a previous run is available for search in the next.

# Python — Persistent knowledge base pattern
import json
from datetime import datetime
from pathlib import Path

KNOWLEDGE_LOG = Path("knowledge_log.json")

def log_scrape(url: str):
    """Track what has been scraped to avoid redundant calls."""
    log = {}
    if KNOWLEDGE_LOG.exists():
        log = json.loads(KNOWLEDGE_LOG.read_text())

    log[url] = datetime.now().isoformat()
    KNOWLEDGE_LOG.write_text(json.dumps(log, indent=2))

def already_scraped(url: str) -> bool:
    """Check if a URL was already indexed in a previous session."""
    if not KNOWLEDGE_LOG.exists():
        return False
    log = json.loads(KNOWLEDGE_LOG.read_text())
    return url in log

# Modified scrape function that checks the log
def scrape_url_cached(url: str) -> str:
    """Scrape a URL (skips if already indexed in a previous session)."""
    if already_scraped(url):
        return f"Already indexed: {url}. Use search_knowledge to query it."
    result = ks_client.scrape(url)
    log_scrape(url)
    return f"Newly indexed: {url} ({len(result.markdown)} chars)"

Comparison: AutoGen Without vs With KnowledgeSDK

Capability	AutoGen Alone	AutoGen + KnowledgeSDK
Answer questions from training data	Yes	Yes
Access live web content	No	Yes
Search specific domains	No	Yes
Persistent knowledge base	No	Yes
JavaScript-rendered pages	No	Yes
Source citations with URLs	No	Yes
Setup time	N/A	5 minutes

FAQ

Does KnowledgeSDK work with AutoGen 0.2 and the newer AgentChat API? Yes. The function registration approach shown here works with AutoGen 0.2. For AutoGen's newer agentchat module (0.4+), use the same Python functions but register them via FunctionTool objects instead of register_function. The KnowledgeSDK client code is identical.

Can I use Claude or Gemini instead of GPT-4o? Yes. AutoGen supports any LLM with a compatible API. Replace the config_list in llm_config with your preferred provider's model and API key. Claude and Gemini both support function calling.

How do I limit which websites the agent can scrape? Add a whitelist check inside the scrape_url function before calling ks_client.scrape(). You can maintain a list of allowed domains and return an error message if the requested URL is not on the list.

What is the maximum content size per page? KnowledgeSDK handles pages of any size. Very long pages are automatically chunked during indexing. The search engine returns the most relevant chunks rather than the full document.

Does AutoGen support streaming responses? AutoGen supports streaming for the conversational responses from the LLM, but tool call results (from KnowledgeSDK) are returned synchronously. This is standard behavior for AutoGen.

Can I use this with a local LLM via Ollama? Yes. AutoGen supports Ollama-compatible endpoints. Replace the OpenAI config with your local endpoint. Note that smaller local models may be less reliable at deciding when and how to call tools. GPT-4o or Claude tend to be more accurate at tool orchestration.

How do I handle rate limits if the agent scrapes too many pages at once? KnowledgeSDK's API handles rate limiting gracefully by returning errors with retry-after headers. For high-volume scraping in AutoGen sessions, add exponential backoff in your tool functions or limit max_pages in extract_site.

Conclusion

AutoGen provides excellent multi-agent orchestration and conversational AI. KnowledgeSDK provides the web research layer that AutoGen lacks: live scraping, persistent indexed knowledge, and hybrid semantic search.

The combination is particularly powerful for:

Research assistants that need current web data to answer questions
Competitive intelligence agents that monitor competitor sites
Documentation assistants grounded in live product docs
News analysis agents that need real-time content

The setup requires about 50 lines of Python to connect both systems, and the result is an agent that can genuinely research the web — not just pretend to.

Get your KnowledgeSDK API key and give your AutoGen agents web access today.