Microsoft AutoGen is a powerful multi-agent framework that supports conversational AI agents with tool use. By default, AutoGen agents are limited to the information they were trained on — they have no live web access. Every time your agent needs to look something up, it either hallucinates or tells the user it cannot help.
KnowledgeSDK changes this. By registering KnowledgeSDK's scrape and search functions as AutoGen tools, your agents can fetch live web content, build a persistent knowledge base, and answer questions grounded in real, current data.
This tutorial is Python-focused — AutoGen is Python-first and most production deployments use Python.
What We Are Building
A two-agent AutoGen system:
- WebResearchAgent: Equipped with KnowledgeSDK tools for scraping URLs and searching the knowledge base
- UserProxyAgent: Represents the human user, drives the conversation, and executes tool calls
The agents collaborate to research any topic from the live web, answer questions, and cite their sources.
Prerequisites
pip install pyautogen knowledgesdk
You need:
- A KnowledgeSDK API key (free tier at knowledgesdk.com)
- An OpenAI API key (or any AutoGen-compatible LLM)
Step 1: Define KnowledgeSDK Functions
AutoGen tools are plain Python functions with docstrings and type hints. AutoGen uses the docstring and signature to describe the tool to the LLM.
# Python — KnowledgeSDK tool functions for AutoGen
import os
from knowledgesdk import KnowledgeSDK
ks_client = KnowledgeSDK(api_key=os.environ["KNOWLEDGESDK_API_KEY"])
def scrape_url(url: str) -> str:
"""
Scrape a web page and add it to the knowledge base.
Use this tool when you need to gather information from a specific URL.
The page content will be automatically indexed for future searches.
Returns the scraped content as clean markdown.
Args:
url: The full URL to scrape (must start with https://)
Returns:
The page content as markdown text, or an error message.
"""
result = ks_client.scrape(url)
char_count = len(result.markdown)
return (
f"Successfully scraped: {url}\n"
f"Content length: {char_count} characters\n\n"
f"---CONTENT---\n"
f"{result.markdown[:2000]}"
f"\n[...truncated, full content indexed for search]"
if char_count > 2000 else
f"Successfully scraped: {url}\n\n{result.markdown}"
)
def search_knowledge(query: str, limit: int = 5) -> str:
"""
Search the knowledge base using semantic and keyword hybrid search.
Use this tool to find information from previously scraped web pages.
The search uses both semantic similarity and keyword matching for
accurate results. Always prefer this over re-scraping if you have
already indexed the relevant pages.
Args:
query: Natural language search query
limit: Number of results to return (1-10, default 5)
Returns:
Ranked search results with titles, snippets, and source URLs.
"""
results = ks_client.search(query, limit=limit)
if not results.items:
return f"No results found for query: '{query}'. Try scraping relevant pages first."
output = f"Search results for: '{query}'\n"
output += f"Found {len(results.items)} results:\n\n"
for i, item in enumerate(results.items, 1):
output += f"{i}. {item.title} (score: {item.score:.2f})\n"
output += f" Source: {item.url}\n"
output += f" {item.snippet}\n\n"
return output
def extract_site(url: str, max_pages: int = 15) -> str:
"""
Extract and index an entire website, crawling up to max_pages pages.
Use this when you need comprehensive knowledge from a site
(documentation, competitor analysis, news site). More thorough
than scraping individual pages but takes longer.
Args:
url: The base URL of the site to extract
max_pages: Maximum pages to extract (1-50, default 15)
Returns:
Summary of extraction including page count and indexed content.
"""
result = ks_client.extract(url, options={"maxPages": max_pages})
return (
f"Successfully extracted: {url}\n"
f"Pages indexed: {result.pageCount}\n"
f"Total content: {result.totalCharacters} characters\n"
f"All content is now searchable via search_knowledge."
)
Step 2: Create AutoGen Agents with KnowledgeSDK Tools
# Python — AutoGen agents with KnowledgeSDK tool registration
import autogen
# LLM configuration
llm_config = {
"config_list": [
{
"model": "gpt-4o",
"api_key": os.environ["OPENAI_API_KEY"]
}
],
"temperature": 0.1,
"timeout": 120
}
# Web Research Agent — has web research capabilities
web_research_agent = autogen.AssistantAgent(
name="WebResearchAgent",
system_message="""You are a web research specialist with access to live web content.
You have three tools available:
1. scrape_url — Fetch and index a specific URL
2. search_knowledge — Search all previously indexed content
3. extract_site — Crawl and index an entire website
Research workflow:
- For known URLs: use scrape_url or extract_site to gather content, then search_knowledge
- For unknown sources: start with scrape_url on the most relevant pages
- Always cite your sources with URLs in your responses
- Use search_knowledge first before scraping — content may already be indexed
Be thorough but efficient. Use search_knowledge before re-scraping pages.""",
llm_config=llm_config
)
# User Proxy — executes tool calls, represents the human
user_proxy = autogen.UserProxyAgent(
name="User",
human_input_mode="NEVER", # Fully automated — change to "TERMINATE" for manual control
max_consecutive_auto_reply=15,
is_termination_msg=lambda msg: "RESEARCH_COMPLETE" in msg.get("content", ""),
code_execution_config=False # Disable code execution, we use function tools only
)
# Register KnowledgeSDK functions as AutoGen tools
autogen.register_function(
scrape_url,
caller=web_research_agent,
executor=user_proxy,
name="scrape_url",
description="Scrape a web page and add it to the searchable knowledge base"
)
autogen.register_function(
search_knowledge,
caller=web_research_agent,
executor=user_proxy,
name="search_knowledge",
description="Search all indexed web content using hybrid semantic + keyword search"
)
autogen.register_function(
extract_site,
caller=web_research_agent,
executor=user_proxy,
name="extract_site",
description="Crawl and index an entire website for comprehensive coverage"
)
Step 3: Run a Web Research Session
# Python — Run an AutoGen web research session
def research(question: str, seed_urls: list[str] = None):
"""
Run a web research session with AutoGen + KnowledgeSDK.
Args:
question: The research question to answer
seed_urls: Optional list of URLs to scrape before researching
"""
# Build the initial message
if seed_urls:
url_list = "\n".join(f"- {url}" for url in seed_urls)
message = f"""Please research the following question:
{question}
Start by indexing these relevant pages:
{url_list}
Then search the knowledge base and provide a comprehensive answer with source citations.
When done, end your response with RESEARCH_COMPLETE."""
else:
message = f"""Please research the following question:
{question}
Find and scrape relevant web sources, then provide a comprehensive answer with citations.
When done, end your response with RESEARCH_COMPLETE."""
# Initiate the conversation
user_proxy.initiate_chat(
web_research_agent,
message=message,
clear_history=True
)
# Example 1: Competitor research
research(
question="What are Firecrawl's pricing tiers and rate limits in 2026?",
seed_urls=[
"https://firecrawl.dev/pricing",
"https://docs.firecrawl.dev/rate-limits"
]
)
# Example 2: Technology research (no seed URLs)
research(
question="What are the main differences between AutoGen and CrewAI for multi-agent systems?"
)
Step 4: Multi-Agent Pattern with Specialist Agents
For more complex research tasks, use multiple specialized agents:
# Python — Multi-agent research team
scraper_agent = autogen.AssistantAgent(
name="Scraper",
system_message="""You are a web scraping specialist. Your only job is to
gather web content by calling scrape_url and extract_site. Do not analyze
content — just gather and index it, then report what you scraped.""",
llm_config=llm_config
)
analyst_agent = autogen.AssistantAgent(
name="Analyst",
system_message="""You are a research analyst. Use search_knowledge to find
specific information from the indexed knowledge base. Extract precise facts,
quote sources, and identify patterns. Always include source URLs.""",
llm_config=llm_config
)
# Register tools to appropriate agents
autogen.register_function(
scrape_url,
caller=scraper_agent,
executor=user_proxy,
name="scrape_url",
description="Scrape a URL and index it"
)
autogen.register_function(
extract_site,
caller=scraper_agent,
executor=user_proxy,
name="extract_site",
description="Extract an entire site"
)
autogen.register_function(
search_knowledge,
caller=analyst_agent,
executor=user_proxy,
name="search_knowledge",
description="Search indexed content"
)
# Group chat with all three agents
group_chat = autogen.GroupChat(
agents=[user_proxy, scraper_agent, analyst_agent],
messages=[],
max_round=20,
speaker_selection_method="round_robin" # Or "auto" for LLM-driven selection
)
manager = autogen.GroupChatManager(
groupchat=group_chat,
llm_config=llm_config
)
user_proxy.initiate_chat(
manager,
message="""Research task: Analyze the top 3 web scraping APIs (Firecrawl,
KnowledgeSDK, Jina Reader) and compare their pricing, features, and
developer experience.
Scraper: index their websites first.
Analyst: search and extract the comparison data.
End with RESEARCH_COMPLETE when done."""
)
Step 5: Persistent Knowledge Base Across Sessions
One major advantage of KnowledgeSDK over in-memory approaches: the knowledge base persists across Python sessions. Content scraped in a previous run is available for search in the next.
# Python — Persistent knowledge base pattern
import json
from datetime import datetime
from pathlib import Path
KNOWLEDGE_LOG = Path("knowledge_log.json")
def log_scrape(url: str):
"""Track what has been scraped to avoid redundant calls."""
log = {}
if KNOWLEDGE_LOG.exists():
log = json.loads(KNOWLEDGE_LOG.read_text())
log[url] = datetime.now().isoformat()
KNOWLEDGE_LOG.write_text(json.dumps(log, indent=2))
def already_scraped(url: str) -> bool:
"""Check if a URL was already indexed in a previous session."""
if not KNOWLEDGE_LOG.exists():
return False
log = json.loads(KNOWLEDGE_LOG.read_text())
return url in log
# Modified scrape function that checks the log
def scrape_url_cached(url: str) -> str:
"""Scrape a URL (skips if already indexed in a previous session)."""
if already_scraped(url):
return f"Already indexed: {url}. Use search_knowledge to query it."
result = ks_client.scrape(url)
log_scrape(url)
return f"Newly indexed: {url} ({len(result.markdown)} chars)"
Comparison: AutoGen Without vs With KnowledgeSDK
| Capability | AutoGen Alone | AutoGen + KnowledgeSDK |
|---|---|---|
| Answer questions from training data | Yes | Yes |
| Access live web content | No | Yes |
| Search specific domains | No | Yes |
| Persistent knowledge base | No | Yes |
| JavaScript-rendered pages | No | Yes |
| Source citations with URLs | No | Yes |
| Setup time | N/A | 5 minutes |
FAQ
Does KnowledgeSDK work with AutoGen 0.2 and the newer AgentChat API?
Yes. The function registration approach shown here works with AutoGen 0.2. For AutoGen's newer agentchat module (0.4+), use the same Python functions but register them via FunctionTool objects instead of register_function. The KnowledgeSDK client code is identical.
Can I use Claude or Gemini instead of GPT-4o?
Yes. AutoGen supports any LLM with a compatible API. Replace the config_list in llm_config with your preferred provider's model and API key. Claude and Gemini both support function calling.
How do I limit which websites the agent can scrape?
Add a whitelist check inside the scrape_url function before calling ks_client.scrape(). You can maintain a list of allowed domains and return an error message if the requested URL is not on the list.
What is the maximum content size per page? KnowledgeSDK handles pages of any size. Very long pages are automatically chunked during indexing. The search engine returns the most relevant chunks rather than the full document.
Does AutoGen support streaming responses? AutoGen supports streaming for the conversational responses from the LLM, but tool call results (from KnowledgeSDK) are returned synchronously. This is standard behavior for AutoGen.
Can I use this with a local LLM via Ollama? Yes. AutoGen supports Ollama-compatible endpoints. Replace the OpenAI config with your local endpoint. Note that smaller local models may be less reliable at deciding when and how to call tools. GPT-4o or Claude tend to be more accurate at tool orchestration.
How do I handle rate limits if the agent scrapes too many pages at once?
KnowledgeSDK's API handles rate limiting gracefully by returning errors with retry-after headers. For high-volume scraping in AutoGen sessions, add exponential backoff in your tool functions or limit max_pages in extract_site.
Conclusion
AutoGen provides excellent multi-agent orchestration and conversational AI. KnowledgeSDK provides the web research layer that AutoGen lacks: live scraping, persistent indexed knowledge, and hybrid semantic search.
The combination is particularly powerful for:
- Research assistants that need current web data to answer questions
- Competitive intelligence agents that monitor competitor sites
- Documentation assistants grounded in live product docs
- News analysis agents that need real-time content
The setup requires about 50 lines of Python to connect both systems, and the result is an agent that can genuinely research the web — not just pretend to.
Get your KnowledgeSDK API key and give your AutoGen agents web access today.
pip install pyautogen knowledgesdk