smolagents Web Scraping: Give HuggingFace Agents Web Access
HuggingFace's smolagents has become the go-to framework for researchers and ML engineers who want to build code agents without the abstraction overhead of LangChain. The design philosophy is minimal: small API surface, model-agnostic, easy to reason about.
smolagents ships with a DuckDuckGoSearchTool that lets agents search the web. But search snippets are a fundamental limitation for many agent tasks. When your agent needs to read a full documentation page, extract structured data from a product listing, or compare two pages side by side, snippets don't contain enough information to answer the question.
This tutorial shows how to add KnowledgeSDK as a custom tool in a smolagents CodeAgent. The integration is under 20 lines of Python. You'll get full page content — not snippets — via JavaScript rendering and anti-bot handling, all transparent to the agent.
Why smolagents?
smolagents is worth understanding on its own terms before adding tools to it.
The key design decision is the CodeAgent pattern. Instead of using a ReAct loop where the agent outputs "Thought/Action/Observation" steps in natural language, a CodeAgent outputs Python code. The framework executes that code and feeds the result back to the model.
This makes agents more capable at multi-step tasks because:
- The model can use Python variables to pass data between steps
- The model can use loops, conditionals, and list comprehensions
- The model's reasoning is inspectable as real code, not natural language
The tradeoff is that you need a model capable of writing clean Python. Qwen2.5-Coder and GPT-4o work well. Smaller models may struggle.
Setup
pip install smolagents knowledgesdk
Set API keys:
export KNOWLEDGESDK_API_KEY="knowledgesdk_live_..."
export OPENAI_API_KEY="sk-..." # or use HuggingFace models
The Built-In DuckDuckGoSearchTool: What It Returns
smolagents ships with DuckDuckGoSearchTool. Let's be specific about what it returns so you understand the gap:
from smolagents import DuckDuckGoSearchTool
search_tool = DuckDuckGoSearchTool()
result = search_tool("KnowledgeSDK web scraping API")
# Result: a string with snippets like:
# "KnowledgeSDK is a web scraping API for AI agents... [snippet] ...
# KnowledgeSDK provides LLM-ready markdown extraction... [snippet]"
Each snippet is typically 100-200 characters. You get the gist of what a page is about, but not the content itself. For many queries — especially those that require reading instructions, code examples, or structured data from a specific page — snippets are insufficient.
Step 1: Define the KnowledgeSDK Tool
The @tool decorator is all you need. smolagents reads the function signature and docstring to understand when and how to call the tool.
import os
from smolagents import tool
from knowledgesdk import KnowledgeSDK
knowledge_client = KnowledgeSDK(api_key=os.environ["KNOWLEDGESDK_API_KEY"])
@tool
def scrape_webpage(url: str) -> str:
"""
Fetch the full content of a webpage and return it as clean markdown text.
Use this when you need to read the complete content of a specific URL.
Handles JavaScript-rendered pages and anti-bot protections automatically.
Returns the full text content, not just a snippet.
Args:
url: The full URL to scrape, must start with http:// or https://
Returns:
The full page content as clean markdown text
"""
result = knowledge_client.scrape(url)
# Limit to first 6000 words to stay within context window
words = result.markdown.split()
content = " ".join(words[:6000])
return f"# {result.title}\n\nSource: {url}\n\n{content}"
That's it. 15 lines including the docstring. The @tool decorator registers the function as a smolagents tool automatically.
Step 2: Add Structured Extraction
For use cases that need typed metadata alongside content, add a second tool:
import json
@tool
def extract_webpage(url: str) -> str:
"""
Extract structured knowledge from a webpage. Returns the full markdown content
AND structured metadata including page title, description, main topics, and category.
Use this when you need both the content and structured information about a page.
Args:
url: The full URL to extract from
Returns:
JSON string with keys: title, description, category, markdown, word_count
"""
result = knowledge_client.extract(url, include_markdown=True, include_structured=True)
data = {
"title": result.title or "",
"description": result.structured.get("description", "") if result.structured else "",
"category": result.structured.get("category", "") if result.structured else "",
"markdown": result.markdown[:5000], # truncated for context
"word_count": len(result.markdown.split()),
}
return json.dumps(data, indent=2)
Step 3: Initialize the CodeAgent
from smolagents import CodeAgent, OpenAIServerModel
model = OpenAIServerModel(
model_id="gpt-4o",
api_key=os.environ["OPENAI_API_KEY"],
)
agent = CodeAgent(
tools=[scrape_webpage, extract_webpage],
model=model,
max_steps=10,
)
Or with a HuggingFace model:
from smolagents import CodeAgent, HfApiModel
model = HfApiModel(model_id="Qwen/Qwen2.5-Coder-32B-Instruct")
agent = CodeAgent(
tools=[scrape_webpage, extract_webpage],
model=model,
max_steps=10,
)
Step 4: Example Agent Queries
Query 1: Read a Documentation Page
result = agent.run(
"Read the documentation at https://docs.stripe.com/api/payment_intents "
"and explain the required parameters for creating a PaymentIntent."
)
print(result)
The agent generates Python code that calls scrape_webpage, reads the result, and synthesizes an answer. No snippet limitations — it reads the complete page.
Query 2: Compare Two Pages
result = agent.run(
"""Compare the pricing at these two pages:
1. https://openai.com/api/pricing
2. https://anthropic.com/pricing
Create a comparison table showing cost per 1M input and output tokens for their flagship models."""
)
print(result)
The agent generates code that calls scrape_webpage twice, stores both results in variables, and produces a comparison.
Query 3: Research Task with Multiple URLs
result = agent.run(
"""Research how Pinecone, Weaviate, and Qdrant handle vector search indexing.
Read their respective documentation pages and summarize the key differences
in how each handles approximate nearest neighbor search."""
)
print(result)
A CodeAgent handles this naturally — it writes a loop to scrape each URL, stores results, and synthesizes.
Query 4: Extract Structured Product Data
result = agent.run(
"Go to https://example.com/product/widget-pro and extract the product name, "
"price, key features (as a list), and whether it's in stock."
)
print(result)
Step 5: Add Search Over Your Knowledge Base
For agents that query a pre-indexed set of URLs, add a search tool backed by KnowledgeSDK's semantic search:
import httpx
@tool
def search_knowledge_base(query: str) -> str:
"""
Search a knowledge base of previously scraped web pages using semantic search.
Use this when you want to find relevant information across many pages without
scraping individual URLs. Returns the most relevant text passages and their sources.
Args:
query: A natural language search query
Returns:
A formatted string with the top matching passages and their source URLs
"""
response = httpx.post(
"https://api.knowledgesdk.com/v1/search",
headers={"x-api-key": os.environ["KNOWLEDGESDK_API_KEY"]},
json={"query": query, "limit": 5},
)
results = response.json().get("results", [])
if not results:
return "No relevant results found in the knowledge base."
output = []
for i, result in enumerate(results, 1):
output.append(
f"**Result {i}** (score: {result.get('score', 0):.2f})\n"
f"Source: {result.get('url', 'unknown')}\n"
f"Title: {result.get('title', '')}\n\n"
f"{result.get('content', '')[:500]}"
)
return "\n\n---\n\n".join(output)
# Agent with both search and scraping
agent_full = CodeAgent(
tools=[search_knowledge_base, scrape_webpage, extract_webpage],
model=model,
max_steps=15,
system_prompt="""You are a research agent with web search and scraping capabilities.
Strategy:
1. First try search_knowledge_base to find relevant information efficiently
2. If search doesn't return enough, use scrape_webpage to read specific pages
3. Use extract_webpage when you need structured metadata alongside content
4. Always cite your sources in the final answer""",
)
Comparing smolagents + KnowledgeSDK vs. DuckDuckGoSearchTool
| Capability | DuckDuckGoSearchTool | KnowledgeSDK Tools |
|---|---|---|
| Content returned | 100-200 char snippets | Full page (thousands of words) |
| Specific URL reading | No (search only) | Yes |
| JavaScript-rendered pages | No | Yes |
| Anti-bot (Cloudflare, etc.) | No | Yes |
| Structured data extraction | No | Yes |
| Semantic search over history | No | Yes |
| Setup complexity | None (built-in) | ~20 lines |
| Cost | Free | Per API call |
DuckDuckGoSearchTool is ideal for discovery: "find pages about topic X." KnowledgeSDK tools are ideal for reading: "read this specific page in full."
The ideal agent has both: use DuckDuckGoSearchTool to find relevant URLs, then KnowledgeSDK to read them completely.
Combined Search-Then-Read Agent
from smolagents import DuckDuckGoSearchTool
agent_combined = CodeAgent(
tools=[
DuckDuckGoSearchTool(), # to find URLs
scrape_webpage, # to read them fully
search_knowledge_base, # for previously indexed content
],
model=model,
max_steps=15,
system_prompt="""You are a research agent.
When researching a topic:
1. Use DuckDuckGoSearchTool to find the most relevant URLs
2. Use scrape_webpage to read the full content of the top 2-3 results
3. Synthesize information from full content, not just snippets
4. Cite sources with URLs in your answer""",
)
result = agent_combined.run(
"What are the best practices for rate limiting in REST APIs? "
"Find current documentation and best-practice guides and summarize the key recommendations."
)
print(result)
The agent writes code like this internally:
# Agent-generated code (example)
search_results = DuckDuckGoSearchTool()("REST API rate limiting best practices")
# Extract URLs from search results...
page1 = scrape_webpage("https://docs.example.com/rate-limiting")
page2 = scrape_webpage("https://engineering.example.com/api-rate-limits")
# Synthesize from full content
final_answer(f"Based on reading the full documentation pages:\n\n{synthesis}")
Node.js: Using KnowledgeSDK in a smolagents-Compatible Pattern
smolagents is Python-only, but if you're building a similar CodeAgent pattern in Node.js, here's how to define tools with KnowledgeSDK:
import KnowledgeSDK from '@knowledgesdk/node';
const client = new KnowledgeSDK({ apiKey: process.env.KNOWLEDGESDK_API_KEY });
// Tool definition for any OpenAI function-calling compatible framework
const scrapeWebpageTool = {
name: 'scrape_webpage',
description: 'Fetch the full content of a webpage as clean markdown. Use for specific URLs that need to be read completely.',
parameters: {
type: 'object',
properties: {
url: {
type: 'string',
description: 'The full URL to scrape',
},
},
required: ['url'],
},
execute: async ({ url }) => {
const result = await client.scrape(url);
const words = result.markdown.split(' ').slice(0, 6000);
return `# ${result.title}\n\nSource: ${url}\n\n${words.join(' ')}`;
},
};
Conclusion
smolagents' CodeAgent pattern is powerful for multi-step research tasks — but it's only as capable as the tools you give it. DuckDuckGoSearchTool returns snippets. KnowledgeSDK tools return full page content, structured extraction, and semantic search over your indexed knowledge base.
The integration takes under 20 lines of Python with the @tool decorator. You get JavaScript rendering, anti-bot handling, and clean markdown output automatically — your agent code never has to think about those details.
The most capable setup combines both: DuckDuckGoSearchTool for discovery, KnowledgeSDK for reading in depth. Your smolagents CodeAgent can write a search, extract the top URLs, scrape them fully, and synthesize an answer from complete page content — all in a single agent run.
Give your smolagents agent full web reading capabilities — start a free KnowledgeSDK trial at knowledgesdk.com.