knowledgesdk.com/blog/smolagents-web-scraping

integrationMarch 20, 2026·12 min read

smolagents Web Scraping: Give HuggingFace Agents Web Access

Add KnowledgeSDK to HuggingFace smolagents in under 20 lines. Custom @tool decorator, CodeAgent setup, and full content scraping vs DuckDuckGoSearchTool snippets.

smolagents Web Scraping: Give HuggingFace Agents Web Access

HuggingFace's smolagents has become the go-to framework for researchers and ML engineers who want to build code agents without the abstraction overhead of LangChain. The design philosophy is minimal: small API surface, model-agnostic, easy to reason about.

smolagents ships with a DuckDuckGoSearchTool that lets agents search the web. But search snippets are a fundamental limitation for many agent tasks. When your agent needs to read a full documentation page, extract structured data from a product listing, or compare two pages side by side, snippets don't contain enough information to answer the question.

This tutorial shows how to add KnowledgeSDK as a custom tool in a smolagents CodeAgent. The integration is under 20 lines of Python. You'll get full page content — not snippets — via JavaScript rendering and anti-bot handling, all transparent to the agent.

Why smolagents?

smolagents is worth understanding on its own terms before adding tools to it.

The key design decision is the CodeAgent pattern. Instead of using a ReAct loop where the agent outputs "Thought/Action/Observation" steps in natural language, a CodeAgent outputs Python code. The framework executes that code and feeds the result back to the model.

This makes agents more capable at multi-step tasks because:

The model can use Python variables to pass data between steps
The model can use loops, conditionals, and list comprehensions
The model's reasoning is inspectable as real code, not natural language

The tradeoff is that you need a model capable of writing clean Python. Qwen2.5-Coder and GPT-4o work well. Smaller models may struggle.

Setup

pip install smolagents knowledgesdk

Set API keys:

export KNOWLEDGESDK_API_KEY="knowledgesdk_live_..."
export OPENAI_API_KEY="sk-..."  # or use HuggingFace models

The Built-In DuckDuckGoSearchTool: What It Returns

smolagents ships with DuckDuckGoSearchTool. Let's be specific about what it returns so you understand the gap:

from smolagents import DuckDuckGoSearchTool

search_tool = DuckDuckGoSearchTool()
result = search_tool("KnowledgeSDK web scraping API")

# Result: a string with snippets like:
# "KnowledgeSDK is a web scraping API for AI agents... [snippet] ...
#  KnowledgeSDK provides LLM-ready markdown extraction... [snippet]"

Each snippet is typically 100-200 characters. You get the gist of what a page is about, but not the content itself. For many queries — especially those that require reading instructions, code examples, or structured data from a specific page — snippets are insufficient.

Step 1: Define the KnowledgeSDK Tool

The @tool decorator is all you need. smolagents reads the function signature and docstring to understand when and how to call the tool.

import os
from smolagents import tool
from knowledgesdk import KnowledgeSDK

knowledge_client = KnowledgeSDK(api_key=os.environ["KNOWLEDGESDK_API_KEY"])

@tool
def scrape_webpage(url: str) -> str:
    """
    Fetch the full content of a webpage and return it as clean markdown text.
    Use this when you need to read the complete content of a specific URL.
    Handles JavaScript-rendered pages and anti-bot protections automatically.
    Returns the full text content, not just a snippet.

    Args:
        url: The full URL to scrape, must start with http:// or https://

    Returns:
        The full page content as clean markdown text
    """
    result = knowledge_client.scrape(url)
    # Limit to first 6000 words to stay within context window
    words = result.markdown.split()
    content = " ".join(words[:6000])
    return f"# {result.title}\n\nSource: {url}\n\n{content}"

That's it. 15 lines including the docstring. The @tool decorator registers the function as a smolagents tool automatically.

Step 2: Add Structured Extraction

For use cases that need typed metadata alongside content, add a second tool:

import json

@tool
def extract_webpage(url: str) -> str:
    """
    Extract structured knowledge from a webpage. Returns the full markdown content
    AND structured metadata including page title, description, main topics, and category.
    Use this when you need both the content and structured information about a page.

    Args:
        url: The full URL to extract from

    Returns:
        JSON string with keys: title, description, category, markdown, word_count
    """
    result = knowledge_client.extract(url, include_markdown=True, include_structured=True)

    data = {
        "title": result.title or "",
        "description": result.structured.get("description", "") if result.structured else "",
        "category": result.structured.get("category", "") if result.structured else "",
        "markdown": result.markdown[:5000],  # truncated for context
        "word_count": len(result.markdown.split()),
    }

    return json.dumps(data, indent=2)

Step 3: Initialize the CodeAgent

from smolagents import CodeAgent, OpenAIServerModel

model = OpenAIServerModel(
    model_id="gpt-4o",
    api_key=os.environ["OPENAI_API_KEY"],
)

agent = CodeAgent(
    tools=[scrape_webpage, extract_webpage],
    model=model,
    max_steps=10,
)

Or with a HuggingFace model:

from smolagents import CodeAgent, HfApiModel

model = HfApiModel(model_id="Qwen/Qwen2.5-Coder-32B-Instruct")

agent = CodeAgent(
    tools=[scrape_webpage, extract_webpage],
    model=model,
    max_steps=10,
)

Step 4: Example Agent Queries

Query 1: Read a Documentation Page

result = agent.run(
    "Read the documentation at https://docs.stripe.com/api/payment_intents "
    "and explain the required parameters for creating a PaymentIntent."
)
print(result)

The agent generates Python code that calls scrape_webpage, reads the result, and synthesizes an answer. No snippet limitations — it reads the complete page.

Query 2: Compare Two Pages

result = agent.run(
    """Compare the pricing at these two pages:
    1. https://openai.com/api/pricing
    2. https://anthropic.com/pricing

    Create a comparison table showing cost per 1M input and output tokens for their flagship models."""
)
print(result)

The agent generates code that calls scrape_webpage twice, stores both results in variables, and produces a comparison.

Query 3: Research Task with Multiple URLs

result = agent.run(
    """Research how Pinecone, Weaviate, and Qdrant handle vector search indexing.
    Read their respective documentation pages and summarize the key differences
    in how each handles approximate nearest neighbor search."""
)
print(result)

A CodeAgent handles this naturally — it writes a loop to scrape each URL, stores results, and synthesizes.

Query 4: Extract Structured Product Data

result = agent.run(
    "Go to https://example.com/product/widget-pro and extract the product name, "
    "price, key features (as a list), and whether it's in stock."
)
print(result)

Step 5: Add Search Over Your Knowledge Base

For agents that query a pre-indexed set of URLs, add a search tool backed by KnowledgeSDK's semantic search:

import httpx

@tool
def search_knowledge_base(query: str) -> str:
    """
    Search a knowledge base of previously scraped web pages using semantic search.
    Use this when you want to find relevant information across many pages without
    scraping individual URLs. Returns the most relevant text passages and their sources.

    Args:
        query: A natural language search query

    Returns:
        A formatted string with the top matching passages and their source URLs
    """
    response = httpx.post(
        "https://api.knowledgesdk.com/v1/search",
        headers={"x-api-key": os.environ["KNOWLEDGESDK_API_KEY"]},
        json={"query": query, "limit": 5},
    )
    results = response.json().get("results", [])

    if not results:
        return "No relevant results found in the knowledge base."

    output = []
    for i, result in enumerate(results, 1):
        output.append(
            f"**Result {i}** (score: {result.get('score', 0):.2f})\n"
            f"Source: {result.get('url', 'unknown')}\n"
            f"Title: {result.get('title', '')}\n\n"
            f"{result.get('content', '')[:500]}"
        )

    return "\n\n---\n\n".join(output)

# Agent with both search and scraping
agent_full = CodeAgent(
    tools=[search_knowledge_base, scrape_webpage, extract_webpage],
    model=model,
    max_steps=15,
    system_prompt="""You are a research agent with web search and scraping capabilities.

Strategy:
1. First try search_knowledge_base to find relevant information efficiently
2. If search doesn't return enough, use scrape_webpage to read specific pages
3. Use extract_webpage when you need structured metadata alongside content
4. Always cite your sources in the final answer""",
)

Comparing smolagents + KnowledgeSDK vs. DuckDuckGoSearchTool

Capability	DuckDuckGoSearchTool	KnowledgeSDK Tools
Content returned	100-200 char snippets	Full page (thousands of words)
Specific URL reading	No (search only)	Yes
JavaScript-rendered pages	No	Yes
Anti-bot (Cloudflare, etc.)	No	Yes
Structured data extraction	No	Yes
Semantic search over history	No	Yes
Setup complexity	None (built-in)	~20 lines
Cost	Free	Per API call

DuckDuckGoSearchTool is ideal for discovery: "find pages about topic X." KnowledgeSDK tools are ideal for reading: "read this specific page in full."

The ideal agent has both: use DuckDuckGoSearchTool to find relevant URLs, then KnowledgeSDK to read them completely.

Combined Search-Then-Read Agent

from smolagents import DuckDuckGoSearchTool

agent_combined = CodeAgent(
    tools=[
        DuckDuckGoSearchTool(),    # to find URLs
        scrape_webpage,            # to read them fully
        search_knowledge_base,     # for previously indexed content
    ],
    model=model,
    max_steps=15,
    system_prompt="""You are a research agent.

When researching a topic:
1. Use DuckDuckGoSearchTool to find the most relevant URLs
2. Use scrape_webpage to read the full content of the top 2-3 results
3. Synthesize information from full content, not just snippets
4. Cite sources with URLs in your answer""",
)

result = agent_combined.run(
    "What are the best practices for rate limiting in REST APIs? "
    "Find current documentation and best-practice guides and summarize the key recommendations."
)
print(result)

The agent writes code like this internally:

# Agent-generated code (example)
search_results = DuckDuckGoSearchTool()("REST API rate limiting best practices")
# Extract URLs from search results...

page1 = scrape_webpage("https://docs.example.com/rate-limiting")
page2 = scrape_webpage("https://engineering.example.com/api-rate-limits")

# Synthesize from full content
final_answer(f"Based on reading the full documentation pages:\n\n{synthesis}")

Node.js: Using KnowledgeSDK in a smolagents-Compatible Pattern

smolagents is Python-only, but if you're building a similar CodeAgent pattern in Node.js, here's how to define tools with KnowledgeSDK:

import KnowledgeSDK from '@knowledgesdk/node';

const client = new KnowledgeSDK({ apiKey: process.env.KNOWLEDGESDK_API_KEY });

// Tool definition for any OpenAI function-calling compatible framework
const scrapeWebpageTool = {
  name: 'scrape_webpage',
  description: 'Fetch the full content of a webpage as clean markdown. Use for specific URLs that need to be read completely.',
  parameters: {
    type: 'object',
    properties: {
      url: {
        type: 'string',
        description: 'The full URL to scrape',
      },
    },
    required: ['url'],
  },
  execute: async ({ url }) => {
    const result = await client.scrape(url);
    const words = result.markdown.split(' ').slice(0, 6000);
    return `# ${result.title}\n\nSource: ${url}\n\n${words.join(' ')}`;
  },
};

Conclusion

smolagents' CodeAgent pattern is powerful for multi-step research tasks — but it's only as capable as the tools you give it. DuckDuckGoSearchTool returns snippets. KnowledgeSDK tools return full page content, structured extraction, and semantic search over your indexed knowledge base.

The integration takes under 20 lines of Python with the @tool decorator. You get JavaScript rendering, anti-bot handling, and clean markdown output automatically — your agent code never has to think about those details.

The most capable setup combines both: DuckDuckGoSearchTool for discovery, KnowledgeSDK for reading in depth. Your smolagents CodeAgent can write a search, extract the top URLs, scrape them fully, and synthesize an answer from complete page content — all in a single agent run.

Give your smolagents agent full web reading capabilities — start a free KnowledgeSDK trial at knowledgesdk.com.

Try it now