knowledgesdk.com/blog/google-adk-web-scraping
integrationMarch 20, 2026·13 min read

Google ADK Web Scraping: Custom Grounding Beyond Google Search

Google ADK's built-in search only covers the public index. Add KnowledgeSDK as a custom FunctionTool to scrape any URL — competitor pages, docs, paywalled content.

Google ADK Web Scraping: Custom Grounding Beyond Google Search

Google ADK Web Scraping: Custom Grounding Beyond Google Search

Google's Agent Development Kit launched in early 2025 and quickly became a serious option for production agent development. It ships with first-party Google Search grounding — your agent can run a search query and get back results from the public web index.

But Google Search grounding has a hard constraint: it only returns content from what Google has indexed, and it returns snippets, not full page content. If you need to:

  • Read a specific competitor pricing page end-to-end
  • Extract structured data from a JavaScript-rendered product listing
  • Access internal documentation behind a corporate domain
  • Scrape content from pages that aren't heavily indexed by Google
  • Get the full text of a page, not just the snippet Google shows

...then Google Search grounding isn't enough. You need a direct scraping tool.

This tutorial shows how to add KnowledgeSDK as a custom FunctionTool in Google ADK, giving your agent the ability to fetch any URL and read it as clean, LLM-ready markdown.


Google ADK Tool System: A Quick Overview

ADK agents can use three types of tools:

  1. Built-in tools — Google Search, code execution, file system access
  2. FunctionTools — Python functions decorated and registered as tools
  3. AgentTools — other agents used as sub-agents

FunctionTool is the right choice here. You define a Python function, add a docstring that describes what it does (ADK uses this to decide when to call it), and register it with the agent. The framework handles tool invocation, result passing, and conversation history automatically.


Limitations of Google Search Grounding

Before building, it's worth being precise about what Google Search grounding does and doesn't give you:

Capability Google Search Grounding KnowledgeSDK FunctionTool
Search the public web index Yes No (scrape specific URLs)
Return full page content No (snippets only) Yes (full markdown)
Handle JavaScript-rendered pages No Yes
Access specific URLs directly No Yes
Bypass anti-bot protections No Yes
Return structured data No Yes
Pages not indexed by Google No Yes
Cost per call Included in ADK Per API call

The two tools are complementary. A well-designed ADK agent might use Google Search to find relevant URLs, then KnowledgeSDK to read their full content.


Setup

Install dependencies:

pip install google-adk knowledgesdk

Set environment variables:

export GOOGLE_API_KEY="your-google-api-key"
export KNOWLEDGESDK_API_KEY="your-knowledgesdk-api-key"

Step 1: Define the KnowledgeSDK Tool

Python:

import os
from google.adk.tools import FunctionTool
from knowledgesdk import KnowledgeSDK

knowledge_client = KnowledgeSDK(api_key=os.environ["KNOWLEDGESDK_API_KEY"])

def scrape_url(url: str) -> dict:
    """
    Scrape any URL and return its full content as clean markdown text.
    Use this tool when you need to read the complete content of a specific webpage,
    including JavaScript-rendered pages, product pages, documentation, or competitor sites.
    Returns the page title, markdown content, and basic metadata.

    Args:
        url: The full URL to scrape (must start with http:// or https://)

    Returns:
        A dict with keys: title, markdown, url, word_count
    """
    try:
        result = knowledge_client.scrape(url)
        return {
            "url": url,
            "title": result.title or "Untitled",
            "markdown": result.markdown,
            "word_count": len(result.markdown.split()),
            "success": True,
        }
    except Exception as e:
        return {
            "url": url,
            "success": False,
            "error": str(e),
        }

def extract_structured(url: str) -> dict:
    """
    Extract structured knowledge from a URL. Returns clean markdown AND structured data
    including the page description, key headings, important links, and detected content type.
    Use this when you need both the raw content and structured metadata about a page.

    Args:
        url: The full URL to extract from

    Returns:
        A dict with keys: title, markdown, structured (description, headings, links, category)
    """
    try:
        result = knowledge_client.extract(url, include_markdown=True, include_structured=True)
        return {
            "url": url,
            "title": result.title or "Untitled",
            "markdown": result.markdown,
            "description": result.structured.get("description", ""),
            "headings": result.structured.get("headings", []),
            "category": result.structured.get("category", ""),
            "success": True,
        }
    except Exception as e:
        return {
            "url": url,
            "success": False,
            "error": str(e),
        }

# Wrap as ADK FunctionTools
scrape_tool = FunctionTool(func=scrape_url)
extract_tool = FunctionTool(func=extract_structured)

Step 2: Register Tools with the ADK Agent

from google.adk.agents import Agent
from google.adk.models.lite_llm import LiteLlm

# Create the agent with KnowledgeSDK tools
agent = Agent(
    name="web_research_agent",
    model="gemini-2.0-flash",
    description="A research agent that can read any webpage and extract structured information.",
    instruction="""You are a research agent with the ability to read any webpage.

When asked to research a topic or get information from a specific URL:
1. Use scrape_url for simple content reading
2. Use extract_structured when you need both content and metadata
3. Always cite the source URL in your response
4. If a page is very long, summarize the key points relevant to the question

You can read competitor pages, documentation, pricing pages, news articles, and any other web content.""",
    tools=[scrape_tool, extract_tool],
)

Step 3: Run Example Queries

Example 1: Read a Competitor Pricing Page

from google.adk.sessions import InMemorySessionService
from google.adk.runners import Runner
from google.genai import types

session_service = InMemorySessionService()
runner = Runner(agent=agent, app_name="web_research", session_service=session_service)

async def run_query(query: str) -> str:
    session = await session_service.create_session(
        app_name="web_research",
        user_id="user_001",
    )

    content = types.Content(role="user", parts=[types.Part(text=query)])

    final_response = ""
    async for event in runner.run_async(
        user_id="user_001",
        session_id=session.id,
        new_message=content,
    ):
        if event.is_final_response():
            final_response = event.content.parts[0].text

    return final_response

# Run queries
import asyncio

result = asyncio.run(run_query(
    "Read the pricing page at https://stripe.com/pricing and tell me the transaction fees for standard cards."
))
print(result)

Example 2: Compare Documentation Across Multiple URLs

result = asyncio.run(run_query(
    """Compare the authentication approaches described at these two pages:
    1. https://docs.example.com/auth/oauth
    2. https://docs.example.com/auth/api-keys

    Summarize the key differences and when to use each."""
))
print(result)

Example 3: Competitive Intelligence

result = asyncio.run(run_query(
    """Read the homepage and features page of https://competitor.com and:
    1. List their main product features
    2. Identify their target customer segment
    3. Note their key differentiators
    Return a structured analysis."""
))
print(result)

Step 4: Combine with Google Search

For the most powerful setup, give your agent both Google Search (to find URLs) and KnowledgeSDK (to read them fully):

from google.adk.tools import google_search

agent_with_search = Agent(
    name="full_research_agent",
    model="gemini-2.0-flash",
    description="Research agent with both web search and full-page reading capabilities.",
    instruction="""You are a comprehensive research agent.

Workflow for research tasks:
1. Use google_search to find relevant URLs on a topic
2. Use scrape_url or extract_structured to read the full content of the most relevant pages
3. Synthesize information from multiple sources
4. Always cite your sources

Use google_search first to discover URLs, then scrape_url to get complete content.""",
    tools=[
        google_search,  # built-in Google Search
        scrape_tool,    # KnowledgeSDK scrape
        extract_tool,   # KnowledgeSDK extract
    ],
)

# This agent can now: search for a topic, find relevant pages, read them fully, and synthesize
result = asyncio.run(run_query(
    "Research the latest developments in GraphRAG frameworks. Search for recent articles and read the most relevant ones in full."
))

Step 5: Add Semantic Search Over Scraped Content

For research-heavy agents, you can pre-index a set of URLs with KnowledgeSDK and add a search tool:

import httpx

def search_knowledge_base(query: str, limit: int = 5) -> dict:
    """
    Search a pre-indexed knowledge base of scraped web content using semantic search.
    Use this when you want to find relevant information across many previously scraped pages.
    More efficient than scraping individual URLs when the knowledge base covers the topic.

    Args:
        query: Natural language search query
        limit: Maximum number of results to return (default 5)

    Returns:
        A dict with search results, each containing url, title, snippet, and relevance score
    """
    response = httpx.post(
        "https://api.knowledgesdk.com/v1/search",
        headers={"x-api-key": os.environ["KNOWLEDGESDK_API_KEY"]},
        json={"query": query, "limit": limit},
    )
    return response.json()

search_tool = FunctionTool(func=search_knowledge_base)

# Add to agent
agent_with_search_kb = Agent(
    name="knowledge_agent",
    model="gemini-2.0-flash",
    instruction="""Research agent with access to a knowledge base and live scraping.

First check the knowledge base with search_knowledge_base. If you don't find relevant results,
use scrape_url to read specific URLs directly.""",
    tools=[search_tool, scrape_tool],
)

Complete Working Example

Here's a self-contained script you can run immediately:

import os
import asyncio
from google.adk.agents import Agent
from google.adk.tools import FunctionTool
from google.adk.sessions import InMemorySessionService
from google.adk.runners import Runner
from google.genai import types
from knowledgesdk import KnowledgeSDK

knowledge_client = KnowledgeSDK(api_key=os.environ["KNOWLEDGESDK_API_KEY"])

def scrape_url(url: str) -> dict:
    """
    Scrape any URL and return its full content as clean markdown.
    Use this to read specific webpages, documentation, or product pages.

    Args:
        url: The full URL to scrape
    """
    result = knowledge_client.scrape(url)
    return {
        "url": url,
        "title": result.title,
        "content": result.markdown[:5000],  # limit for context window
        "word_count": len(result.markdown.split()),
    }

agent = Agent(
    name="researcher",
    model="gemini-2.0-flash",
    instruction="You are a web research agent. Use scrape_url to read webpages and answer questions based on their content.",
    tools=[FunctionTool(func=scrape_url)],
)

async def main():
    sessions = InMemorySessionService()
    runner = Runner(agent=agent, app_name="researcher", session_service=sessions)
    session = await sessions.create_session(app_name="researcher", user_id="u1")

    question = "Read https://knowledgesdk.com and summarize what the product does."
    content = types.Content(role="user", parts=[types.Part(text=question)])

    async for event in runner.run_async(user_id="u1", session_id=session.id, new_message=content):
        if event.is_final_response():
            print(event.content.parts[0].text)

asyncio.run(main())

Comparison: ADK Tool Approaches

Approach What it returns JS rendering Specific URLs Full content
Google Search built-in Snippets from indexed pages No No No
UrlContext tool Fetched page content No Yes Partial
KnowledgeSDK FunctionTool Clean markdown Yes Yes Yes
KnowledgeSDK + Search Markdown + semantic search Yes Yes Yes

Conclusion

Google ADK's built-in Search grounding is powerful for public discovery tasks, but it returns snippets from the public index. For agents that need to read specific pages in full — competitor analysis, documentation ingestion, structured extraction from JS-rendered sites — you need a dedicated scraping tool.

Adding KnowledgeSDK as a FunctionTool takes less than 30 lines of Python. Your ADK agent gains the ability to read any URL and return clean, LLM-ready markdown. Combine it with Google Search for a complete research workflow: search to discover, scrape to read in full.

Add live web reading to your Google ADK agent — start a free KnowledgeSDK trial at knowledgesdk.com.

Try it now

Scrape, search, and monitor any website with one API.

Get your API key in 30 seconds. First 1,000 requests free.

GET API KEY →

Related Articles

integration

Live Web Data in Google ADK: Private Grounding for AI Agents

integration

DSPy + Web Scraping: Optimize Your Retrieval Prompts Automatically

integration

Web Scraping with Haystack: Build a Live RAG Pipeline with KnowledgeSDK

integration

LangGraph Web Scraping: Build a Stateful Web Research Agent

← Back to blog