Google ADK Web Scraping: Custom Grounding Beyond Google Search
Google's Agent Development Kit launched in early 2025 and quickly became a serious option for production agent development. It ships with first-party Google Search grounding — your agent can run a search query and get back results from the public web index.
But Google Search grounding has a hard constraint: it only returns content from what Google has indexed, and it returns snippets, not full page content. If you need to:
- Read a specific competitor pricing page end-to-end
- Extract structured data from a JavaScript-rendered product listing
- Access internal documentation behind a corporate domain
- Scrape content from pages that aren't heavily indexed by Google
- Get the full text of a page, not just the snippet Google shows
...then Google Search grounding isn't enough. You need a direct scraping tool.
This tutorial shows how to add KnowledgeSDK as a custom FunctionTool in Google ADK, giving your agent the ability to fetch any URL and read it as clean, LLM-ready markdown.
Google ADK Tool System: A Quick Overview
ADK agents can use three types of tools:
- Built-in tools — Google Search, code execution, file system access
- FunctionTools — Python functions decorated and registered as tools
- AgentTools — other agents used as sub-agents
FunctionTool is the right choice here. You define a Python function, add a docstring that describes what it does (ADK uses this to decide when to call it), and register it with the agent. The framework handles tool invocation, result passing, and conversation history automatically.
Limitations of Google Search Grounding
Before building, it's worth being precise about what Google Search grounding does and doesn't give you:
| Capability | Google Search Grounding | KnowledgeSDK FunctionTool |
|---|---|---|
| Search the public web index | Yes | No (scrape specific URLs) |
| Return full page content | No (snippets only) | Yes (full markdown) |
| Handle JavaScript-rendered pages | No | Yes |
| Access specific URLs directly | No | Yes |
| Bypass anti-bot protections | No | Yes |
| Return structured data | No | Yes |
| Pages not indexed by Google | No | Yes |
| Cost per call | Included in ADK | Per API call |
The two tools are complementary. A well-designed ADK agent might use Google Search to find relevant URLs, then KnowledgeSDK to read their full content.
Setup
Install dependencies:
pip install google-adk knowledgesdk
Set environment variables:
export GOOGLE_API_KEY="your-google-api-key"
export KNOWLEDGESDK_API_KEY="your-knowledgesdk-api-key"
Step 1: Define the KnowledgeSDK Tool
Python:
import os
from google.adk.tools import FunctionTool
from knowledgesdk import KnowledgeSDK
knowledge_client = KnowledgeSDK(api_key=os.environ["KNOWLEDGESDK_API_KEY"])
def scrape_url(url: str) -> dict:
"""
Scrape any URL and return its full content as clean markdown text.
Use this tool when you need to read the complete content of a specific webpage,
including JavaScript-rendered pages, product pages, documentation, or competitor sites.
Returns the page title, markdown content, and basic metadata.
Args:
url: The full URL to scrape (must start with http:// or https://)
Returns:
A dict with keys: title, markdown, url, word_count
"""
try:
result = knowledge_client.scrape(url)
return {
"url": url,
"title": result.title or "Untitled",
"markdown": result.markdown,
"word_count": len(result.markdown.split()),
"success": True,
}
except Exception as e:
return {
"url": url,
"success": False,
"error": str(e),
}
def extract_structured(url: str) -> dict:
"""
Extract structured knowledge from a URL. Returns clean markdown AND structured data
including the page description, key headings, important links, and detected content type.
Use this when you need both the raw content and structured metadata about a page.
Args:
url: The full URL to extract from
Returns:
A dict with keys: title, markdown, structured (description, headings, links, category)
"""
try:
result = knowledge_client.extract(url, include_markdown=True, include_structured=True)
return {
"url": url,
"title": result.title or "Untitled",
"markdown": result.markdown,
"description": result.structured.get("description", ""),
"headings": result.structured.get("headings", []),
"category": result.structured.get("category", ""),
"success": True,
}
except Exception as e:
return {
"url": url,
"success": False,
"error": str(e),
}
# Wrap as ADK FunctionTools
scrape_tool = FunctionTool(func=scrape_url)
extract_tool = FunctionTool(func=extract_structured)
Step 2: Register Tools with the ADK Agent
from google.adk.agents import Agent
from google.adk.models.lite_llm import LiteLlm
# Create the agent with KnowledgeSDK tools
agent = Agent(
name="web_research_agent",
model="gemini-2.0-flash",
description="A research agent that can read any webpage and extract structured information.",
instruction="""You are a research agent with the ability to read any webpage.
When asked to research a topic or get information from a specific URL:
1. Use scrape_url for simple content reading
2. Use extract_structured when you need both content and metadata
3. Always cite the source URL in your response
4. If a page is very long, summarize the key points relevant to the question
You can read competitor pages, documentation, pricing pages, news articles, and any other web content.""",
tools=[scrape_tool, extract_tool],
)
Step 3: Run Example Queries
Example 1: Read a Competitor Pricing Page
from google.adk.sessions import InMemorySessionService
from google.adk.runners import Runner
from google.genai import types
session_service = InMemorySessionService()
runner = Runner(agent=agent, app_name="web_research", session_service=session_service)
async def run_query(query: str) -> str:
session = await session_service.create_session(
app_name="web_research",
user_id="user_001",
)
content = types.Content(role="user", parts=[types.Part(text=query)])
final_response = ""
async for event in runner.run_async(
user_id="user_001",
session_id=session.id,
new_message=content,
):
if event.is_final_response():
final_response = event.content.parts[0].text
return final_response
# Run queries
import asyncio
result = asyncio.run(run_query(
"Read the pricing page at https://stripe.com/pricing and tell me the transaction fees for standard cards."
))
print(result)
Example 2: Compare Documentation Across Multiple URLs
result = asyncio.run(run_query(
"""Compare the authentication approaches described at these two pages:
1. https://docs.example.com/auth/oauth
2. https://docs.example.com/auth/api-keys
Summarize the key differences and when to use each."""
))
print(result)
Example 3: Competitive Intelligence
result = asyncio.run(run_query(
"""Read the homepage and features page of https://competitor.com and:
1. List their main product features
2. Identify their target customer segment
3. Note their key differentiators
Return a structured analysis."""
))
print(result)
Step 4: Combine with Google Search
For the most powerful setup, give your agent both Google Search (to find URLs) and KnowledgeSDK (to read them fully):
from google.adk.tools import google_search
agent_with_search = Agent(
name="full_research_agent",
model="gemini-2.0-flash",
description="Research agent with both web search and full-page reading capabilities.",
instruction="""You are a comprehensive research agent.
Workflow for research tasks:
1. Use google_search to find relevant URLs on a topic
2. Use scrape_url or extract_structured to read the full content of the most relevant pages
3. Synthesize information from multiple sources
4. Always cite your sources
Use google_search first to discover URLs, then scrape_url to get complete content.""",
tools=[
google_search, # built-in Google Search
scrape_tool, # KnowledgeSDK scrape
extract_tool, # KnowledgeSDK extract
],
)
# This agent can now: search for a topic, find relevant pages, read them fully, and synthesize
result = asyncio.run(run_query(
"Research the latest developments in GraphRAG frameworks. Search for recent articles and read the most relevant ones in full."
))
Step 5: Add Semantic Search Over Scraped Content
For research-heavy agents, you can pre-index a set of URLs with KnowledgeSDK and add a search tool:
import httpx
def search_knowledge_base(query: str, limit: int = 5) -> dict:
"""
Search a pre-indexed knowledge base of scraped web content using semantic search.
Use this when you want to find relevant information across many previously scraped pages.
More efficient than scraping individual URLs when the knowledge base covers the topic.
Args:
query: Natural language search query
limit: Maximum number of results to return (default 5)
Returns:
A dict with search results, each containing url, title, snippet, and relevance score
"""
response = httpx.post(
"https://api.knowledgesdk.com/v1/search",
headers={"x-api-key": os.environ["KNOWLEDGESDK_API_KEY"]},
json={"query": query, "limit": limit},
)
return response.json()
search_tool = FunctionTool(func=search_knowledge_base)
# Add to agent
agent_with_search_kb = Agent(
name="knowledge_agent",
model="gemini-2.0-flash",
instruction="""Research agent with access to a knowledge base and live scraping.
First check the knowledge base with search_knowledge_base. If you don't find relevant results,
use scrape_url to read specific URLs directly.""",
tools=[search_tool, scrape_tool],
)
Complete Working Example
Here's a self-contained script you can run immediately:
import os
import asyncio
from google.adk.agents import Agent
from google.adk.tools import FunctionTool
from google.adk.sessions import InMemorySessionService
from google.adk.runners import Runner
from google.genai import types
from knowledgesdk import KnowledgeSDK
knowledge_client = KnowledgeSDK(api_key=os.environ["KNOWLEDGESDK_API_KEY"])
def scrape_url(url: str) -> dict:
"""
Scrape any URL and return its full content as clean markdown.
Use this to read specific webpages, documentation, or product pages.
Args:
url: The full URL to scrape
"""
result = knowledge_client.scrape(url)
return {
"url": url,
"title": result.title,
"content": result.markdown[:5000], # limit for context window
"word_count": len(result.markdown.split()),
}
agent = Agent(
name="researcher",
model="gemini-2.0-flash",
instruction="You are a web research agent. Use scrape_url to read webpages and answer questions based on their content.",
tools=[FunctionTool(func=scrape_url)],
)
async def main():
sessions = InMemorySessionService()
runner = Runner(agent=agent, app_name="researcher", session_service=sessions)
session = await sessions.create_session(app_name="researcher", user_id="u1")
question = "Read https://knowledgesdk.com and summarize what the product does."
content = types.Content(role="user", parts=[types.Part(text=question)])
async for event in runner.run_async(user_id="u1", session_id=session.id, new_message=content):
if event.is_final_response():
print(event.content.parts[0].text)
asyncio.run(main())
Comparison: ADK Tool Approaches
| Approach | What it returns | JS rendering | Specific URLs | Full content |
|---|---|---|---|---|
| Google Search built-in | Snippets from indexed pages | No | No | No |
UrlContext tool |
Fetched page content | No | Yes | Partial |
| KnowledgeSDK FunctionTool | Clean markdown | Yes | Yes | Yes |
| KnowledgeSDK + Search | Markdown + semantic search | Yes | Yes | Yes |
Conclusion
Google ADK's built-in Search grounding is powerful for public discovery tasks, but it returns snippets from the public index. For agents that need to read specific pages in full — competitor analysis, documentation ingestion, structured extraction from JS-rendered sites — you need a dedicated scraping tool.
Adding KnowledgeSDK as a FunctionTool takes less than 30 lines of Python. Your ADK agent gains the ability to read any URL and return clean, LLM-ready markdown. Combine it with Google Search for a complete research workflow: search to discover, scrape to read in full.
Add live web reading to your Google ADK agent — start a free KnowledgeSDK trial at knowledgesdk.com.