knowledgesdk.com/blog/langchain-web-scraping
integrationMarch 19, 2026·13 min read

LangChain Web Scraping: Give Your AI Agent Live Web Access (2026)

Build a LangChain agent with live web access using knowledgeSDK. Two approaches: knowledgeSDK as a LangChain tool, and adding semantic search for querying scraped content.

LangChain Web Scraping: Give Your AI Agent Live Web Access (2026)

LangChain Web Scraping: Give Your AI Agent Live Web Access (2026)

LangChain is the most popular framework for building AI agents, but out of the box, your agent's knowledge is frozen at the LLM's training cutoff. To give it live access to web content, you need a web scraping layer.

This tutorial covers two approaches:

  1. knowledgeSDK as a LangChain tool — your agent can scrape any URL on demand
  2. knowledgeSDK search tool — your agent can query a pre-indexed knowledge base of scraped content

We'll also show a quick Jina Reader integration for simpler prototyping use cases.

By the end, you'll have a working LangChain agent that can read live websites, answer questions from current content, and stay up to date via webhooks.


Why LangChain Agents Need Web Scraping

LangChain's ReAct and OpenAI Functions agent patterns let your LLM decide which tools to use to answer a question. The problem is that without a web scraping tool, the agent is blind to anything that happened after its training cutoff.

Common scenarios where live web access matters:

  • Documentation Q&A: "How do I configure Stripe's latest webhook retry policy?" — the answer may have changed since GPT-4o's training
  • Competitor monitoring: "What is Competitor X's current pricing for their Pro plan?"
  • News-aware responses: "Summarize the key points from today's Federal Reserve announcement"
  • Code assistance: "What's the latest version of React and what changed in it?"

In each case, your agent needs to fetch and read a live URL, not rely on training data.


Setup

Install Dependencies

# Node.js
npm install langchain @langchain/openai @langchain/core @knowledgesdk/node

# Python
pip install langchain langchain-openai langchain-core knowledgesdk

Environment Variables

OPENAI_API_KEY=sk-...
KNOWLEDGESDK_API_KEY=sk_ks_...

Approach 1: knowledgeSDK as a LangChain Scrape Tool

The simplest integration: give your agent a scrape_url tool that fetches any URL and returns its markdown content. The agent decides when to use it.

Node.js Implementation

import { ChatOpenAI } from '@langchain/openai';
import { AgentExecutor, createOpenAIFunctionsAgent } from 'langchain/agents';
import { DynamicStructuredTool } from '@langchain/core/tools';
import { ChatPromptTemplate, MessagesPlaceholder } from '@langchain/core/prompts';
import { KnowledgeSDK } from '@knowledgesdk/node';
import { z } from 'zod';

const knowledgeClient = new KnowledgeSDK({
  apiKey: process.env.KNOWLEDGESDK_API_KEY,
});

// Define the scrape tool
const scrapeUrlTool = new DynamicStructuredTool({
  name: 'scrape_url',
  description: `Fetches the content of a URL and returns it as clean markdown text.
Use this when you need to read the current content of a specific webpage.
Handles JavaScript-rendered pages, anti-bot protection, and pagination automatically.`,
  schema: z.object({
    url: z.string().url().describe('The URL to scrape'),
  }),
  func: async ({ url }) => {
    try {
      console.log(`[scrape_url] Fetching: ${url}`);
      const result = await knowledgeClient.scrape({ url });
      return result.markdown || 'No content found at this URL.';
    } catch (error) {
      return `Error scraping ${url}: ${error.message}`;
    }
  },
});

// Build the agent
const llm = new ChatOpenAI({
  model: 'gpt-4o',
  temperature: 0,
});

const prompt = ChatPromptTemplate.fromMessages([
  [
    'system',
    `You are a helpful research assistant with access to live web content.
When you need current information from a specific URL, use the scrape_url tool.
Always cite your sources.`,
  ],
  ['human', '{input}'],
  new MessagesPlaceholder('agent_scratchpad'),
]);

const agent = await createOpenAIFunctionsAgent({
  llm,
  tools: [scrapeUrlTool],
  prompt,
});

const executor = new AgentExecutor({
  agent,
  tools: [scrapeUrlTool],
  verbose: true,
});

// Run the agent
const result = await executor.invoke({
  input: 'What are the current rate limits for the Stripe API? Check https://stripe.com/docs/rate-limits',
});

console.log(result.output);

Python Implementation

from langchain_openai import ChatOpenAI
from langchain.agents import AgentExecutor, create_openai_functions_agent
from langchain_core.tools import StructuredTool
from langchain_core.prompts import ChatPromptTemplate, MessagesPlaceholder
from knowledgesdk import KnowledgeSDK
from pydantic import BaseModel, Field

knowledge_client = KnowledgeSDK(api_key=os.environ["KNOWLEDGESDK_API_KEY"])

# Tool input schema
class ScrapeInput(BaseModel):
    url: str = Field(description="The URL to scrape")

def scrape_url(url: str) -> str:
    """Fetch the content of a URL and return it as clean markdown."""
    try:
        print(f"[scrape_url] Fetching: {url}")
        result = knowledge_client.scrape(url=url)
        return result.markdown or "No content found at this URL."
    except Exception as e:
        return f"Error scraping {url}: {str(e)}"

# Define LangChain tool
scrape_tool = StructuredTool.from_function(
    func=scrape_url,
    name="scrape_url",
    description=(
        "Fetches the content of a URL and returns it as clean markdown. "
        "Use when you need to read a specific webpage. "
        "Handles JavaScript-rendered pages and anti-bot protection automatically."
    ),
    args_schema=ScrapeInput,
)

# Build the agent
llm = ChatOpenAI(model="gpt-4o", temperature=0)

prompt = ChatPromptTemplate.from_messages([
    ("system", (
        "You are a helpful research assistant with live web access. "
        "Use the scrape_url tool when you need current information from a URL. "
        "Always cite your sources."
    )),
    ("human", "{input}"),
    MessagesPlaceholder("agent_scratchpad"),
])

agent = create_openai_functions_agent(llm=llm, tools=[scrape_tool], prompt=prompt)
executor = AgentExecutor(agent=agent, tools=[scrape_tool], verbose=True)

# Run the agent
result = executor.invoke({
    "input": "What are the current rate limits for the Stripe API? Check https://stripe.com/docs/rate-limits"
})

print(result["output"])

Approach 2: Jina Reader as a Quick Prototyping Tool

If you want to test quickly without an API key, Jina Reader works as a zero-friction alternative:

import requests
from langchain_core.tools import Tool

def scrape_with_jina(url: str) -> str:
    """Quick URL scraper using r.jina.ai (no API key required)."""
    response = requests.get(f"https://r.jina.ai/{url}", timeout=30)
    if response.status_code == 200:
        return response.text[:8000]  # Trim to avoid context overflow
    return f"Failed to fetch {url}: {response.status_code}"

jina_tool = Tool(
    name="scrape_url",
    description=(
        "Fetches a URL and returns markdown content. "
        "Use for reading live webpages."
    ),
    func=scrape_with_jina,
)

When to use Jina Reader instead of knowledgeSDK:

  • Building a quick demo with no budget
  • Testing agent logic before setting up API keys
  • One-off scrapes where you don't need indexing or search

When to switch to knowledgeSDK:

  • You need reliable JS rendering (Jina Reader is inconsistent for SPAs)
  • You need more than ~200 requests/hour
  • You want the scraped content to be searchable

Approach 3: knowledgeSDK Search Tool (The Powerful Pattern)

The scrape-on-demand approach (Approach 1) works, but it has a problem: every agent run scrapes fresh content from scratch. For a knowledge base that you've already indexed, this is wasteful and slow.

A better pattern: pre-index your knowledge base, then give the agent a search tool. The agent searches indexed content in <100ms instead of waiting 2-10 seconds per scrape.

// Node.js: knowledgeSDK search tool
const searchKnowledgeBase = new DynamicStructuredTool({
  name: 'search_knowledge_base',
  description: `Search across all previously indexed web content using semantic + keyword search.
Returns the most relevant content snippets with source URLs.
Use this to find information from sources that have already been scraped (docs, competitor sites, etc).
Do NOT use this for URLs that haven't been indexed — use scrape_url instead.`,
  schema: z.object({
    query: z.string().describe('The search query'),
    limit: z.number().optional().default(5).describe('Number of results to return'),
  }),
  func: async ({ query, limit = 5 }) => {
    const results = await knowledgeClient.search({ query, limit });

    if (results.results.length === 0) {
      return 'No results found in the knowledge base for this query.';
    }

    return results.results
      .map(
        (r, i) =>
          `[Result ${i + 1}] Source: ${r.url}\nTitle: ${r.title}\n\n${r.content}`
      )
      .join('\n\n---\n\n');
  },
});
# Python: knowledgeSDK search tool
class SearchInput(BaseModel):
    query: str = Field(description="The search query")
    limit: int = Field(default=5, description="Number of results to return")

def search_knowledge_base(query: str, limit: int = 5) -> str:
    """Search across all indexed web content."""
    results = knowledge_client.search(query=query, limit=limit)

    if not results.results:
        return "No results found in the knowledge base for this query."

    return "\n\n---\n\n".join(
        f"[Result {i+1}] Source: {r.url}\nTitle: {r.title}\n\n{r.content}"
        for i, r in enumerate(results.results)
    )

search_tool = StructuredTool.from_function(
    func=search_knowledge_base,
    name="search_knowledge_base",
    description=(
        "Search across previously indexed web content using semantic + keyword search. "
        "Returns relevant content snippets with sources. "
        "Use this for content already indexed; use scrape_url for new URLs."
    ),
    args_schema=SearchInput,
)

Combining Both Tools

The most powerful agent pattern gives LangChain both tools and lets it decide which to use:

const agent = await createOpenAIFunctionsAgent({
  llm,
  tools: [scrapeUrlTool, searchKnowledgeBase],
  prompt: ChatPromptTemplate.fromMessages([
    [
      'system',
      `You are a knowledgeable research assistant with two web tools:

1. search_knowledge_base: Fast semantic search over pre-indexed content (<100ms).
   Use this first for topics likely to be in your knowledge base.

2. scrape_url: Fetch any URL in real-time (2-10 seconds).
   Use this when search returns no results, or for specific URLs not in the knowledge base.

Always cite your sources and prefer search_knowledge_base for speed.`,
    ],
    ['human', '{input}'],
    new MessagesPlaceholder('agent_scratchpad'),
  ]),
});

const executor = new AgentExecutor({
  agent,
  tools: [scrapeUrlTool, searchKnowledgeBase],
  verbose: true,
  maxIterations: 5,
});
agent = create_openai_functions_agent(
    llm=llm,
    tools=[scrape_tool, search_tool],
    prompt=ChatPromptTemplate.from_messages([
        ("system", """You are a research assistant with two web tools:

1. search_knowledge_base: Fast semantic search over pre-indexed content (<100ms).
   Use this first for topics likely in your knowledge base.

2. scrape_url: Fetch any URL in real-time (2-10 seconds).
   Use this when search returns no results, or for specific new URLs.

Always cite sources. Prefer search_knowledge_base for speed."""),
        ("human", "{input}"),
        MessagesPlaceholder("agent_scratchpad"),
    ])
)

executor = AgentExecutor(
    agent=agent,
    tools=[scrape_tool, search_tool],
    verbose=True,
    max_iterations=5,
)

Building a Documentation Q&A Agent

Here's a complete, production-ready example: a documentation Q&A agent that indexes multiple API docs and answers questions from current content.

Step 1: Index Your Documentation Sources

const docSources = [
  'https://stripe.com/docs/api',
  'https://docs.github.com/en/rest',
  'https://developers.notion.com/reference',
];

// Index all docs
console.log('Indexing documentation sources...');
await Promise.all(docSources.map(url => knowledgeClient.scrape({ url })));
console.log('Documentation indexed. Setting up change monitoring...');

// Subscribe to changes — knowledge base updates automatically
await Promise.all(
  docSources.map(url =>
    knowledgeClient.webhooks.subscribe({
      url,
      callbackUrl: 'https://your-app.com/webhooks/docs-updated',
      events: ['content.changed'],
    })
  )
);
console.log('Change monitoring active.');
doc_sources = [
    "https://stripe.com/docs/api",
    "https://docs.github.com/en/rest",
    "https://developers.notion.com/reference",
]

# Index all docs
print("Indexing documentation sources...")
for url in doc_sources:
    knowledge_client.scrape(url=url)

# Subscribe to changes
for url in doc_sources:
    knowledge_client.webhooks.subscribe(
        url=url,
        callback_url="https://your-app.com/webhooks/docs-updated",
        events=["content.changed"]
    )
print("Documentation indexed and monitoring active.")

Step 2: Build the Q&A Agent

async function askDocBot(question) {
  const result = await executor.invoke({ input: question });
  return result.output;
}

// Example interactions
const examples = [
  'What are Stripe\'s current API rate limits?',
  'How do I authenticate GitHub API requests with a PAT?',
  'What\'s the difference between Notion pages and databases?',
];

for (const question of examples) {
  console.log('\n--- Question:', question);
  const answer = await askDocBot(question);
  console.log('Answer:', answer);
}
def ask_doc_bot(question: str) -> str:
    result = executor.invoke({"input": question})
    return result["output"]

# Example interactions
questions = [
    "What are Stripe's current API rate limits?",
    "How do I authenticate GitHub API requests with a PAT?",
    "What's the difference between Notion pages and databases?",
]

for q in questions:
    print(f"\n--- Question: {q}")
    print(f"Answer: {ask_doc_bot(q)}")

Adding Memory to Your Web-Aware Agent

For production agents, you'll want to add conversation memory so the agent can reference previous questions:

import { BufferMemory } from 'langchain/memory';
import { ConversationChain } from 'langchain/chains';

const memory = new BufferMemory({
  memoryKey: 'chat_history',
  returnMessages: true,
});

const conversationalAgent = await createOpenAIFunctionsAgent({
  llm,
  tools: [scrapeUrlTool, searchKnowledgeBase],
  prompt: ChatPromptTemplate.fromMessages([
    ['system', 'You are a research assistant with live web access.'],
    new MessagesPlaceholder('chat_history'),
    ['human', '{input}'],
    new MessagesPlaceholder('agent_scratchpad'),
  ]),
});

const conversationalExecutor = new AgentExecutor({
  agent: conversationalAgent,
  tools: [scrapeUrlTool, searchKnowledgeBase],
  memory,
});

// Multi-turn conversation
await conversationalExecutor.invoke({ input: 'What is Stripe\'s pricing?' });
await conversationalExecutor.invoke({ input: 'How does that compare to their competitors?' });
await conversationalExecutor.invoke({ input: 'Show me the specific page you found that on' });

Performance Comparison: Scrape-on-Demand vs Search

Approach Response time Cost per query Freshness Best for
Jina Reader (direct) 3-8s $0 (rate-limited) Real-time Quick prototypes
knowledgeSDK scrape 2-5s ~$0.003/req Real-time Dynamic, one-off URLs
knowledgeSDK search <100ms ~$0.001/req Minutes (webhook-refreshed) Known knowledge base
Tavily search 500ms-2s ~$0.003/search Hours-days Open web search

For production agents where users expect fast responses, the search-first pattern (check the knowledge base, fall back to scraping) gives the best user experience.


Common Issues and Solutions

Agent Ignores the Tools

If your agent isn't using the scraping tools, check your system prompt. The agent needs explicit guidance on when to use each tool. Add examples in the system prompt:

When a user asks about current pricing, documentation, or recent events, use search_knowledge_base first.
If search returns no results, use scrape_url with the most relevant URL.

Scraped Content is Too Long for Context Window

knowledgeSDK's /v1/scrape returns full page content. For very long pages, truncate or chunk the output:

func: async ({ url }) => {
  const result = await knowledgeClient.scrape({ url });
  const markdown = result.markdown || '';
  // Trim to ~8K tokens to stay within context window
  return markdown.slice(0, 32000);
},

Anti-Bot Errors

If you're scraping sites with aggressive bot detection (Cloudflare Enterprise, Akamai), knowledgeSDK handles most cases automatically. For sites that still block, you may need to add a wait parameter or use session-based scraping.

Rate Limiting in Agent Loops

Agents in a loop can trigger many scraping calls quickly. Add rate limiting to your tool:

import pLimit from 'p-limit';

const limit = pLimit(3); // Max 3 concurrent scrapes

const scrapeUrlTool = new DynamicStructuredTool({
  // ...
  func: ({ url }) => limit(() => knowledgeClient.scrape({ url }).then(r => r.markdown)),
});

FAQ

Does LangChain have built-in web scraping? LangChain has a WebBaseLoader that fetches raw HTML and a few document loaders for specific sites. These are basic compared to a dedicated scraping API. They don't handle JS rendering, anti-bot protection, or return clean markdown. For production use, a dedicated scraping API like knowledgeSDK is the right approach.

What's the best way to handle authentication on scraped pages? For pages behind login, you'd need to use a full browser automation tool like Browserbase to handle the authentication flow first, then pass cookies to your scraping request. knowledgeSDK supports custom headers for bearer token authentication on API endpoints.

How many tools can a LangChain agent use? Technically unlimited, but in practice, more than 5-6 tools degrades agent decision-making quality. For web-aware agents, 2 tools (scrape + search) is the sweet spot.

Can I use knowledgeSDK with LlamaIndex instead of LangChain? Yes. The knowledgeSDK REST API works with any framework. For LlamaIndex, you'd implement a custom BaseTool or QueryEngineTool wrapping the knowledgeSDK client.

How do I prevent the agent from scraping sensitive internal URLs? Add a URL validation step in your tool implementation. Maintain a blocklist of internal domains and reject scraping requests for those URLs before they reach the API.

Does the search tool work across all indexed content or just recent scrapes? knowledgeSDK's search queries across all content ever indexed under your API key. Content is never automatically deleted unless you explicitly remove it. This means your knowledge base grows over time.


Conclusion

Giving a LangChain agent live web access is straightforward with knowledgeSDK. The two-tool pattern — scrape for new URLs, search for known content — gives you both real-time access and fast retrieval from your pre-built knowledge base.

The key architectural decision is whether you want real-time scraping (slower, always fresh) or indexed search (faster, webhook-refreshed). For most production agents, combining both with a search-first strategy gives the best balance of speed and freshness.

For deeper coverage of the RAG pipeline that feeds this agent, see our web scraping for RAG guide.

Try knowledgeSDK free — get your API key at knowledgesdk.com/setup

Try it now

Scrape, search, and monitor any website with one API.

Get your API key in 30 seconds. First 1,000 requests free.

GET API KEY →
← Back to blog