knowledgesdk.com/blog/langgraph-web-scraping

integrationMarch 20, 2026·14 min read

LangGraph Web Scraping: Build a Stateful Web Research Agent

Build a stateful web research agent with LangGraph and KnowledgeSDK. Includes checkpointing, conditional routing, and full Python and Node.js code examples.

LangGraph Web Scraping: Build a Stateful Web Research Agent

LangGraph has become the dominant agent framework for production AI applications in 2026. With 34.5 million monthly npm and PyPI downloads combined, it has pulled ahead of competing orchestration frameworks because it solves the hardest problem in agent engineering: state management across complex, branching workflows.

Most LangGraph tutorials focus on simple chains or tool-calling agents. This guide is different. We will build a genuine stateful web research agent that:

Scrapes multiple URLs concurrently using KnowledgeSDK
Evaluates the confidence of its findings after each scrape
Routes conditionally to "fetch more sources" when confidence is low
Checkpoints long-running crawl jobs so they survive restarts
Synthesizes findings into a structured research report

By the end, you will have a production-ready pattern for any agent that needs to gather and reason over live web data.

Why LangGraph for Web Research?

Before diving into code, let us understand why LangGraph is the right framework for this specific problem.

Web research agents have fundamentally different characteristics from simple chatbots:

Long-running — scraping 20+ URLs can take several minutes
Branching — confidence evaluations create dynamic routing decisions
Resumable — network failures mid-crawl should not restart the entire job
Concurrent — multiple URLs should be fetched in parallel
Auditable — you need to trace which sources contributed to which conclusions

LangGraph handles all of this through its graph-based state machine architecture. Each node in the graph is a discrete operation. Edges define control flow. State persists across nodes via a typed schema. Checkpoints capture state at each step so you can resume from any point.

The alternative — a chain of LLM calls or a ReAct loop — cannot handle the branching, concurrency, or resumability requirements cleanly.

Architecture Overview

Our research agent has five nodes:

[START]
   │
   ▼
[plan_research]  ← Decides which URLs to scrape based on the query
   │
   ▼
[scrape_sources]  ← Fetches URLs concurrently via KnowledgeSDK
   │
   ▼
[evaluate_confidence]  ← LLM grades the quality of gathered evidence
   │
   ├── confidence < 0.7 ──► [fetch_more_sources]  ← Searches for additional URLs
   │                               │
   │                               └──► back to [scrape_sources]
   │
   └── confidence >= 0.7 ──► [synthesize_report]  ← Generates final output
                                    │
                                    ▼
                                  [END]

The conditional edge at evaluate_confidence is the key innovation. Instead of scraping a fixed number of sources and hoping they are sufficient, the agent self-regulates — it keeps fetching until it has enough high-quality evidence to be confident in its answer.

Setup

Install the required packages:

# Python
pip install langgraph langchain-openai knowledgesdk

# Node.js
npm install @langchain/langgraph @langchain/openai @knowledgesdk/node

Set your environment variables:

export KNOWLEDGESDK_API_KEY="knowledgesdk_live_your_key_here"
export OPENAI_API_KEY="sk-your-openai-key"

Python Implementation

Step 1: Define the Agent State

from typing import TypedDict, List, Optional, Annotated
from langgraph.graph import StateGraph, START, END
from langgraph.checkpoint.memory import MemorySaver
import operator

class ResearchState(TypedDict):
    """The state passed between all nodes in our research graph."""
    query: str
    planned_urls: List[str]
    scraped_pages: Annotated[List[dict], operator.add]  # accumulates across iterations
    confidence_score: float
    iteration_count: int
    max_iterations: int
    final_report: Optional[str]
    sources_used: List[str]

The Annotated[List[dict], operator.add] on scraped_pages is important — it tells LangGraph to append new items rather than replace the entire list when the node runs again in a loop.

Step 2: Plan Research Node

from langchain_openai import ChatOpenAI
from langchain_core.messages import SystemMessage, HumanMessage
import json

llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)

async def plan_research(state: ResearchState) -> dict:
    """Generate a list of URLs to scrape for the given query."""

    response = await llm.ainvoke([
        SystemMessage(content="""You are a research planning assistant.
        Given a research query, return a JSON object with a 'urls' array containing
        5-10 high-quality URLs to scrape. Focus on authoritative sources.
        Return only valid JSON, no markdown."""),
        HumanMessage(content=f"Research query: {state['query']}")
    ])

    plan = json.loads(response.content)

    return {
        "planned_urls": plan["urls"],
        "iteration_count": 0,
    }

Step 3: Scrape Sources Node

import asyncio
import knowledgesdk

ks_client = knowledgesdk.AsyncClient(api_key="knowledgesdk_live_your_key_here")

async def scrape_sources(state: ResearchState) -> dict:
    """Scrape all planned URLs concurrently using KnowledgeSDK."""

    # Get URLs not yet scraped
    already_scraped = {p["url"] for p in state.get("scraped_pages", [])}
    urls_to_scrape = [
        url for url in state["planned_urls"]
        if url not in already_scraped
    ]

    if not urls_to_scrape:
        return {"scraped_pages": []}

    # Scrape concurrently
    tasks = [
        ks_client.scrape(url=url)
        for url in urls_to_scrape
    ]

    results = await asyncio.gather(*tasks, return_exceptions=True)

    pages = []
    for url, result in zip(urls_to_scrape, results):
        if isinstance(result, Exception):
            print(f"Failed to scrape {url}: {result}")
            continue

        pages.append({
            "url": url,
            "markdown": result.markdown,
            "title": result.title,
            "word_count": len(result.markdown.split()),
        })

    print(f"Scraped {len(pages)} pages in this iteration")
    return {"scraped_pages": pages}

Step 4: Evaluate Confidence Node

async def evaluate_confidence(state: ResearchState) -> dict:
    """Grade the quality of gathered evidence on a 0-1 scale."""

    pages = state["scraped_pages"]

    if not pages:
        return {"confidence_score": 0.0}

    # Build a summary of what we have
    evidence_summary = "\n\n".join([
        f"Source: {p['url']}\nTitle: {p['title']}\nExcerpt: {p['markdown'][:500]}..."
        for p in pages[:10]  # Cap at 10 to avoid token overflow
    ])

    response = await llm.ainvoke([
        SystemMessage(content="""You are a research quality evaluator.
        Given a research query and gathered evidence, rate confidence on a 0.0-1.0 scale.
        Consider: source diversity, information completeness, source authority, recency.
        Return JSON: {"score": 0.85, "reasoning": "...", "gaps": ["...", "..."]}"""),
        HumanMessage(content=f"""
Query: {state['query']}

Evidence gathered:
{evidence_summary}

How confident are you that this evidence is sufficient to answer the query thoroughly?""")
    ])

    evaluation = json.loads(response.content)

    print(f"Confidence score: {evaluation['score']:.2f} — {evaluation['reasoning'][:100]}")

    return {
        "confidence_score": evaluation["score"],
        "iteration_count": state["iteration_count"] + 1,
    }

Step 5: Fetch More Sources Node

async def fetch_more_sources(state: ResearchState) -> dict:
    """Search for additional URLs when confidence is low."""

    # Use KnowledgeSDK's semantic search to find related content
    # Or use the LLM to suggest additional search angles
    response = await llm.ainvoke([
        SystemMessage(content="""Suggest 5 additional URLs to search for more information.
        Return JSON: {"urls": ["url1", "url2", ...]}
        Focus on different source types than already scraped."""),
        HumanMessage(content=f"""
Query: {state['query']}
Already scraped: {[p['url'] for p in state['scraped_pages']]}
What additional URLs would fill the gaps?""")
    ])

    additional = json.loads(response.content)
    current_urls = set(state["planned_urls"])
    new_urls = [u for u in additional["urls"] if u not in current_urls]

    return {
        "planned_urls": state["planned_urls"] + new_urls,
    }

Step 6: Synthesize Report Node

async def synthesize_report(state: ResearchState) -> dict:
    """Generate the final research report from all gathered evidence."""

    pages = state["scraped_pages"]

    # Combine all scraped content
    full_context = "\n\n---\n\n".join([
        f"# {p['title']}\nSource: {p['url']}\n\n{p['markdown'][:2000]}"
        for p in pages
    ])

    response = await llm.ainvoke([
        SystemMessage(content="""You are an expert research analyst.
        Synthesize the provided sources into a comprehensive research report.
        Include: executive summary, key findings, analysis, and citations.
        Format in markdown with proper headings."""),
        HumanMessage(content=f"""
Research Query: {state['query']}

Sources:
{full_context}

Write a comprehensive research report.""")
    ])

    return {
        "final_report": response.content,
        "sources_used": [p["url"] for p in pages],
    }

Step 7: Conditional Routing

def should_fetch_more(state: ResearchState) -> str:
    """Decide whether to fetch more sources or synthesize the report."""

    confidence = state.get("confidence_score", 0.0)
    iteration = state.get("iteration_count", 0)
    max_iter = state.get("max_iterations", 3)

    if confidence >= 0.7 or iteration >= max_iter:
        return "synthesize"
    else:
        return "fetch_more"

Step 8: Assemble the Graph

from langgraph.graph import StateGraph, START, END
from langgraph.checkpoint.memory import MemorySaver

# Build the graph
builder = StateGraph(ResearchState)

# Add nodes
builder.add_node("plan_research", plan_research)
builder.add_node("scrape_sources", scrape_sources)
builder.add_node("evaluate_confidence", evaluate_confidence)
builder.add_node("fetch_more_sources", fetch_more_sources)
builder.add_node("synthesize_report", synthesize_report)

# Add edges
builder.add_edge(START, "plan_research")
builder.add_edge("plan_research", "scrape_sources")
builder.add_edge("scrape_sources", "evaluate_confidence")
builder.add_conditional_edges(
    "evaluate_confidence",
    should_fetch_more,
    {
        "fetch_more": "fetch_more_sources",
        "synthesize": "synthesize_report",
    }
)
builder.add_edge("fetch_more_sources", "scrape_sources")
builder.add_edge("synthesize_report", END)

# Add checkpointing for resumability
checkpointer = MemorySaver()
graph = builder.compile(checkpointer=checkpointer)

Step 9: Run the Agent

import asyncio

async def run_research_agent(query: str, thread_id: str = "default"):
    """Run the research agent with checkpointing."""

    config = {
        "configurable": {"thread_id": thread_id},
        "recursion_limit": 25,  # Prevent infinite loops
    }

    initial_state = {
        "query": query,
        "planned_urls": [],
        "scraped_pages": [],
        "confidence_score": 0.0,
        "iteration_count": 0,
        "max_iterations": 3,
        "final_report": None,
        "sources_used": [],
    }

    print(f"Starting research agent for: {query}")

    async for event in graph.astream(initial_state, config=config):
        node_name = list(event.keys())[0]
        print(f"Completed node: {node_name}")

    # Get final state
    final_state = await graph.aget_state(config)

    return {
        "report": final_state.values["final_report"],
        "sources": final_state.values["sources_used"],
        "iterations": final_state.values["iteration_count"],
        "confidence": final_state.values["confidence_score"],
    }

# Run it
result = asyncio.run(run_research_agent(
    query="What are the key differences between LangGraph and AutoGen for production agent deployment in 2026?"
))

print(result["report"])
print(f"\nSources used: {len(result['sources'])}")
print(f"Iterations needed: {result['iterations']}")
print(f"Final confidence: {result['confidence']:.2f}")

Node.js Implementation

import { StateGraph, Annotation, START, END } from "@langchain/langgraph";
import { MemorySaver } from "@langchain/langgraph";
import { ChatOpenAI } from "@langchain/openai";
import KnowledgeSDK from "@knowledgesdk/node";

// State definition
const ResearchAnnotation = Annotation.Root({
  query: Annotation<string>(),
  plannedUrls: Annotation<string[]>({ reducer: (a, b) => [...new Set([...a, ...b])] }),
  scrapedPages: Annotation<Array<{ url: string; markdown: string; title: string }>>({
    reducer: (a, b) => [...a, ...b],
  }),
  confidenceScore: Annotation<number>(),
  iterationCount: Annotation<number>(),
  maxIterations: Annotation<number>(),
  finalReport: Annotation<string | null>(),
  sourcesUsed: Annotation<string[]>(),
});

const llm = new ChatOpenAI({ model: "gpt-4o-mini", temperature: 0 });
const ksClient = new KnowledgeSDK({ apiKey: process.env.KNOWLEDGESDK_API_KEY! });

// Plan research node
async function planResearch(state: typeof ResearchAnnotation.State) {
  const response = await llm.invoke([
    { role: "system", content: "Return JSON with 'urls' array of 5-10 URLs to research. No markdown." },
    { role: "user", content: `Research query: ${state.query}` },
  ]);

  const plan = JSON.parse(response.content as string);
  return { plannedUrls: plan.urls, iterationCount: 0 };
}

// Scrape sources node
async function scrapeSources(state: typeof ResearchAnnotation.State) {
  const alreadyScraped = new Set(state.scrapedPages.map((p) => p.url));
  const urlsToScrape = state.plannedUrls.filter((u) => !alreadyScraped.has(u));

  if (urlsToScrape.length === 0) return { scrapedPages: [] };

  const results = await Promise.allSettled(
    urlsToScrape.map((url) => ksClient.scrape({ url }))
  );

  const pages = results
    .map((result, i) => {
      if (result.status === "rejected") return null;
      return {
        url: urlsToScrape[i],
        markdown: result.value.markdown,
        title: result.value.title || urlsToScrape[i],
      };
    })
    .filter(Boolean) as Array<{ url: string; markdown: string; title: string }>;

  console.log(`Scraped ${pages.length} pages`);
  return { scrapedPages: pages };
}

// Evaluate confidence node
async function evaluateConfidence(state: typeof ResearchAnnotation.State) {
  const evidenceSummary = state.scrapedPages
    .slice(0, 10)
    .map((p) => `Source: ${p.url}\nTitle: ${p.title}\n${p.markdown.slice(0, 500)}...`)
    .join("\n\n---\n\n");

  const response = await llm.invoke([
    {
      role: "system",
      content: 'Return JSON: {"score": 0.85, "reasoning": "..."} Score 0-1 for research quality.',
    },
    {
      role: "user",
      content: `Query: ${state.query}\n\nEvidence:\n${evidenceSummary}\n\nConfidence score?`,
    },
  ]);

  const evaluation = JSON.parse(response.content as string);
  console.log(`Confidence: ${evaluation.score.toFixed(2)}`);

  return {
    confidenceScore: evaluation.score,
    iterationCount: state.iterationCount + 1,
  };
}

// Conditional routing
function shouldFetchMore(state: typeof ResearchAnnotation.State): string {
  if (state.confidenceScore >= 0.7 || state.iterationCount >= state.maxIterations) {
    return "synthesize";
  }
  return "fetch_more";
}

// Fetch more sources node
async function fetchMoreSources(state: typeof ResearchAnnotation.State) {
  const response = await llm.invoke([
    { role: "system", content: 'Return JSON: {"urls": ["url1", ...]} with 5 additional URLs.' },
    {
      role: "user",
      content: `Query: ${state.query}\nAlready scraped: ${state.scrapedPages.map((p) => p.url).join(", ")}\nSuggest additional URLs.`,
    },
  ]);

  const additional = JSON.parse(response.content as string);
  return { plannedUrls: additional.urls };
}

// Synthesize report node
async function synthesizeReport(state: typeof ResearchAnnotation.State) {
  const fullContext = state.scrapedPages
    .map((p) => `# ${p.title}\nSource: ${p.url}\n\n${p.markdown.slice(0, 2000)}`)
    .join("\n\n---\n\n");

  const response = await llm.invoke([
    {
      role: "system",
      content: "Write a comprehensive research report in markdown with citations.",
    },
    { role: "user", content: `Query: ${state.query}\n\nSources:\n${fullContext}` },
  ]);

  return {
    finalReport: response.content as string,
    sourcesUsed: state.scrapedPages.map((p) => p.url),
  };
}

// Build and compile the graph
const builder = new StateGraph(ResearchAnnotation)
  .addNode("plan_research", planResearch)
  .addNode("scrape_sources", scrapeSources)
  .addNode("evaluate_confidence", evaluateConfidence)
  .addNode("fetch_more_sources", fetchMoreSources)
  .addNode("synthesize_report", synthesizeReport)
  .addEdge(START, "plan_research")
  .addEdge("plan_research", "scrape_sources")
  .addEdge("scrape_sources", "evaluate_confidence")
  .addConditionalEdges("evaluate_confidence", shouldFetchMore, {
    fetch_more: "fetch_more_sources",
    synthesize: "synthesize_report",
  })
  .addEdge("fetch_more_sources", "scrape_sources")
  .addEdge("synthesize_report", END);

const checkpointer = new MemorySaver();
const graph = builder.compile({ checkpointer });

// Run the agent
async function runResearchAgent(query: string, threadId = "default") {
  const config = { configurable: { thread_id: threadId } };

  const initialState = {
    query,
    plannedUrls: [],
    scrapedPages: [],
    confidenceScore: 0,
    iterationCount: 0,
    maxIterations: 3,
    finalReport: null,
    sourcesUsed: [],
  };

  for await (const event of await graph.stream(initialState, config)) {
    const nodeName = Object.keys(event)[0];
    console.log(`Completed: ${nodeName}`);
  }

  const finalState = await graph.getState(config);
  return {
    report: finalState.values.finalReport,
    sources: finalState.values.sourcesUsed,
    iterations: finalState.values.iterationCount,
    confidence: finalState.values.confidenceScore,
  };
}

// Usage
const result = await runResearchAgent(
  "What are the key trends in AI agent frameworks for 2026?"
);
console.log(result.report);

Checkpointing for Long-Running Crawls

One of LangGraph's killer features for web research is checkpointing. If your agent is halfway through scraping 50 URLs and a network error occurs, you want to resume — not restart.

from langgraph.checkpoint.postgres import PostgresSaver

# Use PostgreSQL for persistent checkpointing in production
async with PostgresSaver.from_conn_string(
    "postgresql://user:pass@localhost/db"
) as checkpointer:
    graph = builder.compile(checkpointer=checkpointer)

    # Run with a specific thread_id to resume later
    config = {"configurable": {"thread_id": "research-job-abc123"}}

    # If this job was interrupted, it will resume from the last checkpoint
    async for event in graph.astream(initial_state, config=config):
        print(event)

    # You can also inspect the state at any point
    state = await graph.aget_state(config)
    print(f"Pages scraped so far: {len(state.values['scraped_pages'])}")

Production Considerations

Rate Limiting

KnowledgeSDK handles anti-bot and rate limiting server-side, but you should still add concurrency limits for very large jobs:

from asyncio import Semaphore

async def scrape_sources_with_limit(state: ResearchState, max_concurrent: int = 5) -> dict:
    sem = Semaphore(max_concurrent)

    async def scrape_one(url: str):
        async with sem:
            return await ks_client.scrape(url=url)

    tasks = [scrape_one(url) for url in state["planned_urls"]]
    results = await asyncio.gather(*tasks, return_exceptions=True)
    # ... process results

Streaming Progress to Users

async for event in graph.astream_events(initial_state, config=config, version="v2"):
    if event["event"] == "on_chain_end":
        node = event["name"]
        if node == "scrape_sources":
            pages = event["data"]["output"]["scraped_pages"]
            yield f"Scraped {len(pages)} new pages\n"
        elif node == "evaluate_confidence":
            score = event["data"]["output"]["confidence_score"]
            yield f"Confidence score: {score:.0%}\n"

Conclusion

LangGraph's graph-based state machine is the right architecture for web research agents because it handles the branching logic, state accumulation, and checkpointing that make production research workflows reliable.

The combination of LangGraph's orchestration and KnowledgeSDK's extraction API gives you a clean separation of concerns: LangGraph manages the agent's decision-making, KnowledgeSDK handles the web data collection. Neither needs to know the implementation details of the other.

The pattern shown here — scrape, evaluate, conditionally fetch more — works for any domain where you need high-confidence research: competitor analysis, due diligence, market research, academic literature review, or customer intelligence gathering.

Start building your LangGraph research agent today. Sign up for KnowledgeSDK and get 1,000 free API requests per month. The scraping layer is handled — focus on the agent logic.

Try it now