knowledgesdk.com/blog/research-agent-web-scraping
use-caseMarch 19, 2026·14 min read

Building a Deep Research Agent That Reads the Web (2026)

Build a multi-step research agent using LangChain and KnowledgeSDK that takes a question, scrapes sources, searches semantically, and synthesizes answers with citations.

Building a Deep Research Agent That Reads the Web (2026)

Research is one of the highest-value tasks you can automate with AI. A researcher asks a question, generates search queries, reads multiple sources, finds the relevant passages, and synthesizes a coherent answer with citations. This process takes a human researcher 30 minutes to several hours. An AI agent with access to live web data can do it in under 60 seconds.

This guide builds a production-ready deep research agent using LangChain and KnowledgeSDK. The agent takes a research question, generates targeted search queries, scrapes the most relevant URLs, searches semantically across all scraped content, and returns a synthesized answer with numbered citations.

How the Agent Works

The research process is a five-step loop:

1. Parse question → generate search queries
2. Retrieve URLs for each query (from a URL corpus or search API)
3. Scrape each URL via KnowledgeSDK → clean markdown
4. Index all scraped content
5. Semantic search across indexed content
6. Synthesize answer with citations

Steps 3-5 are where KnowledgeSDK does the heavy lifting. It handles JavaScript rendering, anti-bot measures, and pagination so the agent always gets complete, clean content — not truncated HTML fragments.

Prerequisites

mkdir research-agent && cd research-agent
npm install langchain @langchain/openai @knowledgesdk/node dotenv zod
npm install -D typescript tsx @types/node

Python alternative:

pip install langchain langchain-openai knowledgesdk python-dotenv

.env:

OPENAI_API_KEY=sk-...
KNOWLEDGESDK_API_KEY=sk_ks_your_key
SERPER_API_KEY=...   # Optional: for Google search integration

The Research Agent (Node.js / TypeScript)

// src/researchAgent.ts
import { ChatOpenAI } from "@langchain/openai";
import { PromptTemplate } from "@langchain/core/prompts";
import { StringOutputParser } from "@langchain/core/output_parsers";
import KnowledgeSDK from "@knowledgesdk/node";
import "dotenv/config";

const ks = new KnowledgeSDK({ apiKey: process.env.KNOWLEDGESDK_API_KEY! });
const llm = new ChatOpenAI({ modelName: "gpt-4o", temperature: 0.1 });

interface ResearchResult {
  question: string;
  answer: string;
  citations: Citation[];
  sourcesScraped: number;
  searchQueries: string[];
}

interface Citation {
  index: number;
  url: string;
  title: string;
  excerpt: string;
}

// Step 1: Generate targeted search queries from the research question
async function generateSearchQueries(question: string): Promise<string[]> {
  const prompt = PromptTemplate.fromTemplate(
    `You are a research assistant. Given a research question, generate 3-5 specific
search queries that would help find the most relevant information.

Research question: {question}

Generate queries as a JSON array of strings. Each query should target a different
aspect of the question. Queries should be suitable for finding specific web pages.

Respond with ONLY the JSON array, no other text.`
  );

  const chain = prompt.pipe(llm).pipe(new StringOutputParser());
  const result = await chain.invoke({ question });

  try {
    return JSON.parse(result);
  } catch {
    // Fallback: use the question itself
    return [question];
  }
}

// Step 2: Get URLs to research (from a predefined corpus or search API)
async function getUrlsForQuery(
  query: string,
  urlCorpus: string[]
): Promise<string[]> {
  // If you have a Serper or SerpAPI key, use it to get real search results
  if (process.env.SERPER_API_KEY) {
    const response = await fetch("https://google.serper.dev/search", {
      method: "POST",
      headers: {
        "X-API-KEY": process.env.SERPER_API_KEY,
        "Content-Type": "application/json",
      },
      body: JSON.stringify({ q: query, num: 5 }),
    });
    const data = await response.json();
    return (data.organic ?? []).map((r: any) => r.link).slice(0, 5);
  }

  // Fallback: use predefined URL corpus and let semantic search filter
  return urlCorpus;
}

// Step 3: Scrape URLs and add to knowledge base
async function scrapeUrls(
  urls: string[]
): Promise<Map<string, { title: string; markdown: string }>> {
  const results = new Map<string, { title: string; markdown: string }>();

  // Process in parallel with concurrency limit
  const CONCURRENCY = 3;
  for (let i = 0; i < urls.length; i += CONCURRENCY) {
    const batch = urls.slice(i, i + CONCURRENCY);
    const batchResults = await Promise.allSettled(
      batch.map(async (url) => {
        const result = await ks.scrape({ url });
        return { url, title: result.title ?? url, markdown: result.markdown };
      })
    );

    for (const result of batchResults) {
      if (result.status === "fulfilled") {
        const { url, title, markdown } = result.value;
        results.set(url, { title, markdown });
        console.log(`  Scraped: ${title || url} (${markdown.split(" ").length} words)`);
      } else {
        console.warn(`  Failed to scrape: ${result.reason}`);
      }
    }

    // Brief pause between batches
    if (i + CONCURRENCY < urls.length) {
      await new Promise((r) => setTimeout(r, 500));
    }
  }

  return results;
}

// Step 4: Search the indexed content
async function searchContent(
  queries: string[],
  limit: number = 8
): Promise<Array<{ url: string; title: string; content: string; score: number }>> {
  const allResults = new Map<string, { url: string; title: string; content: string; score: number }>();

  for (const query of queries) {
    const results = await ks.search({ query, limit, hybrid: true });

    for (const hit of results.hits) {
      const existing = allResults.get(hit.url);
      // Keep the highest-scoring result per URL across queries
      if (!existing || hit.score > existing.score) {
        allResults.set(hit.url, {
          url: hit.url,
          title: hit.title ?? hit.url,
          content: hit.content,
          score: hit.score,
        });
      }
    }
  }

  // Sort by score and return top results
  return [...allResults.values()]
    .sort((a, b) => b.score - a.score)
    .slice(0, limit);
}

// Step 5: Synthesize answer with citations
async function synthesizeAnswer(
  question: string,
  searchResults: Array<{ url: string; title: string; content: string; score: number }>
): Promise<{ answer: string; citations: Citation[] }> {
  if (searchResults.length === 0) {
    return {
      answer: "I could not find sufficient information to answer this question from the available sources.",
      citations: [],
    };
  }

  // Build numbered context with citations
  const citations: Citation[] = searchResults.map((r, i) => ({
    index: i + 1,
    url: r.url,
    title: r.title,
    excerpt: r.content.slice(0, 300),
  }));

  const context = searchResults
    .map(
      (r, i) =>
        `[${i + 1}] Source: ${r.title}\nURL: ${r.url}\n\n${r.content.slice(0, 2000)}`
    )
    .join("\n\n---\n\n");

  const prompt = PromptTemplate.fromTemplate(
    `You are a research assistant synthesizing information from web sources.

Research question: {question}

Sources:
{context}

Write a comprehensive, well-structured answer to the research question.
- Use inline citations like [1], [2], [3] to reference sources
- Be factual and precise — only state what the sources support
- Organize the answer with clear paragraphs
- If sources contradict each other, note the discrepancy
- End with a brief summary of key findings

Answer:`
  );

  const chain = prompt.pipe(llm).pipe(new StringOutputParser());
  const answer = await chain.invoke({ question, context });

  return { answer, citations };
}

// Main research function
export async function research(
  question: string,
  urlCorpus: string[] = []
): Promise<ResearchResult> {
  console.log(`\nResearching: "${question}"`);
  console.log("─".repeat(60));

  // Step 1: Generate search queries
  console.log("\n1. Generating search queries...");
  const searchQueries = await generateSearchQueries(question);
  console.log(`   Queries: ${searchQueries.join(" | ")}`);

  // Step 2: Get URLs
  console.log("\n2. Gathering URLs to research...");
  const urlSets = await Promise.all(
    searchQueries.map((q) => getUrlsForQuery(q, urlCorpus))
  );
  const uniqueUrls = [...new Set(urlSets.flat())].slice(0, 15);
  console.log(`   Found ${uniqueUrls.length} unique URLs`);

  // Step 3: Scrape URLs
  console.log(`\n3. Scraping ${uniqueUrls.length} URLs...`);
  const scrapedContent = await scrapeUrls(uniqueUrls);
  console.log(`   Successfully scraped: ${scrapedContent.size}/${uniqueUrls.length}`);

  // Step 4: Search indexed content
  console.log("\n4. Searching indexed content...");
  const searchResults = await searchContent(searchQueries);
  console.log(`   Found ${searchResults.length} relevant passages`);

  // Step 5: Synthesize
  console.log("\n5. Synthesizing answer...");
  const { answer, citations } = await synthesizeAnswer(question, searchResults);

  return {
    question,
    answer,
    citations,
    sourcesScraped: scrapedContent.size,
    searchQueries,
  };
}

Python Version (LangChain)

# research_agent.py
import os
import asyncio
import json
from langchain_openai import ChatOpenAI
from langchain.prompts import ChatPromptTemplate
from langchain.schema.output_parser import StrOutputParser
from knowledgesdk import AsyncKnowledgeSDK

ks = AsyncKnowledgeSDK(api_key=os.environ["KNOWLEDGESDK_API_KEY"])
llm = ChatOpenAI(model="gpt-4o", temperature=0.1)

async def generate_search_queries(question: str) -> list[str]:
    prompt = ChatPromptTemplate.from_messages([
        ("system", "Generate 3-5 specific search queries for a research question. "
                   "Return ONLY a JSON array of strings."),
        ("human", "Research question: {question}")
    ])
    chain = prompt | llm | StrOutputParser()
    result = await chain.ainvoke({"question": question})
    try:
        return json.loads(result)
    except json.JSONDecodeError:
        return [question]

async def scrape_url(url: str) -> dict | None:
    try:
        result = await ks.scrape(url)
        return {"url": url, "title": result.title, "markdown": result.markdown}
    except Exception as e:
        print(f"  Failed to scrape {url}: {e}")
        return None

async def scrape_urls_concurrent(
    urls: list[str], concurrency: int = 3
) -> list[dict]:
    semaphore = asyncio.Semaphore(concurrency)

    async def bounded_scrape(url: str) -> dict | None:
        async with semaphore:
            return await scrape_url(url)

    results = await asyncio.gather(*[bounded_scrape(url) for url in urls])
    return [r for r in results if r is not None]

async def search_content(queries: list[str], limit: int = 8) -> list[dict]:
    all_results = {}

    for query in queries:
        results = await ks.search(query=query, limit=limit, hybrid=True)
        for hit in results.hits:
            if hit.url not in all_results or hit.score > all_results[hit.url]["score"]:
                all_results[hit.url] = {
                    "url": hit.url,
                    "title": hit.title or hit.url,
                    "content": hit.content,
                    "score": hit.score,
                }

    return sorted(all_results.values(), key=lambda x: x["score"], reverse=True)[:limit]

async def synthesize_answer(
    question: str, search_results: list[dict]
) -> tuple[str, list[dict]]:
    if not search_results:
        return "Insufficient information found.", []

    citations = [
        {"index": i + 1, "url": r["url"], "title": r["title"],
         "excerpt": r["content"][:300]}
        for i, r in enumerate(search_results)
    ]

    context = "\n\n---\n\n".join(
        f"[{i+1}] {r['title']}\nURL: {r['url']}\n\n{r['content'][:2000]}"
        for i, r in enumerate(search_results)
    )

    prompt = ChatPromptTemplate.from_messages([
        ("system", "You are a research assistant. Synthesize information from web sources "
                   "with inline citations like [1], [2]. Be factual and cite sources."),
        ("human", "Research question: {question}\n\nSources:\n{context}\n\nAnswer:"),
    ])

    chain = prompt | llm | StrOutputParser()
    answer = await chain.ainvoke({"question": question, "context": context})
    return answer, citations

async def research(question: str, url_corpus: list[str] = []) -> dict:
    print(f"\nResearching: '{question}'")

    queries = await generate_search_queries(question)
    print(f"Queries: {queries}")

    # Use corpus or fall back to queries as placeholder URLs
    urls_to_scrape = url_corpus[:15] if url_corpus else []

    print(f"Scraping {len(urls_to_scrape)} URLs...")
    scraped = await scrape_urls_concurrent(urls_to_scrape)
    print(f"Scraped: {len(scraped)}")

    print("Searching indexed content...")
    results = await search_content(queries)

    print("Synthesizing answer...")
    answer, citations = await synthesize_answer(question, results)

    return {
        "question": question,
        "answer": answer,
        "citations": citations,
        "sources_scraped": len(scraped),
        "search_queries": queries,
    }

Running the Agent

// src/main.ts
import { research } from "./researchAgent";

async function main() {
  // Research with a predefined URL corpus
  const result = await research(
    "What are the main differences in pricing models between cloud scraping APIs in 2026?",
    [
      "https://firecrawl.dev/pricing",
      "https://apify.com/pricing",
      "https://www.scrapingbee.com/#pricing",
      "https://browserless.io/pricing",
      "https://knowledgesdk.com/pricing",
    ]
  );

  console.log("\n" + "=".repeat(60));
  console.log("RESEARCH RESULTS");
  console.log("=".repeat(60));
  console.log(`\nQuestion: ${result.question}`);
  console.log(`\nSources scraped: ${result.sourcesScraped}`);
  console.log(`\nAnswer:\n${result.answer}`);
  console.log("\nCitations:");
  result.citations.forEach((c) => {
    console.log(`  [${c.index}] ${c.title}\n       ${c.url}`);
    console.log(`       "${c.excerpt}..."`);
  });
}

main().catch(console.error);

Making the Agent More Autonomous

The current implementation uses a predefined URL corpus. To make it fully autonomous, integrate a search API to find URLs at runtime:

async function searchWeb(query: string): Promise<string[]> {
  // Using Serper API (Google search results)
  const response = await fetch("https://google.serper.dev/search", {
    method: "POST",
    headers: {
      "X-API-KEY": process.env.SERPER_API_KEY!,
      "Content-Type": "application/json",
    },
    body: JSON.stringify({ q: query, num: 5 }),
  });

  const data = await response.json();
  return (data.organic ?? []).map((r: any) => r.link);
}

Replace the getUrlsForQuery function with searchWeb. Now the agent can research any question without a pre-configured URL list.

Adding Memory Between Research Sessions

Build a persistent research memory so the agent doesn't re-scrape URLs it already knows about:

import { Pool } from "pg";

const db = new Pool({ connectionString: process.env.DATABASE_URL });

async function isAlreadyScraped(url: string): Promise<boolean> {
  const { rows } = await db.query(
    "SELECT 1 FROM scraped_urls WHERE url = $1 AND scraped_at > NOW() - INTERVAL '7 days'",
    [url]
  );
  return rows.length > 0;
}

async function markAsScraped(url: string): Promise<void> {
  await db.query(
    "INSERT INTO scraped_urls (url, scraped_at) VALUES ($1, NOW()) ON CONFLICT (url) DO UPDATE SET scraped_at = NOW()",
    [url]
  );
}

Check isAlreadyScraped before calling ks.scrape(). Content scraped within the last 7 days is served from KnowledgeSDK's cache anyway, but this saves the API call entirely for content you know is already indexed.

Output Formatting

Format the research result for different consumers:

function formatAsMarkdown(result: ResearchResult): string {
  const citationList = result.citations
    .map((c) => `${c.index}. [${c.title}](${c.url})`)
    .join("\n");

  return `# ${result.question}

${result.answer}

## Sources

${citationList}

---
*Research conducted using ${result.sourcesScraped} sources via KnowledgeSDK*
*Queries: ${result.searchQueries.join(", ")}*`;
}

function formatAsSlackMessage(result: ResearchResult): object {
  return {
    blocks: [
      {
        type: "header",
        text: { type: "plain_text", text: "Research Result" },
      },
      {
        type: "section",
        text: { type: "mrkdwn", text: `*Question:* ${result.question}` },
      },
      {
        type: "section",
        text: { type: "mrkdwn", text: result.answer.slice(0, 2000) },
      },
      {
        type: "section",
        text: {
          type: "mrkdwn",
          text: `*Sources:*\n${result.citations.map((c) => `• <${c.url}|${c.title}>`).join("\n")}`,
        },
      },
    ],
  };
}

Performance Benchmarks

On a typical research question requiring 10 URLs:

Step Time
Query generation ~1s
URL scraping (10 URLs, parallel) ~8-15s
Semantic search <100ms
Answer synthesis ~3-5s
Total ~12-21s

Compare this to a human researcher spending 20-45 minutes on the same task. The agent is 60-100x faster, always available, and produces consistently structured output.

FAQ

Can the agent scrape behind paywalls? No. KnowledgeSDK scrapes publicly accessible pages. For paywalled content, you'd need to provide authentication cookies, which is only appropriate for content you have permission to access.

How do I prevent the agent from hallucinating? The synthesis prompt explicitly instructs the model to only state what the sources support and to use citations. You can add an additional verification step: after synthesis, extract all factual claims and verify each against the cited source.

Can I run multiple research questions in parallel? Yes, but be mindful of KnowledgeSDK rate limits. Each research run scrapes 5-15 URLs. For parallel runs, use a global rate limiter across runs.

How do I handle research questions where no good URLs are found? Return a graceful "insufficient information" response rather than hallucinating. The synthesis function already handles empty results, but you should also set a minimum threshold on search result quality (e.g., minimum score of 0.5) before attempting synthesis.

Can I use this with Anthropic Claude instead of OpenAI? Yes. Replace ChatOpenAI with ChatAnthropic from @langchain/anthropic. Claude 3.5 Sonnet and Claude Opus 4.6 both excel at synthesis and citation tasks.


Build agents that read the real web. Get your KnowledgeSDK API key at knowledgesdk.com/setup.

Try it now

Scrape, search, and monitor any website with one API.

Get your API key in 30 seconds. First 1,000 requests free.

GET API KEY →
← Back to blog