knowledgesdk.com/blog/graphrag-web-scraping

tutorialMarch 20, 2026·16 min read

GraphRAG + Web Scraping: Extract Entities and Build Knowledge Graphs from Any Website

Build a GraphRAG pipeline with KnowledgeSDK: scrape any website to clean markdown, extract entities with Claude or GPT-4o, and load into Neo4j or LightRAG.

GraphRAG + Web Scraping: Extract Entities and Build Knowledge Graphs from Any Website

Most RAG pipelines treat every document as a bag of chunks. You split text into 512-token windows, embed them, and retrieve the top-k nearest neighbors at query time. That works for simple Q&A. It breaks down for questions that span multiple entities and relationships: "Which companies acquired competitors in the infrastructure space in the last 12 months?" or "What are the dependencies between modules A, B, and C in this codebase?"

GraphRAG solves this by adding a layer of structured knowledge — entities, relationships, and communities — on top of the raw text. Microsoft's GraphRAG paper showed that graph-augmented retrieval dramatically outperforms naive RAG on multi-hop questions. In 2026, three tools dominate this space: Microsoft GraphRAG, LightRAG, and Neo4j's LLM Graph Builder.

The missing piece in most tutorials is the ingestion layer. Where does the text come from? If you're building a GraphRAG pipeline over web content — documentation sites, competitor pages, news, product pages — you need to scrape it first. And the quality of your knowledge graph depends directly on the quality of your text input.

This tutorial builds a complete pipeline:

Scrape any website with KnowledgeSDK → clean LLM-ready markdown
Extract entities and relationships using Claude or GPT-4o with a Pydantic schema
Load into Neo4j or use LightRAG for graph-based retrieval

Why Web Scraping Quality Matters for GraphRAG

Entity extraction is LLM-heavy. You're sending every scraped page through Claude or GPT-4o to pull out structured data. If the input text is noisy — full of navigation HTML, cookie banners, script tags, and boilerplate — the LLM wastes tokens on garbage and misses real entities.

KnowledgeSDK handles JavaScript rendering, anti-bot protection, and converts any page to clean markdown before the text reaches your entity extractor. The difference in practice:

Input quality	Extracted entities	Tokens wasted
Raw HTML (20KB page)	~60% recall	~8,000 tokens/page
Basic markdown (BeautifulSoup)	~75% recall	~4,000 tokens/page
KnowledgeSDK clean markdown	~95% recall	~1,200 tokens/page

Clean input means fewer LLM calls, lower cost, and a denser, more accurate knowledge graph.

Architecture Overview

Website URLs
    ↓
KnowledgeSDK /v1/extract
    ↓
Clean markdown + structured metadata
    ↓
Entity/Relationship Extractor (Claude or GPT-4o + Pydantic)
    ↓
Graph nodes and edges
    ↓
Neo4j or LightRAG
    ↓
GraphRAG queries

Each stage is decoupled. You can swap Neo4j for LightRAG, or switch from Claude to GPT-4o, without touching the scraping layer.

Step 1: Scrape with KnowledgeSDK

Install the SDK:

Node.js:

npm install @knowledgesdk/node

Python:

pip install knowledgesdk

Scraping a Documentation Site

Node.js:

import KnowledgeSDK from '@knowledgesdk/node';

const client = new KnowledgeSDK({ apiKey: process.env.KNOWLEDGESDK_API_KEY });

async function scrapeDocSite(urls) {
  const results = [];

  for (const url of urls) {
    const result = await client.extract(url, {
      includeMarkdown: true,
      includeStructured: true,
    });

    results.push({
      url,
      title: result.title,
      markdown: result.markdown,
      // structured fields: description, headings, links, etc.
      metadata: result.structured,
    });

    console.log(`Scraped: ${url} (${result.markdown.length} chars)`);
  }

  return results;
}

// Example: scrape a documentation site
const docUrls = [
  'https://docs.example.com/introduction',
  'https://docs.example.com/architecture',
  'https://docs.example.com/api-reference',
];

const pages = await scrapeDocSite(docUrls);

Python:

import os
from knowledgesdk import KnowledgeSDK

client = KnowledgeSDK(api_key=os.environ["KNOWLEDGESDK_API_KEY"])

def scrape_doc_site(urls: list[str]) -> list[dict]:
    results = []

    for url in urls:
        result = client.extract(url, include_markdown=True, include_structured=True)

        results.append({
            "url": url,
            "title": result.title,
            "markdown": result.markdown,
            "metadata": result.structured,
        })

        print(f"Scraped: {url} ({len(result.markdown)} chars)")

    return results

doc_urls = [
    "https://docs.example.com/introduction",
    "https://docs.example.com/architecture",
    "https://docs.example.com/api-reference",
]

pages = scrape_doc_site(doc_urls)

Scraping a Full Site with the Sitemap Endpoint

If you want to crawl an entire documentation site, use /v1/sitemap to discover all URLs first:

Python:

def scrape_full_site(base_url: str) -> list[dict]:
    # Step 1: discover all URLs
    sitemap = client.sitemap(base_url)
    print(f"Found {len(sitemap.urls)} URLs")

    # Step 2: scrape each one
    pages = []
    for url in sitemap.urls[:50]:  # limit for demo
        try:
            result = client.extract(url, include_markdown=True)
            pages.append({"url": url, "markdown": result.markdown, "title": result.title})
        except Exception as e:
            print(f"Failed {url}: {e}")

    return pages

Step 2: Extract Entities and Relationships

This is where you define what a "node" and an "edge" mean for your domain. For a software documentation site, nodes might be: Module, API, Concept, Configuration. For a business intelligence use case, nodes might be: Company, Person, Product, Acquisition.

Define Pydantic Schemas

Python (with Pydantic v2):

from pydantic import BaseModel, Field
from typing import Optional

class Entity(BaseModel):
    name: str = Field(description="The canonical name of the entity")
    type: str = Field(description="Entity type: Module, API, Concept, Company, Person, etc.")
    description: Optional[str] = Field(description="One-sentence description of this entity")
    aliases: list[str] = Field(default_factory=list, description="Alternative names or abbreviations")

class Relationship(BaseModel):
    source: str = Field(description="Name of the source entity")
    target: str = Field(description="Name of the target entity")
    relation: str = Field(description="Relationship type: DEPENDS_ON, EXTENDS, CONFIGURES, OWNS, ACQUIRED, etc.")
    description: Optional[str] = Field(description="Brief description of this relationship")

class KnowledgeGraph(BaseModel):
    entities: list[Entity]
    relationships: list[Relationship]

Extract with Claude

Python:

import anthropic
import json

anthropic_client = anthropic.Anthropic(api_key=os.environ["ANTHROPIC_API_KEY"])

EXTRACTION_PROMPT = """You are a knowledge graph extraction system. Given the following text from a documentation page, extract all entities and relationships.

For entities, identify: modules, APIs, concepts, configurations, companies, people, products.
For relationships, identify how entities connect: depends_on, extends, configures, part_of, created_by, used_by.

Return a JSON object matching this schema:
{entities: [{name, type, description, aliases}], relationships: [{source, target, relation, description}]}

Text to analyze:
{text}"""

def extract_graph_from_page(page: dict) -> KnowledgeGraph:
    prompt = EXTRACTION_PROMPT.format(text=page["markdown"][:8000])  # limit tokens

    message = anthropic_client.messages.create(
        model="claude-opus-4-6",
        max_tokens=2000,
        messages=[{"role": "user", "content": prompt}],
    )

    raw = message.content[0].text
    # parse JSON from response
    data = json.loads(raw)
    return KnowledgeGraph(**data)

# Extract from all scraped pages
all_entities = {}
all_relationships = []

for page in pages:
    print(f"Extracting entities from: {page['url']}")
    graph = extract_graph_from_page(page)

    for entity in graph.entities:
        # deduplicate by name
        if entity.name not in all_entities:
            all_entities[entity.name] = entity

    all_relationships.extend(graph.relationships)

print(f"Total entities: {len(all_entities)}")
print(f"Total relationships: {len(all_relationships)}")

Extract with GPT-4o (Structured Outputs)

Node.js:

import OpenAI from 'openai';
import { zodResponseFormat } from 'openai/helpers/zod';
import { z } from 'zod';

const openai = new OpenAI({ apiKey: process.env.OPENAI_API_KEY });

const EntitySchema = z.object({
  name: z.string(),
  type: z.string(),
  description: z.string().optional(),
  aliases: z.array(z.string()),
});

const RelationshipSchema = z.object({
  source: z.string(),
  target: z.string(),
  relation: z.string(),
  description: z.string().optional(),
});

const KnowledgeGraphSchema = z.object({
  entities: z.array(EntitySchema),
  relationships: z.array(RelationshipSchema),
});

async function extractGraphFromPage(page) {
  const response = await openai.beta.chat.completions.parse({
    model: 'gpt-4o-2024-11-20',
    messages: [
      {
        role: 'system',
        content: 'Extract entities and relationships from the provided text. Return structured knowledge graph data.',
      },
      {
        role: 'user',
        content: page.markdown.slice(0, 8000),
      },
    ],
    response_format: zodResponseFormat(KnowledgeGraphSchema, 'knowledge_graph'),
  });

  return response.choices[0].message.parsed;
}

// Process all pages
const allEntities = new Map();
const allRelationships = [];

for (const page of pages) {
  const graph = await extractGraphFromPage(page);

  for (const entity of graph.entities) {
    if (!allEntities.has(entity.name)) {
      allEntities.set(entity.name, entity);
    }
  }

  allRelationships.push(...graph.relationships);
}

Step 3A: Load into Neo4j

Neo4j is the most mature graph database for this use case. The LLM Graph Builder UI is great for exploration, but for production pipelines you want to load data programmatically.

Python:

from neo4j import GraphDatabase

driver = GraphDatabase.driver(
    os.environ["NEO4J_URI"],
    auth=(os.environ["NEO4J_USER"], os.environ["NEO4J_PASSWORD"])
)

def load_entities_to_neo4j(entities: dict, relationships: list):
    with driver.session() as session:
        # Create entity nodes
        for name, entity in entities.items():
            session.run(
                """
                MERGE (e:Entity {name: $name})
                SET e.type = $type,
                    e.description = $description
                """,
                name=entity.name,
                type=entity.type,
                description=entity.description or "",
            )

        # Create relationships
        for rel in relationships:
            session.run(
                """
                MATCH (s:Entity {name: $source})
                MATCH (t:Entity {name: $target})
                MERGE (s)-[r:RELATES {type: $relation}]->(t)
                SET r.description = $description
                """,
                source=rel.source,
                target=rel.target,
                relation=rel.relation,
                description=rel.description or "",
            )

    print(f"Loaded {len(entities)} entities and {len(relationships)} relationships")

load_entities_to_neo4j(all_entities, all_relationships)

Querying the Graph

Once loaded, you can query with Cypher:

-- Find all APIs that depend on a specific module
MATCH (api:Entity {type: "API"})-[r:RELATES {type: "DEPENDS_ON"}]->(m:Entity {name: "AuthModule"})
RETURN api.name, api.description

-- Find 2-hop relationships from a concept
MATCH path = (c:Entity {name: "RateLimiting"})-[*1..2]-(connected)
RETURN path

Step 3B: Use LightRAG for Simpler Setups

LightRAG combines entity extraction and graph-based retrieval into a single package. It's a good choice if you don't want to manage a separate Neo4j instance.

Python:

from lightrag import LightRAG, QueryParam
from lightrag.llm import gpt_4o_mini_complete

rag = LightRAG(
    working_dir="./lightrag_cache",
    llm_model_func=gpt_4o_mini_complete,
)

# Insert all scraped pages into LightRAG
for page in pages:
    await rag.ainsert(page["markdown"])
    print(f"Inserted: {page['url']}")

# Query with graph-aware retrieval
result = await rag.aquery(
    "What are the dependencies between the authentication module and the rate limiter?",
    param=QueryParam(mode="global")  # global = graph community search
)

print(result)

LightRAG handles the entity extraction internally, so you skip the manual Pydantic extraction step. The tradeoff: less control over your schema and no ability to query the graph directly with Cypher.

Keeping the Graph Fresh with Webhooks

A static knowledge graph goes stale. When documentation updates, new relationships appear and old ones break. KnowledgeSDK webhooks let you register a URL and receive a notification whenever a scraped page changes.

Python:

import httpx

# Register webhooks for all scraped URLs
KNOWLEDGESDK_API_KEY = os.environ["KNOWLEDGESDK_API_KEY"]

def register_webhook(url: str, callback_url: str):
    response = httpx.post(
        "https://api.knowledgesdk.com/v1/webhooks",
        headers={"x-api-key": KNOWLEDGESDK_API_KEY},
        json={"url": url, "callbackUrl": callback_url, "events": ["content_changed"]},
    )
    return response.json()

for page in pages:
    register_webhook(page["url"], "https://your-app.com/webhooks/graphrag-update")

# Webhook handler (FastAPI example)
from fastapi import FastAPI, Request

app = FastAPI()

@app.post("/webhooks/graphrag-update")
async def handle_update(request: Request):
    payload = await request.json()
    changed_url = payload["url"]

    # Re-scrape the changed page
    updated = client.extract(changed_url, include_markdown=True)

    # Re-extract entities from this page only
    graph = extract_graph_from_page({"markdown": updated.markdown})

    # Update only affected nodes in Neo4j (not a full re-crawl)
    with driver.session() as session:
        for entity in graph.entities:
            session.run(
                "MERGE (e:Entity {name: $name}) SET e.description = $description",
                name=entity.name,
                description=entity.description or "",
            )

    return {"status": "updated", "url": changed_url}

GraphRAG vs. Naive RAG: When to Use Each

Scenario	Naive RAG	GraphRAG
Simple Q&A over docs	Good	Overkill
Multi-hop questions	Poor	Excellent
"How does X relate to Y?"	Poor	Excellent
Summarize a single page	Good	Good
Entity-centric queries	Poor	Excellent
Cost	Low	High (LLM extraction per page)
Setup complexity	Low	High

GraphRAG adds meaningful value when your queries require connecting information across multiple documents, not just retrieving similar chunks.

Performance and Cost Estimates

For a 500-page documentation site:

Stage	Time	Cost
KnowledgeSDK extraction (500 pages)	~8 min	~$2.50
Entity extraction with Claude (500 pages)	~15 min	~$18.00
Neo4j load	~2 min	Free (self-hosted)
Total initial build	~25 min	~$20.50

Incremental updates (via webhooks, 10 pages changed per day):

Stage	Time	Cost
KnowledgeSDK re-scrape (10 pages)	~10 sec	~$0.05
Entity re-extraction (10 pages)	~20 sec	~$0.36
Neo4j partial update	~2 sec	Free
Total per update cycle	~32 sec	~$0.41

Webhooks cut the incremental cost by 98% versus re-crawling everything daily.

Conclusion

GraphRAG moves retrieval from "find similar chunks" to "traverse a knowledge graph." The quality of the graph depends on the quality of the input text. That's where KnowledgeSDK fits: it handles JavaScript rendering, anti-bot layers, and clean markdown conversion so your entity extractor sees signal, not noise.

The pipeline in this tutorial gives you: