Build a Knowledge Graph from Any Website Using LLMs
A vector database stores documents as embedding vectors and retrieves them by semantic similarity. That works well for "find me content similar to this query." It does not work well for "find all products that integrate with Stripe," "which competitors offer both a free tier and an enterprise plan," or "what are the two-hop connections between this company and its investors?"
These multi-hop reasoning questions require a knowledge graph — a data structure where entities (companies, products, people, features) are nodes and their relationships (integrates-with, competes-with, founded-by, priced-at) are typed edges.
This tutorial walks through building a knowledge graph from any website: crawling with KnowledgeSDK, extracting entities and relationships with a structured LLM prompt, loading into Neo4j, and querying the result. By the end, you will have a working pipeline that can answer questions like "find all integrations shared between our product and Competitor X" from scraped web content.
Why Knowledge Graphs Beat RAG for Relational Queries
To understand why this matters, consider a concrete example. You have scraped the documentation sites of 20 SaaS tools. You want to answer: "Which tools in my dataset integrate with both Stripe and Salesforce?"
With a vector database:
- You search for "Stripe integration" and get back a set of chunks
- You search for "Salesforce integration" and get back another set of chunks
- You have to manually intersect these results, which the LLM will attempt but will miss items where both integrations appear in different chunks
With a knowledge graph:
- You have nodes for each tool, Stripe, and Salesforce
- You have
INTEGRATES_WITHedges between them - The query is a single Cypher statement:
MATCH (t:Tool)-[:INTEGRATES_WITH]->(s:Service) WHERE s.name IN ['Stripe', 'Salesforce'] RETURN t.name - The result is exact, complete, and fast
| Query Type | Vector DB | Knowledge Graph |
|---|---|---|
| "Find docs similar to X" | Excellent | Poor (not its strength) |
| "Find all X that have property Y" | Unreliable | Excellent |
| "Find X related to Y through Z" | Cannot do reliably | Native query |
| "How many X integrate with Y?" | Very unreliable | Exact count |
| Setup complexity | Low | Medium |
| Query language | Natural language | Cypher / SPARQL |
Architecture Overview
The pipeline has four stages:
[Target Website]
│
▼
[KnowledgeSDK Crawler] ──── scrapes pages, returns markdown
│
▼
[LLM Entity Extractor] ──── extracts (entity, relation, entity) triples
│
▼
[Neo4j Ingestion] ────────── loads nodes and edges
│
▼
[Graph Query Layer] ──────── Cypher queries or natural language via LLM
Step 1: Crawling the Target Site with KnowledgeSDK
First, get all URLs from the target site using the sitemap endpoint, then extract each page.
import os
from knowledgesdk import KnowledgeSDK
from typing import Generator
client = KnowledgeSDK(api_key=os.environ["KNOWLEDGESDK_API_KEY"])
def crawl_site(base_url: str, max_pages: int = 100) -> Generator[dict, None, None]:
"""
Crawl a site using KnowledgeSDK sitemap discovery, then extract each page.
Yields dicts with url, title, and markdown content.
"""
print(f"Discovering URLs at {base_url}...")
sitemap = client.sitemap(url=base_url)
urls = sitemap.urls[:max_pages]
print(f"Found {len(urls)} URLs — extracting content...")
for url in urls:
try:
result = client.scrape(url=url)
if result.markdown and len(result.markdown.strip()) > 100:
yield {
"url": url,
"title": getattr(result, "title", url),
"markdown": result.markdown,
}
except Exception as e:
print(f"Failed to scrape {url}: {e}")
continue
# Example: crawl a competitor's integration docs
pages = list(crawl_site("https://docs.example-saas.com", max_pages=50))
print(f"Successfully crawled {len(pages)} pages")
Step 2: Entity and Relationship Extraction with a Pydantic Schema
The key to reliable LLM extraction is a strict output schema. We use Pydantic to define the structure and pass it to the LLM via instructor or OpenAI's structured output mode.
from pydantic import BaseModel, Field
from typing import Literal
from openai import OpenAI
import instructor
# Define the schema for extracted knowledge
class Entity(BaseModel):
name: str = Field(description="Canonical name of the entity")
entity_type: Literal["Product", "Company", "Feature", "Integration", "Person", "Technology", "Plan"] = Field(
description="Type of entity"
)
description: str = Field(description="Brief description of this entity", default="")
class Relationship(BaseModel):
source: str = Field(description="Name of the source entity")
relation: Literal[
"INTEGRATES_WITH",
"COMPETES_WITH",
"BUILT_BY",
"HAS_FEATURE",
"HAS_PLAN",
"PRICED_AT",
"USES_TECHNOLOGY",
"FOUNDED_BY",
"PART_OF",
] = Field(description="Type of relationship")
target: str = Field(description="Name of the target entity")
attributes: dict = Field(default_factory=dict, description="Optional relationship attributes like price or date")
class KnowledgeExtraction(BaseModel):
entities: list[Entity] = Field(description="List of entities found in the text")
relationships: list[Relationship] = Field(description="List of relationships between entities")
# Set up instructor-patched client for structured output
openai_client = instructor.from_openai(OpenAI())
EXTRACTION_SYSTEM_PROMPT = """You are a knowledge graph extraction system.
Extract entities and relationships from the provided text.
Focus on:
- Products and their features
- Companies and their relationships
- Technical integrations (tool A integrates with tool B)
- Pricing plans and their attributes
- People and their roles
Be precise with entity names — use the canonical form as it appears in the text.
Only extract relationships that are explicitly stated, not implied."""
def extract_knowledge(page: dict) -> KnowledgeExtraction:
"""Extract entities and relationships from a scraped page."""
prompt = f"""Extract all entities and relationships from this webpage.
Page title: {page['title']}
URL: {page['url']}
Content:
{page['markdown'][:4000]} # Limit to avoid token overflows
"""
return openai_client.chat.completions.create(
model="gpt-4o",
response_model=KnowledgeExtraction,
messages=[
{"role": "system", "content": EXTRACTION_SYSTEM_PROMPT},
{"role": "user", "content": prompt},
],
)
# Process all crawled pages
all_extractions = []
for page in pages:
extraction = extract_knowledge(page)
all_extractions.append(extraction)
print(f"Extracted {len(extraction.entities)} entities, {len(extraction.relationships)} relationships from {page['url']}")
Step 3: Loading into Neo4j
from neo4j import GraphDatabase
from collections import defaultdict
NEO4J_URI = os.environ.get("NEO4J_URI", "bolt://localhost:7687")
NEO4J_USER = os.environ.get("NEO4J_USER", "neo4j")
NEO4J_PASSWORD = os.environ["NEO4J_PASSWORD"]
driver = GraphDatabase.driver(NEO4J_URI, auth=(NEO4J_USER, NEO4J_PASSWORD))
def merge_entity(tx, entity: Entity):
"""Create or update an entity node."""
tx.run(
"""
MERGE (e:Entity {name: $name})
SET e.entity_type = $entity_type,
e.description = $description
WITH e
CALL apoc.create.addLabels(e, [$entity_type]) YIELD node
RETURN node
""",
name=entity.name,
entity_type=entity.entity_type,
description=entity.description,
)
def merge_relationship(tx, rel: Relationship):
"""Create or update a relationship edge."""
query = f"""
MERGE (source:Entity {{name: $source}})
MERGE (target:Entity {{name: $target}})
MERGE (source)-[r:{rel.relation}]->(target)
SET r += $attributes
RETURN r
"""
tx.run(query, source=rel.source, target=rel.target, attributes=rel.attributes)
def load_to_neo4j(extractions: list[KnowledgeExtraction]):
"""Load all extracted knowledge into Neo4j."""
# Deduplicate entities (same name from different pages)
entity_map: dict[str, Entity] = {}
for extraction in extractions:
for entity in extraction.entities:
if entity.name not in entity_map:
entity_map[entity.name] = entity
with driver.session() as session:
# Load entities
print(f"Loading {len(entity_map)} unique entities...")
for entity in entity_map.values():
session.execute_write(merge_entity, entity)
# Load relationships
all_rels = [rel for e in extractions for rel in e.relationships]
print(f"Loading {len(all_rels)} relationships...")
for rel in all_rels:
try:
session.execute_write(merge_relationship, rel)
except Exception as e:
print(f"Failed to load relationship {rel}: {e}")
print("Knowledge graph loaded successfully")
load_to_neo4j(all_extractions)
Step 4: Querying the Knowledge Graph
With the graph loaded, you can now run Cypher queries that would be impossible with a vector database:
def query_graph(cypher: str) -> list[dict]:
with driver.session() as session:
result = session.run(cypher)
return [dict(record) for record in result]
# Q1: Find all products that integrate with Stripe
stripe_integrations = query_graph("""
MATCH (p:Product)-[:INTEGRATES_WITH]->(s:Entity {name: 'Stripe'})
RETURN p.name AS product, p.description AS description
ORDER BY p.name
""")
# Q2: Find competitors shared integrations
shared_integrations = query_graph("""
MATCH (our:Product {name: 'OurProduct'})-[:INTEGRATES_WITH]->(i:Integration)
MATCH (competitor:Product)-[:INTEGRATES_WITH]->(i)
WHERE competitor.name <> 'OurProduct'
RETURN competitor.name AS competitor, collect(i.name) AS shared_integrations,
count(i) AS shared_count
ORDER BY shared_count DESC
""")
# Q3: Two-hop query — find companies that use technologies our competitor uses
competitor_tech_stack = query_graph("""
MATCH (competitor:Company {name: 'CompetitorX'})-[:USES_TECHNOLOGY]->(tech:Technology)
MATCH (other:Company)-[:USES_TECHNOLOGY]->(tech)
WHERE other.name <> 'CompetitorX'
RETURN other.name AS company, collect(tech.name) AS shared_tech
ORDER BY size(shared_tech) DESC
LIMIT 10
""")
# Q4: Find all plans and their prices across competitors
pricing_landscape = query_graph("""
MATCH (company:Company)-[:HAS_PLAN]->(plan:Plan)-[r:PRICED_AT]->(price)
RETURN company.name AS company, plan.name AS plan,
r.monthly_price AS monthly, r.annual_price AS annual
ORDER BY company.name, r.monthly_price
""")
Step 5: Natural Language Query Interface
For non-technical users, add an LLM layer that translates natural language to Cypher:
from openai import OpenAI
openai_client = OpenAI()
CYPHER_SYSTEM_PROMPT = """You are a Neo4j Cypher query generator.
The knowledge graph has the following schema:
Nodes: Entity (with labels: Product, Company, Feature, Integration, Person, Technology, Plan)
Relationships: INTEGRATES_WITH, COMPETES_WITH, BUILT_BY, HAS_FEATURE, HAS_PLAN, PRICED_AT, USES_TECHNOLOGY, FOUNDED_BY, PART_OF
Generate a valid Cypher query to answer the user's question.
Return ONLY the Cypher query, no explanation."""
def natural_language_to_cypher(question: str) -> str:
response = openai_client.chat.completions.create(
model="gpt-4o",
messages=[
{"role": "system", "content": CYPHER_SYSTEM_PROMPT},
{"role": "user", "content": question},
],
temperature=0,
)
return response.choices[0].message.content.strip()
def ask_knowledge_graph(question: str) -> dict:
cypher = natural_language_to_cypher(question)
print(f"Generated Cypher: {cypher}")
results = query_graph(cypher)
# Let the LLM format the answer
answer_response = openai_client.chat.completions.create(
model="gpt-4o",
messages=[
{
"role": "user",
"content": f"Question: {question}\n\nData from knowledge graph:\n{results}\n\nProvide a clear, concise answer.",
}
],
)
return {
"question": question,
"cypher": cypher,
"raw_results": results,
"answer": answer_response.choices[0].message.content,
}
# Usage
result = ask_knowledge_graph(
"Which of our competitors offer both a Stripe integration and a free tier?"
)
print(result["answer"])
Node.js Implementation
import KnowledgeSDK from '@knowledgesdk/node';
import OpenAI from 'openai';
import neo4j from 'neo4j-driver';
const knowledgeClient = new KnowledgeSDK({ apiKey: process.env.KNOWLEDGESDK_API_KEY });
const openai = new OpenAI();
const driver = neo4j.driver(
process.env.NEO4J_URI,
neo4j.auth.basic(process.env.NEO4J_USER, process.env.NEO4J_PASSWORD)
);
async function crawlAndExtract(baseUrl, maxPages = 50) {
const sitemap = await knowledgeClient.sitemap({ url: baseUrl });
const urls = sitemap.urls.slice(0, maxPages);
const pages = [];
for (const url of urls) {
try {
const result = await knowledgeClient.scrape({ url });
if (result.markdown?.trim().length > 100) {
pages.push({ url, title: result.title ?? url, markdown: result.markdown });
}
} catch (e) {
console.error(`Failed to scrape ${url}:`, e.message);
}
}
return pages;
}
async function extractKnowledge(page) {
const response = await openai.chat.completions.create({
model: 'gpt-4o',
messages: [
{
role: 'system',
content: 'Extract entities and relationships as JSON with {entities: [], relationships: []}',
},
{
role: 'user',
content: `Page: ${page.title}\nURL: ${page.url}\n\n${page.markdown.slice(0, 4000)}`,
},
],
response_format: { type: 'json_object' },
});
return JSON.parse(response.choices[0].message.content);
}
async function loadToNeo4j(extractions) {
const session = driver.session();
try {
for (const extraction of extractions) {
for (const entity of extraction.entities ?? []) {
await session.run(
'MERGE (e:Entity {name: $name}) SET e.type = $type, e.description = $description',
{ name: entity.name, type: entity.entity_type, description: entity.description ?? '' }
);
}
for (const rel of extraction.relationships ?? []) {
const relType = rel.relation.replace(/[^A-Z_]/g, '');
await session.run(
`MERGE (s:Entity {name: $source}) MERGE (t:Entity {name: $target}) MERGE (s)-[:${relType}]->(t)`,
{ source: rel.source, target: rel.target }
);
}
}
console.log('Knowledge graph loaded');
} finally {
await session.close();
}
}
// Main pipeline
const pages = await crawlAndExtract('https://docs.example.com', 30);
const extractions = await Promise.all(pages.map(extractKnowledge));
await loadToNeo4j(extractions);
await driver.close();
What This Enables That RAG Cannot
The knowledge graph built from this pipeline supports query patterns that are structurally impossible with flat vector search:
Multi-hop traversal: "Find all companies that are funded by investors who also funded Company X" — this requires traversing two edge types in sequence.
Set intersection: "Find integrations that Product A and Product B both support" — Cypher handles this natively; vector search requires multiple queries and manual intersection.
Aggregation over relationships: "Which integration is mentioned by the most competitors?" — a COUNT query over edges.
Path finding: "What is the shortest connection path between Company X and Company Y through shared investors or customers?" — Cypher's shortestPath function.
Completeness guarantees: If the graph contains all INTEGRATES_WITH edges for Stripe, a Cypher query will return all of them. A vector similarity search returns the most relevant chunks — which may miss pages where the integration is mentioned briefly.
Conclusion
Knowledge graphs built from web content unlock a class of relational queries that vector RAG cannot handle. The pipeline is now accessible to any developer: KnowledgeSDK handles the crawling and content extraction, an LLM with a Pydantic schema handles entity and relationship extraction, and Neo4j provides the graph storage and query engine.
The realistic use cases are substantial: competitive intelligence systems, market research, documentation linking, and any domain where understanding relationships between entities matters as much as the entities themselves.
Start building your knowledge graph. Get a free KnowledgeSDK API key at knowledgesdk.com and crawl your first site in minutes. 1,000 free requests per month, no credit card required.