Building a Knowledge Graph from Websites with Neo4j and KnowledgeSDK

Extract entities and relationships from any website, build a Neo4j knowledge graph, and query it for multi-hop reasoning in your RAG pipeline.

Vector RAG works by finding chunks of text that are semantically similar to a query. It's powerful for "what does this document say about X?" style questions. But for questions like "which companies does Acme Corp partner with?" or "what features does this product have that its competitor lacks?" — questions that require traversing relationships — vector similarity falls short.

Knowledge graphs complement vector retrieval by modeling explicit relationships between entities. Instead of finding the most similar passage, you traverse a graph: Company -[PARTNERS_WITH]-> Company, Product -[HAS_FEATURE]-> Feature. The answer is a chain of graph traversals, not a similarity score.

This tutorial walks through extracting web content with KnowledgeSDK, pulling entity-relationship triples out of that content with an LLM, storing them in Neo4j, and querying the result with Cypher. At the end, we combine graph traversal with vector search for hybrid GraphRAG.

Why Knowledge Graphs Complement Vector RAG

Consider a corpus of 200 company websites in your market. You want to answer: "Which companies in our sector use Kubernetes for infrastructure?"

With vector RAG, you'd search for "Kubernetes infrastructure" and get semantically similar passages. This works, but it treats each chunk in isolation. If the company mentions Kubernetes once in a case study and again in a job posting, you might miss the pattern.

With a knowledge graph, you'd have nodes for each company and technology, and edges for the relationship: (Acme Corp)-[:USES_TECHNOLOGY]->(Kubernetes). Your Cypher query returns every company with that edge in milliseconds, regardless of how many times or in what context it was mentioned.

The two approaches are additive:

Graph query → finds entities and their relationships with precision
Vector search → finds semantically similar content for context and explanation
Combined → finds entities via graph, then fetches narrative context via vector search

The Workflow

Extract website content with KnowledgeSDK (POST /v1/extract)
Extract entity + relationship triples from the markdown with GPT-4o
Store entities as nodes and relationships as edges in Neo4j
Query the graph with Cypher for multi-hop answers
Optionally: enrich graph results with vector search from /v1/search

Step 1: Extract Website Content

KnowledgeSDK's /v1/extract endpoint handles JavaScript-rendered pages, removes navigation and boilerplate, and returns clean markdown. This is important — raw HTML is noisy input for LLM entity extraction.

import fetch from "node-fetch";

const KNOWLEDGE_API_KEY = process.env.KNOWLEDGE_API_KEY!;

async function extractWebsite(url: string): Promise<string> {
  const response = await fetch("https://api.knowledgesdk.com/v1/extract", {
    method: "POST",
    headers: {
      "x-api-key": KNOWLEDGE_API_KEY,
      "Content-Type": "application/json",
    },
    body: JSON.stringify({ url }),
  });

  if (!response.ok) {
    throw new Error(`Extraction failed: ${response.statusText}`);
  }

  const data = await response.json() as { markdown: string; metadata: any };
  return data.markdown;
}

For large sites with many pages, use /v1/sitemap to discover all URLs first, then extract each one asynchronously:

async function extractAllPages(domain: string): Promise<Map<string, string>> {
  // Get all URLs on the site
  const sitemapResponse = await fetch("https://api.knowledgesdk.com/v1/sitemap", {
    method: "POST",
    headers: { "x-api-key": KNOWLEDGE_API_KEY, "Content-Type": "application/json" },
    body: JSON.stringify({ url: domain }),
  });
  const { urls } = await sitemapResponse.json() as { urls: string[] };

  // Extract each page (with concurrency limit)
  const pages = new Map<string, string>();
  const limit = 5; // max concurrent extractions

  for (let i = 0; i < urls.length; i += limit) {
    const batch = urls.slice(i, i + limit);
    const results = await Promise.all(
      batch.map(async (url) => ({ url, content: await extractWebsite(url) }))
    );
    results.forEach(({ url, content }) => pages.set(url, content));
  }

  return pages;
}

Step 2: Extract Entity-Relationship Triples with GPT-4o

With clean markdown in hand, we ask an LLM to extract structured triples: { subject, predicate, object }. Each triple represents one relationship in our future graph.

import OpenAI from "openai";

const openai = new OpenAI();

interface Triple {
  subject: string;
  subjectType: string;
  predicate: string;
  object: string;
  objectType: string;
}

async function extractTriples(markdown: string, sourceUrl: string): Promise<Triple[]> {
  const response = await openai.chat.completions.create({
    model: "gpt-4o",
    response_format: { type: "json_object" },
    messages: [
      {
        role: "system",
        content: `You are a knowledge graph extraction assistant.
Extract all entities and relationships from the provided text as a JSON object.
Return: { "triples": [{ "subject": "", "subjectType": "", "predicate": "", "object": "", "objectType": "" }] }
Entity types: Company, Product, Feature, Technology, Person, Location, Concept.
Predicates should be uppercase with underscores: HAS_FEATURE, USES_TECHNOLOGY, PARTNERS_WITH, FOUNDED_BY, COMPETES_WITH, OFFERS_PLAN, etc.
Extract only clear, factual relationships — not speculative ones.`,
      },
      {
        role: "user",
        content: `Source URL: ${sourceUrl}\n\nContent:\n${markdown.slice(0, 6000)}`,
      },
    ],
  });

  const parsed = JSON.parse(response.choices[0].message.content ?? "{}");
  return parsed.triples ?? [];
}

Example output for a KnowledgeSDK product page:

{
  "triples": [
    { "subject": "KnowledgeSDK", "subjectType": "Product", "predicate": "HAS_ENDPOINT", "object": "/v1/extract", "objectType": "Feature" },
    { "subject": "KnowledgeSDK", "subjectType": "Product", "predicate": "HAS_ENDPOINT", "object": "/v1/search", "objectType": "Feature" },
    { "subject": "KnowledgeSDK", "subjectType": "Product", "predicate": "OFFERS_PLAN", "object": "Starter Plan", "objectType": "Concept" },
    { "subject": "Starter Plan", "subjectType": "Concept", "predicate": "COSTS", "object": "$29/month", "objectType": "Concept" }
  ]
}

Step 3: Store in Neo4j

Install the Neo4j driver and set up a connection:

import neo4j from "neo4j-driver";

const driver = neo4j.driver(
  process.env.NEO4J_URI!,
  neo4j.auth.basic(process.env.NEO4J_USER!, process.env.NEO4J_PASSWORD!)
);

async function storeTriples(triples: Triple[], sourceUrl: string): Promise<void> {
  const session = driver.session();

  try {
    const tx = session.beginTransaction();

    for (const triple of triples) {
      // MERGE ensures we don't create duplicate nodes
      await tx.run(
        `
        MERGE (s:Entity {name: $subject})
        SET s.type = $subjectType, s.updatedAt = datetime()
        MERGE (o:Entity {name: $object})
        SET o.type = $objectType, o.updatedAt = datetime()
        MERGE (s)-[r:RELATIONSHIP {predicate: $predicate}]->(o)
        SET r.sourceUrl = $sourceUrl, r.extractedAt = datetime()
        `,
        {
          subject: triple.subject,
          subjectType: triple.subjectType,
          predicate: triple.predicate,
          object: triple.object,
          objectType: triple.objectType,
          sourceUrl,
        }
      );
    }

    await tx.commit();
  } finally {
    await session.close();
  }
}

For better Neo4j performance, add indexes on the name property:

CREATE INDEX entity_name IF NOT EXISTS FOR (e:Entity) ON (e.name);
CREATE INDEX entity_type IF NOT EXISTS FOR (e:Entity) ON (e.type);

Step 4: Full Pipeline — Extract to Graph

Putting it together:

async function buildKnowledgeGraph(urls: string[]): Promise<void> {
  console.log(`Building knowledge graph from ${urls.length} URLs...`);

  for (const url of urls) {
    try {
      console.log(`Extracting: ${url}`);
      const markdown = await extractWebsite(url);

      console.log(`Extracting triples from ${markdown.length} chars...`);
      const triples = await extractTriples(markdown, url);

      console.log(`Storing ${triples.length} triples in Neo4j...`);
      await storeTriples(triples, url);

      console.log(`Done: ${url} → ${triples.length} relationships`);
    } catch (err) {
      console.error(`Failed for ${url}:`, err);
    }
  }

  console.log("Knowledge graph build complete.");
}

// Run it
await buildKnowledgeGraph([
  "https://competitor-a.com",
  "https://competitor-b.com",
  "https://competitor-c.com",
]);

Step 5: Querying the Graph with Cypher

Once your graph is populated, you can answer structural questions instantly:

-- What features does a product have?
MATCH (p:Entity {name: "KnowledgeSDK"})-[:RELATIONSHIP {predicate: "HAS_ENDPOINT"}]->(f)
RETURN f.name AS feature

-- What technologies does a company use?
MATCH (c:Entity {type: "Company"})-[:RELATIONSHIP {predicate: "USES_TECHNOLOGY"}]->(t)
WHERE c.name = "Acme Corp"
RETURN t.name AS technology

-- Multi-hop: which companies use the same technology as Acme?
MATCH (acme:Entity {name: "Acme Corp"})-[:RELATIONSHIP {predicate: "USES_TECHNOLOGY"}]->(tech)
      <-[:RELATIONSHIP {predicate: "USES_TECHNOLOGY"}]-(competitor)
WHERE competitor.name <> "Acme Corp"
RETURN competitor.name, tech.name

-- What plans does each competitor offer?
MATCH (c:Entity {type: "Product"})-[:RELATIONSHIP {predicate: "OFFERS_PLAN"}]->(plan)
RETURN c.name AS product, collect(plan.name) AS plans

In Node.js:

async function findCompetitorFeatures(productName: string): Promise<string[]> {
  const session = driver.session();
  try {
    const result = await session.run(
      `MATCH (p:Entity {name: $name})-[:RELATIONSHIP]->(f:Entity {type: "Feature"})
       RETURN f.name AS feature`,
      { name: productName }
    );
    return result.records.map((r) => r.get("feature") as string);
  } finally {
    await session.close();
  }
}

Combining Graph and Vector Search

The most powerful queries combine both: use Cypher to identify relevant entities, then use /v1/search to fetch narrative context about those entities.

async function hybridQuery(question: string): Promise<string> {
  // Step 1: Identify entities mentioned in the question (simple extraction)
  const entities = await extractEntitiesFromQuery(question);

  // Step 2: Graph traversal to find related entities
  const session = driver.session();
  const graphResult = await session.run(
    `MATCH (e:Entity)-[r:RELATIONSHIP]->(related)
     WHERE e.name IN $entities
     RETURN e.name, r.predicate, related.name LIMIT 20`,
    { entities }
  );
  await session.close();

  const graphFacts = graphResult.records.map(
    (r) => `${r.get("e.name")} ${r.get("r.predicate")} ${r.get("related.name")}`
  );

  // Step 3: Vector search for semantic context
  const searchResponse = await fetch("https://api.knowledgesdk.com/v1/search", {
    method: "POST",
    headers: { "x-api-key": KNOWLEDGE_API_KEY, "Content-Type": "application/json" },
    body: JSON.stringify({ query: question, limit: 3 }),
  });
  const { results } = await searchResponse.json() as { results: any[] };
  const vectorContext = results.map((r: any) => r.content).join("\n\n");

  // Step 4: Combine and generate
  const context = `Graph facts:\n${graphFacts.join("\n")}\n\nContext:\n${vectorContext}`;
  // ... pass to LLM
  return context;
}

Use Cases

Competitive analysis: Extract all competitor websites monthly. Query: "which competitors offer enterprise SSO?" Graph gives you a precise list; vector search gives you the sales narrative around each one.

Documentation understanding: Extract your product's docs. Build a call graph of functions, modules, and concepts. Query: "what does the authentication module depend on?"

News intelligence: Extract news articles about your market. Model Company -[ANNOUNCED]-> Feature and Company -[ACQUIRED]-> Company relationships. Query: "which companies in our space have acquired an AI startup in the last 6 months?"

Limitations and What to Watch Out For

Schema design is the hard part. Your predicate vocabulary matters enormously. If one page generates USES_TECHNOLOGY and another generates BUILT_WITH for the same relationship type, your graph becomes fragmented. Define a predicate ontology before you start and include it in your extraction prompt.

Entity deduplication is tricky. "KnowledgeSDK" and "Knowledge SDK" and "the company" might all refer to the same entity. Use a canonicalization step (simple string normalization, or a separate LLM call for disambiguation) before storing.

Start small and expand incrementally. Build the graph for one or two websites first, inspect the output manually, fix the extraction prompt, then scale up. The cost of fixing a bad schema after ingesting 500 sites is much higher than getting it right on 5.

Knowledge graphs are powerful, but they reward deliberate design. The good news: with KnowledgeSDK handling the extraction layer and Neo4j handling the graph layer, the plumbing is straightforward. The intelligence is in your schema and your queries.

Try it now