Scrape Documentation Sites for AI: Build a Living Knowledge Base

Learn how to scrape Stripe, GitHub, and other API docs to build a living knowledge base for AI agents. Handle multi-page docs, versioning, and auth.

Every AI application that helps developers eventually runs into the same wall: the LLM's training data is months or years out of date. Stripe updates its API. GitHub ships new Actions syntax. OpenAI deprecates endpoints. Your AI assistant confidently answers with stale information, and your users lose trust.

The fix is a living knowledge base: a continuously updated index of the documentation your AI needs to know. This tutorial walks through exactly how to build one using KnowledgeSDK, covering multi-page documentation sites, versioned docs, and the edge cases that break naive scrapers.

Why Documentation Sites Are Hard to Scrape

API documentation presents unique challenges that make generic scrapers fail:

Multi-page depth. Stripe's documentation has hundreds of pages across dozens of product areas. A single fetch() call gets you the homepage. You need to follow internal links, detect section boundaries, and know when to stop.

JavaScript rendering. Modern docs frameworks — Docusaurus, Mintlify, ReadTheDocs, GitBook — render content client-side. A headless browser or a service that manages one is required to get the actual text.

Versioned content. Many docs expose version selectors: /docs/v2/ vs /docs/v3/. Your scraper needs a strategy for which version(s) to index and how to handle version drift.

Auth-protected docs. Internal documentation, private API references, or docs behind a paywall need session cookies or API keys passed through to the scraper.

Navigation structures. Docs use sidebars, collapsible sections, and tab components. You need to extract the logical hierarchy, not just the raw text, so your AI can reason about where content lives.

KnowledgeSDK handles the headless browser pool, JavaScript execution, and pagination automatically. Your job is to give it the right starting points and extraction strategy.

Architecture: Docs as a Knowledge Base

Before writing code, let's define the system we're building:

┌─────────────────┐     ┌──────────────────┐     ┌─────────────────┐
│  Doc Sources    │────▶│  KnowledgeSDK    │────▶│  Knowledge Base │
│  (URLs + config)│     │  Scrape + Extract│     │  (indexed docs) │
└─────────────────┘     └──────────────────┘     └─────────────────┘
         │                                                  │
         │              ┌──────────────────┐                │
         └─────────────▶│  Webhook refresh │◀───────────────┘
                        │  (change alerts) │
                        └──────────────────┘

The system has three parts:

Initial indexing — full extraction of all doc pages
Semantic search — answer AI queries against the indexed content
Change detection — webhook-driven re-indexing when docs update

Step 1: Initial Documentation Extraction

Node.js

import KnowledgeSDK from '@knowledgesdk/node';

const client = new KnowledgeSDK({ apiKey: process.env.KNOWLEDGESDK_API_KEY });

// Define your documentation sources
const docSources = [
  {
    name: 'stripe',
    baseUrl: 'https://docs.stripe.com',
    startUrls: [
      'https://docs.stripe.com/api',
      'https://docs.stripe.com/payments',
      'https://docs.stripe.com/billing',
    ],
  },
  {
    name: 'github',
    baseUrl: 'https://docs.github.com',
    startUrls: [
      'https://docs.github.com/en/actions',
      'https://docs.github.com/en/rest',
    ],
  },
  {
    name: 'openai',
    baseUrl: 'https://platform.openai.com',
    startUrls: [
      'https://platform.openai.com/docs/api-reference',
      'https://platform.openai.com/docs/guides',
    ],
  },
];

async function indexDocSource(source) {
  console.log(`Indexing ${source.name}...`);

  const results = [];

  for (const startUrl of source.startUrls) {
    try {
      // Use extract for full multi-page extraction
      const extraction = await client.extract({
        url: startUrl,
        includeSubpages: true,
        maxDepth: 3,
        sameDomain: true, // Stay within the docs domain
        tags: [`source:${source.name}`, 'type:documentation'],
      });

      results.push({
        source: source.name,
        startUrl,
        pageCount: extraction.pages?.length ?? 1,
        content: extraction,
      });

      console.log(`  ✓ ${startUrl} — ${extraction.pages?.length ?? 1} pages`);
    } catch (err) {
      console.error(`  ✗ Failed to extract ${startUrl}:`, err.message);
    }
  }

  return results;
}

async function buildKnowledgeBase() {
  const allResults = [];

  for (const source of docSources) {
    const results = await indexDocSource(source);
    allResults.push(...results);
  }

  console.log(`\nKnowledge base built: ${allResults.length} doc sections indexed`);
  return allResults;
}

buildKnowledgeBase().catch(console.error);

Python

import os
import asyncio
from knowledgesdk import KnowledgeSDK

client = KnowledgeSDK(api_key=os.environ["KNOWLEDGESDK_API_KEY"])

doc_sources = [
    {
        "name": "stripe",
        "base_url": "https://docs.stripe.com",
        "start_urls": [
            "https://docs.stripe.com/api",
            "https://docs.stripe.com/payments",
            "https://docs.stripe.com/billing",
        ],
    },
    {
        "name": "github",
        "base_url": "https://docs.github.com",
        "start_urls": [
            "https://docs.github.com/en/actions",
            "https://docs.github.com/en/rest",
        ],
    },
]

async def index_doc_source(source: dict) -> list:
    print(f"Indexing {source['name']}...")
    results = []

    for start_url in source["start_urls"]:
        try:
            extraction = client.extract(
                url=start_url,
                include_subpages=True,
                max_depth=3,
                same_domain=True,
                tags=[f"source:{source['name']}", "type:documentation"],
            )
            results.append({
                "source": source["name"],
                "start_url": start_url,
                "page_count": len(extraction.get("pages", [])) or 1,
                "content": extraction,
            })
            page_count = len(extraction.get("pages", [])) or 1
            print(f"  + {start_url} — {page_count} pages")
        except Exception as e:
            print(f"  - Failed to extract {start_url}: {e}")

    return results

async def build_knowledge_base():
    all_results = []
    for source in doc_sources:
        results = await index_doc_source(source)
        all_results.extend(results)
    print(f"\nKnowledge base built: {len(all_results)} doc sections indexed")
    return all_results

asyncio.run(build_knowledge_base())

Step 2: Single-Page Scraping for Targeted Extraction

For cases where you know the exact page you want — an API reference endpoint, a changelog entry, a specific guide — use the scrape endpoint for faster, targeted extraction:

// Node.js: Scrape a specific API reference page
async function scrapeApiReference(url) {
  const result = await client.scrape({ url });

  return {
    url,
    markdown: result.markdown,
    title: result.title,
    scrapedAt: new Date().toISOString(),
  };
}

// Batch scrape a known list of URLs
async function batchScrapeDocPages(urls) {
  const CONCURRENCY = 5; // Respect rate limits
  const results = [];

  for (let i = 0; i < urls.length; i += CONCURRENCY) {
    const batch = urls.slice(i, i + CONCURRENCY);
    const batchResults = await Promise.all(batch.map(scrapeApiReference));
    results.push(...batchResults);

    // Brief pause between batches
    if (i + CONCURRENCY < urls.length) {
      await new Promise(resolve => setTimeout(resolve, 1000));
    }
  }

  return results;
}

# Python: Batch scrape with concurrency control
import asyncio
from typing import List, Dict

async def scrape_api_reference(url: str) -> Dict:
    result = client.scrape(url=url)
    return {
        "url": url,
        "markdown": result["markdown"],
        "title": result.get("title", ""),
        "scraped_at": __import__("datetime").datetime.utcnow().isoformat(),
    }

async def batch_scrape_doc_pages(urls: List[str], concurrency: int = 5) -> List[Dict]:
    semaphore = asyncio.Semaphore(concurrency)
    results = []

    async def scrape_with_limit(url):
        async with semaphore:
            return await scrape_api_reference(url)

    tasks = [scrape_with_limit(url) for url in urls]
    results = await asyncio.gather(*tasks, return_exceptions=True)
    return [r for r in results if not isinstance(r, Exception)]

Step 3: Handling Versioned Documentation

Versioned docs require a strategy. You generally want to index the latest stable version plus keep previous versions tagged separately for backward compatibility:

async function indexVersionedDocs(baseUrl, versions) {
  const versionResults = {};

  for (const version of versions) {
    const versionUrl = `${baseUrl}/${version}`;
    console.log(`Indexing version ${version}...`);

    try {
      const extraction = await client.extract({
        url: versionUrl,
        includeSubpages: true,
        maxDepth: 4,
        sameDomain: true,
        tags: [`version:${version}`, version === versions[0] ? 'version:latest' : 'version:legacy'],
      });

      versionResults[version] = extraction;
    } catch (err) {
      console.error(`Failed to index version ${version}:`, err.message);
    }
  }

  return versionResults;
}

// Example: Index multiple versions of your own API docs
const versions = ['v3', 'v2', 'v1'];
const results = await indexVersionedDocs('https://docs.yourapi.com', versions);

Step 4: Auth-Protected Documentation

Internal docs behind a login require passing session credentials. KnowledgeSDK supports custom headers for this:

// Node.js: Scrape auth-protected internal docs
async function scrapeInternalDocs(url, sessionToken) {
  const result = await client.scrape({
    url,
    headers: {
      'Authorization': `Bearer ${sessionToken}`,
      'Cookie': `session=${sessionToken}`,
    },
  });

  return result;
}

# Python: Auth-protected docs
def scrape_internal_docs(url: str, session_token: str) -> Dict:
    result = client.scrape(
        url=url,
        headers={
            "Authorization": f"Bearer {session_token}",
            "Cookie": f"session={session_token}",
        },
    )
    return result

Step 5: Querying the Knowledge Base

Once your docs are indexed, you can answer developer questions semantically:

async function answerDocQuestion(question, sourceFilter) {
  const searchResults = await client.search({
    query: question,
    tags: sourceFilter ? [`source:${sourceFilter}`] : undefined,
    limit: 5,
  });

  if (searchResults.results.length === 0) {
    return { answer: null, sources: [] };
  }

  // Build context for your LLM
  const context = searchResults.results
    .map((r, i) => `[${i + 1}] ${r.title}\nSource: ${r.url}\n\n${r.content}`)
    .join('\n\n---\n\n');

  return {
    context,
    sources: searchResults.results.map(r => ({ title: r.title, url: r.url })),
    query: question,
  };
}

// Usage in an AI assistant
const { context, sources } = await answerDocQuestion(
  'How do I create a subscription with a trial period in Stripe?',
  'stripe'
);

// Pass context to your LLM
const response = await openai.chat.completions.create({
  model: 'gpt-4o',
  messages: [
    {
      role: 'system',
      content: `You are a helpful API documentation assistant. Use the following documentation context to answer the user's question accurately.\n\nContext:\n${context}`,
    },
    { role: 'user', content: 'How do I create a subscription with a trial period?' },
  ],
});

# Python: Query docs knowledge base
from openai import OpenAI

openai_client = OpenAI()

def answer_doc_question(question: str, source_filter: str = None) -> Dict:
    search_params = {"query": question, "limit": 5}
    if source_filter:
        search_params["tags"] = [f"source:{source_filter}"]

    results = client.search(**search_params)

    if not results.get("results"):
        return {"answer": None, "sources": []}

    context = "\n\n---\n\n".join(
        f"[{i+1}] {r['title']}\nSource: {r['url']}\n\n{r['content']}"
        for i, r in enumerate(results["results"])
    )

    response = openai_client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {
                "role": "system",
                "content": f"You are a helpful API documentation assistant. Use the following documentation context to answer accurately.\n\nContext:\n{context}",
            },
            {"role": "user", "content": question},
        ],
    )

    return {
        "answer": response.choices[0].message.content,
        "sources": [{"title": r["title"], "url": r["url"]} for r in results["results"]],
    }

result = answer_doc_question(
    "How do I handle webhook signature verification in Stripe?",
    source_filter="stripe"
)
print(result["answer"])

Step 6: Keeping Docs Fresh with Webhooks

Static indexing goes stale. Set up a cron job or webhook listener to detect when documentation changes and re-index affected sections:

import express from 'express';

const app = express();
app.use(express.json());

// KnowledgeSDK webhook for change detection
app.post('/webhooks/docs-changed', async (req, res) => {
  const { event, url, changeType } = req.body;

  if (event !== 'content.changed') {
    return res.status(200).json({ received: true });
  }

  console.log(`Doc changed: ${url} (${changeType})`);

  try {
    // Re-scrape the changed page
    const updated = await client.scrape({ url });
    console.log(`Re-indexed: ${url}`);

    // Optionally notify your team
    await notifySlack(`Docs updated: ${url}`);
  } catch (err) {
    console.error('Re-index failed:', err.message);
  }

  res.status(200).json({ received: true });
});

// Schedule weekly full re-index
import cron from 'node-cron';

cron.schedule('0 2 * * 0', async () => {
  console.log('Running weekly docs re-index...');
  await buildKnowledgeBase();
});

Production Considerations

Rate limiting your own requests. Documentation sites are maintained by developer teams who notice unusual traffic. Keep concurrency low (3-5 parallel requests), add delays between batches, and identify your scraper with a descriptive User-Agent if allowed.

Deduplication. The same content often appears at multiple URLs (canonical pages, print views, mobile versions). Deduplicate by content hash before indexing to avoid bloating your knowledge base with duplicates.

Chunking long pages. Reference documentation pages can be extremely long. Consider chunking large pages into logical sections (by heading) so semantic search returns precise, relevant chunks rather than entire multi-thousand-word pages.

Monitoring freshness. Track the lastScrapedAt timestamp for each doc source. Alert if any source hasn't been refreshed in more than 7 days.

Version pinning for stability. If your application targets a specific API version, pin your knowledge base to that version's documentation. Avoid mixing v2 and v3 docs in the same index — it confuses semantic search.

FAQ

Can I scrape docs behind a login or paywall? Yes, for documentation you have legitimate access to. Pass your session cookie or API key in the request headers. KnowledgeSDK will use them when fetching the page. Only scrape content you are authorized to access.

How do I handle docs that use client-side navigation? KnowledgeSDK runs a full headless browser, so client-side navigation (React Router, Vue Router, Next.js) is handled automatically. The page executes its JavaScript before the content is extracted.

How often should I re-index documentation? For actively maintained APIs like Stripe or GitHub, a weekly full re-index plus webhook-triggered spot updates works well. For more stable libraries, monthly re-indexing is usually sufficient.

What's the difference between a single-page and multi-page extraction? /v1/extract with a single URL processes that page and returns its content as clean markdown. With the followLinks option enabled, it follows links and processes an entire documentation section across multiple pages. Use single-page extraction for targeted updates; use multi-page extraction for initial full indexing.

Can I index docs in multiple languages? Yes. Scrape the language-specific URLs (e.g., /en/, /fr/) and tag each page with its language. Semantic search works across languages — embeddings capture meaning, not just keywords.

How do I handle pagination in long reference pages? KnowledgeSDK handles infinite scroll and paginated content automatically through its headless browser. For docs that split content across numbered pages, use the extract endpoint with the first page as the starting URL.

A living documentation knowledge base turns your AI assistant from a static chatbot into an accurate, up-to-date developer tool. Get started at knowledgesdk.com/setup and have your first documentation source indexed in under five minutes.

Try it now