Build a Compliance Chatbot That Reads Your Website Automatically

Your compliance docs live on your website. Build a chatbot that reads them automatically, stays current when pages change, and answers questions with source citations.

Compliance questions come in at inconvenient times. "Does our data retention policy cover backups?" "What's the SLA for incident notification under our enterprise agreement?" "Which version of our privacy policy applies to EU customers?"

The person asking needs an answer now. Your compliance team is busy. And the documents that contain the answers are scattered across an internal wiki, a regulatory portal, a terms-of-service page, and three PDFs that were last updated in 2023.

Most compliance chatbots are built by exporting those documents, pasting them into a vector database, and calling it done. That works for about six months, until the first policy page gets updated, and nobody remembers to re-index it.

This tutorial covers a better approach: a compliance chatbot that reads your policy pages directly from your website, stays current when content changes, and answers questions with citations back to the original source.

Why Websites, Not Documents

Your compliance documentation is almost certainly already published somewhere: your company's policy page, your terms of service, your privacy notice, regulatory portals you must comply with (GDPR guidance, FTC rules, SEC regulations). These pages are authoritative. They're what your legal team and your customers point to.

The problem with exporting documents is drift. Pages get updated, exports don't. A compliance chatbot operating on stale data is worse than no chatbot — it gives confident wrong answers.

Extracting directly from live URLs solves the freshness problem. Every extraction pulls from the current version of the page.

Architecture

The flow has three stages:

Extraction: URLs → KnowledgeSDK → indexed, searchable knowledge
Query: user question → semantic search over indexed knowledge → relevant chunks with source URLs
Generation: chunks + question → LLM → answer with citations

No vector database to manage. No embedding pipeline. No chunking decisions. KnowledgeSDK handles all of that — including JavaScript rendering for pages that load content dynamically.

Step 1: Identify Your Sources

Start by listing the URLs that contain your compliance content. For most organizations this includes:

Company policy pages (/privacy, /terms, /acceptable-use, /data-processing)
Regulatory guidance you must follow (GDPR Article 5, CCPA consumer rights pages)
Industry standards your certifications reference (SOC 2 trust principles, ISO 27001 controls)
Enterprise agreement templates if you publish them

Be specific. A page like https://company.com/legal that just links to other pages is less useful than the actual policy pages it links to. Use POST /v1/sitemap to discover all pages under a legal or policy section:

const ks = require('@knowledgesdk/node');
const client = new ks.KnowledgeSDK({ apiKey: 'knowledgesdk_live_...' });

// Discover all policy pages automatically
const sitemap = await client.sitemap({ url: 'https://company.com/legal' });
console.log(sitemap.urls); // All URLs under /legal

Review the list and filter to pages that contain actual compliance content, not navigation or marketing copy.

Step 2: Extract Compliance Sources

With your URL list ready, extract each page. KnowledgeSDK renders JavaScript (so dynamic content loads), extracts clean markdown, and indexes it automatically against your API key. Every subsequent search query searches across all extracted pages.

const sources = [
  'https://company.com/policies/data-retention',
  'https://company.com/policies/acceptable-use',
  'https://company.com/terms',
  'https://company.com/privacy',
  'https://gdpr.eu/article-5-how-to-process-personal-data/',
];

for (const url of sources) {
  const result = await client.extract({ url });
  console.log(`Extracted: ${result.title} (${result.word_count} words)`);
}

For large regulatory sites with hundreds of pages, use the async endpoint to avoid blocking:

// Fire and forget — KnowledgeSDK handles it in the background
for (const url of sources) {
  const job = await client.extractAsync({ url });
  console.log(`Job started: ${job.jobId}`);
}

Extraction typically takes 15–60 seconds per page depending on page complexity. Run this once to populate your knowledge base, then set up scheduled re-extraction for pages that change.

Step 3: Build the Chatbot Endpoint

With sources indexed, build an Express.js endpoint that accepts compliance questions and returns answers with citations.

const express = require('express');
const ks = require('@knowledgesdk/node');
const OpenAI = require('openai');

const app = express();
app.use(express.json());

const client = new ks.KnowledgeSDK({ apiKey: process.env.KNOWLEDGESDK_API_KEY });
const openai = new OpenAI({ apiKey: process.env.OPENAI_API_KEY });

app.post('/compliance/ask', async (req, res) => {
  const { question } = req.body;

  if (!question) {
    return res.status(400).json({ error: 'Question is required' });
  }

  // Search indexed compliance knowledge
  const searchResults = await client.search({
    query: question,
    limit: 5,
  });

  if (!searchResults.results || searchResults.results.length === 0) {
    return res.json({
      answer: "I don't have information about that in the indexed compliance sources. Please consult your legal team.",
      citations: [],
    });
  }

  // Assemble context with source attribution
  const context = searchResults.results
    .map((r, i) => `[Source ${i + 1}: ${r.source_url}]\n${r.content}`)
    .join('\n\n---\n\n');

  const citations = searchResults.results.map((r) => ({
    url: r.source_url,
    title: r.title,
    excerpt: r.content.slice(0, 200) + '...',
  }));

  // Generate answer with strict grounding
  const completion = await openai.chat.completions.create({
    model: 'gpt-4o',
    messages: [
      {
        role: 'system',
        content: `You are a compliance assistant. Answer questions ONLY using the provided source documents. 
If the answer is not covered in the sources, say "This topic is not covered in the indexed compliance sources — please consult your legal team."
Never infer, speculate, or add information not present in the sources.
Always reference which source supports each point in your answer.
This is not legal advice. For high-stakes decisions, always involve qualified legal counsel.`,
      },
      {
        role: 'user',
        content: `Compliance question: ${question}\n\nSources:\n${context}`,
      },
    ],
  });

  return res.json({
    answer: completion.choices[0].message.content,
    citations,
    disclaimer: 'This response is generated from indexed compliance documents and is not legal advice.',
  });
});

app.listen(3000, () => console.log('Compliance chatbot running on port 3000'));

The search step returns chunks from your indexed sources, each with a source_url field. You pass those chunks as context to the LLM, and the LLM generates a grounded answer. If a user asks something outside the indexed sources, the system prompt instructs the model to say so rather than guess.

Step 4: Keep Sources Fresh

Compliance documents change. GDPR guidance gets updated. Your privacy policy gets revised. Your enterprise terms change with a new contract year. A compliance chatbot that doesn't track these changes is a liability.

Two strategies for freshness:

Webhook-triggered re-extraction: If your CMS or website platform supports webhooks on content changes, subscribe to them and trigger re-extraction when a policy page is updated.

// Webhook handler — trigger when policy pages change
app.post('/webhooks/content-change', async (req, res) => {
  const { url } = req.body;

  // Only re-extract if it's a compliance-relevant page
  const compliancePatterns = ['/privacy', '/terms', '/policies', '/legal'];
  const isCompliance = compliancePatterns.some((p) => url.includes(p));

  if (isCompliance) {
    await client.extract({ url });
    console.log(`Re-extracted: ${url}`);
  }

  res.sendStatus(200);
});

Scheduled re-extraction: For external regulatory sites that don't offer webhooks, schedule a monthly re-extraction job. Most regulatory guidance changes infrequently, but catching a change within 30 days is better than never catching it.

// Run monthly — cron "0 0 1 * *"
async function refreshComplianceSources() {
  const sources = await getComplianceSourceList(); // from your config
  for (const url of sources) {
    await client.extract({ url });
  }
  console.log(`Refreshed ${sources.length} compliance sources`);
}

Step 5: Surface Citations in the UI

Every search result includes source_url. Show it. Users need to know where answers come from — both to verify accuracy and to find the full document when they need more context.

A good citation display includes the source URL, the page title, and a short excerpt from the relevant chunk. This makes it clear the answer isn't invented — it came from a specific location in a specific document.

For high-stakes compliance decisions, encourage users to click through to the source and read the full section. The chatbot surfaces the right document; the human reads it and makes the judgment call.

Guardrails and Limitations

The system prompt above includes the most important guardrail: only answer from provided sources. But there are a few more worth adding:

Jurisdiction scoping: "These sources cover [Company Name] policies and EU GDPR. They may not reflect requirements in other jurisdictions."
Date awareness: Include the extraction date in your context so the LLM can note "as of [date]" in its answer.
Escalation path: For any answer that could have legal consequences (contract disputes, regulatory investigations, employment matters), add a prompt: "For decisions with significant legal or financial consequences, please review with qualified legal counsel."

This is a research and discovery tool, not a substitute for legal advice. Make that clear in the UI, in the disclaimer returned with every response, and in any documentation you write for internal users.

Extending the System

Once the basic chatbot is working, a few extensions add significant value:

Auto-discover all policy pages: Instead of maintaining a manual list of URLs, use POST /v1/sitemap to discover every page under your legal and policy sections automatically. Filter by URL pattern and extract everything matching /legal/*, /policies/*, /terms/*.

Multi-source compliance: Index sources from multiple domains — your company policies, relevant regulatory portals, industry standards. The search endpoint queries all indexed content simultaneously, so users get answers that synthesize across sources.

Audit logging: Log every question and answer with the source citations used. If your compliance team needs to audit what the chatbot told users, you have a complete record.

The goal is a compliance assistant that's genuinely useful: one that knows your current policies, can find the right passage in seconds, and always points users back to the authoritative source. That's a different thing from a chatbot that guesses based on training data — and for compliance, the difference matters.

Try it now