Competitive intelligence at scale has a tooling problem. Manual checking does not scale past 5-6 competitors. RSS feeds are absent from most modern sites. Custom polling scripts generate enormous amounts of unchanged data. And none of these approaches give you the ability to ask semantic questions across all your monitored content at once.
This guide walks through a complete system: build a competitive corpus from 50 competitor sites, register change detection, and run semantic queries like "which competitors added enterprise pricing this month?"
What You Are Building
The system has four components:
- A URL corpus — the specific pages you want to monitor (pricing, features, blog, docs, careers)
- Bulk extraction — indexing all URLs into a searchable knowledge base on first run
- Change detection webhooks — re-index only when content actually changes
- Semantic search — query across all indexed content with natural language
Step 1: Define Your Competitor URL Corpus
Start with a structured list. Most competitive intelligence programs care about a predictable set of page types:
interface CompetitorPages {
competitor: string;
urls: string[];
}
const corpus: CompetitorPages[] = [
{
competitor: "competitorA",
urls: [
"https://competitorA.com/pricing",
"https://competitorA.com/features",
"https://competitorA.com/about",
"https://competitorA.com/blog",
"https://competitorA.com/docs",
"https://competitorA.com/careers",
],
},
{
competitor: "competitorB",
urls: [
"https://competitorB.com/pricing",
"https://competitorB.com/product",
"https://competitorB.com/changelog",
"https://competitorB.com/api",
],
},
// ... up to 50 competitors
];
const allUrls = corpus.flatMap((c) => c.urls);
console.log(`Total URLs to monitor: ${allUrls.length}`);
For discovering URLs you have not enumerated, use the sitemap endpoint first:
import KnowledgeSDK from "@knowledgesdk/node";
const client = new KnowledgeSDK({ apiKey: process.env.KNOWLEDGESDK_API_KEY });
async function discoverUrls(domain: string): Promise<string[]> {
const sitemap = await client.sitemap(`https://${domain}`);
// Filter to high-value page types
return sitemap.urls.filter((url) => {
const path = new URL(url).pathname.toLowerCase();
return (
path.includes("/pricing") ||
path.includes("/features") ||
path.includes("/product") ||
path.includes("/changelog") ||
path.includes("/about") ||
path.includes("/docs")
);
});
}
Step 2: Bulk Extraction — Build the Baseline Corpus
Extract all URLs on first run. Use async extraction with a callback to handle the volume without blocking:
async function buildBaselineCorpus(urls: string[]) {
const BATCH_SIZE = 10;
const DELAY_MS = 500;
const results = { succeeded: 0, failed: 0, jobs: [] as string[] };
for (let i = 0; i < urls.length; i += BATCH_SIZE) {
const batch = urls.slice(i, i + BATCH_SIZE);
const batchResults = await Promise.allSettled(
batch.map(async (url) => {
const job = await client.extractAsync(url, {
callbackUrl: `${process.env.YOUR_APP_URL}/webhooks/extraction-complete`,
});
return job.jobId;
})
);
for (const result of batchResults) {
if (result.status === "fulfilled") {
results.succeeded++;
results.jobs.push(result.value);
} else {
results.failed++;
console.error("Extraction failed:", result.reason);
}
}
console.log(`Progress: ${Math.min(i + BATCH_SIZE, urls.length)}/${urls.length}`);
if (i + BATCH_SIZE < urls.length) {
await new Promise((resolve) => setTimeout(resolve, DELAY_MS));
}
}
console.log(`Extraction queued: ${results.succeeded} succeeded, ${results.failed} failed`);
return results;
}
await buildBaselineCorpus(allUrls);
Step 3: Register Change Detection Webhooks
After the baseline corpus is built, register webhooks so you only re-index when content actually changes:
async function registerCompetitorWebhooks(urls: string[]) {
const webhooks = await Promise.all(
urls.map((url) =>
client.webhooks.create({
url,
callbackUrl: `${process.env.YOUR_APP_URL}/webhooks/content-changed`,
events: ["content.changed"],
})
)
);
console.log(`Registered ${webhooks.length} change detection webhooks`);
return webhooks;
}
await registerCompetitorWebhooks(allUrls);
Your webhook handler re-indexes the changed URL and optionally triggers an LLM analysis:
import express from "express";
import Anthropic from "@anthropic-ai/sdk";
const app = express();
app.use(express.json());
const anthropic = new Anthropic({ apiKey: process.env.ANTHROPIC_API_KEY });
app.post("/webhooks/content-changed", async (req, res) => {
res.status(200).json({ received: true });
const { url, content, previousContent, event } = req.body;
if (event !== "content.changed") return;
// Re-index the updated content
await client.extract(url);
console.log(`Re-indexed: ${url}`);
// Generate change summary with LLM
const summary = await anthropic.messages.create({
model: "claude-opus-4-6",
max_tokens: 300,
messages: [
{
role: "user",
content: `A competitor page changed. What changed and is it competitively significant?
URL: ${url}
Previous: ${previousContent?.slice(0, 1000)}
Current: ${content?.slice(0, 1000)}
Respond with: what changed, and whether it indicates a pricing, feature, or positioning change.`,
},
],
});
const summaryText =
summary.content[0].type === "text" ? summary.content[0].text : "";
console.log(`Change summary for ${url}:\n${summaryText}`);
// Store or send alert (Slack, email, etc.)
});
Step 4: Semantic Search Across All Monitored Content
With the corpus indexed, you can run natural language queries across all 50 competitors at once:
async function competitorSearch(query: string) {
const results = await client.search(query, { limit: 10 });
return results.items.map((item) => ({
source: item.sourceUrl,
relevanceScore: item.score,
excerpt: item.snippet,
title: item.title,
}));
}
from knowledgesdk import KnowledgeSDK
client = KnowledgeSDK(api_key=KNOWLEDGESDK_API_KEY)
def competitor_search(query: str):
results = client.search(query, limit=10)
return [
{
"source": item.source_url,
"score": item.score,
"excerpt": item.snippet,
"title": item.title,
}
for item in results.items
]
Real Query Examples
Once your corpus is indexed, these queries become answerable:
// Pricing intelligence
const pricingResults = await competitorSearch(
"enterprise plan pricing annual commitment"
);
// Feature tracking
const featureResults = await competitorSearch(
"SSO SAML support single sign-on"
);
// Positioning changes
const positioningResults = await competitorSearch(
"AI-powered machine learning automation"
);
// Hiring signal (from careers pages)
const hiringResults = await competitorSearch(
"engineering manager platform infrastructure"
);
// API and developer focus
const apiResults = await competitorSearch(
"API rate limits webhook developer integration"
);
Each of these runs against your actual indexed competitor pages — not a third-party search index that may or may not include the pages you care about.
Extending with an LLM Digest
The most useful production pattern combines periodic semantic search with LLM synthesis:
async function weeklyCompetitiveDigest() {
const queries = [
"pricing changes tier updates enterprise",
"new features product announcements",
"API developer platform updates",
"hiring plans engineering growth",
];
const allResults = await Promise.all(
queries.map(async (query) => {
const results = await competitorSearch(query);
return { query, results: results.slice(0, 3) };
})
);
const context = allResults
.map(
({ query, results }) =>
`### ${query}\n${results.map((r) => `- [${r.source}] ${r.excerpt}`).join("\n")}`
)
.join("\n\n");
const digest = await anthropic.messages.create({
model: "claude-opus-4-6",
max_tokens: 800,
messages: [
{
role: "user",
content: `Based on the following content from competitor websites, write a concise weekly competitive intelligence digest. Focus on actionable observations.\n\n${context}`,
},
],
});
return digest.content[0].type === "text" ? digest.content[0].text : "";
}
Cost Estimate
Polling approach (50 pages × hourly):
- 50 pages × 24 hours × $0.001/request = $1.20/day = $36/month in polling costs alone
- Plus: LLM calls on every changed page regardless of significance
- Plus: storage and diffing infrastructure
KnowledgeSDK webhook approach:
- Initial 50 extractions: covered by $29/month plan
- Webhook notifications: only fires on actual content changes
- LLM calls: only on changed pages
- Monthly cost: $29/month plan + LLM costs proportional to actual change frequency
For competitors whose pricing pages change once a month, the webhook model is approximately 720x more efficient per page than hourly polling.
Summary
A competitive intelligence system built on KnowledgeSDK has three phases:
- Build — extract your competitor URL corpus on first run
- Monitor — register webhooks for change detection; re-index on change
- Search — run semantic queries across all indexed content on demand
The result is a searchable knowledge base of your competitor landscape that updates reactively rather than on a polling schedule, and that you can query with natural language rather than keyword searches.
npm install @knowledgesdk/node
pip install knowledgesdk