How to Monitor 50 Competitor Websites and Search Them Semantically

A practical guide to building a competitive intelligence system that extracts, indexes, and monitors competitor web content — with semantic search and change detection webhooks.

Competitive intelligence at scale has a tooling problem. Manual checking does not scale past 5-6 competitors. RSS feeds are absent from most modern sites. Custom polling scripts generate enormous amounts of unchanged data. And none of these approaches give you the ability to ask semantic questions across all your monitored content at once.

This guide walks through a complete system: build a competitive corpus from 50 competitor sites, register change detection, and run semantic queries like "which competitors added enterprise pricing this month?"

What You Are Building

The system has four components:

A URL corpus — the specific pages you want to monitor (pricing, features, blog, docs, careers)
Bulk extraction — indexing all URLs into a searchable knowledge base on first run
Change detection webhooks — re-index only when content actually changes
Semantic search — query across all indexed content with natural language

Step 1: Define Your Competitor URL Corpus

Start with a structured list. Most competitive intelligence programs care about a predictable set of page types:

interface CompetitorPages {
  competitor: string;
  urls: string[];
}

const corpus: CompetitorPages[] = [
  {
    competitor: "competitorA",
    urls: [
      "https://competitorA.com/pricing",
      "https://competitorA.com/features",
      "https://competitorA.com/about",
      "https://competitorA.com/blog",
      "https://competitorA.com/docs",
      "https://competitorA.com/careers",
    ],
  },
  {
    competitor: "competitorB",
    urls: [
      "https://competitorB.com/pricing",
      "https://competitorB.com/product",
      "https://competitorB.com/changelog",
      "https://competitorB.com/api",
    ],
  },
  // ... up to 50 competitors
];

const allUrls = corpus.flatMap((c) => c.urls);
console.log(`Total URLs to monitor: ${allUrls.length}`);

For discovering URLs you have not enumerated, use the sitemap endpoint first:

import KnowledgeSDK from "@knowledgesdk/node";

const client = new KnowledgeSDK({ apiKey: process.env.KNOWLEDGESDK_API_KEY });

async function discoverUrls(domain: string): Promise<string[]> {
  const sitemap = await client.sitemap(`https://${domain}`);
  // Filter to high-value page types
  return sitemap.urls.filter((url) => {
    const path = new URL(url).pathname.toLowerCase();
    return (
      path.includes("/pricing") ||
      path.includes("/features") ||
      path.includes("/product") ||
      path.includes("/changelog") ||
      path.includes("/about") ||
      path.includes("/docs")
    );
  });
}

Step 2: Bulk Extraction — Build the Baseline Corpus

Extract all URLs on first run. Use async extraction with a callback to handle the volume without blocking:

async function buildBaselineCorpus(urls: string[]) {
  const BATCH_SIZE = 10;
  const DELAY_MS = 500;
  const results = { succeeded: 0, failed: 0, jobs: [] as string[] };

  for (let i = 0; i < urls.length; i += BATCH_SIZE) {
    const batch = urls.slice(i, i + BATCH_SIZE);

    const batchResults = await Promise.allSettled(
      batch.map(async (url) => {
        const job = await client.extractAsync(url, {
          callbackUrl: `${process.env.YOUR_APP_URL}/webhooks/extraction-complete`,
        });
        return job.jobId;
      })
    );

    for (const result of batchResults) {
      if (result.status === "fulfilled") {
        results.succeeded++;
        results.jobs.push(result.value);
      } else {
        results.failed++;
        console.error("Extraction failed:", result.reason);
      }
    }

    console.log(`Progress: ${Math.min(i + BATCH_SIZE, urls.length)}/${urls.length}`);

    if (i + BATCH_SIZE < urls.length) {
      await new Promise((resolve) => setTimeout(resolve, DELAY_MS));
    }
  }

  console.log(`Extraction queued: ${results.succeeded} succeeded, ${results.failed} failed`);
  return results;
}

await buildBaselineCorpus(allUrls);

Step 3: Register Change Detection Webhooks

After the baseline corpus is built, register webhooks so you only re-index when content actually changes:

async function registerCompetitorWebhooks(urls: string[]) {
  const webhooks = await Promise.all(
    urls.map((url) =>
      client.webhooks.create({
        url,
        callbackUrl: `${process.env.YOUR_APP_URL}/webhooks/content-changed`,
        events: ["content.changed"],
      })
    )
  );

  console.log(`Registered ${webhooks.length} change detection webhooks`);
  return webhooks;
}

await registerCompetitorWebhooks(allUrls);

Your webhook handler re-indexes the changed URL and optionally triggers an LLM analysis:

import express from "express";
import Anthropic from "@anthropic-ai/sdk";

const app = express();
app.use(express.json());
const anthropic = new Anthropic({ apiKey: process.env.ANTHROPIC_API_KEY });

app.post("/webhooks/content-changed", async (req, res) => {
  res.status(200).json({ received: true });

  const { url, content, previousContent, event } = req.body;
  if (event !== "content.changed") return;

  // Re-index the updated content
  await client.extract(url);
  console.log(`Re-indexed: ${url}`);

  // Generate change summary with LLM
  const summary = await anthropic.messages.create({
    model: "claude-opus-4-6",
    max_tokens: 300,
    messages: [
      {
        role: "user",
        content: `A competitor page changed. What changed and is it competitively significant?

URL: ${url}
Previous: ${previousContent?.slice(0, 1000)}
Current: ${content?.slice(0, 1000)}

Respond with: what changed, and whether it indicates a pricing, feature, or positioning change.`,
      },
    ],
  });

  const summaryText =
    summary.content[0].type === "text" ? summary.content[0].text : "";
  console.log(`Change summary for ${url}:\n${summaryText}`);

  // Store or send alert (Slack, email, etc.)
});

Step 4: Semantic Search Across All Monitored Content

With the corpus indexed, you can run natural language queries across all 50 competitors at once:

async function competitorSearch(query: string) {
  const results = await client.search(query, { limit: 10 });

  return results.items.map((item) => ({
    source: item.sourceUrl,
    relevanceScore: item.score,
    excerpt: item.snippet,
    title: item.title,
  }));
}

from knowledgesdk import KnowledgeSDK

client = KnowledgeSDK(api_key=KNOWLEDGESDK_API_KEY)

def competitor_search(query: str):
    results = client.search(query, limit=10)
    return [
        {
            "source": item.source_url,
            "score": item.score,
            "excerpt": item.snippet,
            "title": item.title,
        }
        for item in results.items
    ]

Real Query Examples

Once your corpus is indexed, these queries become answerable:

// Pricing intelligence
const pricingResults = await competitorSearch(
  "enterprise plan pricing annual commitment"
);

// Feature tracking
const featureResults = await competitorSearch(
  "SSO SAML support single sign-on"
);

// Positioning changes
const positioningResults = await competitorSearch(
  "AI-powered machine learning automation"
);

// Hiring signal (from careers pages)
const hiringResults = await competitorSearch(
  "engineering manager platform infrastructure"
);

// API and developer focus
const apiResults = await competitorSearch(
  "API rate limits webhook developer integration"
);

Each of these runs against your actual indexed competitor pages — not a third-party search index that may or may not include the pages you care about.

Extending with an LLM Digest

The most useful production pattern combines periodic semantic search with LLM synthesis:

async function weeklyCompetitiveDigest() {
  const queries = [
    "pricing changes tier updates enterprise",
    "new features product announcements",
    "API developer platform updates",
    "hiring plans engineering growth",
  ];

  const allResults = await Promise.all(
    queries.map(async (query) => {
      const results = await competitorSearch(query);
      return { query, results: results.slice(0, 3) };
    })
  );

  const context = allResults
    .map(
      ({ query, results }) =>
        `### ${query}\n${results.map((r) => `- [${r.source}] ${r.excerpt}`).join("\n")}`
    )
    .join("\n\n");

  const digest = await anthropic.messages.create({
    model: "claude-opus-4-6",
    max_tokens: 800,
    messages: [
      {
        role: "user",
        content: `Based on the following content from competitor websites, write a concise weekly competitive intelligence digest. Focus on actionable observations.\n\n${context}`,
      },
    ],
  });

  return digest.content[0].type === "text" ? digest.content[0].text : "";
}

Cost Estimate

Polling approach (50 pages × hourly):

50 pages × 24 hours × $0.001/request = $1.20/day = $36/month in polling costs alone
Plus: LLM calls on every changed page regardless of significance
Plus: storage and diffing infrastructure

KnowledgeSDK webhook approach:

Initial 50 extractions: covered by $29/month plan
Webhook notifications: only fires on actual content changes
LLM calls: only on changed pages
Monthly cost: $29/month plan + LLM costs proportional to actual change frequency

For competitors whose pricing pages change once a month, the webhook model is approximately 720x more efficient per page than hourly polling.

Summary

A competitive intelligence system built on KnowledgeSDK has three phases:

Build — extract your competitor URL corpus on first run
Monitor — register webhooks for change detection; re-index on change
Search — run semantic queries across all indexed content on demand

The result is a searchable knowledge base of your competitor landscape that updates reactively rather than on a polling schedule, and that you can query with natural language rather than keyword searches.