knowledgesdk.com/blog/monitor-competitor-websites-semantically
TutorialsMarch 22, 2026·8 min read

How to Monitor 50 Competitor Websites and Search Them Semantically

A practical guide to building a competitive intelligence system that extracts, indexes, and monitors competitor web content — with semantic search and change detection webhooks.

How to Monitor 50 Competitor Websites and Search Them Semantically

Competitive intelligence at scale has a tooling problem. Manual checking does not scale past 5-6 competitors. RSS feeds are absent from most modern sites. Custom polling scripts generate enormous amounts of unchanged data. And none of these approaches give you the ability to ask semantic questions across all your monitored content at once.

This guide walks through a complete system: build a competitive corpus from 50 competitor sites, register change detection, and run semantic queries like "which competitors added enterprise pricing this month?"


What You Are Building

The system has four components:

  1. A URL corpus — the specific pages you want to monitor (pricing, features, blog, docs, careers)
  2. Bulk extractionindexing all URLs into a searchable knowledge base on first run
  3. Change detection webhooks — re-index only when content actually changes
  4. Semantic search — query across all indexed content with natural language

Step 1: Define Your Competitor URL Corpus

Start with a structured list. Most competitive intelligence programs care about a predictable set of page types:

interface CompetitorPages {
  competitor: string;
  urls: string[];
}

const corpus: CompetitorPages[] = [
  {
    competitor: "competitorA",
    urls: [
      "https://competitorA.com/pricing",
      "https://competitorA.com/features",
      "https://competitorA.com/about",
      "https://competitorA.com/blog",
      "https://competitorA.com/docs",
      "https://competitorA.com/careers",
    ],
  },
  {
    competitor: "competitorB",
    urls: [
      "https://competitorB.com/pricing",
      "https://competitorB.com/product",
      "https://competitorB.com/changelog",
      "https://competitorB.com/api",
    ],
  },
  // ... up to 50 competitors
];

const allUrls = corpus.flatMap((c) => c.urls);
console.log(`Total URLs to monitor: ${allUrls.length}`);

For discovering URLs you have not enumerated, use the sitemap endpoint first:

import KnowledgeSDK from "@knowledgesdk/node";

const client = new KnowledgeSDK({ apiKey: process.env.KNOWLEDGESDK_API_KEY });

async function discoverUrls(domain: string): Promise<string[]> {
  const sitemap = await client.sitemap(`https://${domain}`);
  // Filter to high-value page types
  return sitemap.urls.filter((url) => {
    const path = new URL(url).pathname.toLowerCase();
    return (
      path.includes("/pricing") ||
      path.includes("/features") ||
      path.includes("/product") ||
      path.includes("/changelog") ||
      path.includes("/about") ||
      path.includes("/docs")
    );
  });
}

Step 2: Bulk Extraction — Build the Baseline Corpus

Extract all URLs on first run. Use async extraction with a callback to handle the volume without blocking:

async function buildBaselineCorpus(urls: string[]) {
  const BATCH_SIZE = 10;
  const DELAY_MS = 500;
  const results = { succeeded: 0, failed: 0, jobs: [] as string[] };

  for (let i = 0; i < urls.length; i += BATCH_SIZE) {
    const batch = urls.slice(i, i + BATCH_SIZE);

    const batchResults = await Promise.allSettled(
      batch.map(async (url) => {
        const job = await client.extractAsync(url, {
          callbackUrl: `${process.env.YOUR_APP_URL}/webhooks/extraction-complete`,
        });
        return job.jobId;
      })
    );

    for (const result of batchResults) {
      if (result.status === "fulfilled") {
        results.succeeded++;
        results.jobs.push(result.value);
      } else {
        results.failed++;
        console.error("Extraction failed:", result.reason);
      }
    }

    console.log(`Progress: ${Math.min(i + BATCH_SIZE, urls.length)}/${urls.length}`);

    if (i + BATCH_SIZE < urls.length) {
      await new Promise((resolve) => setTimeout(resolve, DELAY_MS));
    }
  }

  console.log(`Extraction queued: ${results.succeeded} succeeded, ${results.failed} failed`);
  return results;
}

await buildBaselineCorpus(allUrls);

Step 3: Register Change Detection Webhooks

After the baseline corpus is built, register webhooks so you only re-index when content actually changes:

async function registerCompetitorWebhooks(urls: string[]) {
  const webhooks = await Promise.all(
    urls.map((url) =>
      client.webhooks.create({
        url,
        callbackUrl: `${process.env.YOUR_APP_URL}/webhooks/content-changed`,
        events: ["content.changed"],
      })
    )
  );

  console.log(`Registered ${webhooks.length} change detection webhooks`);
  return webhooks;
}

await registerCompetitorWebhooks(allUrls);

Your webhook handler re-indexes the changed URL and optionally triggers an LLM analysis:

import express from "express";
import Anthropic from "@anthropic-ai/sdk";

const app = express();
app.use(express.json());
const anthropic = new Anthropic({ apiKey: process.env.ANTHROPIC_API_KEY });

app.post("/webhooks/content-changed", async (req, res) => {
  res.status(200).json({ received: true });

  const { url, content, previousContent, event } = req.body;
  if (event !== "content.changed") return;

  // Re-index the updated content
  await client.extract(url);
  console.log(`Re-indexed: ${url}`);

  // Generate change summary with LLM
  const summary = await anthropic.messages.create({
    model: "claude-opus-4-6",
    max_tokens: 300,
    messages: [
      {
        role: "user",
        content: `A competitor page changed. What changed and is it competitively significant?

URL: ${url}
Previous: ${previousContent?.slice(0, 1000)}
Current: ${content?.slice(0, 1000)}

Respond with: what changed, and whether it indicates a pricing, feature, or positioning change.`,
      },
    ],
  });

  const summaryText =
    summary.content[0].type === "text" ? summary.content[0].text : "";
  console.log(`Change summary for ${url}:\n${summaryText}`);

  // Store or send alert (Slack, email, etc.)
});

Step 4: Semantic Search Across All Monitored Content

With the corpus indexed, you can run natural language queries across all 50 competitors at once:

async function competitorSearch(query: string) {
  const results = await client.search(query, { limit: 10 });

  return results.items.map((item) => ({
    source: item.sourceUrl,
    relevanceScore: item.score,
    excerpt: item.snippet,
    title: item.title,
  }));
}
from knowledgesdk import KnowledgeSDK

client = KnowledgeSDK(api_key=KNOWLEDGESDK_API_KEY)

def competitor_search(query: str):
    results = client.search(query, limit=10)
    return [
        {
            "source": item.source_url,
            "score": item.score,
            "excerpt": item.snippet,
            "title": item.title,
        }
        for item in results.items
    ]

Real Query Examples

Once your corpus is indexed, these queries become answerable:

// Pricing intelligence
const pricingResults = await competitorSearch(
  "enterprise plan pricing annual commitment"
);

// Feature tracking
const featureResults = await competitorSearch(
  "SSO SAML support single sign-on"
);

// Positioning changes
const positioningResults = await competitorSearch(
  "AI-powered machine learning automation"
);

// Hiring signal (from careers pages)
const hiringResults = await competitorSearch(
  "engineering manager platform infrastructure"
);

// API and developer focus
const apiResults = await competitorSearch(
  "API rate limits webhook developer integration"
);

Each of these runs against your actual indexed competitor pages — not a third-party search index that may or may not include the pages you care about.


Extending with an LLM Digest

The most useful production pattern combines periodic semantic search with LLM synthesis:

async function weeklyCompetitiveDigest() {
  const queries = [
    "pricing changes tier updates enterprise",
    "new features product announcements",
    "API developer platform updates",
    "hiring plans engineering growth",
  ];

  const allResults = await Promise.all(
    queries.map(async (query) => {
      const results = await competitorSearch(query);
      return { query, results: results.slice(0, 3) };
    })
  );

  const context = allResults
    .map(
      ({ query, results }) =>
        `### ${query}\n${results.map((r) => `- [${r.source}] ${r.excerpt}`).join("\n")}`
    )
    .join("\n\n");

  const digest = await anthropic.messages.create({
    model: "claude-opus-4-6",
    max_tokens: 800,
    messages: [
      {
        role: "user",
        content: `Based on the following content from competitor websites, write a concise weekly competitive intelligence digest. Focus on actionable observations.\n\n${context}`,
      },
    ],
  });

  return digest.content[0].type === "text" ? digest.content[0].text : "";
}

Cost Estimate

Polling approach (50 pages × hourly):

  • 50 pages × 24 hours × $0.001/request = $1.20/day = $36/month in polling costs alone
  • Plus: LLM calls on every changed page regardless of significance
  • Plus: storage and diffing infrastructure

KnowledgeSDK webhook approach:

  • Initial 50 extractions: covered by $29/month plan
  • Webhook notifications: only fires on actual content changes
  • LLM calls: only on changed pages
  • Monthly cost: $29/month plan + LLM costs proportional to actual change frequency

For competitors whose pricing pages change once a month, the webhook model is approximately 720x more efficient per page than hourly polling.


Summary

A competitive intelligence system built on KnowledgeSDK has three phases:

  1. Build — extract your competitor URL corpus on first run
  2. Monitor — register webhooks for change detection; re-index on change
  3. Search — run semantic queries across all indexed content on demand

The result is a searchable knowledge base of your competitor landscape that updates reactively rather than on a polling schedule, and that you can query with natural language rather than keyword searches.

npm install @knowledgesdk/node
pip install knowledgesdk

Try it now

Scrape, search, and monitor any website with one API.

Get your API key in 30 seconds. First 1,000 requests free.

GET API KEY →

Related Articles

Tutorials

How to Build Your Own Tavily for Private Content with KnowledgeSDK

Tutorials

From URL to Searchable Knowledge in One API Call

Tutorials

Webhook-Driven AI: How to Trigger Your LLM When a Website Changes

use-case

Automated Competitive Intelligence: Build a Scraper That Never Sleeps

← Back to blog