Monitor Job Postings for Competitive Intelligence (With AI)

Scrape competitor job boards to understand their hiring plans, detect new AI teams forming, and get a weekly digest of competitive intelligence from job posts.

Every company telegraphs its strategy through its job postings. When Competitor A posts ten machine learning engineer roles in a quarter, they're building an AI feature. When they post five sales engineers targeting enterprise customers, they're moving upmarket. When they post a VP of Partnerships, they're planning an ecosystem play.

Job postings are one of the most reliable and legally unambiguous sources of competitive intelligence. They're public, intentional signals — companies want candidates to read them. Building a monitor to surface these signals automatically is one of the highest-ROI applications of web scraping.

This tutorial walks through building a job posting intelligence system that scrapes competitor career pages, categorizes new postings by team and function, and delivers a weekly digest with AI-generated insights.

The Intelligence Layer: Why Job Posts Reveal Strategy

Before we write code, it's worth understanding what patterns actually matter in job post analysis:

Team growth signals. Five new backend engineers = scaling a product. Five new salespeople = entering a growth phase. The ratio matters as much as the absolute numbers.

Technology bets. Job requirements reveal tech stack decisions 6-12 months before those decisions become public. "3+ years PyTorch experience" means they're betting on Python ML infra, not LLM APIs alone.

Go-to-market shifts. SDR posts signal outbound expansion. Customer success roles signal retention focus. Solutions engineer posts signal technical sales motion.

Leadership changes. VP-level and C-suite hiring often precedes major strategic shifts. A new CFO means an IPO or acquisition could be coming. A Chief Revenue Officer means sales-led growth.

Geographic expansion. Office-specific posts signal new market entry before any announcement.

Architecture Overview

┌──────────────────┐     ┌────────────────┐     ┌─────────────────────┐
│ Competitor Career│────▶│ KnowledgeSDK   │────▶│ Job Post Database   │
│ Pages (weekly)   │     │ Scrape + Index │     │ (Postgres)          │
└──────────────────┘     └────────────────┘     └─────────────────────┘
                                                          │
                                                          ▼
                                                 ┌─────────────────────┐
                                                 │ AI Categorization   │
                                                 │ + Change Detection  │
                                                 └─────────────────────┘
                                                          │
                                                          ▼
                                                 ┌─────────────────────┐
                                                 │ Weekly Digest       │
                                                 │ (Slack / Email)     │
                                                 └─────────────────────┘

Step 1: Database Schema

CREATE TABLE competitors (
  id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
  name TEXT NOT NULL,
  careers_url TEXT NOT NULL,
  domain TEXT,
  active BOOLEAN DEFAULT true,
  last_scraped_at TIMESTAMPTZ,
  created_at TIMESTAMPTZ DEFAULT NOW()
);

CREATE TABLE job_postings (
  id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
  competitor_id UUID REFERENCES competitors(id) ON DELETE CASCADE,
  title TEXT NOT NULL,
  url TEXT,
  location TEXT,
  team TEXT,          -- Derived by AI: Engineering, Sales, Marketing, etc.
  function TEXT,      -- More specific: Backend, Frontend, ML, Account Executive
  seniority TEXT,     -- Junior, Mid, Senior, Staff, Director, VP, C-Suite
  remote BOOLEAN,
  content TEXT,       -- Full job description
  content_hash TEXT,  -- For dedup
  first_seen_at TIMESTAMPTZ DEFAULT NOW(),
  last_seen_at TIMESTAMPTZ DEFAULT NOW(),
  still_active BOOLEAN DEFAULT true,
  UNIQUE(competitor_id, content_hash)
);

CREATE INDEX idx_job_postings_competitor ON job_postings(competitor_id);
CREATE INDEX idx_job_postings_first_seen ON job_postings(first_seen_at DESC);
CREATE INDEX idx_job_postings_team ON job_postings(team);

CREATE TABLE weekly_digests (
  id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
  week_start DATE NOT NULL,
  content TEXT NOT NULL,
  new_postings_count INTEGER,
  created_at TIMESTAMPTZ DEFAULT NOW()
);

Step 2: Scrape Career Pages

Career pages vary enormously in structure. Some companies use Greenhouse, Lever, or Workday. Others build custom career sites. KnowledgeSDK handles all of them:

import KnowledgeSDK from '@knowledgesdk/node';
import crypto from 'crypto';
import pg from 'pg';

const sdk = new KnowledgeSDK({ apiKey: process.env.KNOWLEDGESDK_API_KEY });
const db = new pg.Pool({ connectionString: process.env.DATABASE_URL });

// Common ATS URL patterns
const ATS_PATTERNS = {
  greenhouse: /greenhouse\.io\/embed\/job_app\?for=|greenhouse\.io\/jobs\//,
  lever: /jobs\.lever\.co\//,
  workday: /myworkdayjobs\.com\//,
  ashby: /jobs\.ashbyhq\.com\//,
  rippling: /ats\.rippling\.com\//,
};

function detectATS(url) {
  for (const [ats, pattern] of Object.entries(ATS_PATTERNS)) {
    if (pattern.test(url)) return ats;
  }
  return 'custom';
}

async function scrapeCareerPage(competitor) {
  console.log(`Scraping ${competitor.name} careers...`);

  try {
    // Scrape the careers page
    const result = await sdk.scrape({ url: competitor.careers_url });

    // Extract job listing links from the markdown
    const linkPattern = /\[([^\]]+)\]\((https?:\/\/[^\)]+)\)/g;
    const jobLinks = [];
    let match;

    while ((match = linkPattern.exec(result.markdown)) !== null) {
      const [, title, url] = match;

      // Filter out navigation links by checking ATS patterns or job-like URLs
      const isJobLink =
        Object.values(ATS_PATTERNS).some(p => p.test(url)) ||
        /\/jobs\/|\/careers\/|\/positions\/|\/openings\//i.test(url);

      if (isJobLink) {
        jobLinks.push({ title: title.trim(), url });
      }
    }

    // Deduplicate
    const unique = [...new Map(jobLinks.map(j => [j.url, j])).values()];
    console.log(`  Found ${unique.length} job links at ${competitor.name}`);

    return unique;
  } catch (err) {
    console.error(`Failed to scrape ${competitor.name}:`, err.message);
    return [];
  }
}

async function scrapeJobPosting(jobLink, competitor) {
  try {
    const result = await sdk.scrape({ url: jobLink.url });
    const contentHash = crypto
      .createHash('sha256')
      .update(result.markdown.slice(0, 2000))
      .digest('hex')
      .slice(0, 16);

    return {
      competitorId: competitor.id,
      title: result.title || jobLink.title,
      url: jobLink.url,
      content: result.markdown,
      contentHash,
      location: extractLocation(result.markdown),
      remote: detectRemote(result.markdown),
    };
  } catch (err) {
    console.error(`Failed to scrape job ${jobLink.url}:`, err.message);
    return null;
  }
}

function extractLocation(markdown) {
  // Common location patterns in job descriptions
  const patterns = [
    /Location:\s*([^\n]+)/i,
    /Office:\s*([^\n]+)/i,
    /\b(San Francisco|New York|London|Berlin|Toronto|Austin|Seattle|Remote)[,\s]/i,
  ];

  for (const pattern of patterns) {
    const match = markdown.match(pattern);
    if (match) return match[1]?.trim() || match[0]?.trim();
  }
  return null;
}

function detectRemote(markdown) {
  return /\b(remote|work from anywhere|fully remote|hybrid)\b/i.test(markdown);
}

# Python: scrape career pages
import os
import re
import hashlib
from typing import List, Dict, Optional
from knowledgesdk import KnowledgeSDK

sdk = KnowledgeSDK(api_key=os.environ["KNOWLEDGESDK_API_KEY"])

def scrape_career_page(competitor: Dict) -> List[Dict]:
    print(f"Scraping {competitor['name']} careers...")

    try:
        result = sdk.scrape(url=competitor["careers_url"])
        markdown = result["markdown"]

        link_pattern = r'\[([^\]]+)\]\((https?://[^\)]+)\)'
        job_links = []

        ats_patterns = [
            r'greenhouse\.io',
            r'jobs\.lever\.co',
            r'myworkdayjobs\.com',
            r'jobs\.ashbyhq\.com',
        ]
        job_url_pattern = r'/jobs/|/careers/|/positions/|/openings/'

        for match in re.finditer(link_pattern, markdown):
            title, url = match.group(1), match.group(2)
            is_job = any(re.search(p, url) for p in ats_patterns) or re.search(job_url_pattern, url, re.IGNORECASE)
            if is_job:
                job_links.append({"title": title.strip(), "url": url})

        # Deduplicate
        seen = set()
        unique = []
        for link in job_links:
            if link["url"] not in seen:
                seen.add(link["url"])
                unique.append(link)

        print(f"  Found {len(unique)} job links")
        return unique

    except Exception as e:
        print(f"Failed to scrape {competitor['name']}: {e}")
        return []

def scrape_job_posting(job_link: Dict, competitor: Dict) -> Optional[Dict]:
    try:
        result = sdk.scrape(url=job_link["url"])
        content = result["markdown"]
        content_hash = hashlib.sha256(content[:2000].encode()).hexdigest()[:16]

        return {
            "competitor_id": competitor["id"],
            "title": result.get("title") or job_link["title"],
            "url": job_link["url"],
            "content": content,
            "content_hash": content_hash,
            "remote": bool(re.search(r'\b(remote|work from anywhere|fully remote)\b', content, re.IGNORECASE)),
        }
    except Exception as e:
        print(f"Failed to scrape job {job_link['url']}: {e}")
        return None

Step 3: AI Categorization

Use an LLM to extract structured metadata from each job description:

import OpenAI from 'openai';

const openai = new OpenAI();

async function categorizeJobPosting(posting) {
  const response = await openai.chat.completions.create({
    model: 'gpt-4o-mini',
    messages: [
      {
        role: 'system',
        content: `Analyze this job posting and return a JSON object with these fields:
          - team: The broad team (Engineering, Product, Design, Sales, Marketing, Customer Success, Finance, Legal, HR, Operations, Other)
          - function: More specific role (e.g., Backend Engineer, ML Engineer, Account Executive, Product Manager, Data Scientist)
          - seniority: (Junior, Mid, Senior, Staff, Principal, Director, VP, C-Suite, Intern)
          - signals: Array of strategic signals this hire suggests (max 3, each under 15 words)
          Respond with only valid JSON.`,
      },
      {
        role: 'user',
        content: `Job title: ${posting.title}\n\n${posting.content.slice(0, 2000)}`,
      },
    ],
    response_format: { type: 'json_object' },
    max_tokens: 300,
  });

  try {
    return JSON.parse(response.choices[0].message.content);
  } catch {
    return { team: 'Other', function: posting.title, seniority: 'Unknown', signals: [] };
  }
}

Step 4: Change Detection

The real value comes from detecting new postings — jobs that weren't there last week:

async function processNewPostings(competitor, postings) {
  const newPostings = [];
  const BATCH_SIZE = 5;

  for (let i = 0; i < postings.length; i += BATCH_SIZE) {
    const batch = postings.slice(i, i + BATCH_SIZE);

    await Promise.all(batch.map(async (posting) => {
      if (!posting) return;

      // Try to upsert — INSERT on conflict updates last_seen_at
      const result = await db.query(`
        INSERT INTO job_postings (competitor_id, title, url, content, content_hash, location, remote)
        VALUES ($1, $2, $3, $4, $5, $6, $7)
        ON CONFLICT (competitor_id, content_hash) DO UPDATE
          SET last_seen_at = NOW(), still_active = true
        RETURNING id, (xmax = 0) AS is_new
      `, [
        posting.competitorId,
        posting.title,
        posting.url,
        posting.content,
        posting.contentHash,
        posting.location,
        posting.remote,
      ]);

      const row = result.rows[0];
      if (row?.is_new) {
        // New posting — categorize it
        const categorization = await categorizeJobPosting(posting);

        await db.query(`
          UPDATE job_postings
          SET team = $2, function = $3, seniority = $4
          WHERE id = $1
        `, [row.id, categorization.team, categorization.function, categorization.seniority]);

        newPostings.push({
          ...posting,
          ...categorization,
          competitor: competitor.name,
        });
      }
    }));

    await new Promise(r => setTimeout(r, 1000));
  }

  return newPostings;
}

Step 5: Weekly Intelligence Digest

Generate a structured weekly brief:

async function generateWeeklyDigest() {
  const oneWeekAgo = new Date(Date.now() - 7 * 24 * 60 * 60 * 1000).toISOString();

  // Get all new postings from the past week
  const { rows: newPostings } = await db.query(`
    SELECT jp.*, c.name as competitor_name
    FROM job_postings jp
    JOIN competitors c ON jp.competitor_id = c.id
    WHERE jp.first_seen_at > $1
    ORDER BY jp.first_seen_at DESC
  `, [oneWeekAgo]);

  if (newPostings.length === 0) {
    return 'No new job postings detected this week.';
  }

  // Group by competitor and team
  const byCompetitor = {};
  for (const post of newPostings) {
    if (!byCompetitor[post.competitor_name]) {
      byCompetitor[post.competitor_name] = {};
    }
    const team = post.team || 'Other';
    if (!byCompetitor[post.competitor_name][team]) {
      byCompetitor[post.competitor_name][team] = [];
    }
    byCompetitor[post.competitor_name][team].push(post);
  }

  // Build summary for LLM
  const summaryLines = [];
  for (const [company, teams] of Object.entries(byCompetitor)) {
    summaryLines.push(`\n${company}:`);
    for (const [team, posts] of Object.entries(teams)) {
      const titles = posts.map(p => p.title).join(', ');
      summaryLines.push(`  ${team} (${posts.length}): ${titles}`);
    }
  }

  const response = await openai.chat.completions.create({
    model: 'gpt-4o',
    messages: [
      {
        role: 'system',
        content: 'You are a competitive intelligence analyst. Analyze these competitor job postings from the past week and write a concise strategic brief. For each company, summarize what the hiring pattern suggests about their near-term product and go-to-market strategy. Be specific and actionable. Max 400 words.',
      },
      {
        role: 'user',
        content: `New job postings this week:\n${summaryLines.join('\n')}`,
      },
    ],
  });

  const aiAnalysis = response.choices[0].message.content;

  // Build the markdown digest
  const weekStart = new Date(oneWeekAgo).toLocaleDateString('en-US', { month: 'long', day: 'numeric' });
  const today = new Date().toLocaleDateString('en-US', { month: 'long', day: 'numeric', year: 'numeric' });

  let digest = `# Competitive Intelligence Digest\n`;
  digest += `*Week of ${weekStart} — ${today}*\n\n`;
  digest += `## AI Analysis\n\n${aiAnalysis}\n\n---\n\n`;
  digest += `## New Postings by Company\n\n`;

  for (const [company, teams] of Object.entries(byCompetitor)) {
    const totalPosts = Object.values(teams).flat().length;
    digest += `### ${company} (${totalPosts} new postings)\n\n`;

    for (const [team, posts] of Object.entries(teams)) {
      digest += `**${team}**\n`;
      for (const post of posts) {
        const location = post.location ? ` — ${post.location}` : '';
        const remote = post.remote ? ' (Remote)' : '';
        const url = post.url ? `[${post.title}](${post.url})` : post.title;
        digest += `- ${url}${location}${remote}\n`;
      }
      digest += '\n';
    }
  }

  digest += `---\n*${newPostings.length} new postings across ${Object.keys(byCompetitor).length} competitors*\n`;

  // Store digest
  await db.query(`
    INSERT INTO weekly_digests (week_start, content, new_postings_count)
    VALUES ($1, $2, $3)
  `, [oneWeekAgo, digest, newPostings.length]);

  return digest;
}

Step 6: Delivery via Webhook

Send the digest to Slack or email via webhook:

async function sendDigestToSlack(digest) {
  const webhookUrl = process.env.SLACK_WEBHOOK_URL;
  if (!webhookUrl) return;

  // Truncate for Slack's 3000-char limit
  const truncated = digest.length > 2800
    ? digest.slice(0, 2800) + '\n\n_[Digest truncated — see full report in dashboard]_'
    : digest;

  await fetch(webhookUrl, {
    method: 'POST',
    headers: { 'Content-Type': 'application/json' },
    body: JSON.stringify({
      text: '*Weekly Competitive Intelligence Digest*',
      blocks: [
        {
          type: 'section',
          text: { type: 'mrkdwn', text: truncated },
        },
      ],
    }),
  });
}

// Schedule: run every Monday at 8am
import cron from 'node-cron';
cron.schedule('0 8 * * 1', async () => {
  console.log('Generating weekly digest...');
  const digest = await generateWeeklyDigest();
  await sendDigestToSlack(digest);
});

Competitive Signal Patterns to Watch

Based on common competitive monitoring use cases, these job posting signals tend to be most predictive:

Signal	What to Watch	What It Means
ML/AI team surge	5+ ML engineers in a quarter	Building AI-native features
Enterprise sales hiring	Solutions engineers, enterprise AEs	Moving upmarket
DevRel expansion	Developer advocates, community managers	Ecosystem/PLG push
Security/compliance	Security engineers, compliance officers	Enterprise sales motion
International offices	Location-specific posts	Geographic expansion
Executive hiring	VP+, C-suite	Strategic pivot incoming

FAQ

Is scraping job postings from competitor career pages legal? Yes. Job postings are publicly accessible information that companies intentionally make available to attract candidates. Scraping public career pages for competitive intelligence is legal in most jurisdictions and is a standard business intelligence practice.

How do I handle competitors using Greenhouse, Lever, or Workday? These ATS platforms serve job listings on public URLs. KnowledgeSDK handles all of them — the JavaScript rendering and dynamic content loading is handled automatically.

What's the best cadence for scraping competitor career pages? Weekly scraping catches strategic shifts without over-indexing on day-to-day fluctuations. If you're tracking a competitor closely during a known strategic moment (post-fundraise, new product launch), increase to twice weekly.

How do I detect when a competitor closes a job posting? Mark postings as still_active = false when they're not found in the latest scrape. If a posting is missing for two consecutive scrapes, it's likely been filled or pulled.

Can I search historical job posting data? Yes. Store all job descriptions in KnowledgeSDK and use the /v1/search endpoint to query across them. "What ML roles has Competitor X posted in the last year?" becomes a semantic search query.

Competitive intelligence doesn't require corporate espionage — it's already published on career pages. Start monitoring at knowledgesdk.com/setup.

Try it now