News Monitoring for AI Agents: Real-Time Web Extraction + RAG

Build an AI news monitoring system that tracks specific topics, extracts articles from multiple sources, and enables semantic search — using web extraction APIs and vector embeddings.

News Monitoring for AI Agents: Real-Time Web Extraction + RAG

Real-time news monitoring is one of the most common requests from AI teams: "we want our agent to know what happened in the last 24 hours." It sounds straightforward until you start building it. The naive approach — polling a news API for headlines — gets you 20% of the way there. For an AI agent to actually reason about recent events, it needs the full article text, structured and searchable, from authoritative primary sources.

The problem is that news content is some of the hardest web content to work with. Articles are behind JavaScript-heavy frontends, paywalls, and increasingly aggressive bot detection. The content changes constantly. And your search layer needs to find relevant articles by meaning, not just keyword match — because users ask questions like "what's happening with semiconductor supply chains?" not "TSMC Taiwan chip shortage."

This guide covers the full architecture for a production-grade news monitoring system: source selection, real-time extraction, entity enrichment, and semantic search across a live corpus. All code uses KnowledgeSDK, which handles extraction and search in a single API, eliminating the need to wire together separate scraping, embedding, and vector database services.

Why News APIs Aren't Enough

News APIs like NewsAPI, GDELT, and The Guardian API are the obvious starting point. They're easy to integrate and give you structured data. The problems surface quickly:

No full text. Most news APIs return titles, descriptions, and a URL. Full article text either requires an expensive tier or isn't available at all. For RAG, summaries and titles are useless — you need the complete article to answer nuanced questions.

Coverage gaps. Aggregator APIs prioritize mainstream sources and English content. If you're monitoring semiconductor industry news, policy papers from regulatory bodies, or niche trade publications, you won't find them through NewsAPI.

Rate limits that cap your reach. Free and low-tier news API plans limit you to a few thousand articles per day across all sources. For comprehensive monitoring, you need to scrape primary sources directly.

Latency on breaking news. News APIs index articles on their own schedules, which can lag primary sources by minutes to hours. For time-sensitive domains, this matters.

The better approach for AI teams: identify your primary source list, scrape directly, and build a living knowledge base that's updated as content changes.

The Architecture

A production news monitoring system has five stages:

Source registry — a list of URLs (RSS feeds, section pages, journalist pages) to monitor
Change detection — know when a source has new content without re-scraping everything
Article extraction — scrape the full article, get clean markdown
Embedding and indexing — convert content to vectors, store with metadata
Semantic search — answer queries across the full corpus

The key insight is that stages 3, 4, and 5 can be collapsed into a single API call with the right tool. KnowledgeSDK's /v1/extract endpoint returns clean markdown AND automatically indexes the content for hybrid search — there's no separate embedding pipeline to maintain.

Setting Up Source Monitoring

The first decision is what to monitor. There are two approaches: RSS-based polling and page-based change detection.

RSS-based polling is simpler. Most news sites still publish RSS feeds. Poll the feed every 15 minutes, extract new article URLs, then scrape those URLs.

Page-based change detection uses webhooks to get notified when a page changes. This is more efficient for sites without RSS, or when you want to monitor specific section pages.

import { KnowledgeSDK } from '@knowledgesdk/node';

const client = new KnowledgeSDK({ apiKey: process.env.KNOWLEDGESDK_API_KEY });

// Register section pages for change detection
const sourcesToMonitor = [
  'https://techcrunch.com/category/artificial-intelligence/',
  'https://www.reuters.com/technology/',
  'https://spectrum.ieee.org/artificial-intelligence',
];

for (const sourceUrl of sourcesToMonitor) {
  await client.webhooks.subscribe({
    url: sourceUrl,
    callbackUrl: 'https://your-app.com/webhooks/news-change',
    events: ['content.changed'],
  });
}

console.log(`Monitoring ${sourcesToMonitor.length} sources for new articles`);

When a source page changes — meaning new articles have been posted — your webhook receives a diff. You can parse the diff for new URLs and immediately scrape those articles.

Real-Time Article Extraction

When your webhook fires or your RSS poller finds a new URL, extraction is straightforward:

// Handle webhook notification of a changed news source
app.post('/webhooks/news-change', async (req, res) => {
  res.sendStatus(200); // Acknowledge immediately

  const { url, diff } = req.body;
  
  // Extract new URLs from the diff (new links in the changed sections)
  const newArticleUrls = extractNewLinks(diff.added);

  for (const articleUrl of newArticleUrls) {
    try {
      const article = await client.scrape({
        url: articleUrl,
        // Content is automatically indexed for semantic search
      });

      // Run entity extraction on the article
      await enrichArticle(article);

      console.log(`Indexed: ${article.title} (${articleUrl})`);
    } catch (err) {
      console.error(`Failed to extract ${articleUrl}:`, err.message);
    }
  }
});

async function enrichArticle(article) {
  // Use LLM to extract entities from the article markdown
  const entities = await openai.chat.completions.create({
    model: 'gpt-4o-mini',
    messages: [{
      role: 'user',
      content: `Extract the key entities from this news article. Return JSON with:
      - companies: array of company names mentioned
      - people: array of people mentioned  
      - topics: array of key topics/themes
      - sentiment: "positive" | "negative" | "neutral"
      - summary: 2-sentence summary
      
      Article:
      ${article.markdown.slice(0, 3000)}`,
    }],
    response_format: { type: 'json_object' },
  });

  return JSON.parse(entities.choices[0].message.content);
}

The client.scrape() call handles JavaScript rendering, anti-bot circumvention, and markdown conversion. More importantly, it automatically indexes the content so it's immediately searchable — no separate embedding step required.

Semantic Search Across the News Corpus

With content indexed, your AI agent can answer questions that span hundreds or thousands of articles:

// Agent querying the news corpus
async function answerNewsQuery(userQuery) {
  const results = await client.search({
    query: userQuery,
    limit: 10,
    // Hybrid search combines semantic (vector) + keyword (BM25)
    // "what's happening with OpenAI this week?" finds relevant articles
    // even if they don't contain the exact phrase
  });

  if (results.items.length === 0) {
    return "No relevant recent news found for that query.";
  }

  // Build context from search results
  const context = results.items
    .map(item => `**${item.title}** (${item.url})\n${item.excerpt}`)
    .join('\n\n---\n\n');

  // Generate a grounded answer
  const answer = await openai.chat.completions.create({
    model: 'gpt-4o',
    messages: [
      {
        role: 'system',
        content: 'You are a news analyst. Answer based only on the provided articles. Cite sources.',
      },
      {
        role: 'user',
        content: `Query: ${userQuery}\n\nRelevant articles:\n${context}`,
      },
    ],
  });

  return {
    answer: answer.choices[0].message.content,
    sources: results.items.map(item => ({ title: item.title, url: item.url })),
  };
}

// Example queries
const result = await answerNewsQuery("What's happening with AI regulation in Europe this week?");
const result2 = await answerNewsQuery("Which companies announced layoffs in the last 48 hours?");
const result3 = await answerNewsQuery("What are analysts saying about NVIDIA's latest earnings?");

Hybrid search is critical here. A pure keyword search for "AI regulation Europe" would miss articles that discuss "EU AI Act compliance" or "Brussels data governance." Vector search catches semantic equivalents. KnowledgeSDK's search layer combines both, which produces significantly better recall on news queries than either approach alone.

Entity Extraction and Topic Clustering

A flat corpus of articles isn't enough for sophisticated news monitoring. You need to track entities over time: which companies appear most frequently, how sentiment around a topic is trending, who is being quoted on a specific subject.

// Track entity mentions over time
async function getEntityTrends(entityName, days = 7) {
  const results = await client.search({
    query: entityName,
    limit: 50,
    filters: {
      // Filter to recent content
      after: new Date(Date.now() - days * 24 * 60 * 60 * 1000).toISOString(),
    },
  });

  // Group by day and count mentions
  const mentionsByDay = results.items.reduce((acc, item) => {
    const day = item.publishedAt?.split('T')[0] ?? 'unknown';
    acc[day] = (acc[day] ?? 0) + 1;
    return acc;
  }, {});

  return {
    entity: entityName,
    totalMentions: results.items.length,
    trend: mentionsByDay,
    representativeArticles: results.items.slice(0, 3).map(i => ({
      title: i.title,
      url: i.url,
      excerpt: i.excerpt,
    })),
  };
}

const trends = await getEntityTrends('OpenAI');
console.log(`OpenAI mentioned ${trends.totalMentions} times in the past week`);

This pattern — scrape, index, query — lets you build entity-aware news monitoring without a separate graph database. The semantic search finds articles by meaning; you apply your own aggregation logic on top.

Polling vs. Webhook-Based Refresh

There are two approaches to keeping your corpus fresh: polling (scrape every URL on a schedule) and webhooks (scrape only when content changes).

	Polling	Webhooks
Latency	15 min average	Near real-time
API cost	High (re-scrapes unchanged pages)	Low (only scrapes on change)
Implementation	Simple	Slightly more complex
Works for all sites	Yes	Requires monitoring capability
Missed updates	Possible (between polls)	Rare

For most news monitoring use cases, a hybrid approach works best: webhooks for high-priority sources (competitor sites, major publications), polling for the long tail of sources that aren't worth setting up individual webhooks for.

Production Considerations

Deduplication. The same story gets syndicated across dozens of outlets. Before indexing, check if you've already seen substantially similar content. A simple approach: hash the first 500 characters of the article body and skip if already indexed.

Paywall handling. About 30% of premium news sources are paywalled. If your monitoring targets require subscription content, you'll need credentials. KnowledgeSDK doesn't bypass authentication — it renders what an anonymous browser sees. For paywalled sources, consider whether the public-facing excerpt and metadata are sufficient.

Rate limiting. News sites will block you if you hammer them. Scrape at most one article per source every 2-3 seconds. KnowledgeSDK handles per-domain rate limiting automatically, but configure your polling loops conservatively.

Index freshness. Set a TTL or explicit staleness window for your indexed content. News older than 7 days is rarely useful in agent context. Periodically purge old indexed content to keep your search latency low and results relevant.

Complete Pipeline: News Monitoring in 50 Lines

import { KnowledgeSDK } from '@knowledgesdk/node';
import Parser from 'rss-parser';

const client = new KnowledgeSDK({ apiKey: process.env.KNOWLEDGESDK_API_KEY });
const rss = new Parser();

const RSS_FEEDS = [
  'https://techcrunch.com/feed/',
  'https://feeds.reuters.com/reuters/technologyNews',
  'https://rss.nytimes.com/services/xml/rss/nyt/Technology.xml',
];

async function runNewsMonitor() {
  console.log('Starting news monitoring cycle...');

  for (const feedUrl of RSS_FEEDS) {
    const feed = await rss.parseURL(feedUrl);
    const recentItems = feed.items.slice(0, 10); // Latest 10 from each feed

    for (const item of recentItems) {
      if (!item.link) continue;

      // Scrape and auto-index the article
      const result = await client.scrape({ url: item.link });
      
      console.log(`Indexed: ${result.title ?? item.title}`);
      
      // Small delay to be respectful
      await new Promise(r => setTimeout(r, 1500));
    }
  }

  console.log('Cycle complete. Corpus is searchable.');
}

// Run on schedule
runNewsMonitor();
setInterval(runNewsMonitor, 15 * 60 * 1000); // Every 15 minutes

// Search endpoint
app.get('/api/news/search', async (req, res) => {
  const { q } = req.query;
  const results = await client.search({ query: q, limit: 10 });
  res.json(results);
});

This 50-line pipeline replaces what would typically require: a scraping service, an embedding pipeline, a vector database, an indexing job, and a search API. The corpus is live-updated every 15 minutes and searchable by semantic meaning immediately.

Start monitoring with KnowledgeSDK's free tier — 1,000 requests per month at no cost. Get your API key at knowledgesdk.com/setup.

Try it now