Build a News Aggregator AI Agent with Web Scraping (No RSS Needed)

Build an AI news aggregator that scrapes any tech site, categorizes articles semantically, deduplicates stories, and delivers a daily brief—no RSS required.

RSS was supposed to solve news aggregation. It mostly did — until publishers started removing their feeds, paywalling their content, or simply never adding RSS in the first place. Today, roughly half the tech news sources worth reading don't expose a usable RSS feed. The rest expose truncated versions that require you to fetch the full article anyway.

The alternative is to scrape directly. With a scraping API that handles JavaScript rendering and a semantic search layer, you can build a news aggregator that works on any site, categorizes stories automatically, and delivers a de-duplicated daily brief — without RSS dependency.

This guide walks through building exactly that.

What We're Building

A scraping agent that collects articles from multiple tech news sources daily
A semantic deduplication engine that detects similar stories from different publishers
An AI categorization layer that tags articles by topic
A daily brief generator that produces a markdown newsletter
A search interface so you can query past articles by topic

Why Not Just Use RSS?

Before writing code, a quick recap of RSS's failure modes in 2026:

Missing feeds. Many tech blogs, company blogs, and niche publications don't publish RSS. Scraping is the only option.

Truncated content. Publishers that do have RSS often truncate articles to 200 words. You get the headline and a teaser — not the content your AI needs.

No semantic structure. RSS gives you a flat list of items with titles, dates, and descriptions. It tells you nothing about what the article actually covers or how it relates to other stories.

No deduplication. The same story breaks across 20 publications. RSS gives you 20 separate entries with no indication they're covering the same event.

Web scraping solves all four problems. The tradeoff is complexity — you need to handle JavaScript rendering, site structure variation, and content extraction. KnowledgeSDK handles the rendering and extraction, leaving you with clean markdown and structured metadata.

Source Configuration

Define your news sources as configuration objects. Each source has its index page and some metadata about its structure:

// Node.js: news sources config
const NEWS_SOURCES = [
  {
    id: 'techcrunch',
    name: 'TechCrunch',
    indexUrl: 'https://techcrunch.com',
    category: 'startup-news',
    articleUrlPattern: /techcrunch\.com\/\d{4}\/\d{2}\/\d{2}\//,
  },
  {
    id: 'hackernews',
    name: 'Hacker News',
    indexUrl: 'https://news.ycombinator.com',
    category: 'developer-news',
    articleUrlPattern: /news\.ycombinator\.com\/item/,
  },
  {
    id: 'theregister',
    name: 'The Register',
    indexUrl: 'https://www.theregister.com',
    category: 'tech-industry',
    articleUrlPattern: /theregister\.com\/\d{4}\/\d{2}\/\d{2}\//,
  },
  {
    id: 'venturebeat',
    name: 'VentureBeat',
    indexUrl: 'https://venturebeat.com',
    category: 'ai-enterprise',
    articleUrlPattern: /venturebeat\.com\/\d{4}\/\d{2}\/\d{2}\//,
  },
  {
    id: 'mit-news',
    name: 'MIT Technology Review',
    indexUrl: 'https://www.technologyreview.com',
    category: 'deep-tech',
    articleUrlPattern: /technologyreview\.com\/\d{4}\/\d{2}\/\d{2}\//,
  },
];

# Python: news sources config
NEWS_SOURCES = [
    {
        "id": "techcrunch",
        "name": "TechCrunch",
        "index_url": "https://techcrunch.com",
        "category": "startup-news",
    },
    {
        "id": "venturebeat",
        "name": "VentureBeat",
        "index_url": "https://venturebeat.com",
        "category": "ai-enterprise",
    },
    {
        "id": "theregister",
        "name": "The Register",
        "index_url": "https://www.theregister.com",
        "category": "tech-industry",
    },
]

Step 1: Scrape the Index Pages

The first step is extracting article links from each source's homepage or section pages:

import KnowledgeSDK from '@knowledgesdk/node';

const sdk = new KnowledgeSDK({ apiKey: process.env.KNOWLEDGESDK_API_KEY });

async function getArticleLinks(source) {
  try {
    const result = await sdk.scrape({ url: source.indexUrl });
    const markdown = result.markdown;

    // Extract URLs from markdown links
    const urlPattern = /\[([^\]]+)\]\((https?:\/\/[^\)]+)\)/g;
    const links = [];
    let match;

    while ((match = urlPattern.exec(markdown)) !== null) {
      const [, title, url] = match;

      // Filter to article URLs using source pattern
      if (source.articleUrlPattern && !source.articleUrlPattern.test(url)) continue;

      // Skip navigation, category, and tag pages
      if (url.includes('/tag/') || url.includes('/category/') || url.includes('/author/')) continue;

      links.push({ title: title.trim(), url });
    }

    // Deduplicate by URL
    const unique = [...new Map(links.map(l => [l.url, l])).values()];
    return unique.slice(0, 20); // Take top 20 links
  } catch (err) {
    console.error(`Failed to get links from ${source.name}:`, err.message);
    return [];
  }
}

# Python: extract article links from index page
import re
from typing import List, Dict

def get_article_links(sdk, source: Dict) -> List[Dict]:
    try:
        result = sdk.scrape(url=source["index_url"])
        markdown = result["markdown"]

        url_pattern = r'\[([^\]]+)\]\((https?://[^\)]+)\)'
        links = []

        for match in re.finditer(url_pattern, markdown):
            title, url = match.group(1), match.group(2)
            # Skip navigation pages
            if any(skip in url for skip in ["/tag/", "/category/", "/author/", "/page/"]):
                continue
            links.append({"title": title.strip(), "url": url})

        # Deduplicate by URL
        seen = set()
        unique = []
        for link in links:
            if link["url"] not in seen:
                seen.add(link["url"])
                unique.append(link)

        return unique[:20]
    except Exception as e:
        print(f"Failed to get links from {source['name']}: {e}")
        return []

Step 2: Scrape Individual Articles

Once you have article URLs, scrape each one to get the full content:

async function scrapeArticle(link, source) {
  try {
    const result = await sdk.scrape({ url: link.url });

    return {
      title: result.title || link.title,
      url: link.url,
      content: result.markdown,
      source: source.name,
      sourceId: source.id,
      category: source.category,
      wordCount: result.markdown.split(/\s+/).length,
      scrapedAt: new Date().toISOString(),
    };
  } catch (err) {
    console.error(`Failed to scrape ${link.url}:`, err.message);
    return null;
  }
}

async function scrapeAllSources(sources) {
  const allArticles = [];

  for (const source of sources) {
    console.log(`Collecting from ${source.name}...`);
    const links = await getArticleLinks(source);
    console.log(`  Found ${links.length} article links`);

    // Scrape articles with controlled concurrency
    const BATCH_SIZE = 3;
    for (let i = 0; i < links.length; i += BATCH_SIZE) {
      const batch = links.slice(i, i + BATCH_SIZE);
      const results = await Promise.allSettled(
        batch.map(link => scrapeArticle(link, source))
      );

      for (const result of results) {
        if (result.status === 'fulfilled' && result.value) {
          allArticles.push(result.value);
        }
      }

      // Rate limit between batches
      await new Promise(r => setTimeout(r, 1500));
    }

    console.log(`  Scraped ${allArticles.filter(a => a.sourceId === source.id).length} articles`);
  }

  return allArticles;
}

Step 3: Semantic Deduplication

The same story breaking across multiple publications is the biggest noise problem in news aggregation. Semantic deduplication identifies stories that cover the same event, even when the headlines differ:

import OpenAI from 'openai';

const openai = new OpenAI();

async function generateEmbedding(text) {
  const response = await openai.embeddings.create({
    model: 'text-embedding-3-small',
    input: text.slice(0, 2000), // First 2000 chars is usually enough
  });
  return response.data[0].embedding;
}

function cosineSimilarity(a, b) {
  const dotProduct = a.reduce((sum, ai, i) => sum + ai * b[i], 0);
  const magnitudeA = Math.sqrt(a.reduce((sum, ai) => sum + ai * ai, 0));
  const magnitudeB = Math.sqrt(b.reduce((sum, bi) => sum + bi * bi, 0));
  return dotProduct / (magnitudeA * magnitudeB);
}

async function deduplicateArticles(articles) {
  const SIMILARITY_THRESHOLD = 0.87; // High threshold — only deduplicate near-identical stories

  // Generate embeddings for all articles
  const articlesWithEmbeddings = await Promise.all(
    articles.map(async (article) => ({
      ...article,
      embedding: await generateEmbedding(`${article.title}\n\n${article.content.slice(0, 500)}`),
    }))
  );

  const clusters = [];
  const assigned = new Set();

  for (let i = 0; i < articlesWithEmbeddings.length; i++) {
    if (assigned.has(i)) continue;

    const cluster = [articlesWithEmbeddings[i]];
    assigned.add(i);

    for (let j = i + 1; j < articlesWithEmbeddings.length; j++) {
      if (assigned.has(j)) continue;

      const similarity = cosineSimilarity(
        articlesWithEmbeddings[i].embedding,
        articlesWithEmbeddings[j].embedding
      );

      if (similarity >= SIMILARITY_THRESHOLD) {
        cluster.push(articlesWithEmbeddings[j]);
        assigned.add(j);
      }
    }

    // Keep the most authoritative source in the cluster as the primary
    const primary = cluster.sort((a, b) => {
      // Prefer longer articles (more content)
      return b.wordCount - a.wordCount;
    })[0];

    clusters.push({
      primary,
      duplicates: cluster.slice(1),
      coverageCount: cluster.length,
    });
  }

  return clusters;
}

# Python: semantic deduplication
import numpy as np
from openai import OpenAI

openai_client = OpenAI()

def generate_embedding(text: str) -> List[float]:
    response = openai_client.embeddings.create(
        model="text-embedding-3-small",
        input=text[:2000]
    )
    return response.data[0].embedding

def cosine_similarity(a: List[float], b: List[float]) -> float:
    a_arr = np.array(a)
    b_arr = np.array(b)
    return float(np.dot(a_arr, b_arr) / (np.linalg.norm(a_arr) * np.linalg.norm(b_arr)))

def deduplicate_articles(articles: List[Dict]) -> List[Dict]:
    THRESHOLD = 0.87

    for article in articles:
        text = f"{article['title']}\n\n{article['content'][:500]}"
        article["embedding"] = generate_embedding(text)

    clusters = []
    assigned = set()

    for i, article in enumerate(articles):
        if i in assigned:
            continue
        cluster = [article]
        assigned.add(i)

        for j in range(i + 1, len(articles)):
            if j in assigned:
                continue
            sim = cosine_similarity(article["embedding"], articles[j]["embedding"])
            if sim >= THRESHOLD:
                cluster.append(articles[j])
                assigned.add(j)

        primary = max(cluster, key=lambda a: a["word_count"])
        clusters.append({
            "primary": primary,
            "duplicates": [a for a in cluster if a is not primary],
            "coverage_count": len(cluster),
        })

    return clusters

Step 4: AI Categorization

Use an LLM to apply consistent topic labels across all articles:

async function categorizeArticle(article) {
  const categories = [
    'AI & Machine Learning',
    'Startups & Funding',
    'Big Tech',
    'Security & Privacy',
    'Developer Tools',
    'Cloud & Infrastructure',
    'Cryptocurrency',
    'Policy & Regulation',
    'Science & Research',
    'Other',
  ];

  const response = await openai.chat.completions.create({
    model: 'gpt-4o-mini',
    messages: [
      {
        role: 'system',
        content: `Categorize the following article into exactly one of these categories: ${categories.join(', ')}. Respond with only the category name.`,
      },
      {
        role: 'user',
        content: `Title: ${article.title}\n\nContent preview: ${article.content.slice(0, 500)}`,
      },
    ],
    max_tokens: 20,
  });

  return response.choices[0].message.content.trim();
}

Step 5: Generate the Daily Brief

Combine everything into a daily newsletter:

async function generateDailyBrief(clusters) {
  const today = new Date().toLocaleDateString('en-US', {
    weekday: 'long', year: 'numeric', month: 'long', day: 'numeric'
  });

  // Group by category
  const byCategory = {};
  for (const cluster of clusters) {
    const cat = cluster.primary.aiCategory || cluster.primary.category;
    if (!byCategory[cat]) byCategory[cat] = [];
    byCategory[cat].push(cluster);
  }

  // Generate brief using LLM
  const topStories = clusters
    .sort((a, b) => b.coverageCount - a.coverageCount) // Most covered stories first
    .slice(0, 10);

  const storySummaries = topStories.map((cluster, i) =>
    `${i + 1}. ${cluster.primary.title}\n   Source: ${cluster.primary.source}${cluster.duplicates.length > 0 ? ` (+${cluster.duplicates.length} more)` : ''}\n   ${cluster.primary.url}`
  ).join('\n\n');

  const response = await openai.chat.completions.create({
    model: 'gpt-4o',
    messages: [
      {
        role: 'system',
        content: 'You are a tech news editor writing a concise daily brief for developers. Write 2-3 sentences summarizing the top stories, focusing on why they matter for developers and builders.',
      },
      {
        role: 'user',
        content: `Today's top tech stories:\n\n${storySummaries}`,
      },
    ],
  });

  const editorNote = response.choices[0].message.content;

  // Build markdown brief
  let brief = `# Tech News Brief — ${today}\n\n`;
  brief += `${editorNote}\n\n---\n\n`;

  for (const [category, categoryClusters] of Object.entries(byCategory)) {
    brief += `## ${category}\n\n`;
    for (const cluster of categoryClusters) {
      const { primary, duplicates, coverageCount } = cluster;
      const coverage = coverageCount > 1 ? ` *(${coverageCount} sources)*` : '';
      brief += `**[${primary.title}](${primary.url})**${coverage}  \n`;
      brief += `*${primary.source}*\n\n`;
    }
  }

  return brief;
}

Step 6: Search Past Articles

Index scraped articles into KnowledgeSDK for semantic search:

async function indexArticleForSearch(article) {
  // Use the scrape endpoint to re-process and index the article
  // Articles are already scraped, but we index the content for search
  await sdk.scrape({
    url: article.url,
    // KnowledgeSDK automatically indexes content for /v1/search
  });
}

async function searchArticles(query, filters = {}) {
  const results = await sdk.search({
    query,
    limit: 10,
    ...filters,
  });

  return results.results.map(r => ({
    title: r.title,
    url: r.url,
    excerpt: r.content.slice(0, 200),
    relevanceScore: r.score,
  }));
}

// Example queries
const aiResults = await searchArticles('OpenAI GPT-5 release date');
const securityResults = await searchArticles('zero-day vulnerability patch');
const fundingResults = await searchArticles('Series B startup funding AI');

Production Considerations

Storage. A news aggregator scraping 20 sources, 20 articles each, daily will accumulate fast. Archive articles older than 30 days to cold storage. Keep recent articles in hot storage for search.

Freshness. Run the agent twice daily — once in the morning and once in the afternoon. Set a publication date threshold to skip articles older than 48 hours.

Source health monitoring. Track success rates per source. If a source fails 3 days in a row, alert your team — the site may have changed structure or added anti-scraping measures.

Paywall detection. Some articles are paywalled — the index page lists them but the article content returns a login prompt. Detect this by checking if scraped content is under 200 words or contains "subscribe to read" patterns.

FAQ

Does this work for non-English news sites? Yes. KnowledgeSDK returns content in whatever language the page is in. Semantic embeddings work cross-language, so text-embedding-3-small will cluster German and English articles about the same story together.

How do I handle sites that block scrapers? KnowledgeSDK handles most bot protection automatically. For sites with aggressive blocking (major newspapers with DRM, for instance), focus on their freely available content or use their official APIs if available.

What's the difference between this and a hosted news API? Hosted news APIs (like NewsAPI, GDELT) give you broad coverage but poor content quality — truncated articles, no JavaScript-rendered content, limited sources. A custom scraper gives you full article text from exactly the sources you care about.

How do I avoid re-scraping articles I've already indexed? Maintain a URL index (a simple set in Redis or a DB table). Before scraping any article link, check if the URL has already been processed today.

Can I add Hacker News comments to the aggregator? Yes. Hacker News has a public API (https://hacker-news.firebaseio.com/v0/) that provides top stories and comment threads. Use the API for HN content rather than scraping the HTML.

Your team's daily brief, automated. Build your news agent in minutes at knowledgesdk.com/setup.

Try it now