E-Commerce Data Extraction for AI: Products, Prices, Reviews at Scale

How to build an AI-powered e-commerce data pipeline — extracting products, prices, and reviews from any website, structuring the data, and making it searchable.

E-Commerce Data Extraction for AI: Products, Prices, Reviews at Scale

E-commerce is the most competitive arena in web scraping. Amazon, Shopify stores, and retailer sites collectively represent the largest source of product data on the internet — and the most aggressively protected. Anti-bot systems, dynamic pricing, JavaScript-heavy pages, and geo-restricted content all stand between you and usable product data.

But for AI-native teams in 2026, the goal isn't just raw data. It's structured, searchable product knowledge that feeds recommendation engines, price intelligence systems, competitive analysis tools, and customer-facing AI assistants. Getting from URL to useful AI context requires a multi-step pipeline — this guide walks through each step.

What E-Commerce AI Agents Need

An AI agent shopping assistant, price tracker, or competitive intelligence system needs more than a product name and price. It needs:

Product details. Name, description, specifications, dimensions, materials, compatibility, included accessories. The richer the product data, the better the AI can answer questions like "does this laptop have Thunderbolt 4?" or "will this case fit a 13-inch MacBook Pro?"

Pricing data. Current price, original price, discount percentage, sale end dates, price history trends. Structured pricing enables temporal analysis: "when was this product cheapest in the last 6 months?"

Availability. In stock, out of stock, low stock warnings, ship date estimates, seller/fulfillment information. Real-time availability data is critical for purchase intent applications.

Reviews and ratings. Aggregate rating, review count, verified purchase percentage, key themes from review text. LLM-powered review summarization surfaces what customers actually care about.

Competitive positioning. Price relative to similar products. Which features appear in competing products at lower price points. Customer sentiment comparison across brands.

The Scraping Challenge

E-commerce sites are among the hardest to scrape reliably:

Amazon. Amazon's anti-bot systems are sophisticated and continuously updated. Prices change dynamically based on session history, account type, geographic location, and time of day. Product pages load in multiple stages via JavaScript. Amazon actively pursues scrapers legally and technically.

Dynamic pricing. Real-time price changes happen millions of times per day on major platforms. A scrape from 9 AM may not reflect the price at 9:05 AM. For price intelligence, you need both snapshot data and change-tracking infrastructure.

JavaScript-heavy pages. Modern product pages render key information (prices, availability, variant selectors) via JavaScript. A simple HTTP fetch returns a loading skeleton, not the actual data.

Geo-restricted prices. Prices differ by region. Without residential proxies in the target geography, you'll see incorrect or unavailable prices. A US IP sees different Amazon prices than a UK IP, often by design.

Quality scraping APIs handle JS rendering and anti-bot natively. Geographic proxy targeting (US, UK, EU) handles geo-restrictions.

Architecture for AI-Powered E-Commerce Intelligence

Product URLs
     │
     ▼
┌──────────────────────┐
│   KnowledgeSDK       │
│   POST /v1/extract   │  ← JS rendering + anti-bot + markdown
│   POST /v1/extract    │
└──────────┬───────────┘
           │ Clean markdown
           ▼
┌──────────────────────┐
│   LLM Extraction     │  ← Structure: price, specs, availability
│   (GPT-4o / Claude)  │
└──────────┬───────────┘
           │ Structured JSON
           ▼
┌──────────────────────┐
│   Storage + Search   │
│   KnowledgeSDK       │  ← Hybrid keyword + vector search
│   POST /v1/search    │
└──────────┬───────────┘
           │
           ▼
┌──────────────────────┐
│   AI Application     │  ← Shopping assistant, price alerts, reports
└──────────────────────┘

Webhooks from KnowledgeSDK feed the change-detection layer: when a monitored product URL changes, the pipeline re-extracts and re-structures the data automatically.

Extracting Structured Product Data

Raw markdown from a product page is readable but not queryable. You need structured JSON for storage and comparison. LLMs excel at this extraction task — they handle inconsistent formatting, infer missing fields from context, and adapt to different site layouts without schema changes.

import KnowledgeSDK from '@knowledgesdk/node';
import OpenAI from 'openai';

const ks = new KnowledgeSDK({ apiKey: 'knowledgesdk_live_...' });
const openai = new OpenAI({ apiKey: process.env.OPENAI_API_KEY });

interface ProductData {
  name: string;
  price: number;
  originalPrice?: number;
  currency: string;
  rating?: number;
  reviewCount?: number;
  availability: 'in_stock' | 'out_of_stock' | 'limited';
  specs: Record<string, string>;
  description: string;
  imageUrls: string[];
}

async function extractProduct(url: string): Promise<ProductData> {
  // Step 1: Get clean markdown from the product page
  const { markdown } = await ks.extract(url);

  // Step 2: LLM extracts structured data
  const response = await openai.chat.completions.create({
    model: 'gpt-4o',
    messages: [
      {
        role: 'system',
        content: `Extract product information from the following webpage content.
Return a JSON object with: name, price (number), originalPrice (if discounted),
currency (ISO code), rating (0-5), reviewCount, availability (in_stock/out_of_stock/limited),
specs (object of key-value pairs from product specifications table),
description (2-3 sentence summary).`
      },
      { role: 'user', content: markdown }
    ],
    response_format: { type: 'json_object' }
  });

  return JSON.parse(response.choices[0].message.content!) as ProductData;
}

Price Change Monitoring with Webhooks

Tracking price changes requires monitoring URLs over time and detecting when the price field changes. KnowledgeSDK's webhook system handles the monitoring layer; your code handles the interpretation.

// Register product URLs for price monitoring
const webhook = await ks.webhooks.create({
  url: 'https://your-app.com/webhooks/product-change',
  events: ['content.changed'],
  monitors: productUrls.map(url => ({
    url,
    schedule: 'every_hour'
  }))
});

// Webhook handler
export async function POST(req: NextRequest) {
  const { url, previousContent, newContent } = await req.json();

  // Extract prices from both versions
  const [prevProduct, newProduct] = await Promise.all([
    extractProductFromMarkdown(previousContent),
    extractProductFromMarkdown(newContent)
  ]);

  const priceChange = newProduct.price - prevProduct.price;
  const priceChangePct = (priceChange / prevProduct.price) * 100;

  if (Math.abs(priceChangePct) >= 5) {
    await notifyPriceChange({
      url,
      previousPrice: prevProduct.price,
      newPrice: newProduct.price,
      changePercent: priceChangePct,
      product: newProduct.name
    });
  }

  // Store snapshot in your database
  await db.priceHistory.insert({
    url,
    price: newProduct.price,
    timestamp: new Date(),
    product: newProduct
  });

  return NextResponse.json({ ok: true });
}

Semantic Search Over Your Product Catalog

Once products are extracted and indexed, semantic search unlocks a new class of query. Instead of keyword matching ("laptop"), you can answer natural language queries ("budget laptop under $500 for college students") using KnowledgeSDK's hybrid search.

// Index product into KnowledgeSDK's search
async function indexProduct(product: ProductData, sourceUrl: string) {
  await ks.extract(sourceUrl); // KnowledgeSDK auto-indexes on extraction
}

// Semantic search across your catalog
async function searchProducts(query: string) {
  const results = await ks.search(query);

  return results.map(r => ({
    url: r.url,
    title: r.title,
    snippet: r.snippet,
    score: r.score
  }));
}

// Example usage
const results = await searchProducts('lightweight laptop for video editing under $1500');
// Returns semantically relevant products, not just keyword matches

The POST /v1/search endpoint uses hybrid search — combining dense vector similarity (semantic meaning) with sparse keyword matching (exact terms). This outperforms pure vector search on queries with specific model numbers or brands, and outperforms keyword search on conceptual queries.

Full Pipeline: Competitor Price Intelligence

Here's the complete pipeline for a competitive price intelligence system:

import KnowledgeSDK from '@knowledgesdk/node';

const ks = new KnowledgeSDK({ apiKey: 'knowledgesdk_live_...' });

// 1. Discover competitor product URLs
const { urls } = await ks.sitemap('https://competitor.com/products');
console.log(`Found ${urls.length} product URLs`);

// 2. Extract all products (async for large catalogs)
const jobs = await Promise.all(
  urls.slice(0, 100).map(url =>
    ks.extractAsync(url, {
      callbackUrl: 'https://your-app.com/webhooks/extract-complete'
    })
  )
);

// 3. Poll for completion (or handle via callback)
for (const job of jobs) {
  let status = await ks.getJob(job.jobId);
  while (status.status === 'pending' || status.status === 'running') {
    await new Promise(r => setTimeout(r, 5000));
    status = await ks.getJob(job.jobId);
  }

  if (status.status === 'completed') {
    const product = await extractProduct(status.result.url);
    console.log(`Extracted: ${product.name} at ${product.currency}${product.price}`);
  }
}

// 4. Now searchable:
const cheapAlternatives = await ks.search(
  'gaming headset under $80 with noise cancellation'
);

What to Expect at Scale

Processing 10,000 product pages monthly sits comfortably in KnowledgeSDK's Starter tier at $29/month. Processing 100,000+ pages is a Pro use case at $99/month. Both tiers include the search index, webhooks, and async job infrastructure you need for a production pipeline.

For comparison, building this stack yourself — proxy infrastructure, headless browser fleet, vector database, change detection — runs $200-500/month in infrastructure costs before engineering time. The managed API path is not just simpler; it's usually cheaper at these scales.

E-commerce data extraction in 2026 is a solved problem at the infrastructure level. The differentiation is in what you do with the data once it's structured and searchable — which is where your AI application logic lives.

Try it now