Semantic Product Search: Embedding Your E-Commerce Catalog for AI

Replace keyword search with semantic product search — customers find what they're looking for even when they don't know the product name. Here's how to build it.

A customer types "waterproof running shoes under $100" into your search bar. Your catalog contains exactly what they're looking for — trail runners with a waterproof membrane, listed at $89. But the product title is "TrailGuard X4 All-Weather Athletic Footwear." None of those words match the query. Zero results.

That customer leaves.

This is the fundamental failure of keyword search: it matches strings, not meaning. A customer who says "lightweight laptop for travel" and a customer who says "portable thin notebook for business trips" are looking for the same thing. Keyword search treats them as completely different queries.

Semantic product search solves this by embedding your catalog — converting each product into a vector that represents its meaning — and searching by conceptual similarity rather than exact word match.

Two Approaches to Semantic Product Search

The approach you choose depends on where your catalog lives.

Approach 1: Extract your product pages with KnowledgeSDK

If your products are already published on your website (which they almost certainly are), use the sitemap endpoint to discover all product URLs, then extract each one. KnowledgeSDK handles the JavaScript rendering, content extraction, embedding, and indexing automatically. Your products become searchable in minutes with no custom infrastructure.

Approach 2: Build a custom embedding pipeline

If your catalog lives in a database or PIM system and isn't published as individual pages, you'll embed product data directly and push it to a vector store. This gives you more control over exactly what gets embedded but requires more infrastructure.

For most e-commerce businesses, Approach 1 is the right starting point. If you already have product pages, you already have the content — just index it.

Approach 1: Indexing Product Pages with KnowledgeSDK

Start by discovering all your product URLs:

const ks = require('@knowledgesdk/node');
const client = new ks.KnowledgeSDK({ apiKey: 'knowledgesdk_live_...' });

// Discover all URLs under /products/
const sitemap = await client.sitemap({ url: 'https://yourstore.com/products' });

// Filter to product pages (exclude category pages, filters, etc.)
const productUrls = sitemap.urls.filter(url =>
  url.includes('/products/') && !url.includes('?') && !url.includes('/category/')
);

console.log(`Found ${productUrls.length} product pages`);

// Extract and index each product page
for (const url of productUrls) {
  await client.extractAsync({ url });
}

Using extractAsync fires off extraction jobs without waiting for each one — your entire catalog can be indexed in parallel. For a catalog of 5,000 products, this typically completes in 10–20 minutes.

Once indexed, search works immediately:

// Customer searches
const results = await client.search({
  query: 'waterproof running shoes under $100',
  limit: 10,
});

// results.results contains ranked product pages with title, content, source_url
for (const result of results.results) {
  console.log(result.title); // "TrailGuard X4 All-Weather Athletic Footwear"
  console.log(result.source_url); // https://yourstore.com/products/trailguard-x4
}

The hybrid search (semantic + keyword) returns the TrailGuard X4 because the product description mentions "waterproof membrane," "trail running," and the price. The semantic layer understands that "waterproof running shoes" and "all-weather athletic footwear" are the same concept. The keyword layer catches the price mention.

Building the Product Search API Endpoint

Wire this into a search endpoint your frontend calls:

const express = require('express');
const ks = require('@knowledgesdk/node');

const app = express();
app.use(express.json());

const client = new ks.KnowledgeSDK({ apiKey: process.env.KNOWLEDGESDK_API_KEY });

app.get('/api/search', async (req, res) => {
  const { q, limit = 12 } = req.query;

  if (!q) {
    return res.status(400).json({ error: 'Query parameter q is required' });
  }

  const results = await client.search({
    query: q,
    limit: parseInt(limit),
  });

  // Format results for the product grid
  const products = results.results.map(result => ({
    title: result.title,
    url: result.source_url,
    description: result.content.slice(0, 300),
    score: result.score,
    // Extract product ID from URL for your database lookup
    productId: extractProductId(result.source_url),
  }));

  return res.json({
    query: q,
    count: products.length,
    products,
  });
});

function extractProductId(url) {
  // e.g., https://yourstore.com/products/trailguard-x4 → trailguard-x4
  const parts = url.split('/products/');
  return parts[1] ? parts[1].replace(/\/$/, '') : null;
}

app.listen(3000);

The productId extracted from the source URL lets you look up real-time inventory, pricing, and images from your own database — you're using KnowledgeSDK for ranking and relevance, then enriching results with live data from your systems.

Writing Better Product Descriptions for Embeddings

Embedding quality depends directly on description quality. A sparse product description like "Men's shoe, size 8-13, black" embeds poorly. A rich description embeds well and surfaces in many more relevant queries.

Include in every product description:

Use case: "Designed for trail running, hiking, and outdoor activities in wet conditions"
Material and construction: "Waterproof Gore-Tex membrane, reinforced toe cap, Vibram outsole"
Fit and sizing notes: "Runs true to size; wide toe box for natural foot position"
Style descriptors: "Low-profile design, available in slate gray and forest green"
Who it's for: "Ideal for runners who train in all weather conditions and don't want to change shoes for light rain"

Each of these phrases creates embedding dimensions that match different customer queries. "I train outdoors year-round" maps to use case. "I want a wide toe box" maps to fit. "I need something that looks casual but performs" maps to style.

This investment in description quality compounds. Better descriptions → better embeddings → more semantic search matches → more conversions. It also improves your SEO as a side effect.

Search UX Patterns That Work

Conversational search: Let users type natural language queries. Don't force them into keyword patterns. "Show me something like my current running shoes but waterproof" is a valid query for a semantic search system.

"No results" as a discovery signal: Instead of "no results found," run a broader semantic search and show "We didn't find an exact match, but here are similar products." Customers who search and find nothing often leave; customers who find alternatives often buy.

Autocomplete with semantic ranking: As the user types, query your semantic index for partial queries. "waterproof run..." should start surfacing relevant products before the user finishes typing.

Embedding model choice: For English-only catalogs, text-embedding-3-small from OpenAI is cost-efficient and accurate. For multilingual catalogs (Spanish, German, French customers searching in their language for products described in English), consider BGE-M3 — it handles cross-lingual search natively.

Keeping Your Catalog Index Fresh

Product catalogs change constantly. New products launch. Old products are discontinued. Prices change (relevant to queries like "under $100"). Descriptions get updated.

A simple strategy for catalog freshness:

New products: Trigger extraction when a new product page is published. If your CMS supports webhooks on publish events, subscribe to them.

// CMS webhook: product published
app.post('/webhooks/product-published', async (req, res) => {
  const { url } = req.body;
  if (url.includes('/products/')) {
    await client.extractAsync({ url });
  }
  res.sendStatus(200);
});

Discontinued products: When a product is removed, it stays in your index until it's re-extracted (returning a 404 or redirect). Run a weekly cleanup job that validates source URLs are still live.

Weekly full re-index: For smaller catalogs (under 1,000 products), a weekly re-index of everything is fast and keeps all content current. For larger catalogs, re-index only recently modified products.

A/B Testing Semantic vs. Keyword Search

Before fully replacing your existing keyword search, run both in parallel and measure:

Click-through rate: What percentage of search result pages result in a product click?
Search-to-purchase conversion: Of customers who searched, what percentage bought?
Zero-result rate: How often does a query return no results?

Route 50% of traffic to semantic search, 50% to keyword search. Compare these metrics after two weeks. The improvement is typically largest for long-tail queries — "gift for someone who loves hiking and cooking" — where keyword search returns nothing useful and semantic search surfaces adventure kitchen gear and trail cookware.

Teams that have measured this consistently see 15–30% improvement in search-to-purchase conversion for queries with no exact keyword match. The zero-result rate typically drops by 60–80%.

From Product Search to Recommendation

Once your catalog is embedded, you get product similarity for free. Two products with vectors that are close in embedding space are semantically similar — useful for:

"Customers also viewed" recommendations (find products similar to the current page)
"Complete the look" cross-sells (find semantically complementary products)
"You might also like" post-purchase recommendations

This turns your semantic search infrastructure into a recommendation engine. The index you built for search does double duty for personalization — no additional infrastructure required.

The investment in embedding your product catalog pays off in three ways simultaneously: better search results, lower bounce rates from zero-result pages, and smarter recommendations. All from the same underlying index.

Try it now