knowledgesdk.com/glossary/sitemap
Web Scraping & Extractionbeginner

Also known as: XML sitemap, sitemap.xml

Sitemap

An XML or HTML file listing all discoverable URLs on a website, used by crawlers to efficiently find and index pages.

What Is a Sitemap?

A sitemap is a file — most commonly an XML document located at /sitemap.xml — that lists all the URLs on a website that the owner wants crawlers and search engines to discover and index. Rather than forcing a crawler to follow every link on every page to find content, a sitemap provides a direct, structured inventory of the site.

Sitemaps were standardized by Google, Yahoo, and Microsoft in 2006 as the Sitemap Protocol and are now universally supported by all major search engines.

Sitemap Formats

XML Sitemap (most common)

<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
  <url>
    <loc>https://example.com/products/widget-pro</loc>
    <lastmod>2025-11-15</lastmod>
    <changefreq>weekly</changefreq>
    <priority>0.8</priority>
  </url>
  <url>
    <loc>https://example.com/blog/intro-to-scraping</loc>
    <lastmod>2025-10-01</lastmod>
    <changefreq>monthly</changefreq>
    <priority>0.5</priority>
  </url>
</urlset>

Sitemap Index

Large sites split their URLs across multiple sitemap files and reference them from a sitemap index:

<sitemapindex xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
  <sitemap>
    <loc>https://example.com/sitemap-products.xml</loc>
  </sitemap>
  <sitemap>
    <loc>https://example.com/sitemap-blog.xml</loc>
  </sitemap>
</sitemapindex>

HTML Sitemap

A human-readable page listing links organized by section — primarily for user navigation rather than crawler consumption.

Key Sitemap Fields

Field Description
<loc> The canonical URL (required)
<lastmod> Date the page was last modified
<changefreq> How often the page changes (hint, not a guarantee)
<priority> Relative importance within the site (0.0–1.0)

Using Sitemaps for Scraping

For scrapers, a sitemap is a gift: it hands you the complete URL inventory without requiring a full crawl. Typical workflow:

  1. Fetch https://example.com/robots.txt — it often contains a Sitemap: directive pointing to the sitemap URL
  2. Download and parse the sitemap XML
  3. Filter URLs by path pattern, lastmod date, or changefreq
  4. Feed the filtered URL list into your scraper

KnowledgeSDK Sitemap API

KnowledgeSDK's POST /v1/sitemap endpoint discovers all URLs on a site — even if the site doesn't publish a sitemap.xml — by combining sitemap parsing with crawl-based discovery:

POST /v1/sitemap
Authorization: Bearer knowledgesdk_live_...

{
  "url": "https://docs.example.com"
}

The response is a flat JSON array of all discovered URLs, ready to pass to POST /v1/scrape or POST /v1/extract in batch.

Why Sitemaps Matter for Large-Scale Extraction

  • Efficiency — no need to crawl every page to find all URLs
  • Freshness signalslastmod tells you which pages have changed since your last scrape
  • Priority hints — focus extraction on high-priority pages first
  • Completeness — pages with no inbound links (orphan pages) still appear in sitemaps

For any project that needs to extract content from an entire website, checking for a sitemap first is always the fastest starting point.

Related Terms

Web Scraping & Extractionbeginner
Web Crawling
The systematic traversal of websites by following links to discover and fetch pages at scale.
Web Scraping & Extractionbeginner
robots.txt
A text file at the root of a website that instructs web crawlers which pages or sections they are allowed or disallowed from accessing.
RAG & Retrievalbeginner
Indexing
The process of transforming raw content into a searchable structure — embeddings, inverted indexes, or graph nodes — that enables fast retrieval.
Semantic SearchSkill (Agent)

Try it now

Build with Sitemap using one API.

Extract, index, and search any web content. First 1,000 requests free.

GET API KEY →
← Back to glossary