Sitemap

An XML or HTML file listing all discoverable URLs on a website, used by crawlers to efficiently find and index pages.

What Is a Sitemap?

A sitemap is a file — most commonly an XML document located at /sitemap.xml — that lists all the URLs on a website that the owner wants crawlers and search engines to discover and index. Rather than forcing a crawler to follow every link on every page to find content, a sitemap provides a direct, structured inventory of the site.

Sitemaps were standardized by Google, Yahoo, and Microsoft in 2006 as the Sitemap Protocol and are now universally supported by all major search engines.

Sitemap Formats

XML Sitemap (most common)

<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
  <url>
    <loc>https://example.com/products/widget-pro</loc>
    <lastmod>2025-11-15</lastmod>
    <changefreq>weekly</changefreq>
    <priority>0.8</priority>
  </url>
  <url>
    <loc>https://example.com/blog/intro-to-scraping</loc>
    <lastmod>2025-10-01</lastmod>
    <changefreq>monthly</changefreq>
    <priority>0.5</priority>
  </url>
</urlset>

Sitemap Index

Large sites split their URLs across multiple sitemap files and reference them from a sitemap index:

<sitemapindex xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
  <sitemap>
    <loc>https://example.com/sitemap-products.xml</loc>
  </sitemap>
  <sitemap>
    <loc>https://example.com/sitemap-blog.xml</loc>
  </sitemap>
</sitemapindex>

HTML Sitemap

A human-readable page listing links organized by section — primarily for user navigation rather than crawler consumption.

Key Sitemap Fields

Field	Description
`<loc>`	The canonical URL (required)
`<lastmod>`	Date the page was last modified
`<changefreq>`	How often the page changes (hint, not a guarantee)
`<priority>`	Relative importance within the site (0.0–1.0)

Using Sitemaps for Scraping

For scrapers, a sitemap is a gift: it hands you the complete URL inventory without requiring a full crawl. Typical workflow:

Fetch https://example.com/robots.txt — it often contains a Sitemap: directive pointing to the sitemap URL
Download and parse the sitemap XML
Filter URLs by path pattern, lastmod date, or changefreq
Feed the filtered URL list into your scraper

KnowledgeSDK Sitemap API

KnowledgeSDK's POST /v1/sitemap endpoint discovers all URLs on a site — even if the site doesn't publish a sitemap.xml — by combining sitemap parsing with crawl-based discovery:

POST /v1/sitemap
Authorization: Bearer knowledgesdk_live_...

{
  "url": "https://docs.example.com"
}

The response is a flat JSON array of all discovered URLs, ready to pass to POST /v1/scrape or POST /v1/extract in batch.

Why Sitemaps Matter for Large-Scale Extraction

Efficiency — no need to crawl every page to find all URLs
Freshness signals — lastmod tells you which pages have changed since your last scrape
Priority hints — focus extraction on high-priority pages first
Completeness — pages with no inbound links (orphan pages) still appear in sitemaps

For any project that needs to extract content from an entire website, checking for a sitemap first is always the fastest starting point.

Related Terms

Web Scraping & Extractionbeginner

Web Crawling

The systematic traversal of websites by following links to discover and fetch pages at scale.

Web Scraping & Extractionbeginner

robots.txt

A text file at the root of a website that instructs web crawlers which pages or sections they are allowed or disallowed from accessing.

RAG & Retrievalbeginner

Indexing

The process of transforming raw content into a searchable structure — embeddings, inverted indexes, or graph nodes — that enables fast retrieval.

← Semantic Search Skill (Agent) →

Try it now

Build with Sitemap using one API.

Extract, index, and search any web content. First 1,000 requests free.

GET API KEY →

← Back to glossary