What Is a Sitemap?
A sitemap is a file — most commonly an XML document located at /sitemap.xml — that lists all the URLs on a website that the owner wants crawlers and search engines to discover and index. Rather than forcing a crawler to follow every link on every page to find content, a sitemap provides a direct, structured inventory of the site.
Sitemaps were standardized by Google, Yahoo, and Microsoft in 2006 as the Sitemap Protocol and are now universally supported by all major search engines.
Sitemap Formats
XML Sitemap (most common)
<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
<url>
<loc>https://example.com/products/widget-pro</loc>
<lastmod>2025-11-15</lastmod>
<changefreq>weekly</changefreq>
<priority>0.8</priority>
</url>
<url>
<loc>https://example.com/blog/intro-to-scraping</loc>
<lastmod>2025-10-01</lastmod>
<changefreq>monthly</changefreq>
<priority>0.5</priority>
</url>
</urlset>
Sitemap Index
Large sites split their URLs across multiple sitemap files and reference them from a sitemap index:
<sitemapindex xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
<sitemap>
<loc>https://example.com/sitemap-products.xml</loc>
</sitemap>
<sitemap>
<loc>https://example.com/sitemap-blog.xml</loc>
</sitemap>
</sitemapindex>
HTML Sitemap
A human-readable page listing links organized by section — primarily for user navigation rather than crawler consumption.
Key Sitemap Fields
| Field | Description |
|---|---|
<loc> |
The canonical URL (required) |
<lastmod> |
Date the page was last modified |
<changefreq> |
How often the page changes (hint, not a guarantee) |
<priority> |
Relative importance within the site (0.0–1.0) |
Using Sitemaps for Scraping
For scrapers, a sitemap is a gift: it hands you the complete URL inventory without requiring a full crawl. Typical workflow:
- Fetch
https://example.com/robots.txt— it often contains aSitemap:directive pointing to the sitemap URL - Download and parse the sitemap XML
- Filter URLs by path pattern,
lastmoddate, orchangefreq - Feed the filtered URL list into your scraper
KnowledgeSDK Sitemap API
KnowledgeSDK's POST /v1/sitemap endpoint discovers all URLs on a site — even if the site doesn't publish a sitemap.xml — by combining sitemap parsing with crawl-based discovery:
POST /v1/sitemap
Authorization: Bearer knowledgesdk_live_...
{
"url": "https://docs.example.com"
}
The response is a flat JSON array of all discovered URLs, ready to pass to POST /v1/scrape or POST /v1/extract in batch.
Why Sitemaps Matter for Large-Scale Extraction
- Efficiency — no need to crawl every page to find all URLs
- Freshness signals —
lastmodtells you which pages have changed since your last scrape - Priority hints — focus extraction on high-priority pages first
- Completeness — pages with no inbound links (orphan pages) still appear in sitemaps
For any project that needs to extract content from an entire website, checking for a sitemap first is always the fastest starting point.