Sitemap Extraction: Crawl a Thousand Pages Without Getting Blocked
Sitemaps are the most underused tool in web crawling. While most tutorials start with recursive link-following from a seed URL — fetching page after page, extracting links, and queuing them for later — there is a much faster approach for sites that publish their structure: go straight to the sitemap.
A sitemap is a structured XML file that lists all (or most) of a website's publicly accessible pages, often with metadata like last modification date and update frequency. Most serious websites publish one, usually at /sitemap.xml or linked from robots.txt. For AI knowledge pipelines, crawling from a sitemap is faster, more complete, and requires less crawl logic than link extraction.
This tutorial walks through a complete TypeScript implementation: fetching a sitemap, extracting all URLs, crawling each page with rate limiting and error handling, and storing the results as clean markdown indexed for semantic search.
Why Start from the Sitemap
Link-following crawlers discover URLs by parsing pages and extracting <a> tags. This works, but has real drawbacks:
- Pages with no inbound links from other crawled pages are never discovered
- Depth limits mean deep pages get missed
- Crawl order is unpredictable — you might process category pages before the content they link to
- Pagination, filters, and faceted navigation create enormous URL spaces with thin content
Sitemaps solve these problems. The site operator has already done the work of listing their important pages. You get a complete URL list immediately, before making a single content request. You know exactly how many pages you are dealing with, which lets you plan rate limiting and estimate completion time.
The downside: not every site has a sitemap, some sitemaps are incomplete, and sitemap indexes (sitemaps of sitemaps) add a parsing layer. But for documentation sites, SaaS marketing sites, e-commerce catalogs, and news publications, sitemaps are usually the fastest path.
Finding the Sitemap
Before fetching content, find the sitemap. There are three common locations:
/sitemap.xml(most common)/sitemap_index.xml(for sitemap indexes)- Listed in
robots.txtvia aSitemap:directive
async function findSitemapUrl(baseUrl: string): Promise<string> {
// Check robots.txt first
const robotsResponse = await fetch(`${baseUrl}/robots.txt`);
if (robotsResponse.ok) {
const robotsTxt = await robotsResponse.text();
const sitemapMatch = robotsTxt.match(/^Sitemap:\s*(.+)$/m);
if (sitemapMatch) return sitemapMatch[1].trim();
}
// Fall back to common locations
const candidates = ['/sitemap.xml', '/sitemap_index.xml', '/sitemap/sitemap.xml'];
for (const path of candidates) {
const response = await fetch(`${baseUrl}${path}`);
if (response.ok && response.headers.get('content-type')?.includes('xml')) {
return `${baseUrl}${path}`;
}
}
throw new Error(`No sitemap found for ${baseUrl}`);
}
Using KnowledgeSDK for Sitemap Discovery
KnowledgeSDK exposes a sitemap endpoint that handles this discovery automatically:
import { KnowledgeSDK } from '@knowledgesdk/node';
const ks = new KnowledgeSDK({ apiKey: 'knowledgesdk_live_...' });
const { urls } = await ks.sitemap('https://docs.example.com');
console.log(`Discovered ${urls.length} URLs`);
// Output: Discovered 847 URLs
The API handles sitemap discovery, sitemap index parsing (fetching and merging multiple sitemap files), and returns a clean array of unique URLs ready for crawling.
Crawling with Rate Limiting
Once you have a URL list, the challenge is crawling all pages without triggering rate limiting or IP blocks. The key principle: batch requests, add delays between batches, and never hammer a single domain with parallel requests.
import { KnowledgeSDK } from '@knowledgesdk/node';
const ks = new KnowledgeSDK({ apiKey: 'knowledgesdk_live_...' });
async function crawlSitemap(domain: string) {
// Step 1: Discover all URLs
console.log(`Fetching sitemap for ${domain}...`);
const { urls } = await ks.sitemap(domain);
console.log(`Found ${urls.length} pages to crawl`);
// Step 2: Crawl in batches with rate limiting
const BATCH_SIZE = 5; // Concurrent requests per batch
const BATCH_DELAY_MS = 2000; // Wait between batches (2 seconds)
const results = {
success: [] as string[],
failed: [] as { url: string; error: string }[],
};
for (let i = 0; i < urls.length; i += BATCH_SIZE) {
const batch = urls.slice(i, i + BATCH_SIZE);
const batchNum = Math.floor(i / BATCH_SIZE) + 1;
const totalBatches = Math.ceil(urls.length / BATCH_SIZE);
console.log(`Processing batch ${batchNum}/${totalBatches}...`);
const batchResults = await Promise.allSettled(
batch.map(url => ks.extract({ url }))
);
batchResults.forEach((result, idx) => {
const url = batch[idx];
if (result.status === 'fulfilled') {
results.success.push(url);
} else {
results.failed.push({ url, error: result.reason?.message || 'Unknown error' });
console.warn(` Failed: ${url} — ${result.reason?.message}`);
}
});
// Rate limiting delay between batches
if (i + BATCH_SIZE < urls.length) {
await new Promise(resolve => setTimeout(resolve, BATCH_DELAY_MS));
}
}
return results;
}
// Run the crawl
const results = await crawlSitemap('https://docs.stripe.com');
console.log(`\nCrawl complete:`);
console.log(` Success: ${results.success.length} pages`);
console.log(` Failed: ${results.failed.length} pages`);
Handling Errors and Retries
In any large crawl, some pages will fail. Transient errors (network timeouts, temporary server errors) should be retried. Permanent errors (404s, 403s) should be logged and skipped.
async function extractWithRetry(
ks: KnowledgeSDK,
url: string,
maxRetries = 3
): Promise<{ url: string; success: boolean; error?: string }> {
for (let attempt = 1; attempt <= maxRetries; attempt++) {
try {
await ks.extract({ url });
return { url, success: true };
} catch (error: any) {
const isRetryable = !error.message?.includes('404') &&
!error.message?.includes('403') &&
!error.message?.includes('410');
if (!isRetryable || attempt === maxRetries) {
return { url, success: false, error: error.message };
}
// Exponential backoff: 2s, 4s, 8s
const delay = Math.pow(2, attempt) * 1000;
console.warn(` Retry ${attempt}/${maxRetries} for ${url} in ${delay}ms`);
await new Promise(resolve => setTimeout(resolve, delay));
}
}
return { url, success: false, error: 'Max retries exceeded' };
}
Filtering URLs Before Crawling
Sitemaps sometimes include URLs you do not want — archive pages, tag pages, pagination URLs, media files. Filtering before crawling saves requests.
function filterUrls(urls: string[], options: {
excludePatterns?: RegExp[];
includePatterns?: RegExp[];
excludeExtensions?: string[];
}): string[] {
const { excludePatterns = [], includePatterns = [], excludeExtensions = [] } = options;
return urls.filter(url => {
// Exclude specific extensions
const urlLower = url.toLowerCase();
if (excludeExtensions.some(ext => urlLower.endsWith(ext))) return false;
// Exclude patterns
if (excludePatterns.some(pattern => pattern.test(url))) return false;
// Include patterns (if specified, URL must match at least one)
if (includePatterns.length > 0 && !includePatterns.some(p => p.test(url))) return false;
return true;
});
}
// Example: crawl only documentation pages, exclude media
const { urls } = await ks.sitemap('https://example.com');
const filteredUrls = filterUrls(urls, {
excludeExtensions: ['.pdf', '.jpg', '.png', '.svg'],
excludePatterns: [
/\/tag\//,
/\/archive\//,
/\?page=/,
/\/wp-content\//,
],
includePatterns: [/\/docs\//], // Only docs section
});
console.log(`Filtered from ${urls.length} to ${filteredUrls.length} URLs`);
Tracking Progress for Long Crawls
For large sites (500+ pages), a crawl can take 30+ minutes. Track and persist progress so you can resume if something interrupts the process.
import * as fs from 'fs';
interface CrawlProgress {
domain: string;
totalUrls: number;
processedUrls: string[];
failedUrls: { url: string; error: string }[];
startedAt: string;
lastUpdatedAt: string;
}
function saveProgress(progressFile: string, progress: CrawlProgress) {
fs.writeFileSync(progressFile, JSON.stringify(progress, null, 2));
}
function loadProgress(progressFile: string): CrawlProgress | null {
if (!fs.existsSync(progressFile)) return null;
return JSON.parse(fs.readFileSync(progressFile, 'utf-8'));
}
async function resumableCrawl(domain: string, progressFile: string) {
const ks = new KnowledgeSDK({ apiKey: process.env.KNOWLEDGESDK_API_KEY! });
const { urls } = await ks.sitemap(domain);
const existingProgress = loadProgress(progressFile);
const processedSet = new Set(existingProgress?.processedUrls || []);
const remaining = urls.filter(url => !processedSet.has(url));
console.log(`Total: ${urls.length} | Already done: ${processedSet.size} | Remaining: ${remaining.length}`);
const progress: CrawlProgress = {
domain,
totalUrls: urls.length,
processedUrls: [...processedSet],
failedUrls: existingProgress?.failedUrls || [],
startedAt: existingProgress?.startedAt || new Date().toISOString(),
lastUpdatedAt: new Date().toISOString(),
};
const BATCH_SIZE = 5;
for (let i = 0; i < remaining.length; i += BATCH_SIZE) {
const batch = remaining.slice(i, i + BATCH_SIZE);
const batchResults = await Promise.allSettled(
batch.map(url => extractWithRetry(ks, url))
);
batchResults.forEach((result) => {
if (result.status === 'fulfilled') {
if (result.value.success) {
progress.processedUrls.push(result.value.url);
} else {
progress.failedUrls.push({ url: result.value.url, error: result.value.error || '' });
}
}
});
progress.lastUpdatedAt = new Date().toISOString();
saveProgress(progressFile, progress);
if (i + BATCH_SIZE < remaining.length) {
await new Promise(resolve => setTimeout(resolve, 2000));
}
}
return progress;
}
Searching the Extracted Knowledge
Once all pages are extracted and indexed, KnowledgeSDK's semantic search makes the knowledge immediately queryable:
// Hybrid keyword + vector search across all extracted content
const results = await ks.search({
query: 'how to handle webhook signature verification',
limit: 5,
});
results.forEach(result => {
console.log(`[${result.score.toFixed(2)}] ${result.title}`);
console.log(` URL: ${result.url}`);
console.log(` Excerpt: ${result.excerpt}`);
});
The search uses hybrid keyword + vector indexing, so it finds conceptually related content even when the exact query terms do not appear verbatim in the source pages.
Complete Example: Crawl Stripe Docs
import { KnowledgeSDK } from '@knowledgesdk/node';
const ks = new KnowledgeSDK({ apiKey: process.env.KNOWLEDGESDK_API_KEY! });
async function indexDocSite(domain: string) {
console.log(`Starting crawl of ${domain}`);
// 1. Discover URLs
const { urls } = await ks.sitemap(domain);
const docsUrls = urls.filter(url => url.includes('/docs/'));
console.log(`Found ${docsUrls.length} documentation pages`);
// 2. Crawl with progress reporting
let done = 0;
const BATCH_SIZE = 5;
for (let i = 0; i < docsUrls.length; i += BATCH_SIZE) {
const batch = docsUrls.slice(i, i + BATCH_SIZE);
await Promise.allSettled(batch.map(url => ks.extract({ url })));
done += batch.length;
process.stdout.write(`\rProgress: ${done}/${docsUrls.length}`);
if (i + BATCH_SIZE < docsUrls.length) {
await new Promise(r => setTimeout(r, 2000));
}
}
console.log('\nCrawl complete. Testing search...');
// 3. Search
const results = await ks.search({ query: 'payment intent lifecycle', limit: 3 });
results.forEach(r => console.log(`- ${r.title} (${r.url})`));
}
indexDocSite('https://stripe.com');
Sitemap-first crawling with proper rate limiting, error handling, and progress tracking gives you a complete, reliable knowledge extraction pipeline. Combined with KnowledgeSDK's built-in semantic search, a full documentation site becomes a queryable knowledge base in a few hours of crawl time — without building any vector database infrastructure yourself.
Try this tutorial with KnowledgeSDK's free tier — 1,000 requests per month, no credit card required. Get started at knowledgesdk.com.