Web Crawling

The systematic traversal of websites by following links to discover and fetch pages at scale.

What Is Web Crawling?

Web crawling is the systematic process of traversing a website — or the entire web — by following hyperlinks from page to page. A crawler (also called a spider or bot) starts from one or more seed URLs, downloads each page, extracts all outbound links, and adds new, unseen URLs to a queue for future fetching.

Search engines like Google and Bing rely on web crawlers to build their indexes. The same technique powers competitive intelligence tools, AI training data pipelines, and site-wide content audits.

Crawling vs. Scraping

These two terms are often used interchangeably but describe different scopes:

Concept	Scope	Goal
Web crawling	Many pages, whole sites	Discover and fetch URLs at scale
Web scraping	Specific pages or fields	Extract structured data from known URLs

In practice, most large-scale data collection projects combine both: a crawler discovers URLs, and a scraper extracts content from each one.

How a Web Crawler Works

A typical crawl loop looks like this:

Seed — load one or more starting URLs into a queue
Fetch — download the next URL from the queue
Parse — extract all <a href> links from the response
Filter — remove already-visited URLs, off-domain links, and paths blocked by robots.txt
Enqueue — add new URLs to the queue
Repeat — continue until the queue is empty or a depth/page limit is reached

URL Discovery with KnowledgeSDK

Rather than building your own crawler, KnowledgeSDK's POST /v1/sitemap endpoint returns every discoverable URL on a domain in seconds:

POST /v1/sitemap
Authorization: Bearer knowledgesdk_live_...

{
  "url": "https://docs.example.com"
}

The response is a flat list of URLs you can immediately pass to POST /v1/scrape or POST /v1/extract for content extraction — no crawl queue management required.

Key Crawler Concepts

Crawl depth — how many link hops from the seed URL the crawler will follow
Crawl frontier — the queue of URLs waiting to be fetched
URL deduplication — ensuring the same URL is not fetched twice
Politeness delay — a pause between requests to avoid overloading the server
Scope control — restricting the crawler to a specific domain, subdomain, or path prefix

Common Use Cases

Indexing all pages of a documentation site for search
Discovering product URLs across a large e-commerce catalog
Monitoring an entire news site for new article URLs
Building training datasets by collecting pages across many domains

Respecting Crawl Rules

Before crawling any site, check its robots.txt file at https://example.com/robots.txt. This file specifies which paths crawlers may and may not access. Ignoring it is both unethical and, in some jurisdictions, legally risky.