knowledgesdk.com/glossary/polite-crawling
Web Scraping & Extractionbeginner

Also known as: ethical crawling, respectful crawling

Polite Crawling

Following web crawling best practices such as respecting robots.txt, adding crawl delays, and identifying your crawler in the user agent.

What Is Polite Crawling?

Polite crawling (also called ethical or respectful crawling) is a set of best practices for operating a web crawler in a way that is considerate of the target server's resources, respects the site owner's stated crawling preferences, and avoids causing harm to the site or its users. It is the standard expected of any well-behaved automated client.

The core principle is simple: crawl as if you are a guest, not a siege engine.

The Core Rules of Polite Crawling

1. Respect robots.txt

Before crawling any domain, fetch and parse its robots.txt file. Honor every Disallow directive for your user agent and the * wildcard user agent. Do not access paths the site has explicitly excluded.

# If this appears in robots.txt for your User-agent:
Disallow: /private/

# Never fetch URLs matching:
https://example.com/private/*

2. Identify Yourself with a Descriptive User-Agent

Set a User-Agent string that identifies who you are and provides a contact method. This allows site owners to reach you if your crawler causes problems:

User-Agent: MyCompanyBot/1.0 (+https://mycompany.com/bot; contact@mycompany.com)

Anonymous scrapers that impersonate browsers make it impossible for site owners to differentiate legitimate research from malicious bots.

3. Observe Crawl-Delay Directives

If robots.txt specifies a Crawl-delay, honor it:

Crawl-delay: 5

This means wait at least 5 seconds between requests to that domain. Even if no delay is specified, adding a 1-2 second delay between requests is good practice.

4. Limit Concurrency Per Domain

Do not hammer a single domain with hundreds of simultaneous requests. Even with delays between sequential requests, high concurrency can generate enough load to degrade the site for real users.

A safe default: no more than 1-2 concurrent connections per domain.

5. Cache Responses

Store fetched pages locally or in an object store. Avoid re-fetching a page that has not changed since your last crawl. Use Last-Modified and ETag HTTP headers to make conditional requests:

If-Modified-Since: Wed, 01 Jan 2025 00:00:00 GMT

6. Avoid Crawling During Peak Hours

For large crawls of smaller sites, schedule your crawl during off-peak hours (nights, weekends) to minimize impact on the site's real users.

7. Handle Errors Gracefully

If the server returns a 429 Too Many Requests or 503 Service Unavailable, back off immediately and wait before retrying. Do not retry aggressively — this compounds the problem.

// Exponential backoff on 429
if (response.status === 429) {
  const delay = Math.min(baseDelay * 2 ** attempt, maxDelay);
  await sleep(delay);
}

Why Polite Crawling Matters

  • Legal protection — ignoring robots.txt and rate limits strengthens a plaintiff's case in CFAA-related litigation
  • Reputation — aggressive crawlers get IP-banned and may harm your organization's reputation
  • Access longevity — polite crawlers are less likely to trigger anti-bot defenses, keeping access open for longer
  • Ecosystem health — if everyone scraped aggressively, smaller sites could not survive the load

KnowledgeSDK's Approach

KnowledgeSDK's APIs (POST /v1/scrape, POST /v1/extract, POST /v1/sitemap) are built with polite crawling principles baked in. The platform manages request pacing, respects server signals, and operates within ethical crawling guidelines — so you can build powerful data pipelines without becoming a bad actor on the web.

Related Terms

Web Scraping & Extractionbeginner
robots.txt
A text file at the root of a website that instructs web crawlers which pages or sections they are allowed or disallowed from accessing.
Web Scraping & Extractionbeginner
Web Crawling
The systematic traversal of websites by following links to discover and fetch pages at scale.
Infrastructure & DevOpsbeginner
Rate Limiting
A control mechanism that restricts how many API requests a client can make within a given time window.
Planner (Agent)Precision

Try it now

Build with Polite Crawling using one API.

Extract, index, and search any web content. First 1,000 requests free.

GET API KEY →
← Back to glossary