knowledgesdk.com/glossary/robots-txt
Web Scraping & Extractionbeginner

Also known as: robots exclusion standard

robots.txt

A text file at the root of a website that instructs web crawlers which pages or sections they are allowed or disallowed from accessing.

What Is robots.txt?

robots.txt is a plain-text file placed at the root of a web server (e.g., https://example.com/robots.txt) that communicates crawling rules to web robots — search engine bots, scrapers, and other automated clients. It is the primary mechanism websites use to signal which parts of their content they do and do not want indexed or crawled.

The file is part of the Robots Exclusion Standard, an informal protocol that has been followed by well-behaved crawlers since 1994.

Basic Syntax

User-agent: *
Disallow: /admin/
Disallow: /private/
Allow: /public/
Crawl-delay: 2

User-agent: Googlebot
Allow: /

Sitemap: https://example.com/sitemap.xml

Key directives:

  • User-agent — specifies which crawler the rules apply to (* means all crawlers)
  • Disallow — paths the crawler must not access
  • Allow — exceptions to a broader Disallow rule
  • Crawl-delay — recommended seconds to wait between requests
  • Sitemap — the location of the site's XML sitemap

Common robots.txt Patterns

Block all crawlers from everything

User-agent: *
Disallow: /

Block a specific bot

User-agent: BadBot
Disallow: /

Allow search engines, block everyone else

User-agent: Googlebot
Allow: /

User-agent: *
Disallow: /

Protect admin and API paths

User-agent: *
Disallow: /admin/
Disallow: /api/
Disallow: /user/

What robots.txt Does NOT Do

  • It is not enforced — it is advisory, not a technical barrier. Any crawler can ignore it.
  • It does not prevent indexing — a page can be linked from elsewhere and still appear in search results even if it is in robots.txt
  • It is publicly readable — anyone can view it, including people looking for hidden paths to probe

For true access control, use server-side authentication or firewall rules.

robots.txt and KnowledgeSDK

When using KnowledgeSDK's crawling and extraction APIs (POST /v1/sitemap, POST /v1/scrape, POST /v1/extract), KnowledgeSDK operates within ethical crawling guidelines. You are responsible for ensuring that your use of these APIs complies with the target site's robots.txt and terms of service.

A quick check before starting any crawl:

curl https://example.com/robots.txt

Legal and Ethical Significance

While robots.txt has no binding legal force on its own, courts in multiple jurisdictions have referenced a site's robots.txt as evidence of intent when ruling on scraping-related cases. Ignoring explicit Disallow rules — especially after being notified — strengthens a plaintiff's case under laws such as the CFAA (US) or the Computer Misuse Act (UK).

Respecting robots.txt is both the ethical standard and the legally safer approach for any scraping project.

Related Terms

Web Scraping & Extractionbeginner
Web Crawling
The systematic traversal of websites by following links to discover and fetch pages at scale.
Web Scraping & Extractionbeginner
Sitemap
An XML or HTML file listing all discoverable URLs on a website, used by crawlers to efficiently find and index pages.
Web Scraping & Extractionbeginner
Polite Crawling
Following web crawling best practices such as respecting robots.txt, adding crawl delays, and identifying your crawler in the user agent.
RLHFScraping Pipeline

Try it now

Build with robots.txt using one API.

Extract, index, and search any web content. First 1,000 requests free.

GET API KEY →
← Back to glossary