What Is robots.txt?
robots.txt is a plain-text file placed at the root of a web server (e.g., https://example.com/robots.txt) that communicates crawling rules to web robots — search engine bots, scrapers, and other automated clients. It is the primary mechanism websites use to signal which parts of their content they do and do not want indexed or crawled.
The file is part of the Robots Exclusion Standard, an informal protocol that has been followed by well-behaved crawlers since 1994.
Basic Syntax
User-agent: *
Disallow: /admin/
Disallow: /private/
Allow: /public/
Crawl-delay: 2
User-agent: Googlebot
Allow: /
Sitemap: https://example.com/sitemap.xml
Key directives:
User-agent— specifies which crawler the rules apply to (*means all crawlers)Disallow— paths the crawler must not accessAllow— exceptions to a broaderDisallowruleCrawl-delay— recommended seconds to wait between requestsSitemap— the location of the site's XML sitemap
Common robots.txt Patterns
Block all crawlers from everything
User-agent: *
Disallow: /
Block a specific bot
User-agent: BadBot
Disallow: /
Allow search engines, block everyone else
User-agent: Googlebot
Allow: /
User-agent: *
Disallow: /
Protect admin and API paths
User-agent: *
Disallow: /admin/
Disallow: /api/
Disallow: /user/
What robots.txt Does NOT Do
- It is not enforced — it is advisory, not a technical barrier. Any crawler can ignore it.
- It does not prevent indexing — a page can be linked from elsewhere and still appear in search results even if it is in
robots.txt - It is publicly readable — anyone can view it, including people looking for hidden paths to probe
For true access control, use server-side authentication or firewall rules.
robots.txt and KnowledgeSDK
When using KnowledgeSDK's crawling and extraction APIs (POST /v1/sitemap, POST /v1/scrape, POST /v1/extract), KnowledgeSDK operates within ethical crawling guidelines. You are responsible for ensuring that your use of these APIs complies with the target site's robots.txt and terms of service.
A quick check before starting any crawl:
curl https://example.com/robots.txt
Legal and Ethical Significance
While robots.txt has no binding legal force on its own, courts in multiple jurisdictions have referenced a site's robots.txt as evidence of intent when ruling on scraping-related cases. Ignoring explicit Disallow rules — especially after being notified — strengthens a plaintiff's case under laws such as the CFAA (US) or the Computer Misuse Act (UK).
Respecting robots.txt is both the ethical standard and the legally safer approach for any scraping project.