Web Scraping Legal Guide 2026: GDPR, robots.txt, and Terms of Service Explained
Web scraping sits in one of the most contested gray areas of internet law. Every developer building an AI application that touches external data eventually asks the same question: "Is this legal?" The honest answer in 2026 is still "it depends" — but the landscape has clarified considerably since the landmark hiQ v. LinkedIn ruling, and the EU AI Act's training data provisions are adding new wrinkles.
This guide is written for developers and technical teams who need a practical understanding of the legal environment around web scraping. It covers the main legal frameworks, the most relevant case law, and a practical checklist for staying on the right side of the line. It is educational only and does not constitute legal advice. For any real compliance question, talk to a lawyer who specializes in internet law.
The good news: the vast majority of common AI developer use cases — scraping publicly available documentation, product pages, news articles, and company information for knowledge bases — are in genuinely defensible territory. The risk zones are narrower and more specific than the headlines suggest.
Key Legal Frameworks
Three bodies of law are most relevant to web scraping in 2026, depending on where you and your target sites are located.
The Computer Fraud and Abuse Act (CFAA) — United States. The CFAA criminalizes unauthorized access to computer systems. For years, website operators argued that ignoring a Terms of Service prohibition on scraping constituted "unauthorized access" under the CFAA. The hiQ v. LinkedIn ruling effectively rejected this theory for publicly accessible data. However, bypassing authentication (scraping content behind a login without permission) remains a genuine CFAA risk. Accessing systems you are explicitly denied access to — after being served a cease-and-desist, for example — also raises exposure.
GDPR — European Union. The General Data Protection Regulation governs the collection, storage, and processing of personal data about EU residents, regardless of where the data collector is located. If you are scraping content that contains personal data — names, email addresses, profile information — GDPR applies. The key tests: do you have a lawful basis for processing? Is the purpose legitimate? Are you collecting only what you need (data minimization)? Scraping publicly posted business contact information for legitimate B2B purposes is generally considered to have a lawful basis under legitimate interest, but scraping personal profile data at scale for unclear purposes is high-risk.
CCPA — California. The California Consumer Privacy Act gives California residents rights over their personal data. If you are building a product for a US audience and scraping personal information, CCPA compliance is relevant — particularly if your dataset becomes large enough to constitute a "sale" of personal information under the Act's broad definition.
The hiQ Labs v. LinkedIn Landmark Case
The hiQ v. LinkedIn case is the most consequential US court ruling on web scraping to date, and its logic is central to how developers should think about the law.
hiQ scraped LinkedIn's public profiles to build workforce analytics products. LinkedIn sent a cease-and-desist and blocked hiQ's scrapers. hiQ sued for injunctive relief, arguing LinkedIn couldn't lock them out of publicly accessible data.
The Ninth Circuit ruled in hiQ's favor in 2022, finding that scraping publicly available data does not constitute unauthorized access under the CFAA. The reasoning: the CFAA targets unauthorized access to computer systems, and a website that makes data available to any visitor without authentication has implicitly authorized access to that data.
What this means in practice: scraping data that is publicly accessible without login is generally defensible under US federal law. What it does not protect: scraping behind logins, bypassing CAPTCHAs or explicit IP blocks after receiving a cease-and-desist, or violating state laws or GDPR through the data you collect.
robots.txt: Not Legally Binding, But It Matters
robots.txt is a technical convention, not a legal document. Ignoring it does not, by itself, make scraping illegal. Courts have generally treated robots.txt as a signal of intent rather than an enforceable access restriction.
However, robots.txt matters in two practical ways. First, it is relevant to whether you had notice that access was unwanted — which can factor into CFAA analysis and ToS-based arguments. Second, the EU AI Act (discussed below) is moving toward treating robots.txt signals as meaningful for training data rights.
Best practice for any scraper: check robots.txt, respect Crawl-delay directives, and avoid scraping paths explicitly disallowed for your user agent unless you have a clear legal basis to do so. This is also just good web citizenship — ignoring robots.txt is how you end up on blocklists and in cease-and-desist letters.
Terms of Service Clauses
Almost every major website has ToS clauses that prohibit scraping, automated access, or commercial use of scraped data. These clauses are ubiquitous and largely unread.
The legal enforceability of ToS anti-scraping clauses is contested and jurisdiction-dependent. In the US, post-hiQ, violating a ToS prohibition alone is unlikely to create CFAA liability for publicly accessible data. However, ToS violations can still expose you to:
- Breach of contract claims if you clicked through an agreement to access the data
- Tortious interference claims if your scraping harms the site's business
- Copyright claims if the scraped content is copyrightable and you are reproducing it
The practical risk from ToS violations for AI developers scraping public data for knowledge bases is usually low — most operators do not litigate, and most ToS clauses are drafted broadly to cover everything. But receiving a cease-and-desist and continuing to scrape raises the legal temperature significantly. When you get one, stop scraping that source and talk to a lawyer.
What GDPR Means for Scraping Personal Data
GDPR is the most concrete legal risk for AI developers scraping at scale, and it deserves the most attention.
If you are scraping content that contains personal data about EU residents — full names, contact details, professional history, anything that identifies a specific individual — GDPR applies to your entire data lifecycle: collection, storage, processing, and any downstream use.
The critical requirements: you need a lawful basis (legitimate interest is most commonly relied on, but requires a balancing test), you must implement appropriate security, you cannot keep data longer than necessary, and you must be able to respond to subject access requests and deletion requests.
The riskiest use case: building a training dataset from scraped personal data without clear consent. The safer use cases: scraping publicly available business information (company names, product descriptions, published articles) where personal data is incidental and minimized.
Safe Practices for Compliant Scraping
Regardless of jurisdiction, these practices significantly reduce your legal exposure:
- Scrape only public data. Data accessible without authentication, behind no login wall.
- Rate limit your requests. Aggressive scraping that degrades a site's performance is the fastest path to cease-and-desist letters and legal action.
- Respect robots.txt. Honor
Crawl-delayandDisallowdirectives. - Identify your bot. Use a descriptive User-Agent string that includes contact information. This is a basic courtesy and signals good faith.
- Minimize personal data. If you do not need it, do not collect it. If you collect it, delete it when you are done.
- Do not bypass authentication. Never scrape content behind a login without explicit permission.
- Do not circumvent technical measures after notice. If you receive a cease-and-desist or your IP is blocked following explicit notice, continuing to scrape with new IPs is high-risk.
- Attribution and terms compliance for content you publish or train on.
EU AI Act Implications for Training Data
The EU AI Act is the newest significant development for AI developers who use scraped data for training. Its provisions on training data transparency and rights are being enforced starting August 2026.
The Act requires providers of general-purpose AI models to publish sufficiently detailed summaries of training data, including web-scraped content. It also requires compliance with copyright law and the EU's text and data mining exceptions — meaning opt-outs expressed via robots.txt or equivalent machine-readable means must be respected for training data.
If you are building AI models trained on web-scraped data for the EU market, you need to:
- Document your training data sources
- Respect copyright opt-outs (including
robots.txtAI-prefixed directives) - Implement a process for handling rights-holder opt-outs
For knowledge base use cases (scraping to populate a RAG system, not to train a model), the AI Act's training data provisions are generally not directly applicable.
Practical Compliance Checklist
Use this checklist before launching any web scraping project:
- Is the target data publicly accessible without authentication?
- Have you reviewed the site's robots.txt and respected its directives?
- Have you reviewed the site's Terms of Service for scraping restrictions?
- Does your scraping involve personal data about EU residents? If yes, do you have a lawful basis?
- Are you rate-limiting requests to avoid service degradation?
- Are you using a descriptive User-Agent with contact information?
- Do you have a process for handling cease-and-desist notices?
- If using data for AI training in the EU, are you documenting sources and respecting opt-outs?
- Have you scoped data retention to what you actually need?
When to Get a Lawyer
For most common AI developer use cases — scraping public documentation, product information, news articles, company data — the legal risk is low and the checklist above covers you. You probably do not need a lawyer to scrape HackerNews.
You should consult a lawyer when: you are scraping personal data at scale, you have received a cease-and-desist letter, you are scraping a competitor's data in a way that could be commercially damaging to them, you are building training datasets for commercial AI products in the EU, or your use case involves financial, medical, or legal data where additional regulations apply.
The law around web scraping is maturing, and the direction is broadly favorable for developers working with publicly available data. The risk zones are concentrated around authentication bypass, personal data collection, and willful disregard of explicit notices. Stay out of those zones and you are operating in well-established territory.
KnowledgeSDK provides an API for extracting clean markdown from websites, with built-in JavaScript rendering and anti-bot handling. Start with 1,000 free requests at knowledgesdk.com.