Disclaimer: This article is for informational purposes only and does not constitute legal advice. Consult a qualified attorney for guidance on your specific situation.
Is web scraping legal? The honest answer is: it depends. It depends on what you're scraping, how you're doing it, what you're doing with the data, and where you're located. The nuanced answer is that most web scraping of publicly available data is legal in most jurisdictions — but there are meaningful landmines to avoid.
This guide covers the key legal concepts developers need to understand, the landmark cases that shaped current law, and the practical best practices that keep your scraping activities defensible.
The Landmark Case: hiQ v. LinkedIn
If there's one case that defines the legal landscape for web scraping in the United States, it's hiQ Labs, Inc. v. LinkedIn Corporation.
The facts are simple: hiQ scraped LinkedIn's public profiles (data visible without login) to build workforce analytics products. LinkedIn sent a cease-and-desist letter and invoked the Computer Fraud and Abuse Act (CFAA). hiQ sued for declaratory relief.
After years of litigation and multiple appeals, the Ninth Circuit ultimately ruled in 2022 that scraping publicly accessible data does not violate the CFAA. The court's reasoning: the CFAA requires "unauthorized access to a computer" — but if data is publicly accessible (no login required), there's no authorization barrier being bypassed. Accessing public data is, by definition, authorized.
The Supreme Court declined to hear LinkedIn's appeal, leaving the Ninth Circuit ruling intact.
Key takeaway from hiQ: Scraping publicly accessible data (accessible without login) is not a CFAA violation, at least in the Ninth Circuit's jurisdiction. This covers California, and the reasoning has been influential in other circuits.
What hiQ Did NOT Decide
hiQ is not a license to scrape anything. The ruling specifically addressed:
- Public data only (not data behind authentication)
- CFAA liability only (not contract law or copyright)
- One circuit's interpretation (not binding across all US jurisdictions)
Other legal theories — terms of service violation, copyright infringement, trespass to chattels, state computer crime laws — were not resolved by hiQ.
Public Data vs. Auth-Gated Data
The most important legal distinction in web scraping is whether data is publicly accessible or requires authentication.
Public Data (Generally Permissible)
Data that anyone can access without creating an account or logging in. Examples:
- News articles on public websites
- Product listings on e-commerce sites
- Government databases
- Business directory listings
- Job postings
- Public social media profiles
Scraping this type of data is generally permissible in the US, EU, and most jurisdictions, particularly if you're scraping for legitimate business purposes, not harvesting personal data at scale.
Auth-Gated Data (High Risk)
Data that requires login credentials to access. Examples:
- LinkedIn connections data (requires login)
- Private social media posts
- Paywalled content
- Internal company systems
- Subscription databases
Scraping auth-gated data by bypassing authentication, sharing credentials, or using someone else's account is almost certainly illegal under the CFAA and equivalent laws worldwide. Don't do this.
The Gray Zone: Logged-In Public Profiles
Some data is "public" in the sense that any logged-in user can see it, but requires an account. This is legally contested. The safer interpretation: if you need to create an account to access the data, treat it as auth-gated.
Robots.txt: Not Legally Binding, But Important
robots.txt is a file at the root of a website (/robots.txt) that specifies which paths crawlers should or shouldn't access. A typical file:
User-agent: *
Disallow: /private/
Disallow: /admin/
Crawl-delay: 2
User-agent: Googlebot
Allow: /
The legal status of robots.txt is ambiguous. No US federal law requires robots.txt compliance. Courts have not clearly held that violating robots.txt creates CFAA liability for public data.
However, ignoring robots.txt can be used as evidence of "bad faith" in litigation, can support claims of intentional misconduct, and in some jurisdictions may be relevant to unfair competition claims.
More practically: following robots.txt is the right thing to do. If a site's robots.txt says don't crawl /api/private/, there's a reason — probably that those endpoints are expensive, sensitive, or not intended for public consumption. Ignoring that signal while claiming to be a respectful scraper is hypocritical.
Best practice: Always fetch and parse robots.txt. Honor Disallow directives and Crawl-delay values.
import robotsParser from 'robots-parser';
async function isAllowed(url: string, userAgent = 'KnowledgeSDKBot'): Promise<boolean> {
const domain = new URL(url).origin;
try {
const response = await fetch(`${domain}/robots.txt`);
const text = await response.text();
const robots = robotsParser(`${domain}/robots.txt`, text);
return robots.isAllowed(url, userAgent) !== false;
} catch {
return true; // If robots.txt is unreachable, assume allowed
}
}
Terms of Service: Breach of Contract
This is where many scrapers get tripped up. Even if robots.txt and the CFAA don't apply, most websites have Terms of Service that prohibit automated scraping. Violating these terms is a breach of contract.
A typical ToS clause:
"You agree not to use automated tools, bots, scrapers, or spiders to access, collect, or copy content from this website without our express written permission."
Can ToS violations be enforced? Yes. Companies have successfully sued scrapers for breach of contract. The damages are typically actual damages (hard to quantify and often small for data you could have accessed anyway) plus potentially injunctive relief.
The more significant risk is that ToS violations can also be evidence of "exceeding authorized access" under the CFAA, depending on the court.
Practical guidance:
- Read the ToS for any site you're scraping at scale
- If it explicitly prohibits scraping, consider whether you need to scrape it or can get the data another way
- For commercial operations, get written permission or negotiate a data license with sites you depend on heavily
Copyright and Database Rights
The content you scrape may be protected by copyright. While facts (prices, addresses, specifications) are generally not copyrightable, creative works (articles, images, video transcripts, code) are.
What this means in practice:
- You can scrape factual data freely
- Re-publishing scraped articles verbatim likely infringes copyright
- Using scraped content to train LLMs is legally contested (see the ongoing NYT v. OpenAI litigation)
- For RAG applications — using scraped content to answer user questions — the fair use analysis is favorable but not settled law
In the EU, the Database Directive adds an additional layer: even collections of non-copyrightable data may be protected by a "database right" if substantial investment went into their creation. This primarily affects EU-domiciled scrapers targeting EU-hosted databases.
Best practice: Summarize and transform scraped content rather than reproduce it verbatim. This supports fair use arguments and reduces copyright exposure.
GDPR and Personal Data
If you're in the EU, or scraping data about EU residents, GDPR applies. The key question: is the data you're scraping "personal data" (information relating to an identifiable natural person)?
Examples of personal data that triggers GDPR concerns:
- Names and email addresses from business directories
- Social media profiles with real names
- Employee information from company websites
- Reviews with author names attached
What GDPR requires for processing personal data:
- A lawful basis (consent, legitimate interest, legal obligation, etc.)
- A stated purpose
- Data minimization (collect only what you need)
- Appropriate retention limits
- Rights management (deletion, correction, export)
For B2B data (business contact information scraped for lead generation), the "legitimate interest" basis may apply — but this requires a genuine balancing test and is not automatic.
Practical guidance: If your scraping involves personal data:
- Consult a GDPR-specialized lawyer
- Document your lawful basis
- Don't scrape more personal data than you need
- Have a process for handling deletion requests
- Consider whether you need the personal data at all (anonymization is often possible)
Computer Fraud Laws Outside the US
Most developed countries have computer fraud laws analogous to the US CFAA:
- EU: Computer Misuse Directive, implemented differently in each member state
- UK: Computer Misuse Act 1990
- Canada: Criminal Code Section 342.1
- Australia: Criminal Code Act 1995
These laws generally focus on unauthorized access. The same analysis applies: public data without authentication barriers is generally accessible, while bypassing authentication mechanisms is illegal.
Best Practices for Defensible Scraping
Following these practices significantly reduces legal risk:
1. Scrape Public Data Only
Never bypass authentication to access data. If a page requires login, either get an official API or skip it.
2. Respect robots.txt
Always fetch and honor robots.txt, including Crawl-delay directives.
3. Rate Limit Respectfully
Don't hammer servers. Aggressive scraping can be characterized as a denial of service attack in extreme cases.
4. Identify Your Scraper
Use a descriptive user-agent string that includes a contact email:
User-Agent: KnowledgeSDKBot/1.0 (https://knowledgesdk.com; contact@knowledgesdk.com)
This demonstrates good faith and gives site owners a way to contact you.
5. Cache Aggressively
Avoid re-scraping content that hasn't changed. This reduces server load and request volume.
6. Don't Harvest Personal Data at Scale
PII scraping at scale is the fastest path to regulatory trouble in the EU. Minimize personal data collection.
7. Don't Re-Publish Verbatim
Transformative use of scraped content (summarization, analysis, RAG) is much more defensible than verbatim republication.
8. Have a Takedown Process
If a site owner asks you to stop scraping their site, stop. Maintaining a process for and honoring such requests demonstrates good faith.
What KnowledgeSDK Does Not Change About Legal Obligations
Using a web scraping API like KnowledgeSDK handles the technical infrastructure — proxies, browser execution, content extraction — but does not change your legal obligations. You are still responsible for:
- Evaluating whether scraping a specific site is permissible
- Complying with the site's ToS
- Ensuring GDPR compliance for personal data
- Not scraping auth-gated content
- Respecting robots.txt
KnowledgeSDK's infrastructure includes automatic robots.txt checking and rate limiting, which helps with technical compliance. But legal analysis is your responsibility.
import { KnowledgeSDK } from '@knowledgesdk/node';
const client = new KnowledgeSDK({ apiKey: process.env.KNOWLEDGE_API_KEY });
// KnowledgeSDK respects robots.txt automatically
// But you should still review the site's ToS for your specific use case
const result = await client.scrape({
url: 'https://example.com/public-data',
// respectRobotsTxt: true (default)
});
Frequently Asked Questions
Q: Is it legal to scrape Google search results?
Google's ToS explicitly prohibits automated access to search results. Google also actively blocks scrapers and has sued scrapers under the CFAA. Scraping Google is inadvisable. Use the official Custom Search API instead.
Q: Can I scrape social media platforms?
Most major social platforms (X/Twitter, Facebook, Instagram, LinkedIn) explicitly prohibit scraping in their ToS and actively defend against it. They have also filed successful lawsuits (hiQ notwithstanding). Officially: use their APIs. The APIs are rate-limited and sometimes paywalled, but they're the defensible path.
Q: What if I'm only scraping for personal use?
Personal, non-commercial scraping (saving articles for yourself, archiving content you created) is the most defensible use case. Commercial scraping, especially at scale, attracts more scrutiny.
Q: Does scraping for AI training create additional legal risk?
Yes, currently. The use of scraped content to train AI models is the subject of active litigation. The legal landscape is evolving rapidly, and outcomes are uncertain. This is an area where legal advice is especially important.
Q: Is using a scraping API (vs. building your own) legally different?
No. Using a third-party API doesn't change your legal analysis. You're still the party making the decision to scrape, and you're responsible for that decision's legality.
Conclusion
Web scraping public data is generally legal in the United States based on current case law, particularly after hiQ v. LinkedIn. The legal risks concentrate around: bypassing authentication, ignoring ToS restrictions, scraping personal data subject to GDPR, and using scraped content in ways that infringe copyright.
The practical path forward: scrape public data, respect robots.txt and rate limits, minimize personal data collection, transform rather than reproduce content, and have a clear process for responding to takedown requests.
KnowledgeSDK is designed for scraping publicly available web content — documentation, product pages, articles, and other public business information — which sits squarely in the defensible zone.
Get your API key at knowledgesdk.com/setup.
Reminder: This article is general information, not legal advice. Consult a qualified attorney for your specific situation.