Web Scraping for AI Training Data: Building High-Quality LLM Datasets
Web scraping is the foundational data collection method for LLM training. Common Crawl — a public dataset of petabyte-scale web crawls — underlies most foundation model pre-training. GPT-4, Llama 3, Claude, and Gemini all trained on web-derived text. The web is where human language, reasoning, code, and knowledge live at scale, which makes it irreplaceable as a training data source.
But for teams building domain-specific AI — a medical coding assistant, a legal research tool, a coding copilot for a specific framework — Common Crawl isn't enough. You need targeted high-quality data from authoritative sources in your domain. That requires purposeful web scraping: selecting the right sources, extracting clean content, filtering for quality, deduplicating, and formatting into training-ready examples.
This guide covers the full pipeline for domain-specific LLM training data collection, from source selection through HuggingFace dataset formatting — including the legal and ethical considerations that have become mandatory reading after the EU AI Act's training data provisions took effect in August 2026.
Pre-Training vs. Fine-Tuning: Different Data Needs
Before building a pipeline, be precise about what kind of training you're doing. The data requirements are fundamentally different.
Pre-training builds foundation model capabilities: language understanding, reasoning, factual knowledge, code generation. This requires enormous volume — hundreds of billions of tokens — and broad coverage. Quality filtering matters, but so does scale. Most teams doing pre-training aren't building their own data pipelines from scratch; they use Common Crawl derivatives like FineWeb, RedPajama, or Dolma.
Fine-tuning adapts a pre-trained model to a specific domain, task, or style. This requires quality over quantity — thousands to millions of high-quality examples, not billions of mediocre ones. Domain-specific fine-tuning datasets are where purposeful web scraping pays off. You want authoritative sources in your target domain, correctly formatted examples, and content that the foundation model hasn't already saturated on.
Instruction tuning (a subset of fine-tuning) trains models to follow instructions. Data format is specific: (instruction, input, output) triples. Web content needs to be converted into this format — raw scraped text isn't directly usable.
This guide focuses on fine-tuning and instruction tuning pipelines, where domain-specific web scraping is most valuable.
Legal and Ethical Considerations
This section is not optional reading. Training data legality has become a first-order concern.
EU AI Act (August 2026). The EU AI Act's transparency provisions for general-purpose AI models now require model providers to publish "summaries" of training data, including web-crawled content, and to respect opt-out signals in robots.txt for training data collection. Training on data that violates these provisions exposes you to regulatory risk in the EU market.
Robots.txt. The emerging industry norm — codified in some jurisdictions — is that Disallow: / or Disallow: GPTBot signals should be respected for training data collection. Check robots.txt before scraping. Many high-quality sources (news sites, academic publishers) now explicitly disallow AI training data collection.
Copyright and ToS. Web content is generally copyrighted. Fair use and fair dealing doctrines in the US and UK may protect certain training uses, but this is actively litigated. News publishers (NYT v. OpenAI) and code repositories (GitHub Copilot cases) are precedent-setting. For high-risk domains, have legal review your source list.
Practical guidance:
- Prefer CC-licensed content (Creative Commons), open-access academic papers, government data, and public domain text
- Check
robots.txtand honor AI training crawl restrictions - Avoid scraping paywalled content or content with explicit ToS prohibitions
- Maintain a record of source URLs and access dates for provenance tracking
import robotsParser from 'robots-parser';
async function isScrapingAllowed(url, userAgent = 'KnowledgeSDKBot') {
const domain = new URL(url).origin;
const robotsUrl = `${domain}/robots.txt`;
try {
const response = await fetch(robotsUrl);
const robotsTxt = await response.text();
const robots = robotsParser(robotsUrl, robotsTxt);
return robots.isAllowed(url, userAgent);
} catch {
return true; // No robots.txt = allowed by default
}
}
Source Selection: Quality Over Quantity
The most important decision in building a training dataset is source selection. A dataset of 10M tokens from 20 authoritative sources in your domain will outperform 1B tokens from indiscriminate crawling for fine-tuning purposes.
Characteristics of high-quality training sources:
- Authoritative. Written by domain experts, not content mills. Academic journals, official documentation, reputable trade publications, professional organization sites.
- Consistent style. Sources with coherent writing standards produce cleaner training examples.
- Dense in domain knowledge. Not just mentions of domain concepts, but substantial treatment of them.
- High text-to-noise ratio. Minimal ads, navigation, boilerplate, and marketing filler.
- CC-licensed or training-permissive. See legal section above.
Finding sources systematically:
import { KnowledgeSDK } from '@knowledgesdk/node';
const client = new KnowledgeSDK({ apiKey: process.env.KNOWLEDGESDK_API_KEY });
// For a medical coding dataset, start with authoritative sources
const domainSources = [
'https://www.cms.gov/medicare/coding-billing', // CMS (public domain)
'https://www.ama-assn.org/practice-management/cpt', // AMA (check ToS)
'https://www.ncbi.nlm.nih.gov/books/', // NCBI (open access)
'https://www.who.int/classifications/', // WHO (CC BY-NC)
];
// Discover all content URLs at each source
for (const source of domainSources) {
const sitemap = await client.sitemap({ url: source });
console.log(`Found ${sitemap.urls.length} URLs at ${source}`);
// Filter to relevant content pages, not index/nav pages
}
The Extraction Pipeline
With sources identified, the pipeline is: crawl → scrape → clean → filter → format.
import { KnowledgeSDK } from '@knowledgesdk/node';
const client = new KnowledgeSDK({ apiKey: process.env.KNOWLEDGESDK_API_KEY });
async function buildTrainingCorpus(sourceUrls) {
const corpus = [];
for (const url of sourceUrls) {
// Check robots.txt first
if (!(await isScrapingAllowed(url))) {
console.log(`Skipping ${url} — disallowed by robots.txt`);
continue;
}
try {
// Extract clean markdown
const { markdown, title, url: canonicalUrl } = await client.scrape({ url });
// Quality filter
const quality = assessQuality(markdown);
if (!quality.passes) {
console.log(`Skipping ${url} — quality filter: ${quality.reason}`);
continue;
}
corpus.push({
url: canonicalUrl,
title,
content: markdown,
wordCount: markdown.split(' ').length,
scrapedAt: new Date().toISOString(),
});
// Rate limit: 1 request per 2 seconds per domain
await new Promise(r => setTimeout(r, 2000));
} catch (err) {
console.error(`Failed to scrape ${url}:`, err.message);
}
}
return corpus;
}
Quality Filtering
Quality filtering is what separates a good training dataset from a noisy one. Apply these filters before any content enters your training pipeline:
function assessQuality(markdown) {
const wordCount = markdown.split(/\s+/).length;
const lines = markdown.split('\n');
// Minimum length filter
if (wordCount < 200) {
return { passes: false, reason: 'too_short' };
}
// Maximum length filter (avoid log dumps, auto-generated pages)
if (wordCount > 50000) {
return { passes: false, reason: 'too_long' };
}
// Repetitive content detection (line diversity ratio)
const uniqueLines = new Set(lines.filter(l => l.trim().length > 10));
const diversityRatio = uniqueLines.size / lines.length;
if (diversityRatio < 0.5) {
return { passes: false, reason: 'low_diversity' };
}
// Boilerplate detection (high ratio of navigation-like short lines)
const shortLines = lines.filter(l => l.trim().length > 0 && l.trim().length < 30);
const shortLineRatio = shortLines.length / lines.length;
if (shortLineRatio > 0.6) {
return { passes: false, reason: 'high_boilerplate' };
}
// Code-to-prose ratio (for code-heavy domains, adjust threshold)
const codeBlockCount = (markdown.match(/```/g) ?? []).length / 2;
const hasBalancedContent = wordCount > 100 || codeBlockCount > 2;
if (!hasBalancedContent) {
return { passes: false, reason: 'insufficient_content' };
}
return { passes: true };
}
For domain-specific filtering, add checks relevant to your domain — medical content should reference specific terminologies, legal content should have case citations, code documentation should have code examples.
Deduplication
Training on duplicate content is wasteful at best and harmful at worst — models overfit to repeated examples. Deduplication operates at two levels:
Exact deduplication — hash-based, cheap, handles syndicated content:
import crypto from 'crypto';
function contentHash(text) {
// Normalize before hashing: lowercase, collapse whitespace
const normalized = text.toLowerCase().replace(/\s+/g, ' ').trim();
return crypto.createHash('sha256').update(normalized).digest('hex');
}
const seenHashes = new Set();
function isDuplicate(markdown) {
const hash = contentHash(markdown);
if (seenHashes.has(hash)) return true;
seenHashes.add(hash);
return false;
}
Near-duplicate detection — MinHash-based, catches paraphrased or lightly edited duplicates:
import { MinHash } from 'minhashlsh'; // or use 'minhash' package
function textToShingles(text, k = 5) {
const words = text.split(/\s+/);
const shingles = new Set();
for (let i = 0; i <= words.length - k; i++) {
shingles.add(words.slice(i, i + k).join(' '));
}
return shingles;
}
// Documents with Jaccard similarity > 0.8 are considered near-duplicates
// MinHash approximates this efficiently at scale
For datasets up to a few million documents, exact deduplication is usually sufficient. For larger corpora, MinHash LSH (Locality-Sensitive Hashing) is the standard approach — it's what Common Crawl processing pipelines use.
Formatting for Training
Raw markdown needs to be converted into training examples. The format depends on the training objective.
For continued pre-training (domain adaptation), raw text is sufficient:
// HuggingFace datasets format for pre-training
const pretrainingExample = {
text: document.content, // Raw markdown content
source: document.url,
token_count: estimateTokens(document.content),
};
For instruction tuning, you need (instruction, input, output) triples. Generate these from your scraped content:
async function generateInstructionPairs(document) {
const pairs = await openai.chat.completions.create({
model: 'gpt-4o-mini',
messages: [{
role: 'user',
content: `Generate 3 high-quality instruction-following examples from this content.
Each example should be a realistic question a user might ask, with an answer grounded in the content.
Return JSON array of { instruction, input, output } objects.
- instruction: the user's question or task
- input: additional context if needed (empty string if not needed)
- output: the ideal response, citing specific facts from the content
Content:
${document.content.slice(0, 3000)}`,
}],
response_format: { type: 'json_object' },
});
return JSON.parse(pairs.choices[0].message.content).examples;
}
// Alpaca format (compatible with most fine-tuning frameworks)
const instructionExample = {
instruction: "What are the diagnostic criteria for Type 2 diabetes?",
input: "",
output: "According to the ADA guidelines, Type 2 diabetes is diagnosed when...",
};
Building a Domain-Specific Coding Dataset
As a concrete example: building a fine-tuning dataset for a coding assistant focused on a specific framework (say, Next.js App Router).
const nextjsSources = [
'https://nextjs.org/docs', // Official docs (check ToS)
'https://nextjs.org/blog', // Release notes and guides
];
async function buildNextJsDataset() {
const allUrls = [];
// Discover all documentation URLs
for (const source of nextjsSources) {
const sitemap = await client.sitemap({ url: source });
allUrls.push(...sitemap.urls.filter(u => u.includes('/docs/') || u.includes('/blog/')));
}
const corpus = await buildTrainingCorpus(allUrls);
// Generate instruction pairs from each document
const instructionPairs = [];
for (const doc of corpus) {
const pairs = await generateInstructionPairs(doc);
instructionPairs.push(...pairs.map(p => ({
...p,
source_url: doc.url,
domain: 'nextjs',
})));
}
// Save as HuggingFace datasets format
const dataset = {
train: instructionPairs.slice(0, Math.floor(instructionPairs.length * 0.9)),
test: instructionPairs.slice(Math.floor(instructionPairs.length * 0.9)),
};
await fs.writeFile('./nextjs-dataset.jsonl',
dataset.train.map(d => JSON.stringify(d)).join('\n')
);
console.log(`Dataset built: ${dataset.train.length} training examples`);
return dataset;
}
KnowledgeSDK's extraction handles the heavy lifting here: JavaScript-rendered documentation pages, clean markdown output without navigation noise, and consistent formatting across hundreds of pages from the same source.
Dataset Scale vs. Quality Tradeoff
| Dataset size | Training approach | Source strategy | Expected quality |
|---|---|---|---|
| 1K-10K examples | Full fine-tuning small model | Curated, manually reviewed | Very high |
| 10K-100K examples | LoRA/QLoRA fine-tuning | Semi-automated, filtered | High |
| 100K-1M examples | Full fine-tuning large model | Automated, quality-filtered | Medium-high |
| 1M+ examples | Pre-training / continued pre-training | Broad crawl + filters | Medium |
For most domain-specific fine-tuning tasks, 10K-100K high-quality examples produces better results than 1M+ low-quality examples. Prioritize source quality and filtering rigor over raw volume.
KnowledgeSDK's Starter plan ($29/month) provides 10,000 monthly requests — enough to build a meaningful domain-specific dataset during initial development. The Pro plan ($99/month) covers 100,000 requests for larger corpus builds. Start with the free tier (1,000 requests) to validate your pipeline at knowledgesdk.com/setup.