Web Scraping for LLM Fine-Tuning: Building High-Quality Training Datasets
General-purpose LLMs are impressive, but they generalize across everything — which means they are mediocre at specific domains. A model fine-tuned on domain-specific content — legal filings, medical literature, internal engineering documentation, financial reports — will consistently outperform a general model on that domain's tasks, even when the general model is significantly larger.
The bottleneck for domain-specific fine-tuning is not compute or budget. It is data. You need thousands of examples of high-quality text from your target domain, in a clean format the fine-tuning framework can consume. Building that dataset from scratch is the work.
This article covers the full pipeline: identifying target sources, crawling at scale with KnowledgeSDK, filtering by quality signals, deduplicating, and exporting to the JSONL formats expected by OpenAI, Anthropic, and HuggingFace fine-tuning pipelines. We also walk through how to convert raw web content into instruction-tuning format (Q&A pairs) automatically, and include cost estimates for a 50,000-page corpus.
Planning Your Dataset
Before writing any code, spend time on source selection. The quality of your fine-tuning data is the single biggest determinant of how good the fine-tuned model will be. Two principles:
Signal density matters more than volume. 10,000 pages of expert-written documentation will produce a better model than 100,000 pages of mixed-quality forum posts. For most domain-specific tasks, a carefully curated 5,000-10,000 example dataset will outperform a noisy 100,000 example dataset.
Match the source format to your target use case. If you want the model to answer technical questions, fine-tune on Q&A content. If you want it to write in a particular style, fine-tune on examples of that style. If you want it to follow instructions, fine-tune on instruction-following demonstrations.
Source Types by Domain
| Domain | High-Quality Sources | What to Extract |
|---|---|---|
| Engineering | Official documentation, RFCs, PEPs | Conceptual explanations, code + description pairs |
| Legal | Court opinions, regulatory filings, law review articles | Case summaries, statute explanations |
| Medical | PubMed abstracts, clinical guidelines, textbooks | Abstract + conclusion pairs |
| Finance | SEC filings, analyst reports, earnings transcripts | Company descriptions, risk factor analysis |
| Customer support | Your own ticket history, FAQ pages | Question + resolution pairs |
Step 1: URL Discovery
Start by enumerating the URLs you want to crawl. Most documentation sites and structured content sources have sitemaps.
import os
import json
from pathlib import Path
from datetime import datetime
import knowledgesdk
ks = knowledgesdk.Client(api_key=os.environ["KNOWLEDGESDK_API_KEY"])
def discover_urls(sitemap_urls: list[str], output_file: str = "urls.jsonl") -> list[str]:
"""Enumerate all crawlable URLs from a list of sitemaps."""
all_urls = []
for sitemap_url in sitemap_urls:
print(f"Fetching sitemap: {sitemap_url}")
try:
result = ks.sitemap(sitemap_url)
urls = result["urls"]
all_urls.extend(urls)
print(f" Found {len(urls)} URLs")
except Exception as e:
print(f" Error fetching sitemap: {e}")
# Deduplicate
all_urls = list(dict.fromkeys(all_urls))
print(f"\nTotal unique URLs: {len(all_urls)}")
# Save for resumability
with open(output_file, "w") as f:
for url in all_urls:
f.write(json.dumps({"url": url, "status": "pending"}) + "\n")
return all_urls
# Example: crawl multiple documentation sites
target_sitemaps = [
"https://docs.python.org/3/sitemap.xml",
"https://pytorch.org/docs/stable/sitemap.xml",
"https://huggingface.co/docs/transformers/sitemap.xml",
]
urls = discover_urls(target_sitemaps)
Step 2: Crawling the Corpus
Crawl the discovered URLs and store raw content. Design the crawler for resumability — large crawls will fail partway through, and you do not want to restart from scratch.
import hashlib
import asyncio
import aiofiles
from dataclasses import dataclass, asdict
@dataclass
class RawPage:
url: str
title: str
markdown: str
word_count: int
has_code_blocks: bool
crawled_at: str
content_hash: str
def analyze_content(markdown: str) -> dict:
"""Compute quality signals for a page."""
words = len(markdown.split())
has_code = "```" in markdown or " " in markdown[:500] # fenced or indented code
has_headers = markdown.count("\n#") > 1 # multiple H2+ headers
has_lists = markdown.count("\n- ") > 3 # meaningful list content
return {
"word_count": words,
"has_code_blocks": has_code,
"has_headers": has_headers,
"has_lists": has_lists,
}
async def crawl_corpus(
urls: list[str],
output_dir: str = "raw_pages",
max_concurrent: int = 5,
skip_existing: bool = True,
) -> None:
"""Crawl all URLs and save raw content."""
Path(output_dir).mkdir(exist_ok=True)
semaphore = asyncio.Semaphore(max_concurrent)
async def crawl_one(url: str) -> None:
# Use URL hash as filename for resumability
url_hash = hashlib.md5(url.encode()).hexdigest()
output_path = f"{output_dir}/{url_hash}.json"
if skip_existing and Path(output_path).exists():
return # Already crawled
async with semaphore:
try:
# Run synchronous SDK call in thread pool
loop = asyncio.get_event_loop()
result = await loop.run_in_executor(None, ks.extract, url)
signals = analyze_content(result["markdown"])
page = RawPage(
url=url,
title=result.get("title", ""),
markdown=result["markdown"],
word_count=signals["word_count"],
has_code_blocks=signals["has_code_blocks"],
crawled_at=datetime.utcnow().isoformat(),
content_hash=hashlib.sha256(result["markdown"].encode()).hexdigest(),
)
async with aiofiles.open(output_path, "w") as f:
await f.write(json.dumps(asdict(page), ensure_ascii=False))
print(f"Crawled: {url} ({signals['word_count']} words)")
except Exception as e:
print(f"Failed: {url} — {e}")
# Write error marker so we can retry later
async with aiofiles.open(f"{output_dir}/{url_hash}.error", "w") as f:
await f.write(json.dumps({"url": url, "error": str(e)}))
await asyncio.gather(*[crawl_one(url) for url in urls])
print(f"\nCrawl complete. Results in {output_dir}/")
# Run the crawl
asyncio.run(crawl_corpus(urls, output_dir="raw_pages"))
Step 3: Quality Filtering
Raw web content is noisy. Before fine-tuning, filter out pages that will degrade the model rather than improve it.
import glob
from typing import Iterator
def quality_filter(raw_pages_dir: str) -> Iterator[dict]:
"""
Apply quality filters to raw pages.
Yields pages that pass all filters.
"""
files = glob.glob(f"{raw_pages_dir}/*.json")
stats = {"total": 0, "passed": 0, "filtered_short": 0, "filtered_boilerplate": 0, "filtered_low_signal": 0}
for filepath in files:
with open(filepath) as f:
page = json.load(f)
stats["total"] += 1
markdown = page["markdown"]
# Filter 1: Minimum length
if page["word_count"] < 200:
stats["filtered_short"] += 1
continue
# Filter 2: Maximum length (avoid full book dumps that pollute context)
if page["word_count"] > 15_000:
# Truncate rather than discard
words = markdown.split()[:15_000]
markdown = " ".join(words)
page["markdown"] = markdown
page["word_count"] = 15_000
# Filter 3: Boilerplate detection
boilerplate_signals = [
"cookie policy",
"privacy policy",
"terms of service",
"404 not found",
"page not found",
"access denied",
"enable javascript",
]
lower = markdown.lower()
if any(signal in lower for signal in boilerplate_signals) and page["word_count"] < 500:
stats["filtered_boilerplate"] += 1
continue
# Filter 4: Content quality score
# At least one of: has headers, has code blocks, has meaningful lists
quality_signals = sum([
page.get("has_code_blocks", False),
page.get("has_headers", False),
page["word_count"] > 400,
])
if quality_signals < 1:
stats["filtered_low_signal"] += 1
continue
stats["passed"] += 1
yield page
print(f"\nFiltering complete:")
print(f" Total: {stats['total']}")
print(f" Passed: {stats['passed']} ({100*stats['passed']//max(1,stats['total'])}%)")
print(f" Filtered (too short): {stats['filtered_short']}")
print(f" Filtered (boilerplate): {stats['filtered_boilerplate']}")
print(f" Filtered (low signal): {stats['filtered_low_signal']}")
Step 4: Deduplication
The web has substantial content duplication — mirror sites, scraped republications, and pages with near-identical content. Deduplication is essential for fine-tuning data quality.
from datasketch import MinHash, MinHashLSH
def deduplicate_pages(pages: list[dict], threshold: float = 0.85) -> list[dict]:
"""
Near-duplicate detection using MinHash LSH.
threshold: similarity threshold above which pages are considered duplicates.
"""
lsh = MinHashLSH(threshold=threshold, num_perm=128)
unique_pages = []
seen_hashes = set()
for i, page in enumerate(pages):
# Exact duplicate check first (fast)
content_hash = page["content_hash"]
if content_hash in seen_hashes:
continue
seen_hashes.add(content_hash)
# MinHash for near-duplicate detection
minhash = MinHash(num_perm=128)
# Shingle the text at word level
words = page["markdown"].lower().split()
for j in range(len(words) - 3):
shingle = " ".join(words[j:j+3])
minhash.update(shingle.encode())
key = f"page_{i}"
try:
result = lsh.query(minhash)
if result:
# Duplicate found — keep the longer one
continue
lsh.insert(key, minhash)
unique_pages.append(page)
except Exception:
lsh.insert(key, minhash)
unique_pages.append(page)
print(f"Deduplication: {len(pages)} → {len(unique_pages)} pages "
f"({len(pages) - len(unique_pages)} duplicates removed)")
return unique_pages
Step 5: Exporting to Fine-Tuning Formats
Completion Format (for Anthropic / HuggingFace pre-training style)
def export_completion_format(pages: list[dict], output_file: str) -> None:
"""
Simple completion format: each page is one training example.
Used for language modeling pre-training style fine-tuning.
Compatible with HuggingFace Trainer and most open-source frameworks.
"""
with open(output_file, "w") as f:
for page in pages:
example = {
"text": f"# {page['title']}\n\n{page['markdown']}"
}
f.write(json.dumps(example, ensure_ascii=False) + "\n")
print(f"Exported {len(pages)} examples to {output_file}")
Instruction-Tuning Format (Q&A pairs for OpenAI fine-tuning)
This format requires an LLM to generate question-answer pairs from each page. It is more expensive but produces models that are significantly better at following instructions.
from openai import OpenAI
openai_client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])
def generate_qa_pairs(page: dict, num_pairs: int = 3) -> list[dict]:
"""Generate Q&A pairs from a page using GPT-4o-mini."""
response = openai_client.chat.completions.create(
model="gpt-4o-mini",
response_format={"type": "json_object"},
messages=[
{
"role": "system",
"content": f"""Generate {num_pairs} high-quality question-answer pairs from the provided text.
Each pair should:
- Ask a specific, factual question that is fully answered by the text
- Have a complete, accurate answer based only on the provided content
- Be useful for someone learning this topic
Return JSON: {{"pairs": [{{"question": "...", "answer": "..."}}]}}"""
},
{
"role": "user",
"content": f"Title: {page['title']}\n\n{page['markdown'][:4000]}"
}
]
)
data = json.loads(response.choices[0].message.content)
return data.get("pairs", [])
def export_openai_chat_format(pages: list[dict], output_file: str, system_prompt: str) -> None:
"""
OpenAI chat fine-tuning format.
See: https://platform.openai.com/docs/guides/fine-tuning
"""
with open(output_file, "w") as f:
for page in pages:
try:
qa_pairs = generate_qa_pairs(page)
for pair in qa_pairs:
example = {
"messages": [
{"role": "system", "content": system_prompt},
{"role": "user", "content": pair["question"]},
{"role": "assistant", "content": pair["answer"]},
]
}
f.write(json.dumps(example, ensure_ascii=False) + "\n")
except Exception as e:
print(f"QA generation failed for {page['url']}: {e}")
# Export both formats
filtered_pages = list(quality_filter("raw_pages"))
deduped_pages = deduplicate_pages(filtered_pages)
export_completion_format(deduped_pages, "training_completion.jsonl")
export_openai_chat_format(
deduped_pages[:2000], # limit for cost control during initial run
"training_chat.jsonl",
system_prompt="You are a helpful assistant with deep expertise in Python and machine learning."
)
Step 6: Validation and Statistics
Before uploading to a fine-tuning API, validate the dataset and review statistics:
def validate_dataset(jsonl_file: str, format_type: str = "chat") -> dict:
"""Validate a JSONL dataset file and return statistics."""
stats = {
"total_examples": 0,
"total_tokens_estimate": 0,
"avg_tokens_per_example": 0,
"errors": [],
}
with open(jsonl_file) as f:
for line_num, line in enumerate(f, 1):
try:
example = json.loads(line.strip())
if format_type == "chat":
assert "messages" in example, "Missing 'messages' key"
assert len(example["messages"]) >= 2, "Need at least 2 messages"
text = " ".join(m["content"] for m in example["messages"])
else:
assert "text" in example, "Missing 'text' key"
text = example["text"]
# Rough token estimate: 4 chars per token
tokens = len(text) // 4
stats["total_tokens_estimate"] += tokens
stats["total_examples"] += 1
except (json.JSONDecodeError, AssertionError, KeyError) as e:
stats["errors"].append(f"Line {line_num}: {e}")
if stats["total_examples"] > 0:
stats["avg_tokens_per_example"] = (
stats["total_tokens_estimate"] // stats["total_examples"]
)
return stats
chat_stats = validate_dataset("training_chat.jsonl", format_type="chat")
print(f"Dataset validation:")
print(f" Examples: {chat_stats['total_examples']:,}")
print(f" Estimated tokens: {chat_stats['total_tokens_estimate']:,}")
print(f" Avg tokens/example: {chat_stats['avg_tokens_per_example']}")
print(f" Errors: {len(chat_stats['errors'])}")
Cost Estimate: 50,000-Page Corpus
Here is a realistic cost breakdown for building a 50,000-page fine-tuning dataset:
Crawling Costs
| Item | Calculation | Cost |
|---|---|---|
| KnowledgeSDK extraction (50,000 pages) | 50,000 × $0.004 | $200 |
| Sitemap discovery (50 sitemaps) | 50 × $0.001 | $0.05 |
| Total crawling | $200 |
Processing Costs
| Item | Calculation | Cost |
|---|---|---|
| GPT-4o-mini for QA generation (assume 30% pass quality filter: 15,000 pages × 3 pairs each) | 15,000 × ~1,500 tokens × $0.00015/1k | $3.38 |
| Deduplication (compute only) | Negligible | $0 |
| Total processing | ~$3.50 |
Fine-Tuning Costs (OpenAI gpt-4o-mini)
| Item | Calculation | Cost |
|---|---|---|
| 45,000 chat examples × 300 avg tokens | 13.5M tokens × $0.003/1k | $40.50 |
| Total fine-tuning | ~$40 |
Total Pipeline Cost
| Phase | Cost |
|---|---|
| Crawling (50,000 pages) | $200 |
| QA pair generation | $3.50 |
| Fine-tuning | $40 |
| Total | ~$245 |
For $245, you get a domain-specific fine-tuned model trained on 50,000 pages of expert content. The equivalent in human annotation cost would run $50,000-$150,000. Web scraping + LLM-generated Q&A is the only economically viable path to large-scale domain-specific fine-tuning for most teams.
Tips for Better Fine-Tuning Data
Prefer sources with clear authorship and expertise. Official documentation, peer-reviewed papers, and expert-authored technical content produce better models than forums or user-generated content. A 5,000-page dataset of official Python documentation will outperform 50,000 pages of Stack Overflow answers.
Include negative examples for instruction tuning. Not just "here is the right answer" — occasionally include examples of common mistakes and correct them. This teaches the model what not to do.
Maintain format consistency. If your target domain uses specific terminology, abbreviations, or conventions, ensure your training data reflects them. Inconsistent formatting in training data produces models that alternate randomly between conventions.
Version your datasets. Store dataset snapshots with metadata (source URLs, crawl date, quality thresholds used) so you can reproduce and iterate. As content sources update, your dataset drifts — maintaining versions lets you ablate the contribution of each source.
Get Started
KnowledgeSDK's /v1/extract endpoint returns clean, LLM-ready markdown from any URL — the format you need for fine-tuning pipelines. With JavaScript rendering, structured data extraction, and support for complex SPAs, it handles the pages that simple scrapers miss.
Start building your fine-tuning corpus today at knowledgesdk.com. The free tier gives you 100 extractions to validate your quality filters and pipeline before committing to a full corpus crawl.