Web Scraping for LLM Fine-Tuning: Building High-Quality Training Datasets

Build high-quality LLM fine-tuning datasets from web content. Full Python pipeline: crawl with KnowledgeSDK, filter, deduplicate, and export as JSONL for OpenAI and HuggingFace.

Web Scraping for LLM Fine-Tuning: Building High-Quality Training Datasets

General-purpose LLMs are impressive, but they generalize across everything — which means they are mediocre at specific domains. A model fine-tuned on domain-specific content — legal filings, medical literature, internal engineering documentation, financial reports — will consistently outperform a general model on that domain's tasks, even when the general model is significantly larger.

The bottleneck for domain-specific fine-tuning is not compute or budget. It is data. You need thousands of examples of high-quality text from your target domain, in a clean format the fine-tuning framework can consume. Building that dataset from scratch is the work.

This article covers the full pipeline: identifying target sources, crawling at scale with KnowledgeSDK, filtering by quality signals, deduplicating, and exporting to the JSONL formats expected by OpenAI, Anthropic, and HuggingFace fine-tuning pipelines. We also walk through how to convert raw web content into instruction-tuning format (Q&A pairs) automatically, and include cost estimates for a 50,000-page corpus.

Planning Your Dataset

Before writing any code, spend time on source selection. The quality of your fine-tuning data is the single biggest determinant of how good the fine-tuned model will be. Two principles:

Signal density matters more than volume. 10,000 pages of expert-written documentation will produce a better model than 100,000 pages of mixed-quality forum posts. For most domain-specific tasks, a carefully curated 5,000-10,000 example dataset will outperform a noisy 100,000 example dataset.

Match the source format to your target use case. If you want the model to answer technical questions, fine-tune on Q&A content. If you want it to write in a particular style, fine-tune on examples of that style. If you want it to follow instructions, fine-tune on instruction-following demonstrations.

Source Types by Domain

Domain	High-Quality Sources	What to Extract
Engineering	Official documentation, RFCs, PEPs	Conceptual explanations, code + description pairs
Legal	Court opinions, regulatory filings, law review articles	Case summaries, statute explanations
Medical	PubMed abstracts, clinical guidelines, textbooks	Abstract + conclusion pairs
Finance	SEC filings, analyst reports, earnings transcripts	Company descriptions, risk factor analysis
Customer support	Your own ticket history, FAQ pages	Question + resolution pairs

Step 1: URL Discovery

Start by enumerating the URLs you want to crawl. Most documentation sites and structured content sources have sitemaps.

import os
import json
from pathlib import Path
from datetime import datetime
import knowledgesdk

ks = knowledgesdk.Client(api_key=os.environ["KNOWLEDGESDK_API_KEY"])


def discover_urls(sitemap_urls: list[str], output_file: str = "urls.jsonl") -> list[str]:
    """Enumerate all crawlable URLs from a list of sitemaps."""
    all_urls = []

    for sitemap_url in sitemap_urls:
        print(f"Fetching sitemap: {sitemap_url}")
        try:
            result = ks.sitemap(sitemap_url)
            urls = result["urls"]
            all_urls.extend(urls)
            print(f"  Found {len(urls)} URLs")
        except Exception as e:
            print(f"  Error fetching sitemap: {e}")

    # Deduplicate
    all_urls = list(dict.fromkeys(all_urls))
    print(f"\nTotal unique URLs: {len(all_urls)}")

    # Save for resumability
    with open(output_file, "w") as f:
        for url in all_urls:
            f.write(json.dumps({"url": url, "status": "pending"}) + "\n")

    return all_urls


# Example: crawl multiple documentation sites
target_sitemaps = [
    "https://docs.python.org/3/sitemap.xml",
    "https://pytorch.org/docs/stable/sitemap.xml",
    "https://huggingface.co/docs/transformers/sitemap.xml",
]

urls = discover_urls(target_sitemaps)

Step 2: Crawling the Corpus

Crawl the discovered URLs and store raw content. Design the crawler for resumability — large crawls will fail partway through, and you do not want to restart from scratch.

import hashlib
import asyncio
import aiofiles
from dataclasses import dataclass, asdict


@dataclass
class RawPage:
    url: str
    title: str
    markdown: str
    word_count: int
    has_code_blocks: bool
    crawled_at: str
    content_hash: str


def analyze_content(markdown: str) -> dict:
    """Compute quality signals for a page."""
    words = len(markdown.split())
    has_code = "```" in markdown or "    " in markdown[:500]  # fenced or indented code
    has_headers = markdown.count("\n#") > 1  # multiple H2+ headers
    has_lists = markdown.count("\n- ") > 3  # meaningful list content

    return {
        "word_count": words,
        "has_code_blocks": has_code,
        "has_headers": has_headers,
        "has_lists": has_lists,
    }


async def crawl_corpus(
    urls: list[str],
    output_dir: str = "raw_pages",
    max_concurrent: int = 5,
    skip_existing: bool = True,
) -> None:
    """Crawl all URLs and save raw content."""
    Path(output_dir).mkdir(exist_ok=True)
    semaphore = asyncio.Semaphore(max_concurrent)

    async def crawl_one(url: str) -> None:
        # Use URL hash as filename for resumability
        url_hash = hashlib.md5(url.encode()).hexdigest()
        output_path = f"{output_dir}/{url_hash}.json"

        if skip_existing and Path(output_path).exists():
            return  # Already crawled

        async with semaphore:
            try:
                # Run synchronous SDK call in thread pool
                loop = asyncio.get_event_loop()
                result = await loop.run_in_executor(None, ks.extract, url)

                signals = analyze_content(result["markdown"])

                page = RawPage(
                    url=url,
                    title=result.get("title", ""),
                    markdown=result["markdown"],
                    word_count=signals["word_count"],
                    has_code_blocks=signals["has_code_blocks"],
                    crawled_at=datetime.utcnow().isoformat(),
                    content_hash=hashlib.sha256(result["markdown"].encode()).hexdigest(),
                )

                async with aiofiles.open(output_path, "w") as f:
                    await f.write(json.dumps(asdict(page), ensure_ascii=False))

                print(f"Crawled: {url} ({signals['word_count']} words)")

            except Exception as e:
                print(f"Failed: {url} — {e}")
                # Write error marker so we can retry later
                async with aiofiles.open(f"{output_dir}/{url_hash}.error", "w") as f:
                    await f.write(json.dumps({"url": url, "error": str(e)}))

    await asyncio.gather(*[crawl_one(url) for url in urls])
    print(f"\nCrawl complete. Results in {output_dir}/")


# Run the crawl
asyncio.run(crawl_corpus(urls, output_dir="raw_pages"))

Step 3: Quality Filtering

Raw web content is noisy. Before fine-tuning, filter out pages that will degrade the model rather than improve it.

import glob
from typing import Iterator


def quality_filter(raw_pages_dir: str) -> Iterator[dict]:
    """
    Apply quality filters to raw pages.
    Yields pages that pass all filters.
    """
    files = glob.glob(f"{raw_pages_dir}/*.json")
    stats = {"total": 0, "passed": 0, "filtered_short": 0, "filtered_boilerplate": 0, "filtered_low_signal": 0}

    for filepath in files:
        with open(filepath) as f:
            page = json.load(f)

        stats["total"] += 1
        markdown = page["markdown"]

        # Filter 1: Minimum length
        if page["word_count"] < 200:
            stats["filtered_short"] += 1
            continue

        # Filter 2: Maximum length (avoid full book dumps that pollute context)
        if page["word_count"] > 15_000:
            # Truncate rather than discard
            words = markdown.split()[:15_000]
            markdown = " ".join(words)
            page["markdown"] = markdown
            page["word_count"] = 15_000

        # Filter 3: Boilerplate detection
        boilerplate_signals = [
            "cookie policy",
            "privacy policy",
            "terms of service",
            "404 not found",
            "page not found",
            "access denied",
            "enable javascript",
        ]
        lower = markdown.lower()
        if any(signal in lower for signal in boilerplate_signals) and page["word_count"] < 500:
            stats["filtered_boilerplate"] += 1
            continue

        # Filter 4: Content quality score
        # At least one of: has headers, has code blocks, has meaningful lists
        quality_signals = sum([
            page.get("has_code_blocks", False),
            page.get("has_headers", False),
            page["word_count"] > 400,
        ])

        if quality_signals < 1:
            stats["filtered_low_signal"] += 1
            continue

        stats["passed"] += 1
        yield page

    print(f"\nFiltering complete:")
    print(f"  Total: {stats['total']}")
    print(f"  Passed: {stats['passed']} ({100*stats['passed']//max(1,stats['total'])}%)")
    print(f"  Filtered (too short): {stats['filtered_short']}")
    print(f"  Filtered (boilerplate): {stats['filtered_boilerplate']}")
    print(f"  Filtered (low signal): {stats['filtered_low_signal']}")

Step 4: Deduplication

The web has substantial content duplication — mirror sites, scraped republications, and pages with near-identical content. Deduplication is essential for fine-tuning data quality.

from datasketch import MinHash, MinHashLSH


def deduplicate_pages(pages: list[dict], threshold: float = 0.85) -> list[dict]:
    """
    Near-duplicate detection using MinHash LSH.
    threshold: similarity threshold above which pages are considered duplicates.
    """
    lsh = MinHashLSH(threshold=threshold, num_perm=128)
    unique_pages = []
    seen_hashes = set()

    for i, page in enumerate(pages):
        # Exact duplicate check first (fast)
        content_hash = page["content_hash"]
        if content_hash in seen_hashes:
            continue
        seen_hashes.add(content_hash)

        # MinHash for near-duplicate detection
        minhash = MinHash(num_perm=128)
        # Shingle the text at word level
        words = page["markdown"].lower().split()
        for j in range(len(words) - 3):
            shingle = " ".join(words[j:j+3])
            minhash.update(shingle.encode())

        key = f"page_{i}"
        try:
            result = lsh.query(minhash)
            if result:
                # Duplicate found — keep the longer one
                continue
            lsh.insert(key, minhash)
            unique_pages.append(page)
        except Exception:
            lsh.insert(key, minhash)
            unique_pages.append(page)

    print(f"Deduplication: {len(pages)} → {len(unique_pages)} pages "
          f"({len(pages) - len(unique_pages)} duplicates removed)")
    return unique_pages

Step 5: Exporting to Fine-Tuning Formats

Completion Format (for Anthropic / HuggingFace pre-training style)

def export_completion_format(pages: list[dict], output_file: str) -> None:
    """
    Simple completion format: each page is one training example.
    Used for language modeling pre-training style fine-tuning.
    Compatible with HuggingFace Trainer and most open-source frameworks.
    """
    with open(output_file, "w") as f:
        for page in pages:
            example = {
                "text": f"# {page['title']}\n\n{page['markdown']}"
            }
            f.write(json.dumps(example, ensure_ascii=False) + "\n")

    print(f"Exported {len(pages)} examples to {output_file}")

Instruction-Tuning Format (Q&A pairs for OpenAI fine-tuning)

This format requires an LLM to generate question-answer pairs from each page. It is more expensive but produces models that are significantly better at following instructions.

from openai import OpenAI

openai_client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])


def generate_qa_pairs(page: dict, num_pairs: int = 3) -> list[dict]:
    """Generate Q&A pairs from a page using GPT-4o-mini."""
    response = openai_client.chat.completions.create(
        model="gpt-4o-mini",
        response_format={"type": "json_object"},
        messages=[
            {
                "role": "system",
                "content": f"""Generate {num_pairs} high-quality question-answer pairs from the provided text.
                Each pair should:
                - Ask a specific, factual question that is fully answered by the text
                - Have a complete, accurate answer based only on the provided content
                - Be useful for someone learning this topic

                Return JSON: {{"pairs": [{{"question": "...", "answer": "..."}}]}}"""
            },
            {
                "role": "user",
                "content": f"Title: {page['title']}\n\n{page['markdown'][:4000]}"
            }
        ]
    )

    data = json.loads(response.choices[0].message.content)
    return data.get("pairs", [])


def export_openai_chat_format(pages: list[dict], output_file: str, system_prompt: str) -> None:
    """
    OpenAI chat fine-tuning format.
    See: https://platform.openai.com/docs/guides/fine-tuning
    """
    with open(output_file, "w") as f:
        for page in pages:
            try:
                qa_pairs = generate_qa_pairs(page)
                for pair in qa_pairs:
                    example = {
                        "messages": [
                            {"role": "system", "content": system_prompt},
                            {"role": "user", "content": pair["question"]},
                            {"role": "assistant", "content": pair["answer"]},
                        ]
                    }
                    f.write(json.dumps(example, ensure_ascii=False) + "\n")
            except Exception as e:
                print(f"QA generation failed for {page['url']}: {e}")


# Export both formats
filtered_pages = list(quality_filter("raw_pages"))
deduped_pages = deduplicate_pages(filtered_pages)

export_completion_format(deduped_pages, "training_completion.jsonl")

export_openai_chat_format(
    deduped_pages[:2000],  # limit for cost control during initial run
    "training_chat.jsonl",
    system_prompt="You are a helpful assistant with deep expertise in Python and machine learning."
)

Step 6: Validation and Statistics

Before uploading to a fine-tuning API, validate the dataset and review statistics:

def validate_dataset(jsonl_file: str, format_type: str = "chat") -> dict:
    """Validate a JSONL dataset file and return statistics."""
    stats = {
        "total_examples": 0,
        "total_tokens_estimate": 0,
        "avg_tokens_per_example": 0,
        "errors": [],
    }

    with open(jsonl_file) as f:
        for line_num, line in enumerate(f, 1):
            try:
                example = json.loads(line.strip())

                if format_type == "chat":
                    assert "messages" in example, "Missing 'messages' key"
                    assert len(example["messages"]) >= 2, "Need at least 2 messages"
                    text = " ".join(m["content"] for m in example["messages"])
                else:
                    assert "text" in example, "Missing 'text' key"
                    text = example["text"]

                # Rough token estimate: 4 chars per token
                tokens = len(text) // 4
                stats["total_tokens_estimate"] += tokens
                stats["total_examples"] += 1

            except (json.JSONDecodeError, AssertionError, KeyError) as e:
                stats["errors"].append(f"Line {line_num}: {e}")

    if stats["total_examples"] > 0:
        stats["avg_tokens_per_example"] = (
            stats["total_tokens_estimate"] // stats["total_examples"]
        )

    return stats


chat_stats = validate_dataset("training_chat.jsonl", format_type="chat")
print(f"Dataset validation:")
print(f"  Examples: {chat_stats['total_examples']:,}")
print(f"  Estimated tokens: {chat_stats['total_tokens_estimate']:,}")
print(f"  Avg tokens/example: {chat_stats['avg_tokens_per_example']}")
print(f"  Errors: {len(chat_stats['errors'])}")

Cost Estimate: 50,000-Page Corpus

Here is a realistic cost breakdown for building a 50,000-page fine-tuning dataset:

Crawling Costs

Item	Calculation	Cost
KnowledgeSDK extraction (50,000 pages)	50,000 × $0.004	$200
Sitemap discovery (50 sitemaps)	50 × $0.001	$0.05
Total crawling		$200

Processing Costs

Item	Calculation	Cost
GPT-4o-mini for QA generation (assume 30% pass quality filter: 15,000 pages × 3 pairs each)	15,000 × ~1,500 tokens × $0.00015/1k	$3.38
Deduplication (compute only)	Negligible	$0
Total processing		~$3.50

Fine-Tuning Costs (OpenAI gpt-4o-mini)

Item	Calculation	Cost
45,000 chat examples × 300 avg tokens	13.5M tokens × $0.003/1k	$40.50
Total fine-tuning		~$40

Total Pipeline Cost

Phase	Cost
Crawling (50,000 pages)	$200
QA pair generation	$3.50
Fine-tuning	$40
Total	~$245

For $245, you get a domain-specific fine-tuned model trained on 50,000 pages of expert content. The equivalent in human annotation cost would run $50,000-$150,000. Web scraping + LLM-generated Q&A is the only economically viable path to large-scale domain-specific fine-tuning for most teams.

Tips for Better Fine-Tuning Data

Prefer sources with clear authorship and expertise. Official documentation, peer-reviewed papers, and expert-authored technical content produce better models than forums or user-generated content. A 5,000-page dataset of official Python documentation will outperform 50,000 pages of Stack Overflow answers.

Include negative examples for instruction tuning. Not just "here is the right answer" — occasionally include examples of common mistakes and correct them. This teaches the model what not to do.

Maintain format consistency. If your target domain uses specific terminology, abbreviations, or conventions, ensure your training data reflects them. Inconsistent formatting in training data produces models that alternate randomly between conventions.

Version your datasets. Store dataset snapshots with metadata (source URLs, crawl date, quality thresholds used) so you can reproduce and iterate. As content sources update, your dataset drifts — maintaining versions lets you ablate the contribution of each source.

Get Started

KnowledgeSDK's /v1/extract endpoint returns clean, LLM-ready markdown from any URL — the format you need for fine-tuning pipelines. With JavaScript rendering, structured data extraction, and support for complex SPAs, it handles the pages that simple scrapers miss.

Start building your fine-tuning corpus today at knowledgesdk.com. The free tier gives you 100 extractions to validate your quality filters and pipeline before committing to a full corpus crawl.

Try it now