Chunking

The process of splitting long documents into smaller, overlapping or non-overlapping segments before embedding and indexing.

What Is Chunking?

Chunking is the process of dividing long documents into smaller segments before embedding and storing them in a vector database. It is one of the most impactful — and most underestimated — decisions in any RAG pipeline.

LLMs and embedding models both have token limits. A 50-page PDF cannot be embedded as a single unit, and even if it could, the resulting vector would be too diffuse to match specific queries accurately.

Why Chunk Size Matters

Too small (e.g., 50 tokens) — individual chunks lose context; retrieved passages are hard for the LLM to use
Too large (e.g., 2000 tokens) — vectors are diluted; a chunk about 10 different topics will score mediocrely for all of them
Sweet spot — typically 256–512 tokens for most use cases, with overlap to prevent boundary artifacts

Common Chunking Strategies

Fixed-Size Chunking

Split every N tokens regardless of sentence or paragraph boundaries. Simple and fast, but can cut mid-sentence.

def chunk_fixed(text, size=400, overlap=50):
    tokens = tokenize(text)
    chunks = []
    for i in range(0, len(tokens), size - overlap):
        chunks.append(tokens[i:i + size])
    return chunks

Sentence-Aware Chunking

Split on sentence boundaries, accumulating until a token budget is reached. Produces more coherent chunks.

Recursive Character Splitting

Tries to split on paragraph (\n\n), then sentence (. ), then word, falling back to character. This is the default strategy in LangChain's RecursiveCharacterTextSplitter.

Semantic Chunking

Embeds each sentence and splits when cosine similarity between adjacent sentences drops below a threshold. More expensive but topic-coherent.

Structured Chunking

For documents with known structure (Markdown, HTML), split on headings (##, ###) or HTML section tags. Preserves natural logical units.

Overlap

Most strategies include an overlap — the last N tokens of chunk i are repeated at the start of chunk i+1. This prevents important context from falling into the gap between chunks.

Chunk 1: [tokens 1–400]
Chunk 2: [tokens 350–750]   ← 50-token overlap
Chunk 3: [tokens 700–1100]

What to Attach as Metadata

Every chunk should store:

Source URL or document ID
Page number or section heading
Creation timestamp
Any category or tag from the source

This metadata enables filtered retrieval and source attribution in LLM responses.

Chunking with KnowledgeSDK

When you call POST /v1/extract, KnowledgeSDK automatically handles chunking optimized for web content — splitting on semantic boundaries (headings, paragraphs) while respecting token budgets. Each chunk is stored as a knowledge_item with source metadata attached.

curl -X POST https://api.knowledgesdk.com/v1/extract \
  -H "x-api-key: knowledgesdk_live_..." \
  -d '{"url": "https://docs.example.com/guide"}'

The chunks are immediately searchable via POST /v1/search without any additional configuration.

Chunking Best Practices

Start with 512 tokens, 50-token overlap, and iterate based on retrieval quality
Use sentence-aware or heading-aware splitting for documentation
Always store the parent document reference alongside each chunk
Evaluate chunking quality by checking whether retrieved chunks are self-contained and relevant