What Is Chunking?
Chunking is the process of dividing long documents into smaller segments before embedding and storing them in a vector database. It is one of the most impactful — and most underestimated — decisions in any RAG pipeline.
LLMs and embedding models both have token limits. A 50-page PDF cannot be embedded as a single unit, and even if it could, the resulting vector would be too diffuse to match specific queries accurately.
Why Chunk Size Matters
- Too small (e.g., 50 tokens) — individual chunks lose context; retrieved passages are hard for the LLM to use
- Too large (e.g., 2000 tokens) — vectors are diluted; a chunk about 10 different topics will score mediocrely for all of them
- Sweet spot — typically 256–512 tokens for most use cases, with overlap to prevent boundary artifacts
Common Chunking Strategies
Fixed-Size Chunking
Split every N tokens regardless of sentence or paragraph boundaries. Simple and fast, but can cut mid-sentence.
def chunk_fixed(text, size=400, overlap=50):
tokens = tokenize(text)
chunks = []
for i in range(0, len(tokens), size - overlap):
chunks.append(tokens[i:i + size])
return chunks
Sentence-Aware Chunking
Split on sentence boundaries, accumulating until a token budget is reached. Produces more coherent chunks.
Recursive Character Splitting
Tries to split on paragraph (\n\n), then sentence (. ), then word, falling back to character. This is the default strategy in LangChain's RecursiveCharacterTextSplitter.
Semantic Chunking
Embeds each sentence and splits when cosine similarity between adjacent sentences drops below a threshold. More expensive but topic-coherent.
Structured Chunking
For documents with known structure (Markdown, HTML), split on headings (##, ###) or HTML section tags. Preserves natural logical units.
Overlap
Most strategies include an overlap — the last N tokens of chunk i are repeated at the start of chunk i+1. This prevents important context from falling into the gap between chunks.
Chunk 1: [tokens 1–400]
Chunk 2: [tokens 350–750] ← 50-token overlap
Chunk 3: [tokens 700–1100]
What to Attach as Metadata
Every chunk should store:
- Source URL or document ID
- Page number or section heading
- Creation timestamp
- Any category or tag from the source
This metadata enables filtered retrieval and source attribution in LLM responses.
Chunking with KnowledgeSDK
When you call POST /v1/extract, KnowledgeSDK automatically handles chunking optimized for web content — splitting on semantic boundaries (headings, paragraphs) while respecting token budgets. Each chunk is stored as a knowledge_item with source metadata attached.
curl -X POST https://api.knowledgesdk.com/v1/extract \
-H "x-api-key: knowledgesdk_live_..." \
-d '{"url": "https://docs.example.com/guide"}'
The chunks are immediately searchable via POST /v1/search without any additional configuration.
Chunking Best Practices
- Start with 512 tokens, 50-token overlap, and iterate based on retrieval quality
- Use sentence-aware or heading-aware splitting for documentation
- Always store the parent document reference alongside each chunk
- Evaluate chunking quality by checking whether retrieved chunks are self-contained and relevant