AST-Aware Code Chunking for RAG: Why Text Splitting Fails on Code

Splitting code files at arbitrary token boundaries breaks functions in half and destroys semantic meaning. AST-aware chunking respects code structure — and dramatically improves retrieval.

Take a 200-line Python file. Run it through RecursiveCharacterTextSplitter with a 100-token chunk size. You will get chunks that start mid-function, reference variables defined twenty lines earlier, miss docstrings entirely, and split class methods from the class definition they belong to. Feed those chunks into a vector store and ask a coding agent to find "how the authentication middleware works." The agent will retrieve semantically broken fragments and produce broken answers.

Text splitting was designed for prose. Code is not prose. Code has strict syntactic structure, and meaning is deeply entangled with that structure. A function body without its signature is meaningless. A method without its class context is ambiguous. A code block without the imports it depends on cannot be understood in isolation.

AST-aware chunking solves this by treating code as what it actually is: a structured tree, not a stream of tokens.

Why Naive Splitting Fails on Code

When you split prose, losing a sentence boundary is unfortunate but recoverable. The surrounding context usually makes the meaning clear. When you split code at an arbitrary token boundary, you get something much worse than a truncated sentence — you get syntactically invalid or semantically broken fragments.

Specific failure modes of naive text splitting on code:

Split mid-function. A 150-line function gets split at token 100. The first chunk has the signature and first half of the body. The second chunk has the rest of the body and no signature. Neither chunk is independently understandable.

Lost docstrings. Python and TypeScript docstrings immediately follow the function or class definition. If the split falls after the definition but before the docstring, you lose the semantic description of the function — which is exactly what embedding models rely on for retrieval.

Missing variable scope. Local variables defined in the first chunk are referenced in the second chunk. The second chunk retrieves without the context to understand what those variables mean.

Broken class structure. Methods are split from their class. You retrieve def process_payment(self, amount) without knowing which class it belongs to, without the class-level attributes it reads, and without the other methods it calls.

The result is degraded retrieval at every level. Embedding models produce lower-quality vectors for syntactically broken code than for complete syntactic units. And even when retrieval surfaces the right file, the chunks returned are often incomplete enough to be misleading.

What AST-Aware Chunking Does Differently

Abstract Syntax Tree chunking parses the source file into a structured tree representation, then extracts complete syntactic units as chunks. No function is ever split mid-body. No class method is separated from its class definition. Docstrings stay with their definitions.

The resulting chunks are:

Always syntactically complete
Always semantically self-contained (within the chunk)
Enriched with scope context (file path, class name, parent function)
Accompanied by relevant imports

The tree-sitter Approach

tree-sitter is a parsing library that can parse source code in 40+ languages into concrete syntax trees with deterministic, error-tolerant parsing. It is the same library used by Neovim, GitHub, and Helix for syntax highlighting — battle-tested on real-world code.

Supermemory's code-chunk NPM package is built on tree-sitter and provides the cleanest open-source implementation of AST-aware chunking available today. It handles the full pipeline from source file to structured chunks.

The 5 Steps of AST-Aware Chunking

Step 1: Parse with tree-sitter. The source file is parsed into a concrete syntax tree. Every node in the tree corresponds to a syntactic element: function definition, class declaration, method, import statement, variable assignment.

Step 2: Extract semantic entities. Walk the tree and extract the nodes that represent complete semantic units: functions, classes, methods, exported constants, type declarations. These become your chunk candidates.

Step 3: Build the scope tree. Establish the hierarchy of each entity: which file it lives in, which class it belongs to (if any), which parent function contains it (for nested functions). This scope chain is used for context injection later.

Step 4: Greedy window assignment. Pack complete semantic units into chunks that fit within your target token window. Unlike text splitters that cut at the token boundary, this approach always cuts between complete units — never inside one. If a single function exceeds the token limit, it is either kept as a single oversized chunk (preserving integrity) or split at the most semantically neutral boundary (between statements, not mid-expression).

Step 5: Contextualization. Prepend each chunk with its scope chain and relevant imports. A method chunk becomes:

# File: src/auth/middleware.py
# Class: AuthMiddleware
# Method: validate_token

import jwt
from .exceptions import InvalidTokenError

def validate_token(self, token: str) -> dict:
    """Validate a JWT token and return the decoded payload."""
    try:
        payload = jwt.decode(token, self.secret, algorithms=['HS256'])
        return payload
    except jwt.ExpiredSignatureError:
        raise InvalidTokenError("Token has expired")
    except jwt.InvalidTokenError as e:
        raise InvalidTokenError(f"Invalid token: {e}")

That chunk is fully understandable in isolation. The embedding model gets a meaningful vector. The retrieval result is useful.

Performance Numbers

The improvement is measurable. Supermemory's own benchmarks comparing AST-aware chunking against naive text splitting on code retrieval tasks:

Recall@5: 70.1% (AST-aware) vs 49% (naive text splitting) — a 43% relative improvement
Token efficiency: 4,300 average tokens per workflow → 2,400 with AST-aware chunking, because you stop duplicating context that appears in multiple broken chunks
Reduction in hallucinated completions: agents referencing undefined variables or wrong method signatures drops significantly when chunks are syntactically complete

Implementation Options

code-chunk (NPM, open-source). The simplest path for Node.js/TypeScript projects. Install, point at a source file, get structured chunks out.

import { chunkCode } from 'code-chunk';

const chunks = await chunkCode({
  content: fs.readFileSync('./src/auth/middleware.ts', 'utf8'),
  language: 'typescript',
  maxTokens: 512,
});

// Each chunk: { content, language, startLine, endLine, scope, imports }
for (const chunk of chunks) {
  await ks.index({
    content: chunk.content,
    metadata: {
      file: './src/auth/middleware.ts',
      scope: chunk.scope,
      startLine: chunk.startLine,
    },
  });
}

LlamaIndex CodeSplitter. Python-based, also tree-sitter backed, integrates with the broader LlamaIndex ecosystem. Good choice if you are already using LlamaIndex for your pipeline.

Manual tree-sitter implementation. For teams with specific requirements — custom languages, non-standard chunking strategies, or integration into existing pipelines — implementing directly against the tree-sitter Python or Node bindings gives full control.

When AST-Aware Chunking Matters

Use AST-aware chunking when:

Your RAG corpus includes source code files (any language)
You are building a coding agent that retrieves over a codebase
You are indexing API documentation that includes code examples
You are building code review AI or code search functionality
You are indexing SDK documentation where function signatures and types matter

For Web Documentation Containing Code

When you use KnowledgeSDK to extract technical documentation pages, the extraction automatically converts HTML to clean markdown. Code blocks in the page are preserved as fenced markdown code blocks — not as inline text. This means the output is already structured in a way that AST-aware chunkers can process: you can identify code blocks by their markdown fencing, extract them as candidates for AST chunking, and process the prose sections with standard text splitters.

A hybrid pipeline works well here: extract the page with POST /v1/extract, parse the markdown output, route prose paragraphs through RecursiveCharacterTextSplitter, and route code blocks through code-chunk. Each section gets the chunking strategy appropriate to its content type.

When Text Splitting Is Fine

Not every RAG use case needs AST-aware chunking. If your corpus is entirely prose — blog posts, product pages, news articles, legal documents, marketing copy — text splitting does the job well. The structural precision of AST chunking only pays off when the content has syntactic structure worth preserving.

The cost of unnecessary AST chunking is engineering complexity: you need tree-sitter bindings, language detection, and a more complex chunking pipeline. For prose content, that complexity delivers no retrieval improvement.

Evaluate your corpus. If more than 20% of your indexed content is code, AST-aware chunking is worth implementing. Below that threshold, invest the engineering time elsewhere.

Try it now