What Is Indexing?
Indexing is the offline preprocessing step that transforms raw source content — web pages, PDFs, markdown files, database records — into a structured form that can be searched quickly at query time. Without an index, every search would require scanning all documents from scratch, which is impractical at any meaningful scale.
In RAG systems, indexing typically produces two artifacts side by side: a vector index (for semantic search) and an inverted index (for keyword search).
The Indexing Pipeline
Raw Content
│
▼
1. Loading — fetch content from source (URL, file, API)
│
▼
2. Parsing — extract clean text (strip HTML, handle PDFs, OCR images)
│
▼
3. Chunking — split into segments of appropriate size
│
▼
4. Enrichment — attach metadata (source URL, category, timestamp)
│
▼
5. Embedding — convert each chunk to a dense vector
│
▼
6. Storage — write vectors to vector DB, text to inverted index
Index Types in RAG
Vector Index (Dense)
Stores embedding vectors and supports approximate nearest-neighbor queries. Implemented with HNSW, IVF, or similar algorithms. Enables semantic search.
Inverted Index (Sparse)
Maps each term to the list of documents containing it. Enables keyword search (BM25). Extremely fast for exact-term lookup.
Graph Index
Represents documents and their relationships as graph nodes and edges. Used in graph RAG (e.g., Microsoft GraphRAG) for reasoning over entity relationships.
Relational Index
A structured SQL table with full-text search extensions (e.g., PostgreSQL tsvector). Useful for structured data with natural language fields.
Freshness: Keeping the Index Current
An index is only as useful as its freshness. Stale indexes cause outdated answers.
Common strategies:
- Polling — re-index sources on a schedule (hourly, daily)
- Webhooks — re-index on content-change events
- On-demand — re-index when a user reports a bad answer
- Incremental — index only changed pages (diff-based)
Most production systems use a combination: scheduled re-indexing for bulk refresh + webhooks for critical content.
Indexing Cost
Indexing is significantly more expensive than search:
- Embedding N chunks requires N embedding model calls (GPU or API cost)
- Writing to a vector database has higher write latency than reads
- Large documents require many chunks (a 100-page PDF might produce 300+ chunks)
Optimize by:
- Caching — skip re-embedding chunks whose content hash has not changed
- Batching — send chunks in bulk to the embedding API (usually 100–2000 at a time)
- Async indexing — use a job queue for large sources; do not block the user
Indexing with KnowledgeSDK
POST /v1/extract handles the full indexing pipeline in a single API call:
curl -X POST https://api.knowledgesdk.com/v1/extract \
-H "x-api-key: knowledgesdk_live_..." \
-H "Content-Type: application/json" \
-d '{"url": "https://docs.yourapp.com/api-reference"}'
KnowledgeSDK:
- Fetches and parses the URL
- Extracts clean content (stripping navigation, ads, boilerplate)
- Chunks the content with overlap
- Embeds each chunk
- Writes to your Typesense collection (vector + BM25 index)
For large sites, use POST /v1/extract/async to get a jobId and index asynchronously:
curl -X POST https://api.knowledgesdk.com/v1/extract/async \
-H "x-api-key: knowledgesdk_live_..." \
-d '{"url": "https://docs.yourapp.com/api-reference", "callbackUrl": "https://yourapp.com/webhooks/index-complete"}'
Indexing Best Practices
- Index only clean content — remove navigation menus, footers, cookie banners before chunking
- Attach metadata — always store source URL, category, and timestamp with each chunk
- Use content hashes — skip re-embedding if the source content has not changed
- Monitor index health — track the number of items indexed, their freshness, and retrieval quality metrics
- Separate test and production indexes — use separate API keys / collections for different environments