Indexing

The process of transforming raw content into a searchable structure — embeddings, inverted indexes, or graph nodes — that enables fast retrieval.

What Is Indexing?

Indexing is the offline preprocessing step that transforms raw source content — web pages, PDFs, markdown files, database records — into a structured form that can be searched quickly at query time. Without an index, every search would require scanning all documents from scratch, which is impractical at any meaningful scale.

In RAG systems, indexing typically produces two artifacts side by side: a vector index (for semantic search) and an inverted index (for keyword search).

The Indexing Pipeline

Raw Content
    │
    ▼
1. Loading      — fetch content from source (URL, file, API)
    │
    ▼
2. Parsing      — extract clean text (strip HTML, handle PDFs, OCR images)
    │
    ▼
3. Chunking     — split into segments of appropriate size
    │
    ▼
4. Enrichment   — attach metadata (source URL, category, timestamp)
    │
    ▼
5. Embedding    — convert each chunk to a dense vector
    │
    ▼
6. Storage      — write vectors to vector DB, text to inverted index

Index Types in RAG

Vector Index (Dense)

Stores embedding vectors and supports approximate nearest-neighbor queries. Implemented with HNSW, IVF, or similar algorithms. Enables semantic search.

Inverted Index (Sparse)

Maps each term to the list of documents containing it. Enables keyword search (BM25). Extremely fast for exact-term lookup.

Graph Index

Represents documents and their relationships as graph nodes and edges. Used in graph RAG (e.g., Microsoft GraphRAG) for reasoning over entity relationships.

Relational Index

A structured SQL table with full-text search extensions (e.g., PostgreSQL tsvector). Useful for structured data with natural language fields.

Freshness: Keeping the Index Current

An index is only as useful as its freshness. Stale indexes cause outdated answers.

Common strategies:

Polling — re-index sources on a schedule (hourly, daily)
Webhooks — re-index on content-change events
On-demand — re-index when a user reports a bad answer
Incremental — index only changed pages (diff-based)

Most production systems use a combination: scheduled re-indexing for bulk refresh + webhooks for critical content.

Indexing Cost

Indexing is significantly more expensive than search:

Embedding N chunks requires N embedding model calls (GPU or API cost)
Writing to a vector database has higher write latency than reads
Large documents require many chunks (a 100-page PDF might produce 300+ chunks)

Optimize by:

Caching — skip re-embedding chunks whose content hash has not changed
Batching — send chunks in bulk to the embedding API (usually 100–2000 at a time)
Async indexing — use a job queue for large sources; do not block the user

Indexing with KnowledgeSDK

POST /v1/extract handles the full indexing pipeline in a single API call:

curl -X POST https://api.knowledgesdk.com/v1/extract \
  -H "x-api-key: knowledgesdk_live_..." \
  -H "Content-Type: application/json" \
  -d '{"url": "https://docs.yourapp.com/api-reference"}'

KnowledgeSDK:

Fetches and parses the URL
Extracts clean content (stripping navigation, ads, boilerplate)
Chunks the content with overlap
Embeds each chunk
Writes to your Typesense collection (vector + BM25 index)

For large sites, use POST /v1/extract/async to get a jobId and index asynchronously:

curl -X POST https://api.knowledgesdk.com/v1/extract/async \
  -H "x-api-key: knowledgesdk_live_..." \
  -d '{"url": "https://docs.yourapp.com/api-reference", "callbackUrl": "https://yourapp.com/webhooks/index-complete"}'

Indexing Best Practices

Index only clean content — remove navigation menus, footers, cookie banners before chunking
Attach metadata — always store source URL, category, and timestamp with each chunk
Use content hashes — skip re-embedding if the source content has not changed
Monitor index health — track the number of items indexed, their freshness, and retrieval quality metrics
Separate test and production indexes — use separate API keys / collections for different environments