knowledgesdk.com/glossary/indexing
RAG & Retrievalbeginner

Also known as: document indexing, knowledge indexing

Indexing

The process of transforming raw content into a searchable structure — embeddings, inverted indexes, or graph nodes — that enables fast retrieval.

What Is Indexing?

Indexing is the offline preprocessing step that transforms raw source content — web pages, PDFs, markdown files, database records — into a structured form that can be searched quickly at query time. Without an index, every search would require scanning all documents from scratch, which is impractical at any meaningful scale.

In RAG systems, indexing typically produces two artifacts side by side: a vector index (for semantic search) and an inverted index (for keyword search).

The Indexing Pipeline

Raw Content
    │
    ▼
1. Loading      — fetch content from source (URL, file, API)
    │
    ▼
2. Parsing      — extract clean text (strip HTML, handle PDFs, OCR images)
    │
    ▼
3. Chunking     — split into segments of appropriate size
    │
    ▼
4. Enrichment   — attach metadata (source URL, category, timestamp)
    │
    ▼
5. Embedding    — convert each chunk to a dense vector
    │
    ▼
6. Storage      — write vectors to vector DB, text to inverted index

Index Types in RAG

Vector Index (Dense)

Stores embedding vectors and supports approximate nearest-neighbor queries. Implemented with HNSW, IVF, or similar algorithms. Enables semantic search.

Inverted Index (Sparse)

Maps each term to the list of documents containing it. Enables keyword search (BM25). Extremely fast for exact-term lookup.

Graph Index

Represents documents and their relationships as graph nodes and edges. Used in graph RAG (e.g., Microsoft GraphRAG) for reasoning over entity relationships.

Relational Index

A structured SQL table with full-text search extensions (e.g., PostgreSQL tsvector). Useful for structured data with natural language fields.

Freshness: Keeping the Index Current

An index is only as useful as its freshness. Stale indexes cause outdated answers.

Common strategies:

  • Polling — re-index sources on a schedule (hourly, daily)
  • Webhooks — re-index on content-change events
  • On-demand — re-index when a user reports a bad answer
  • Incremental — index only changed pages (diff-based)

Most production systems use a combination: scheduled re-indexing for bulk refresh + webhooks for critical content.

Indexing Cost

Indexing is significantly more expensive than search:

  • Embedding N chunks requires N embedding model calls (GPU or API cost)
  • Writing to a vector database has higher write latency than reads
  • Large documents require many chunks (a 100-page PDF might produce 300+ chunks)

Optimize by:

  • Caching — skip re-embedding chunks whose content hash has not changed
  • Batching — send chunks in bulk to the embedding API (usually 100–2000 at a time)
  • Async indexing — use a job queue for large sources; do not block the user

Indexing with KnowledgeSDK

POST /v1/extract handles the full indexing pipeline in a single API call:

curl -X POST https://api.knowledgesdk.com/v1/extract \
  -H "x-api-key: knowledgesdk_live_..." \
  -H "Content-Type: application/json" \
  -d '{"url": "https://docs.yourapp.com/api-reference"}'

KnowledgeSDK:

  1. Fetches and parses the URL
  2. Extracts clean content (stripping navigation, ads, boilerplate)
  3. Chunks the content with overlap
  4. Embeds each chunk
  5. Writes to your Typesense collection (vector + BM25 index)

For large sites, use POST /v1/extract/async to get a jobId and index asynchronously:

curl -X POST https://api.knowledgesdk.com/v1/extract/async \
  -H "x-api-key: knowledgesdk_live_..." \
  -d '{"url": "https://docs.yourapp.com/api-reference", "callbackUrl": "https://yourapp.com/webhooks/index-complete"}'

Indexing Best Practices

  • Index only clean content — remove navigation menus, footers, cookie banners before chunking
  • Attach metadata — always store source URL, category, and timestamp with each chunk
  • Use content hashes — skip re-embedding if the source content has not changed
  • Monitor index health — track the number of items indexed, their freshness, and retrieval quality metrics
  • Separate test and production indexes — use separate API keys / collections for different environments

Related Terms

RAG & Retrievalbeginner
Chunking
The process of splitting long documents into smaller, overlapping or non-overlapping segments before embedding and indexing.
RAG & Retrievalbeginner
Vector Database
A specialized database that stores high-dimensional embedding vectors and enables fast similarity search.
RAG & Retrievalintermediate
Retrieval Pipeline
The end-to-end sequence of steps — query processing, search, re-ranking, and context assembly — that retrieves relevant documents for an LLM.
RAG & Retrievalbeginner
Knowledge Base
A structured or unstructured collection of information that an AI system can query to answer questions or complete tasks.
IdempotencyInference

Try it now

Build with Indexing using one API.

Extract, index, and search any web content. First 1,000 requests free.

GET API KEY →
← Back to glossary