Document Store

A database that stores semi-structured or unstructured documents (JSON, markdown, text) and supports retrieval by ID or metadata filters.

What Is a Document Store?

A document store (or document database) is a type of NoSQL database that stores data as self-describing documents — typically JSON, BSON, or XML — rather than as rows in fixed-schema tables. Each document can have a different structure, making document stores ideal for heterogeneous data like articles, product descriptions, knowledge items, or configuration records.

In AI and RAG pipelines, document stores play a dual role: as the primary storage backend for raw content before it is embedded, and as a retrieval layer that supports filtering by metadata (author, date, category, source URL) to narrow down candidates before or after vector search.

How Document Stores Work

Documents are stored as key-value collections where the key is a unique ID and the value is the document body (a JSON object or text blob). Retrieval works in several ways:

By ID: Fetch a specific document directly — GET /documents/{id}.
By query: Filter documents using field-based predicates — { category: "legal", date: { $gte: "2024-01-01" } }.
By full-text search: Many document stores include full-text indexing (inverted index) for keyword matching.
By vector similarity (hybrid stores): Modern document databases like MongoDB Atlas and Elasticsearch support storing embeddings alongside documents and running ANN queries.

Document Stores vs. Vector Databases

Feature	Document Store	Vector Database
Primary index	Metadata / full-text	Embedding vectors
Query type	Exact / range / keyword	Approximate nearest neighbor
Filtering	Rich metadata filters	Limited (varies by system)
Best for	Structured retrieval, CRUD	Semantic similarity search
Examples	MongoDB, Elasticsearch	Pinecone, Weaviate, Qdrant

In production RAG systems, document stores and vector databases are often used together: the vector database finds semantically similar candidates, and the document store stores and serves the full content of those candidates.

Document Stores in Knowledge Pipelines

A typical knowledge pipeline might use a document store as follows:

Raw content is scraped and parsed into structured documents with fields: id, title, content, source_url, category, extracted_at.
Documents are stored in MongoDB or a similar system.
Documents are also embedded and indexed in a vector store.
At query time, vector search finds candidate document IDs, which are then fetched from the document store to get full content.

KnowledgeSDK abstracts this pattern — the /v1/search endpoint handles semantic retrieval, while the underlying knowledge item store persists the structured document data, so you do not need to manage the storage layer separately.

When to Choose a Document Store

Your data is heterogeneous in structure.
You need filtering by metadata fields alongside text search.
You want the simplicity of storing and retrieving records without a strict schema.
You are building a knowledge base that will be updated frequently with new documents.