What Is a Document Store?
A document store (or document database) is a type of NoSQL database that stores data as self-describing documents — typically JSON, BSON, or XML — rather than as rows in fixed-schema tables. Each document can have a different structure, making document stores ideal for heterogeneous data like articles, product descriptions, knowledge items, or configuration records.
In AI and RAG pipelines, document stores play a dual role: as the primary storage backend for raw content before it is embedded, and as a retrieval layer that supports filtering by metadata (author, date, category, source URL) to narrow down candidates before or after vector search.
How Document Stores Work
Documents are stored as key-value collections where the key is a unique ID and the value is the document body (a JSON object or text blob). Retrieval works in several ways:
- By ID: Fetch a specific document directly —
GET /documents/{id}. - By query: Filter documents using field-based predicates —
{ category: "legal", date: { $gte: "2024-01-01" } }. - By full-text search: Many document stores include full-text indexing (inverted index) for keyword matching.
- By vector similarity (hybrid stores): Modern document databases like MongoDB Atlas and Elasticsearch support storing embeddings alongside documents and running ANN queries.
Popular Document Stores
- MongoDB: The most widely used general-purpose document store. Supports rich queries, aggregation pipelines, and Atlas Vector Search.
- Elasticsearch / OpenSearch: Full-text search optimized, widely used for log analysis and knowledge retrieval. Now supports vector search.
- Firestore: Google's managed document store, popular in mobile and web apps.
- CouchDB: Open-source, designed for offline-first and replication scenarios.
- DynamoDB: AWS's managed key-value and document store, optimized for high throughput at scale.
Document Stores vs. Vector Databases
| Feature | Document Store | Vector Database |
|---|---|---|
| Primary index | Metadata / full-text | Embedding vectors |
| Query type | Exact / range / keyword | Approximate nearest neighbor |
| Filtering | Rich metadata filters | Limited (varies by system) |
| Best for | Structured retrieval, CRUD | Semantic similarity search |
| Examples | MongoDB, Elasticsearch | Pinecone, Weaviate, Qdrant |
In production RAG systems, document stores and vector databases are often used together: the vector database finds semantically similar candidates, and the document store stores and serves the full content of those candidates.
Document Stores in Knowledge Pipelines
A typical knowledge pipeline might use a document store as follows:
- Raw content is scraped and parsed into structured documents with fields:
id,title,content,source_url,category,extracted_at. - Documents are stored in MongoDB or a similar system.
- Documents are also embedded and indexed in a vector store.
- At query time, vector search finds candidate document IDs, which are then fetched from the document store to get full content.
KnowledgeSDK abstracts this pattern — the /v1/search endpoint handles semantic retrieval, while the underlying knowledge item store persists the structured document data, so you do not need to manage the storage layer separately.
When to Choose a Document Store
- Your data is heterogeneous in structure.
- You need filtering by metadata fields alongside text search.
- You want the simplicity of storing and retrieving records without a strict schema.
- You are building a knowledge base that will be updated frequently with new documents.