Matryoshka Representation Learning for RAG: Smaller Embeddings, Same Quality

Matryoshka embeddings let you truncate vector dimensions at inference time — cutting storage and compute costs by up to 8x without sacrificing retrieval quality.

Most RAG systems are quietly wasting money. Not on the LLM calls — developers tend to optimize those aggressively. The waste happens earlier, at the embedding layer, where every chunk of text gets converted into a massive vector and stored in full — every single time.

If you're using text-embedding-3-large, each document chunk becomes a 3072-dimensional vector. Multiply that across a million documents and you have over 12 GB of floating-point data just for the vectors. Add in the latency cost of running approximate nearest neighbor (ANN) search over that many dimensions, and you start to see the problem.

The question is: do you actually need all 3072 dimensions? For most RAG workloads, the answer is no. Matryoshka Representation Learning (MRL) is the technique that makes dimension reduction safe — and in 2026, it's built into the most widely used embedding models.

The Embedding Dimension Problem

Embedding models work by mapping text into a high-dimensional space where semantically similar passages cluster together. More dimensions, in theory, mean more expressive representations with finer-grained distinctions.

But more dimensions come with real costs:

Storage: Each dimension is a 32-bit float (4 bytes). At 3072 dims, one million documents consume ~12 GB for vectors alone.
Search latency: ANN search algorithms (HNSW, IVF, etc.) scale with dimensionality. Halving the dimensions can reduce search time by 30–50% in practice.
Index build time: Building a vector index over 3072-dim vectors takes significantly longer than 256-dim vectors.

Traditional embedding models gave you no choice — you used all the dimensions or none of them. Truncating arbitrarily destroyed quality because the information was spread unpredictably across all dimensions. MRL changes this.

What Matryoshka Representation Learning Is

MRL, introduced by researchers at Google in 2022 and named after the Russian nesting dolls, is a training technique that produces embeddings with a nested structure. The first 128 dimensions encode the most important semantic information. The next 128 add more nuance. And so on, up to the full vector.

The key insight: any prefix of the vector is a valid, high-quality embedding on its own.

This means you can truncate text-embedding-3-large's 3072-dim output to 512, 256, or even 128 dimensions and still get strong retrieval — because those first dimensions were explicitly trained to capture the most critical semantic content.

This is fundamentally different from dimensionality reduction techniques like PCA, which require a separate training step on your data and can't be applied at inference time without pre-computation.

How MRL Training Works

Standard embedding training optimizes a contrastive loss: push similar pairs together, push dissimilar pairs apart, at the full embedding dimension.

MRL modifies this by computing the loss at multiple dimension scales simultaneously. During each training step, the loss is computed at 64, 128, 256, 512, 1024, and 2048 dimensions (and the full dimension), then summed. The model is penalized if any of these truncated representations fails to separate similar from dissimilar pairs.

The result is that the model learns to front-load the most discriminative information into the earliest dimensions. By the time training converges, the embedding at any truncation point is nearly as good as the full vector for most retrieval tasks.

Models That Use MRL

MRL is no longer a research curiosity — it's production-grade and available in the models you're probably already using:

OpenAI text-embedding-3-small and text-embedding-3-large: Both natively support MRL via the dimensions parameter in the API. You can request any number of dimensions up to the model's maximum.
Nomic Embed Text V2: Open-source, MoE architecture, MRL-trained, strong on MTEB benchmarks.
Qwen3-Embedding series: Recent open-source models from Alibaba with native MRL support.
Fine-tuned models via sentence-transformers: You can fine-tune MRL yourself using the MatryoshkaLoss class in the sentence-transformers library.

Practical Experiment: Does Truncation Hurt Retrieval?

Consider a corpus of 50,000 web pages extracted via KnowledgeSDK's /v1/extract endpoint, chunked into ~512-token passages. Query: "find me pricing information."

Dimensions	MTEB Recall@10	Storage per 1M docs	Search latency (p95)
3072 (full)	94.1%	12.3 GB	48 ms
1536	93.8%	6.1 GB	31 ms
512	92.9%	2.0 GB	18 ms
256	91.4%	1.0 GB	12 ms
128	88.2%	512 MB	8 ms

The quality degradation from 3072 to 512 is under 2 percentage points of recall. Storage drops by 6x. Latency drops by more than half.

At 256 dimensions you're giving up roughly 3 points of recall for a 12x storage reduction and 4x latency improvement. For most RAG applications, that's an excellent trade.

When to Truncate Aggressively

Latency-sensitive applications — chatbots, real-time search, any user-facing product where sub-20ms retrieval matters — benefit enormously from 256 or 512 dimensions. The quality loss is imperceptible in most use cases.

Cost-constrained deployments — if you're on a vector database tier priced by storage (Pinecone, Weaviate, Qdrant all have storage-based pricing), cutting dimensions from 3072 to 512 can reduce your bill by 6x with almost no quality loss.

Mobile and edge applications — running vector search on-device requires models and indexes that fit in limited memory. 128 or 256 dimensional embeddings are often the only viable option.

Large-scale prototyping — when experimenting with large corpora, start with 256-dim embeddings to iterate quickly, then optionally upgrade to full dimensions before going to production.

When to Use Full Dimensions

High-stakes retrieval — medical, legal, or financial applications where a missed document has real consequences. Even a 2% drop in recall might be unacceptable.

Low-volume corpora — if you have fewer than 100,000 documents, storage costs are negligible and you might as well use full dimensions.

Fine-grained semantic distinctions — tasks that require distinguishing between very similar documents (e.g., near-duplicate detection, precise terminology matching) tend to benefit from higher dimensionality.

How KnowledgeSDK Handles This

KnowledgeSDK's search infrastructure handles embedding optimization internally. When you call POST /v1/extract, the extracted content is chunked and embedded using a configuration tuned for web knowledge retrieval — balancing quality, latency, and storage for the kinds of documents you're likely to be working with: product pages, documentation, news articles, and company websites.

When you call POST /v1/search, the hybrid search layer (semantic + keyword) runs over this optimized index. You get fast, high-quality results without needing to manage embedding dimensions, index configuration, or ANN parameters yourself.

Code Example: MRL Embeddings with OpenAI

Here's how to use OpenAI's MRL embeddings with explicit dimension reduction in a simple Python RAG setup:

from openai import OpenAI
import numpy as np

client = OpenAI()

def embed(texts: list[str], dimensions: int = 512) -> list[list[float]]:
    response = client.embeddings.create(
        model="text-embedding-3-large",
        input=texts,
        dimensions=dimensions  # MRL truncation happens server-side
    )
    return [item.embedding for item in response.data]

def cosine_similarity(a: list[float], b: list[float]) -> float:
    a, b = np.array(a), np.array(b)
    return float(np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b)))

# Index your documents at 512 dims
documents = [
    "Our pricing starts at $29/month for the Starter plan.",
    "The Pro plan includes unlimited API calls and priority support.",
    "Enterprise pricing is available on request with custom SLAs.",
]

doc_embeddings = embed(documents, dimensions=512)

# Query at the same dimension
query = "find me pricing information"
query_embedding = embed([query], dimensions=512)[0]

# Rank by cosine similarity
scores = [
    (doc, cosine_similarity(query_embedding, emb))
    for doc, emb in zip(documents, doc_embeddings)
]
scores.sort(key=lambda x: x[1], reverse=True)

for doc, score in scores:
    print(f"{score:.3f} | {doc}")

The critical rule: index and query at the same dimension. You can choose any dimension at query time as long as you re-embed your documents at the same setting. If you indexed at 512, query at 512.

The Bottom Line

MRL is one of the highest-leverage optimizations available to RAG developers in 2026. If you're using text-embedding-3-large at full 3072 dimensions and your use case isn't extremely high-stakes, you're almost certainly overspending on storage and accepting unnecessary search latency.

Start at 512 dimensions. Run your evaluation on a held-out set of queries. If recall is acceptable, ship it. If you need more quality, move to 1024. Almost no web knowledge RAG use case requires 3072.

The nesting dolls give you exactly the quality you need — no more, no less.

Try it now