Every RAG tutorial jumps straight to chunking strategy, prompt templates, and LLM selection. The embedding model — the component that determines whether your retrieval even finds the right documents — gets a footnote, if that.
This is backwards. Your embedding model is the foundation of your retrieval system. A weak embedding model means weak retrieval. Weak retrieval means the LLM generates answers from the wrong context. No amount of prompt engineering fixes bad retrieval.
In 2026, the embedding model landscape has changed significantly. Open-source models now match or exceed closed-source alternatives on benchmarks. Multimodal embeddings are becoming production-ready. And new architectures like MoE embedding models are pushing quality while keeping inference costs manageable.
Here's the complete guide.
Why the MTEB Benchmark Is the Standard
The Massive Text Embedding Benchmark (MTEB) is the most comprehensive evaluation framework for text embedding models. It covers 56 datasets across 8 task types:
- Retrieval — the task most relevant for RAG (finding relevant documents)
- Classification — assigning labels to text
- Clustering — grouping similar texts
- Pair classification — determining if two texts are semantically equivalent
- Reranking — ordering a list of results by relevance
- STS (Semantic Textual Similarity) — scoring similarity between pairs
- Summarization — embedding quality for summaries
- Bitext mining — cross-lingual alignment
For RAG use cases, the Retrieval task scores are most predictive of real-world performance. The overall MTEB score averages all task types, which can be misleading — a model optimized for STS might rank high overall while underperforming on retrieval specifically.
When evaluating models, look at the Retrieval-specific scores alongside the overall MTEB score.
2026 Leaderboard: Top 10 Embedding Models
| Rank | Model | Type | MTEB Score | License |
|---|---|---|---|---|
| 1 | Gemini Embedding 2 | Closed API | ~1605 ELO | Proprietary |
| 2 | Qwen3-Embedding-8B | Open | ~70.58 | Apache 2.0 |
| 3 | text-embedding-3-large | Closed API | Strong general | Proprietary |
| 4 | BGE-M3 | Open | 63.0 | MIT |
| 5 | Nomic Embed Text V2 | Open | Strong | Apache 2.0 |
| 6 | e5-mistral-7b-instruct | Open | Strong | MIT |
| 7 | text-embedding-3-small | Closed API | Good | Proprietary |
| 8 | EmbeddingGemma-300M | Open | Solid | Apache 2.0 |
| 9 | gte-large | Open | 63.1 | MIT |
| 10 | UAE-Large-V1 | Open | 64.6 | MIT |
Gemini Embedding 2 sits at the top of the ELO rankings and is notably multimodal — it can embed text alongside images and structured data. For document-heavy pipelines that include screenshots, PDFs with charts, or mixed-media content, it's the strongest option. The downside is vendor lock-in and the fact that it's closed-source.
Qwen3-Embedding-8B from Alibaba is the open-source surprise of 2025-2026. It closes the gap with the best closed models significantly, supports MRL (truncation at inference time), and has strong multilingual performance. If you're self-hosting, this is the model to start with.
text-embedding-3-large remains excellent for general-purpose RAG and has native MRL support via the dimensions API parameter. It's the safest default for teams already in the OpenAI ecosystem.
BGE-M3 (BAAI General Embedding) is the go-to for multilingual use cases. It was trained on 100+ languages, supports hybrid dense-sparse retrieval natively, and runs efficiently on a single A10 GPU.
Nomic Embed Text V2 takes an unusual approach — it uses a Mixture-of-Experts (MoE) architecture for the embedding model itself, achieving high quality while keeping active parameter count low. Strong MTEB scores and fully open weights.
e5-mistral-7b-instruct uses a Mistral 7B base model fine-tuned for embedding. The instruction-following approach means it benefits from explicit task prefixes ("Represent this document for retrieval:"). Strong on out-of-domain retrieval.
text-embedding-3-small is OpenAI's cost-optimized model. At $0.02 per million tokens, it's 20x cheaper than text-embedding-3-large while achieving retrieval quality that's sufficient for the majority of RAG use cases. The first model to evaluate for cost-sensitive applications.
EmbeddingGemma-300M is Google's tiny, on-device embedding model. At 300M parameters it runs comfortably on mobile hardware and achieves respectable retrieval quality — not competitive with the top models, but remarkable for its size class.
Decision Framework
| Use Case | Recommended Model | Reason |
|---|---|---|
| General RAG (cost priority) | text-embedding-3-small | Cheap, reliable, OpenAI ecosystem |
| General RAG (quality priority) | Qwen3-Embedding-8B | Best open model, free to self-host |
| Multilingual knowledge base | BGE-M3 | 100+ language support, hybrid retrieval |
| Privacy / on-premises | Qwen3-8B or BGE-M3 | Open weights, run on your own infra |
| On-device / mobile | EmbeddingGemma-300M | 300M params, fast on device |
| Code search | gte-large or CodeBERT | Code-aware training |
| Long documents (>512 tokens) | e5-mistral-7b or BGE-M3 | Extended context windows |
| Multimodal content | Gemini Embedding 2 | Text + image embedding |
| High-stakes retrieval | text-embedding-3-large | Maximum quality, MRL support |
| Self-hosted multilingual | BGE-M3 | Best open multilingual model |
Pricing Comparison
For closed-source hosted models:
| Model | Price per 1M tokens |
|---|---|
| text-embedding-3-small | $0.02 |
| text-embedding-3-large | $0.13 |
| Gemini Embedding 2 | ~$0.025 (varies by tier) |
For open-source models, the cost is the infrastructure to run them. A single text-embedding-3-small equivalent open model (e.g., gte-large) runs on a single A10G GPU at roughly $1–2/hour on cloud providers — meaning at scale (>100M tokens/month), self-hosting pays for itself.
At low volumes (<10M tokens/month), the OpenAI API is almost always cheaper after you account for engineering time to deploy and maintain your own model server.
Latency Considerations
Hosted API latency is predictable but non-trivial — you're making a network call on every query:
- text-embedding-3-small: ~50–80ms p95 round trip
- text-embedding-3-large: ~80–120ms p95 round trip
Self-hosted on GPU:
- BGE-M3 (single A10G): ~5–15ms per batch of 32 passages
- Qwen3-8B: ~20–40ms per batch of 32 passages
For latency-sensitive retrieval (sub-100ms end-to-end), self-hosted small models or the OpenAI small model with connection pooling are the right choices.
Practical Recommendation for Web Knowledge Bases
For the KnowledgeSDK use case — extracting and searching web content (company pages, documentation, product sites, news) — the decision usually comes down to:
If you're building on top of KnowledgeSDK's API: the embedding layer is managed for you. POST /v1/search handles the retrieval; you don't need to pick or host an embedding model.
If you're building your own embedding layer:
- text-embedding-3-small for teams that want the simplest possible setup. It handles English-dominant web content well, and the OpenAI API requires no infrastructure.
- BGE-M3 for teams handling multi-language web content or with privacy constraints that prevent sending content to third-party APIs. Run it on a single GPU, expose an embedding endpoint, and you control everything.
The one scenario where you should use text-embedding-3-large or Qwen3-8B over small models: when your documents are dense with technical terminology, acronyms, or domain-specific language that smaller models may conflate. Product documentation for complex B2B software is the clearest example.
How to Evaluate Before You Commit
The MTEB benchmark is a proxy. Your actual retrieval quality depends on your specific domain, your query patterns, and your document types.
Before committing to an embedding model at scale, run a lightweight evaluation:
- Collect 100–200 representative queries from your target use case.
- Annotate which documents are relevant for each (manually or via LLM-assisted labeling).
- Embed your corpus with each candidate model.
- Measure Recall@5 and Recall@10 for each model on your query set.
- Pick the model that maximizes recall on your data, not on MTEB.
This takes a few hours and can save you from a painful migration later. The BEIR benchmark framework has ready-made tooling for this evaluation loop and works with any embedding model via a standard interface.
The best embedding model is the one that retrieves the right documents for your users — not the one with the highest number on a leaderboard.