What Is an Embedding?
An embedding is a dense vector — an ordered list of floating-point numbers — that represents a piece of content (text, image, audio) in a high-dimensional geometric space. Points that are close together in that space have similar meaning; points far apart are semantically different.
For example, the sentences "How do I cancel my plan?" and "I want to unsubscribe" will produce vectors that are very close to each other, even though they share no words.
How Embeddings Are Generated
An embedding model (typically a transformer) maps input text to a fixed-size vector:
import openai
response = openai.embeddings.create(
model="text-embedding-3-small",
input="How do I cancel my subscription?"
)
vector = response.data[0].embedding # list of 1536 floats
Common embedding models:
| Model | Dimensions | Provider |
|---|---|---|
| text-embedding-3-small | 1536 | OpenAI |
| text-embedding-3-large | 3072 | OpenAI |
| embed-english-v3.0 | 1024 | Cohere |
| all-MiniLM-L6-v2 | 384 | Sentence Transformers |
Why Dimensionality Matters
Higher dimensions generally capture more nuance but cost more to store and query. A 1536-dimension embedding for 1 million chunks requires roughly 6 GB of float32 storage. Many systems use quantization (int8 or binary) to reduce this by 4–32x with minimal accuracy loss.
Embedding Properties
- Directionality — the angle between vectors encodes semantic similarity (see cosine similarity)
- Compositionality — related concepts cluster together; analogies can sometimes be solved by vector arithmetic
- Model-specificity — vectors from different models are not comparable; always use the same model for indexing and querying
Bi-Encoders vs Cross-Encoders
- Bi-encoder — encodes query and document independently into vectors, then compares. Fast but less accurate. Used for initial retrieval.
- Cross-encoder — processes query and document together, producing a relevance score. Slower but more accurate. Used for re-ranking.
Most RAG pipelines use a bi-encoder for retrieval and a cross-encoder for re-ranking.
Embeddings in KnowledgeSDK
When you call POST /v1/extract on a URL, KnowledgeSDK automatically:
- Scrapes and cleans the page content
- Splits it into chunks
- Embeds each chunk using a high-quality embedding model
- Stores the vectors in your dedicated Typesense collection
When you call POST /v1/search, your query is embedded with the same model and compared against stored chunk vectors. You get back the most semantically relevant passages without writing a single line of embedding code.
Practical Tips
- Always use the same model for indexing and querying — mixing models produces garbage results
- Embed at the right granularity — too short (single sentence) loses context; too long (full page) dilutes specificity
- Re-embed after model upgrades — new model versions produce incompatible vector spaces
- Cache embeddings — embedding is the most expensive step; cache results for repeated content