What Is Cosine Similarity?
Cosine similarity measures the cosine of the angle between two vectors in a multi-dimensional space. It is the most commonly used metric for comparing text embeddings in semantic search and RAG pipelines.
A cosine similarity of 1.0 means the vectors point in exactly the same direction (identical meaning). A value of 0.0 means they are orthogonal (unrelated). A value of -1.0 means they point in opposite directions (though in practice, text embeddings rarely produce negative similarities).
The Formula
cosine_similarity(A, B) = (A · B) / (||A|| × ||B||)
Where:
A · Bis the dot product of vectors A and B||A||and||B||are the L2 norms (magnitudes) of the vectors
In Python:
import numpy as np
def cosine_similarity(a, b):
return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))
# Example
query_vec = embed("how do I cancel?")
doc_vec = embed("steps to unsubscribe from the plan")
score = cosine_similarity(query_vec, doc_vec)
# score ≈ 0.89 — highly similar
Why Cosine Over Euclidean Distance?
Cosine similarity ignores vector magnitude and measures only direction. This matters because:
- Two documents about the same topic but different lengths will have embeddings in the same direction but different magnitudes
- Cosine similarity correctly identifies them as similar
- Euclidean distance would score them as dissimilar because of the magnitude difference
For normalized vectors (unit length), cosine similarity and dot product are equivalent.
Typical Score Ranges for Text Embeddings
| Score Range | Interpretation |
|---|---|
| 0.90 – 1.00 | Near-duplicate or paraphrase |
| 0.75 – 0.90 | Highly relevant, same topic |
| 0.60 – 0.75 | Related, some overlap |
| 0.40 – 0.60 | Weakly related |
| < 0.40 | Likely unrelated |
These thresholds vary by embedding model — always calibrate against your own data.
Cosine Distance
Cosine distance is defined as:
cosine_distance = 1 - cosine_similarity
It converts similarity (higher = better) into distance (lower = better), which some ANN libraries require. Be aware that many vector databases expose a parameter to choose between similarity and distance — make sure your configuration matches your expectations.
Normalization and Dot Product
If you normalize all vectors to unit length before storing them:
def normalize(v):
return v / np.linalg.norm(v)
normalized = normalize(embed("some text"))
Then cosine similarity equals the plain dot product. This allows some vector databases to use SIMD-optimized dot product operations instead of the full cosine formula, improving query throughput significantly.
Cosine Similarity in KnowledgeSDK
When you call POST /v1/search, KnowledgeSDK embeds your query and computes cosine similarity between the query vector and all indexed chunk vectors via HNSW. The returned results include a relevance score that reflects this similarity. You can use the score to filter out low-confidence matches before passing chunks to your LLM:
{
"results": [
{ "content": "...", "score": 0.92, "source": "https://docs.example.com/billing" },
{ "content": "...", "score": 0.78, "source": "https://docs.example.com/faq" }
]
}
A common practice is to discard any result below a threshold (e.g., 0.70) to avoid injecting irrelevant context into LLM prompts.