Vector Retrieval: From Embedding to Vector Databases
The previous article covered document splitting — the first step of the RAG pipeline. The split text chunks need to be converted into numeric vectors before they can be used by the retrieval system. This article covers the second and third steps: embedding, and vector storage/retrieval.
Embedding: Converting Text to Vectors
What Is Embedding
An Embedding model maps text into a high-dimensional vector space. Semantically similar text maps to vectors that are close together in the space.
"cat" → [0.23, -0.15, 0.87, ..., 0.42] (1536 dimensions)
"dog" → [0.21, -0.13, 0.85, ..., 0.40] (1536 dimensions) ← close to "cat"
"quantum computing" → [-0.56, 0.72, 0.03, ..., -0.91] (1536 dimensions) ← far from "cat"
Vector dimensions are typically 384, 768, 1024, or 1536. Higher dimensions can express richer semantic information, but also increase computation and storage costs.
The distance between two vectors (typically cosine similarity) represents the semantic similarity between the corresponding texts. Cosine similarity closer to 1 means more semantically similar; closer to 0 means less relevant.
Mainstream Embedding Model Comparison
| Model | Dimensions | Parameters | Characteristics | Price |
|---|---|---|---|---|
| text-embedding-3-large | 3072 | - | OpenAI's strongest, MTEB benchmark leader | $0.13/1M tokens |
| text-embedding-3-small | 1536 | - | OpenAI's value option, good performance | $0.02/1M tokens |
| embed-v3 | 1024 | - | Cohere, supports compressed dimensions | $0.10/1M tokens |
| BGE-large | 1024 | 326M | Open-source top choice, strong Chinese/English | Free |
| GTE-large | 1024 | 326M | Alibaba open-source, excellent for Chinese | Free |
| E5-large | 1024 | 326M | Meta open-source, strong multilingual support | Free |
How to Choose an Embedding Model
Choosing an Embedding model requires considering three factors:
Performance: Score on MTEB (Massive Text Embedding Benchmark). This benchmark covers 58 datasets including classification, clustering, retrieval, and semantic similarity tasks. Higher scores mean better average performance across various tasks.
Speed: How long it takes to generate one vector. For local models, this depends on your GPU; for API models, it depends on the provider's response time.
Cost: API models charge per token; local models require GPU resources.
Selection recommendations:
- Rapid prototyping: Use text-embedding-3-small — API calls, zero ops
- Chinese scenarios: BGE-large or GTE-large — open-source and free, strong Chinese performance
- Multilingual scenarios: E5-large — strong multilingual support
- Pursuing maximum performance: text-embedding-3-large — but expensive
- Limited resources: BGE-small or GTE-small — fewer parameters, faster
Using Embedding Models
# Option 1: OpenAI API
from langchain_openai import OpenAIEmbeddings
embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
vector = embeddings.embed_query("What is retrieval-augmented generation?")
vectors = embeddings.embed_documents(["Document 1", "Document 2", "Document 3"])
# Option 2: Local model (Sentence Transformers)
from langchain_community.embeddings import HuggingFaceEmbeddings
embeddings = HuggingFaceEmbeddings(model_name="BAAI/bge-large-en-v1.5")
vector = embeddings.embed_query("What is retrieval-augmented generation?")
# Option 3: Cohere API
from langchain_cohere import CohereEmbeddings
embeddings = CohereEmbeddings(model="embed-english-v3.0")
vector = embeddings.embed_query("What is retrieval-augmented generation?")
Vector Databases
Once you have vectors, you need a place to store and retrieve them. That's what vector databases do.
Mainstream Vector Database Comparison
| Database | Type | Max Scale | Characteristics | Ops Cost |
|---|---|---|---|---|
| Chroma | Embedded | Millions | Zero-config, Python-native, ideal for prototypes | Very low |
| Milvus | Distributed | Billions | High performance, supports large-scale data | High |
| Pinecone | Cloud Service | Billions | Fully managed, no ops needed | Low (but pay-per-use) |
| Weaviate | Open-source | Hundreds of millions | Hybrid search, GraphQL interface | Medium |
| Qdrant | Open-source | Billions | High performance, Rust implementation | Medium |
| pgvector | PG Extension | Hundreds of millions | Leverages existing PostgreSQL infra | Low (if PG already exists) |
How to Choose a Vector Database
Prototyping: Chroma. Zero config, pip install and go, data stored on local disk.
from langchain_community.vectorstores import Chroma
vectorstore = Chroma.from_documents(
documents=chunks,
embedding=embeddings,
persist_directory="./chroma_db"
)
Small-to-medium scale production: Qdrant or Weaviate. Open-source and free, excellent performance, supports hybrid search.
Large-scale production: Milvus or Pinecone. Milvus for self-hosted large-scale scenarios; Pinecone for teams that don't want to manage infrastructure.
Already have PostgreSQL: PGVector. No additional deployment needed — just add vector search capabilities directly to PostgreSQL.
# pgvector example
from langchain_community.vectorstores import PGVector
vectorstore = PGVector.from_documents(
documents=chunks,
embedding=embeddings,
connection_string="postgresql://user:pass@localhost:5432/rag_db"
)
Index Strategy
Retrieval speed in a vector database depends on the index type. Common indexes:
| Index Type | Principle | Speed | Precision | Use Case |
|---|---|---|---|---|
| Flat | Brute force — computes distance to every vector | Slow | 100% | Small datasets (<100K) |
| IVF | Cluster first, then search nearest clusters | Fast | 95%+ | Medium scale |
| HNSW | Hierarchical Navigable Small World graph | Very fast | 99%+ | Large scale, preferred |
| PQ | Product quantization — compresses vectors | Very fast | 90%+ | Memory-constrained scenarios |
Recommendation: Use HNSW for most scenarios. It offers the best balance between speed and precision. Chroma uses HNSW by default; Milvus and Qdrant also recommend HNSW.
Hybrid Search
Pure vector search has blind spots. Consider these scenarios:
- "What is LCEL in LangChain?" — the user explicitly mentions the technical term "LCEL," but vector search might return content about other aspects of LangChain.
- "The March 2026 product launch" — contains a precise date that vector search can hardly match precisely.
Hybrid search combines vector search and keyword search (BM25), complementing each other.
Why Hybrid Search Is Needed
| Dimension | Vector Search | Keyword Search (BM25) |
|---|---|---|
| Strengths | Understands semantics — "car" matches "automobile" | Exact matching — "LangChain" only matches "LangChain" |
| Weaknesses | Poor at matching technical terms, dates, proper nouns | Doesn't understand semantics — "car" and "automobile" don't recognize each other |
| Best for | Fuzzy queries, semantic queries | Exact queries, technical term queries |
Implementing Hybrid Search
from langchain.retrievers import EnsembleRetriever
from langchain_community.retrievers import BM25Retriever
# Vector retriever
vector_retriever = vectorstore.as_retriever(search_kwargs={"k": 5})
# BM25 keyword retriever
bm25_retriever = BM25Retriever.from_documents(chunks)
bm25_retriever.k = 5
# Hybrid retriever
ensemble_retriever = EnsembleRetriever(
retrievers=[vector_retriever, bm25_retriever],
weights=[0.6, 0.4] # Vector search weight 0.6, BM25 weight 0.4
)
results = ensemble_retriever.invoke("What is LCEL in LangChain?")
Fusion Strategies
Results from multiple retrieval paths need to be fused. Common fusion methods:
Weighted fusion: Weight and sum results from different retrievers. Simple and direct, but weights need tuning.
RRF (Reciprocal Rank Fusion): Rank-based fusion. Each retriever's results are assigned scores by rank (1/rank), and final ranking is by total score. No tuning needed, and performance is usually excellent.
from langchain.retrievers import EnsembleRetriever
# RRF fusion (default)
ensemble_retriever = EnsembleRetriever(
retrievers=[vector_retriever, bm25_retriever],
weights=[0.5, 0.5],
c=60 # RRF constant, typically 60
)
My recommendation: Start with RRF. If performance is unsatisfactory, try weighted fusion. RFF requires no tuning and usually delivers excellent results.
Reranking: A Powerful Tool for Retrieval Improvement
Why Reranking Is Needed
Vector retrieval results are ranked by similarity, but not necessarily by relevance. Especially when queries are complex and there are many chunks, the top-k results from initial retrieval may contain chunks that "look similar but are actually irrelevant."
Reranking uses a more precise model to re-sort the initial retrieval results, pushing truly relevant chunks to the top.
Cross-Encoder
Reranking typically uses a cross-Encoder. Unlike a bi-Encoder (i.e., an Embedding model), a Cross-Encoder takes a pair of texts (query + document) as input and outputs a relevance score directly.
| Dimension | Bi-Embedding (Embedding) | Cross-Encoder (Reranker) |
|---|---|---|
| Input | Single text | Text pair (query + document) |
| Output | Vector | Relevance score |
| Speed | Fast (pre-computable) | Slow (inference required each time) |
| Precision | Lower | Higher |
| Use Case | Initial retrieval | Re-ranking |
Implementing Reranking
from langchain.retrievers import ContextualCompressionRetriever
from langchain.retrievers.document_compressors import CrossEncoderReranker
from sentence_transformers import CrossEncoder
# Base retriever (vector search)
base_retriever = vectorstore.as_retriever(search_kwargs={"k": 20}) # Recall more first
# Reranker
cross_encoder = CrossEncoder("cross-encoder/ms-marco-MiniLM-L-6-v2")
reranker = CrossEncoderReranker(model=cross_encoder, top_n=5)
# Compression retriever: retrieve 20, then rerank to top 5
compression_retriever = ContextualCompressionRetriever(
base_compressor=reranker,
base_retriever=base_retriever
)
results = compression_retriever.invoke("What is LCEL in LangChain?")
Reranking Effectiveness
Reranking can significantly improve retrieval performance. Based on my experience:
- Top-5 accuracy with pure vector search: assume 70%
- Top-5 accuracy after reranking: typically improves to 85-90%
- Cost: an additional 100-500ms of latency (depending on document count)
Recommendation: If you require high retrieval precision, reranking is a worthwhile investment. If it's just an internal tool, pure vector search might be sufficient.
Vector Retrieval in Practice: Claude Code MagmaAdapter
Looking back at the Claude Code memory system. Its vector retrieval implementation is quite basic — pure vector search, no hybrid search, no reranking.
// MagmaAdapter core retrieval logic
async readLanceDB(query: string, limit: number = 10, layer?: string) {
const queryVector = await this.embedQuery(query, lancedb)
for (const tableName of await db.tableNames()) {
const table = await db.openTable(tableName)
const results = await table
.vectorSearch(queryVector)
.limit(limit)
.distanceType('COSINE')
.toArray()
results.push({
key, value,
score: 1 - entry._distance // Convert cosine distance to similarity
})
}
return results.sort((a, b) => b.score - a.score).slice(0, limit)
}
Characteristics of this implementation:
- Pure vector search: Only uses LanceDB's vectorSearch
- COSINE distance: Standard cosine similarity
- Sort by score: Simplest ranking approach
- No hybrid search: No BM25
- No reranking: No Cross-Encoder
Why is it so basic? Because the memory system's data volume is small (tens to hundreds of entries), and each memory is human-written, well-formatted text. In this scenario, basic vector search is sufficient.
But for large-scale RAG systems (thousands to millions of chunks), pure vector search alone is not enough. You need hybrid search to improve recall, and reranking to improve precision.
Common Pitfalls in Vector Retrieval
Pitfall 1: Embedding Model Mismatch
Symptoms: The semantic relevance between retrieval results and questions is low.
Cause: The Embedding model doesn't correctly understand your data type. For example, using an English model for Chinese documents, or a general-purpose model for domain-specific documents.
Solution: Switch to a more suitable Embedding model.
Pitfall 2: Chunks Too Large, Degrading Vector Quality
Symptoms: Retrieval results contain large amounts of irrelevant content.
Cause: Chunks are too large, containing multiple topics within one chunk. The vector is "diluted."
Solution: Reduce chunk_size, or switch to a more precise splitting strategy.
Pitfall 3: Using Only Vector Search
Symptoms: Exact keyword matching performance is poor.
Cause: Vector search is not good at matching technical terms, dates, or proper nouns.
Solution: Add BM25 keyword search and implement hybrid search.
Pitfall 4: Insufficient Retrieval Precision
Symptoms: Top-k results contain some less relevant chunks.
Cause: Initial retrieval precision is insufficient.
Solution: Add reranking. First recall more results (top-20 or top-50), then use a Cross-Encoder to re-rank and take the top 5.
Pitfall 5: Vector Database Performance Issues
Symptoms: Retrieval latency is too high.
Cause: Data volume is too large, or the index strategy is incorrect.
Solution:
- Check index type (HNSW recommended)
- Adjust index parameters (HNSW's
MandefConstruction) - Consider sharding or distributed deployment
Summary
Vector retrieval is the core step of the RAG pipeline.
Key takeaways:
- Choose Embedding models based on data type — BGE/GTE for Chinese, E5 for multilingual
- Choose vector databases based on scale — Chroma for prototyping, Qdrant/Milvus/Pinecone for production
- Hybrid search is a production standard — vector + BM25, RRF fusion
- Reranking is a powerful tool for quality improvement — recall more first, then re-rank
The next article covers post-retrieval optimization — after getting retrieval results, how to further improve RAG performance.
Series:
