Vector Retrieval: From Embedding to Vector Databases

The previous article covered document splitting — the first step of the RAG pipeline. The split text chunks need to be converted into numeric vectors before they can be used by the retrieval system. This article covers the second and third steps: embedding, and vector storage/retrieval.

Embedding: Converting Text to Vectors

What Is Embedding

An Embedding model maps text into a high-dimensional vector space. Semantically similar text maps to vectors that are close together in the space.

"cat" → [0.23, -0.15, 0.87, ..., 0.42]  (1536 dimensions)
"dog" → [0.21, -0.13, 0.85, ..., 0.40]  (1536 dimensions)  ← close to "cat"
"quantum computing" → [-0.56, 0.72, 0.03, ..., -0.91]  (1536 dimensions)  ← far from "cat"

Vector dimensions are typically 384, 768, 1024, or 1536. Higher dimensions can express richer semantic information, but also increase computation and storage costs.

The distance between two vectors (typically cosine similarity) represents the semantic similarity between the corresponding texts. Cosine similarity closer to 1 means more semantically similar; closer to 0 means less relevant.

Mainstream Embedding Model Comparison

Model	Dimensions	Parameters	Characteristics	Price
text-embedding-3-large	3072	-	OpenAI's strongest, MTEB benchmark leader	$0.13/1M tokens
text-embedding-3-small	1536	-	OpenAI's value option, good performance	$0.02/1M tokens
embed-v3	1024	-	Cohere, supports compressed dimensions	$0.10/1M tokens
BGE-large	1024	326M	Open-source top choice, strong Chinese/English	Free
GTE-large	1024	326M	Alibaba open-source, excellent for Chinese	Free
E5-large	1024	326M	Meta open-source, strong multilingual support	Free

How to Choose an Embedding Model

Choosing an Embedding model requires considering three factors:

Performance: Score on MTEB (Massive Text Embedding Benchmark). This benchmark covers 58 datasets including classification, clustering, retrieval, and semantic similarity tasks. Higher scores mean better average performance across various tasks.

Speed: How long it takes to generate one vector. For local models, this depends on your GPU; for API models, it depends on the provider's response time.

Cost: API models charge per token; local models require GPU resources.

Selection recommendations:

Rapid prototyping: Use text-embedding-3-small — API calls, zero ops
Chinese scenarios: BGE-large or GTE-large — open-source and free, strong Chinese performance
Multilingual scenarios: E5-large — strong multilingual support
Pursuing maximum performance: text-embedding-3-large — but expensive
Limited resources: BGE-small or GTE-small — fewer parameters, faster

Using Embedding Models

# Option 1: OpenAI API
from langchain_openai import OpenAIEmbeddings

embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
vector = embeddings.embed_query("What is retrieval-augmented generation?")
vectors = embeddings.embed_documents(["Document 1", "Document 2", "Document 3"])

# Option 2: Local model (Sentence Transformers)
from langchain_community.embeddings import HuggingFaceEmbeddings

embeddings = HuggingFaceEmbeddings(model_name="BAAI/bge-large-en-v1.5")
vector = embeddings.embed_query("What is retrieval-augmented generation?")

# Option 3: Cohere API
from langchain_cohere import CohereEmbeddings

embeddings = CohereEmbeddings(model="embed-english-v3.0")
vector = embeddings.embed_query("What is retrieval-augmented generation?")

Vector Databases

Once you have vectors, you need a place to store and retrieve them. That's what vector databases do.

Mainstream Vector Database Comparison

Database	Type	Max Scale	Characteristics	Ops Cost
Chroma	Embedded	Millions	Zero-config, Python-native, ideal for prototypes	Very low
Milvus	Distributed	Billions	High performance, supports large-scale data	High
Pinecone	Cloud Service	Billions	Fully managed, no ops needed	Low (but pay-per-use)
Weaviate	Open-source	Hundreds of millions	Hybrid search, GraphQL interface	Medium
Qdrant	Open-source	Billions	High performance, Rust implementation	Medium
pgvector	PG Extension	Hundreds of millions	Leverages existing PostgreSQL infra	Low (if PG already exists)

How to Choose a Vector Database

Prototyping: Chroma. Zero config, pip install and go, data stored on local disk.

from langchain_community.vectorstores import Chroma

vectorstore = Chroma.from_documents(
    documents=chunks,
    embedding=embeddings,
    persist_directory="./chroma_db"
)

Small-to-medium scale production: Qdrant or Weaviate. Open-source and free, excellent performance, supports hybrid search.

Large-scale production: Milvus or Pinecone. Milvus for self-hosted large-scale scenarios; Pinecone for teams that don't want to manage infrastructure.

Already have PostgreSQL: PGVector. No additional deployment needed — just add vector search capabilities directly to PostgreSQL.

# pgvector example
from langchain_community.vectorstores import PGVector

vectorstore = PGVector.from_documents(
    documents=chunks,
    embedding=embeddings,
    connection_string="postgresql://user:pass@localhost:5432/rag_db"
)

Index Strategy

Retrieval speed in a vector database depends on the index type. Common indexes:

Index Type	Principle	Speed	Precision	Use Case
Flat	Brute force — computes distance to every vector	Slow	100%	Small datasets (<100K)
IVF	Cluster first, then search nearest clusters	Fast	95%+	Medium scale
HNSW	Hierarchical Navigable Small World graph	Very fast	99%+	Large scale, preferred
PQ	Product quantization — compresses vectors	Very fast	90%+	Memory-constrained scenarios

Recommendation: Use HNSW for most scenarios. It offers the best balance between speed and precision. Chroma uses HNSW by default; Milvus and Qdrant also recommend HNSW.

Hybrid Search

Pure vector search has blind spots. Consider these scenarios:

"What is LCEL in LangChain?" — the user explicitly mentions the technical term "LCEL," but vector search might return content about other aspects of LangChain.
"The March 2026 product launch" — contains a precise date that vector search can hardly match precisely.

Hybrid search combines vector search and keyword search (BM25), complementing each other.

Why Hybrid Search Is Needed

Dimension	Vector Search	Keyword Search (BM25)
Strengths	Understands semantics — "car" matches "automobile"	Exact matching — "LangChain" only matches "LangChain"
Weaknesses	Poor at matching technical terms, dates, proper nouns	Doesn't understand semantics — "car" and "automobile" don't recognize each other
Best for	Fuzzy queries, semantic queries	Exact queries, technical term queries

Implementing Hybrid Search

from langchain.retrievers import EnsembleRetriever
from langchain_community.retrievers import BM25Retriever

# Vector retriever
vector_retriever = vectorstore.as_retriever(search_kwargs={"k": 5})

# BM25 keyword retriever
bm25_retriever = BM25Retriever.from_documents(chunks)
bm25_retriever.k = 5

# Hybrid retriever
ensemble_retriever = EnsembleRetriever(
    retrievers=[vector_retriever, bm25_retriever],
    weights=[0.6, 0.4]  # Vector search weight 0.6, BM25 weight 0.4
)

results = ensemble_retriever.invoke("What is LCEL in LangChain?")

Fusion Strategies

Results from multiple retrieval paths need to be fused. Common fusion methods:

Weighted fusion: Weight and sum results from different retrievers. Simple and direct, but weights need tuning.

RRF (Reciprocal Rank Fusion): Rank-based fusion. Each retriever's results are assigned scores by rank (1/rank), and final ranking is by total score. No tuning needed, and performance is usually excellent.

from langchain.retrievers import EnsembleRetriever

# RRF fusion (default)
ensemble_retriever = EnsembleRetriever(
    retrievers=[vector_retriever, bm25_retriever],
    weights=[0.5, 0.5],
    c=60  # RRF constant, typically 60
)

My recommendation: Start with RRF. If performance is unsatisfactory, try weighted fusion. RFF requires no tuning and usually delivers excellent results.

Reranking: A Powerful Tool for Retrieval Improvement

Why Reranking Is Needed

Vector retrieval results are ranked by similarity, but not necessarily by relevance. Especially when queries are complex and there are many chunks, the top-k results from initial retrieval may contain chunks that "look similar but are actually irrelevant."

Reranking uses a more precise model to re-sort the initial retrieval results, pushing truly relevant chunks to the top.

Cross-Encoder

Reranking typically uses a cross-Encoder. Unlike a bi-Encoder (i.e., an Embedding model), a Cross-Encoder takes a pair of texts (query + document) as input and outputs a relevance score directly.

Dimension	Bi-Embedding (Embedding)	Cross-Encoder (Reranker)
Input	Single text	Text pair (query + document)
Output	Vector	Relevance score
Speed	Fast (pre-computable)	Slow (inference required each time)
Precision	Lower	Higher
Use Case	Initial retrieval	Re-ranking

Implementing Reranking

from langchain.retrievers import ContextualCompressionRetriever
from langchain.retrievers.document_compressors import CrossEncoderReranker
from sentence_transformers import CrossEncoder

# Base retriever (vector search)
base_retriever = vectorstore.as_retriever(search_kwargs={"k": 20})  # Recall more first

# Reranker
cross_encoder = CrossEncoder("cross-encoder/ms-marco-MiniLM-L-6-v2")
reranker = CrossEncoderReranker(model=cross_encoder, top_n=5)

# Compression retriever: retrieve 20, then rerank to top 5
compression_retriever = ContextualCompressionRetriever(
    base_compressor=reranker,
    base_retriever=base_retriever
)

results = compression_retriever.invoke("What is LCEL in LangChain?")

Reranking Effectiveness

Reranking can significantly improve retrieval performance. Based on my experience:

Top-5 accuracy with pure vector search: assume 70%
Top-5 accuracy after reranking: typically improves to 85-90%
Cost: an additional 100-500ms of latency (depending on document count)

Recommendation: If you require high retrieval precision, reranking is a worthwhile investment. If it's just an internal tool, pure vector search might be sufficient.

Vector Retrieval in Practice: Claude Code MagmaAdapter

Looking back at the Claude Code memory system. Its vector retrieval implementation is quite basic — pure vector search, no hybrid search, no reranking.

// MagmaAdapter core retrieval logic
async readLanceDB(query: string, limit: number = 10, layer?: string) {
  const queryVector = await this.embedQuery(query, lancedb)

  for (const tableName of await db.tableNames()) {
    const table = await db.openTable(tableName)
    const results = await table
      .vectorSearch(queryVector)
      .limit(limit)
      .distanceType('COSINE')
      .toArray()

    results.push({
      key, value,
      score: 1 - entry._distance  // Convert cosine distance to similarity
    })
  }

  return results.sort((a, b) => b.score - a.score).slice(0, limit)
}

Characteristics of this implementation:

Pure vector search: Only uses LanceDB's vectorSearch
COSINE distance: Standard cosine similarity
Sort by score: Simplest ranking approach
No hybrid search: No BM25
No reranking: No Cross-Encoder

Why is it so basic? Because the memory system's data volume is small (tens to hundreds of entries), and each memory is human-written, well-formatted text. In this scenario, basic vector search is sufficient.

But for large-scale RAG systems (thousands to millions of chunks), pure vector search alone is not enough. You need hybrid search to improve recall, and reranking to improve precision.

Common Pitfalls in Vector Retrieval

Pitfall 1: Embedding Model Mismatch

Symptoms: The semantic relevance between retrieval results and questions is low.

Cause: The Embedding model doesn't correctly understand your data type. For example, using an English model for Chinese documents, or a general-purpose model for domain-specific documents.

Solution: Switch to a more suitable Embedding model.

Pitfall 2: Chunks Too Large, Degrading Vector Quality

Symptoms: Retrieval results contain large amounts of irrelevant content.

Cause: Chunks are too large, containing multiple topics within one chunk. The vector is "diluted."

Solution: Reduce chunk_size, or switch to a more precise splitting strategy.

Pitfall 3: Using Only Vector Search

Symptoms: Exact keyword matching performance is poor.

Cause: Vector search is not good at matching technical terms, dates, or proper nouns.

Solution: Add BM25 keyword search and implement hybrid search.

Pitfall 4: Insufficient Retrieval Precision

Symptoms: Top-k results contain some less relevant chunks.

Cause: Initial retrieval precision is insufficient.

Solution: Add reranking. First recall more results (top-20 or top-50), then use a Cross-Encoder to re-rank and take the top 5.

Pitfall 5: Vector Database Performance Issues

Symptoms: Retrieval latency is too high.

Cause: Data volume is too large, or the index strategy is incorrect.

Solution:

Check index type (HNSW recommended)
Adjust index parameters (HNSW's M and efConstruction)
Consider sharding or distributed deployment

Summary

Vector retrieval is the core step of the RAG pipeline.

Key takeaways:

Choose Embedding models based on data type — BGE/GTE for Chinese, E5 for multilingual
Choose vector databases based on scale — Chroma for prototyping, Qdrant/Milvus/Pinecone for production
Hybrid search is a production standard — vector + BM25, RRF fusion
Reranking is a powerful tool for quality improvement — recall more first, then re-rank

The next article covers post-retrieval optimization — after getting retrieval results, how to further improve RAG performance.

Series:

Vector Retrieval: From Embedding to Vector Databases

Vector Retrieval: From Embedding to Vector Databases

Embedding: Converting Text to Vectors

What Is Embedding

Mainstream Embedding Model Comparison

How to Choose an Embedding Model

Using Embedding Models

Vector Databases

Mainstream Vector Database Comparison

How to Choose a Vector Database

Index Strategy

Hybrid Search

Why Hybrid Search Is Needed

Implementing Hybrid Search

Fusion Strategies

Reranking: A Powerful Tool for Retrieval Improvement

Why Reranking Is Needed

Cross-Encoder

Implementing Reranking

Reranking Effectiveness

Vector Retrieval in Practice: Claude Code MagmaAdapter

Common Pitfalls in Vector Retrieval

Pitfall 1: Embedding Model Mismatch

Pitfall 2: Chunks Too Large, Degrading Vector Quality

Pitfall 3: Using Only Vector Search

Pitfall 4: Insufficient Retrieval Precision

Pitfall 5: Vector Database Performance Issues

Summary

Related Articles

面试官问你：如何解决大模型的上下文长度限制——标准回答框架

大模型上下文长度限制完全指南：从原理到工程落地的 4 种方案

面试官问你：RAG 如何处理 PDF——别再说转文本切片了