Vector Retrieval: From Embedding to Vector Databases

Vector Retrieval: From Embedding to Vector Databases

The previous article covered document splitting — the first step of the RAG pipeline. The split text chunks need to be converted into numeric vectors before they can be used by the retrieval system. This article covers the second and third steps: embedding, and vector storage/retrieval.

Embedding: Converting Text to Vectors

What Is Embedding

An Embedding model maps text into a high-dimensional vector space. Semantically similar text maps to vectors that are close together in the space.

"cat" → [0.23, -0.15, 0.87, ..., 0.42]  (1536 dimensions)
"dog" → [0.21, -0.13, 0.85, ..., 0.40]  (1536 dimensions)  ← close to "cat"
"quantum computing" → [-0.56, 0.72, 0.03, ..., -0.91]  (1536 dimensions)  ← far from "cat"

Vector dimensions are typically 384, 768, 1024, or 1536. Higher dimensions can express richer semantic information, but also increase computation and storage costs.

The distance between two vectors (typically cosine similarity) represents the semantic similarity between the corresponding texts. Cosine similarity closer to 1 means more semantically similar; closer to 0 means less relevant.

Mainstream Embedding Model Comparison

Model Dimensions Parameters Characteristics Price
text-embedding-3-large 3072 - OpenAI's strongest, MTEB benchmark leader $0.13/1M tokens
text-embedding-3-small 1536 - OpenAI's value option, good performance $0.02/1M tokens
embed-v3 1024 - Cohere, supports compressed dimensions $0.10/1M tokens
BGE-large 1024 326M Open-source top choice, strong Chinese/English Free
GTE-large 1024 326M Alibaba open-source, excellent for Chinese Free
E5-large 1024 326M Meta open-source, strong multilingual support Free

How to Choose an Embedding Model

Choosing an Embedding model requires considering three factors:

Performance: Score on MTEB (Massive Text Embedding Benchmark). This benchmark covers 58 datasets including classification, clustering, retrieval, and semantic similarity tasks. Higher scores mean better average performance across various tasks.

Speed: How long it takes to generate one vector. For local models, this depends on your GPU; for API models, it depends on the provider's response time.

Cost: API models charge per token; local models require GPU resources.

Selection recommendations:

  • Rapid prototyping: Use text-embedding-3-small — API calls, zero ops
  • Chinese scenarios: BGE-large or GTE-large — open-source and free, strong Chinese performance
  • Multilingual scenarios: E5-large — strong multilingual support
  • Pursuing maximum performance: text-embedding-3-large — but expensive
  • Limited resources: BGE-small or GTE-small — fewer parameters, faster

Using Embedding Models

# Option 1: OpenAI API
from langchain_openai import OpenAIEmbeddings

embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
vector = embeddings.embed_query("What is retrieval-augmented generation?")
vectors = embeddings.embed_documents(["Document 1", "Document 2", "Document 3"])

# Option 2: Local model (Sentence Transformers)
from langchain_community.embeddings import HuggingFaceEmbeddings

embeddings = HuggingFaceEmbeddings(model_name="BAAI/bge-large-en-v1.5")
vector = embeddings.embed_query("What is retrieval-augmented generation?")

# Option 3: Cohere API
from langchain_cohere import CohereEmbeddings

embeddings = CohereEmbeddings(model="embed-english-v3.0")
vector = embeddings.embed_query("What is retrieval-augmented generation?")

Vector Databases

Once you have vectors, you need a place to store and retrieve them. That's what vector databases do.

Mainstream Vector Database Comparison

Database Type Max Scale Characteristics Ops Cost
Chroma Embedded Millions Zero-config, Python-native, ideal for prototypes Very low
Milvus Distributed Billions High performance, supports large-scale data High
Pinecone Cloud Service Billions Fully managed, no ops needed Low (but pay-per-use)
Weaviate Open-source Hundreds of millions Hybrid search, GraphQL interface Medium
Qdrant Open-source Billions High performance, Rust implementation Medium
pgvector PG Extension Hundreds of millions Leverages existing PostgreSQL infra Low (if PG already exists)

How to Choose a Vector Database

Prototyping: Chroma. Zero config, pip install and go, data stored on local disk.

from langchain_community.vectorstores import Chroma

vectorstore = Chroma.from_documents(
    documents=chunks,
    embedding=embeddings,
    persist_directory="./chroma_db"
)

Small-to-medium scale production: Qdrant or Weaviate. Open-source and free, excellent performance, supports hybrid search.

Large-scale production: Milvus or Pinecone. Milvus for self-hosted large-scale scenarios; Pinecone for teams that don't want to manage infrastructure.

Already have PostgreSQL: PGVector. No additional deployment needed — just add vector search capabilities directly to PostgreSQL.

# pgvector example
from langchain_community.vectorstores import PGVector

vectorstore = PGVector.from_documents(
    documents=chunks,
    embedding=embeddings,
    connection_string="postgresql://user:pass@localhost:5432/rag_db"
)

Index Strategy

Retrieval speed in a vector database depends on the index type. Common indexes:

Index Type Principle Speed Precision Use Case
Flat Brute force — computes distance to every vector Slow 100% Small datasets (<100K)
IVF Cluster first, then search nearest clusters Fast 95%+ Medium scale
HNSW Hierarchical Navigable Small World graph Very fast 99%+ Large scale, preferred
PQ Product quantization — compresses vectors Very fast 90%+ Memory-constrained scenarios

Recommendation: Use HNSW for most scenarios. It offers the best balance between speed and precision. Chroma uses HNSW by default; Milvus and Qdrant also recommend HNSW.

Hybrid Search

Pure vector search has blind spots. Consider these scenarios:

  • "What is LCEL in LangChain?" — the user explicitly mentions the technical term "LCEL," but vector search might return content about other aspects of LangChain.
  • "The March 2026 product launch" — contains a precise date that vector search can hardly match precisely.

Hybrid search combines vector search and keyword search (BM25), complementing each other.

Why Hybrid Search Is Needed

Dimension Vector Search Keyword Search (BM25)
Strengths Understands semantics — "car" matches "automobile" Exact matching — "LangChain" only matches "LangChain"
Weaknesses Poor at matching technical terms, dates, proper nouns Doesn't understand semantics — "car" and "automobile" don't recognize each other
Best for Fuzzy queries, semantic queries Exact queries, technical term queries

Implementing Hybrid Search

from langchain.retrievers import EnsembleRetriever
from langchain_community.retrievers import BM25Retriever

# Vector retriever
vector_retriever = vectorstore.as_retriever(search_kwargs={"k": 5})

# BM25 keyword retriever
bm25_retriever = BM25Retriever.from_documents(chunks)
bm25_retriever.k = 5

# Hybrid retriever
ensemble_retriever = EnsembleRetriever(
    retrievers=[vector_retriever, bm25_retriever],
    weights=[0.6, 0.4]  # Vector search weight 0.6, BM25 weight 0.4
)

results = ensemble_retriever.invoke("What is LCEL in LangChain?")

Fusion Strategies

Results from multiple retrieval paths need to be fused. Common fusion methods:

Weighted fusion: Weight and sum results from different retrievers. Simple and direct, but weights need tuning.

RRF (Reciprocal Rank Fusion): Rank-based fusion. Each retriever's results are assigned scores by rank (1/rank), and final ranking is by total score. No tuning needed, and performance is usually excellent.

from langchain.retrievers import EnsembleRetriever

# RRF fusion (default)
ensemble_retriever = EnsembleRetriever(
    retrievers=[vector_retriever, bm25_retriever],
    weights=[0.5, 0.5],
    c=60  # RRF constant, typically 60
)

My recommendation: Start with RRF. If performance is unsatisfactory, try weighted fusion. RFF requires no tuning and usually delivers excellent results.

Reranking: A Powerful Tool for Retrieval Improvement

Why Reranking Is Needed

Vector retrieval results are ranked by similarity, but not necessarily by relevance. Especially when queries are complex and there are many chunks, the top-k results from initial retrieval may contain chunks that "look similar but are actually irrelevant."

Reranking uses a more precise model to re-sort the initial retrieval results, pushing truly relevant chunks to the top.

Cross-Encoder

Reranking typically uses a cross-Encoder. Unlike a bi-Encoder (i.e., an Embedding model), a Cross-Encoder takes a pair of texts (query + document) as input and outputs a relevance score directly.

Dimension Bi-Embedding (Embedding) Cross-Encoder (Reranker)
Input Single text Text pair (query + document)
Output Vector Relevance score
Speed Fast (pre-computable) Slow (inference required each time)
Precision Lower Higher
Use Case Initial retrieval Re-ranking

Implementing Reranking

from langchain.retrievers import ContextualCompressionRetriever
from langchain.retrievers.document_compressors import CrossEncoderReranker
from sentence_transformers import CrossEncoder

# Base retriever (vector search)
base_retriever = vectorstore.as_retriever(search_kwargs={"k": 20})  # Recall more first

# Reranker
cross_encoder = CrossEncoder("cross-encoder/ms-marco-MiniLM-L-6-v2")
reranker = CrossEncoderReranker(model=cross_encoder, top_n=5)

# Compression retriever: retrieve 20, then rerank to top 5
compression_retriever = ContextualCompressionRetriever(
    base_compressor=reranker,
    base_retriever=base_retriever
)

results = compression_retriever.invoke("What is LCEL in LangChain?")

Reranking Effectiveness

Reranking can significantly improve retrieval performance. Based on my experience:

  • Top-5 accuracy with pure vector search: assume 70%
  • Top-5 accuracy after reranking: typically improves to 85-90%
  • Cost: an additional 100-500ms of latency (depending on document count)

Recommendation: If you require high retrieval precision, reranking is a worthwhile investment. If it's just an internal tool, pure vector search might be sufficient.

Vector Retrieval in Practice: Claude Code MagmaAdapter

Looking back at the Claude Code memory system. Its vector retrieval implementation is quite basic — pure vector search, no hybrid search, no reranking.

// MagmaAdapter core retrieval logic
async readLanceDB(query: string, limit: number = 10, layer?: string) {
  const queryVector = await this.embedQuery(query, lancedb)

  for (const tableName of await db.tableNames()) {
    const table = await db.openTable(tableName)
    const results = await table
      .vectorSearch(queryVector)
      .limit(limit)
      .distanceType('COSINE')
      .toArray()

    results.push({
      key, value,
      score: 1 - entry._distance  // Convert cosine distance to similarity
    })
  }

  return results.sort((a, b) => b.score - a.score).slice(0, limit)
}

Characteristics of this implementation:

  • Pure vector search: Only uses LanceDB's vectorSearch
  • COSINE distance: Standard cosine similarity
  • Sort by score: Simplest ranking approach
  • No hybrid search: No BM25
  • No reranking: No Cross-Encoder

Why is it so basic? Because the memory system's data volume is small (tens to hundreds of entries), and each memory is human-written, well-formatted text. In this scenario, basic vector search is sufficient.

But for large-scale RAG systems (thousands to millions of chunks), pure vector search alone is not enough. You need hybrid search to improve recall, and reranking to improve precision.

Common Pitfalls in Vector Retrieval

Pitfall 1: Embedding Model Mismatch

Symptoms: The semantic relevance between retrieval results and questions is low.

Cause: The Embedding model doesn't correctly understand your data type. For example, using an English model for Chinese documents, or a general-purpose model for domain-specific documents.

Solution: Switch to a more suitable Embedding model.

Pitfall 2: Chunks Too Large, Degrading Vector Quality

Symptoms: Retrieval results contain large amounts of irrelevant content.

Cause: Chunks are too large, containing multiple topics within one chunk. The vector is "diluted."

Solution: Reduce chunk_size, or switch to a more precise splitting strategy.

Pitfall 3: Using Only Vector Search

Symptoms: Exact keyword matching performance is poor.

Cause: Vector search is not good at matching technical terms, dates, or proper nouns.

Solution: Add BM25 keyword search and implement hybrid search.

Pitfall 4: Insufficient Retrieval Precision

Symptoms: Top-k results contain some less relevant chunks.

Cause: Initial retrieval precision is insufficient.

Solution: Add reranking. First recall more results (top-20 or top-50), then use a Cross-Encoder to re-rank and take the top 5.

Pitfall 5: Vector Database Performance Issues

Symptoms: Retrieval latency is too high.

Cause: Data volume is too large, or the index strategy is incorrect.

Solution:

  • Check index type (HNSW recommended)
  • Adjust index parameters (HNSW's M and efConstruction)
  • Consider sharding or distributed deployment

Summary

Vector retrieval is the core step of the RAG pipeline.

Key takeaways:

  • Choose Embedding models based on data type — BGE/GTE for Chinese, E5 for multilingual
  • Choose vector databases based on scale — Chroma for prototyping, Qdrant/Milvus/Pinecone for production
  • Hybrid search is a production standard — vector + BM25, RRF fusion
  • Reranking is a powerful tool for quality improvement — recall more first, then re-rank

The next article covers post-retrieval optimization — after getting retrieval results, how to further improve RAG performance.


Series: