Post-Retrieval Optimization: Making RAG Results More Precise

The previous three articles covered document splitting, vector retrieval, and hybrid search. Suppose you've already split your chunks, chosen your Embedding model, and set up hybrid search. Now a user asks a question, and you've retrieved the top-k results — what next?

Many people think retrieval is done, and directly stuff the results into a prompt for the model to answer. In reality, between "retrieval results" and "stuffing into the prompt," there are a series of optimization steps. These steps often determine the final performance of the RAG system.

Query Rewriting

The user's question is often not the optimal retrieval query.

Vague phrasing: "How do you use that thing?" — what is "that thing"?
Overly broad: "Tell me about LangChain" — what aspect?
Contains multiple sub-questions: "What is the difference between LangChain and LlamaIndex, and what scenarios are they each suited for?" — this is actually two questions.
Wording doesn't match the documents: The user says "how to apply for time off," but the documents say "employee leave request process."

Query Rewriting optimizes the user's question before retrieval, making the search more precise.

Basic Rewriting: Let the LLM Optimize the Query

The simplest approach: let the LLM rewrite the user's question into a more precise, more specific search query.

from langchain_core.output_parsers import StrOutputParser
from langchain_core.prompts import ChatPromptTemplate
from langchain_openai import ChatOpenAI

rewrite_prompt = ChatPromptTemplate.from_template(
    "You are a search optimization assistant. Rewrite the user's question into a more precise, more specific search query. "
    "Only output the rewritten query, nothing else.\n\nUser question: {query}"
)

rewrite_chain = rewrite_prompt | ChatOpenAI(model="gpt-4o-mini") | StrOutputParser()

# Original query
original_query = "How does that framework handle external data?"

# Rewritten query
rewritten_query = rewrite_chain.invoke({"query": original_query})
# → "How does the LangChain framework integrate external data sources (PDF, databases, API) for retrieval-augmented generation (RAG)"

# Search with the rewritten query
results = retriever.invoke(rewritten_query)

Multi-Query Rewriting: Ask from Multiple Angles

The same question, rewritten from different angles, generating multiple queries and merging retrieval results.

from langchain.retrievers import MultiQueryRetriever

# Automatically generate queries from 3 different angles
multi_query_retriever = MultiQueryRetriever.from_llm(
    retriever=base_retriever,
    llm=ChatOpenAI(model="gpt-4o-mini"),
    prompt=ChatPromptTemplate.from_template(
        "The user's question is: {query}\n"
        "Please generate 3 search queries from different angles, one per line."
    )
)

# Original query: "What are the core modules of LangChain?"
# Might generate:
# 1. "LangChain six core modules LLM Chain LCEL RAG Agent Memory"
# 2. "LangChain module architecture and function overview"
# 3. "What problems does each LangChain module solve and use cases"

results = multi_query_retriever.invoke("What are the core modules of LangChain?")

Question Decomposition: Step-Back Prompting

For complex questions, "step back" first — break them into sub-questions, retrieve separately, then merge results.

# Original question (complex)
query = "What is the performance difference between LangChain and LlamaIndex for large-scale document retrieval?"

# Step-back: decompose into sub-questions
sub_queries = [
    "How does LangChain handle large-scale document retrieval?",
    "How does LlamaIndex handle large-scale document retrieval?",
    "LangChain vs LlamaIndex performance benchmark comparison",
]

# Retrieve separately
all_results = []
for sub_query in sub_queries:
    results = retriever.invoke(sub_query)
    all_results.extend(results)

# Deduplicate and merge
final_results = deduplicate(all_results)

When to Use Query Rewriting

Scenario	Rewriting Method	Effect
Vague phrasing	Basic rewriting	Makes the query more specific
Overly broad	Multi-query rewriting	Covers from multiple angles
Contains multiple sub-questions	Question decomposition	Retrieve separately, merge results
Wording mismatch	Synonym expansion	Matches wording in documents

My recommendation: Start with basic rewriting; if performance is insufficient, add multi-query rewriting. Question decomposition is suitable for complex queries but increases latency and cost.

HyDE: Hypothetical Document Embeddings

HyDE (Hypothetical Document Embeddings) is a clever retrieval optimization technique. Its core idea: don't match questions to chunks — match "hypothetical answers" to chunks.

Traditional Retrieval vs. HyDE

Traditional retrieval: Vectorize the user's question → find the most similar chunks in the vector database.

The problem: questions and chunks are often phrased differently. A user asks "how to apply for time off," but documents say "employee leave request procedure." The two texts are semantically related, but their vectors may not be very close.

HyDE:

User asks a question
Let the LLM generate a "hypothetical answer" (it doesn't need to be true, just semantically close to the real answer in vector space)
Vectorize the hypothetical answer
Find chunks most similar to the hypothetical answer in the vector database

Why does it work? Because hypothetical answers and document chunks are both "declarative text" — their phrasing is more similar, and their distances in vector space are also closer.

Implementing HyDE

from langchain.retrievers import ContextualCompressionRetriever
from langchain.retrievers.document_compressors import LLMChainExtractor
from langchain_core.prompts import ChatPromptTemplate
from langchain_openai import ChatOpenAI

# HyDE prompt
hyde_prompt = ChatPromptTemplate.from_template(
    "The user asked a question. Please generate a hypothetical answer paragraph that "
    "would contain the possible answer to the user's question. "
    "Only output the hypothetical answer, nothing else.\n\nQuestion: {query}"
)

# Generate hypothetical answer
hyde_chain = hyde_prompt | ChatOpenAI(model="gpt-4o-mini") | StrOutputParser()
hypothetical_answer = hyde_chain.invoke({"query": "What is LCEL in LangChain?"})
# → "LCEL (LangChain Expression Language) is LangChain's expression language, using pipe "
#    "symbols | to connect components to define AI processing pipelines. For example, "
#    "prompt | model | parser defines a complete AI processing chain. LCEL supports "
#    "streaming output, batch processing, and async execution."

# Search with the hypothetical answer
results = retriever.invoke(hypothetical_answer)

When HyDE Works Best

HyDE is particularly effective in these scenarios:

Large phrasing differences between user questions and documents
Short, vague queries
Documents in specialized domains where users ask questions in everyday language

But HyDE also has costs: an additional LLM call (increased latency and cost), and the quality of the hypothetical answer directly affects retrieval performance.

Multi-Path Recall

Different retrievers excel at different things. Multi-path recall uses multiple retrievers in parallel and merges the results.

Why Multi-Path Recall Is Needed

Retriever	Excels At	Struggles With
Vector search	Semantic matching	Exact keywords
BM25	Exact keywords	Semantic understanding
Graph search	Entity relationships	Free text
Knowledge graph	Structured knowledge	Unstructured text

Implementing Multi-Path Recall

from langchain.retrievers import EnsembleRetriever
from langchain_community.retrievers import BM25Retriever

# Path 1: Vector search
vector_retriever = vectorstore.as_retriever(search_kwargs={"k": 5})

# Path 2: BM25 keyword search
bm25_retriever = BM25Retriever.from_documents(chunks)
bm25_retriever.k = 5

# Path 3: HyDE search
# (Search with hypothetical answer, see previous section)

# Merge all three paths
ensemble_retriever = EnsembleRetriever(
    retrievers=[vector_retriever, bm25_retriever],
    weights=[0.5, 0.5]
)

results = ensemble_retriever.invoke("What is LCEL in LangChain?")

Deduplication and Fusion

Multi-path recall results need deduplication. The same chunk may be returned by multiple retrievers and needs score merging.

def merge_results(retriever_results_list):
    """Merge multi-path retrieval results, deduplicate, and re-rank"""
    chunk_map = {}  # {chunk_id: {chunk, scores: []}}

    for results in retriever_results_list:
        for chunk in results:
            chunk_id = chunk.metadata.get("chunk_id", hash(chunk.page_content))
            if chunk_id not in chunk_map:
                chunk_map[chunk_id] = {"chunk": chunk, "scores": []}
            chunk_map[chunk_id]["scores"].append(chunk.metadata.get("score", 0))

    # Merge scores (take max or average)
    merged = []
    for item in chunk_map.values():
        max_score = max(item["scores"])
        item["chunk"].metadata["score"] = max_score
        merged.append(item["chunk"])

    return sorted(merged, key=lambda x: x.metadata["score"], reverse=True)

Context Compression

Retrieved chunks may contain large amounts of content irrelevant to the query. Context compression extracts the parts most relevant to the query, reducing noise and saving tokens.

LLM Compression

Use an LLM to extract query-relevant parts from chunks:

from langchain.retrievers import ContextualCompressionRetriever
from langchain.retrievals.document_compressors import LLMChainExtractor
from langchain_core.prompts import ChatPromptTemplate

# Compression prompt
compress_prompt = ChatPromptTemplate.from_template(
    "Given the following document excerpt and user question, extract the parts most relevant to the user's question. "
    "Keep only the content that directly answers the question, removing irrelevant information.\n\n"
    "User question: {question}\n\nDocument excerpt: {context}\n\nExtraction result:"
)

compressor = LLMChainExtractor.from_llm_and_prompt(
    llm=ChatOpenAI(model="gpt-4o-mini"),
    prompt=compress_prompt
)

# Compression retriever
compression_retriever = ContextualCompressionRetriever(
    base_compressor=compressor,
    base_retriever=base_retriever
)

# Search and compress
compressed_results = compression_retriever.invoke("What is LCEL in LangChain?")
# Returns compressed chunks containing only query-relevant parts

Truncation Compression

A simpler approach: keep only the first N tokens of each chunk.

def truncate_chunks(chunks, max_tokens_per_chunk=300):
    """Simple truncation: keep only the first N tokens of each chunk"""
    truncated = []
    for chunk in chunks:
        tokens = chunk.page_content.split()[:max_tokens_per_chunk]
        chunk.page_content = " ".join(tokens)
        truncated.append(chunk)
    return truncated

Compression Trade-offs

Method	Pros	Cons
LLM compression	Precise, keeps only relevant content	Slow, expensive, adds latency
Truncation compression	Fast, cheap	May lose key information
No compression	Preserves complete context	More noise, wastes tokens

Recommendation: Consider LLM compression if average chunk length exceeds 500 tokens. If chunks are already short (200-300 tokens), compression is unnecessary.

Hallucination Suppression

One of the biggest pain points in RAG is hallucination — the model doesn't answer based on retrieval results, but instead fabricates information.

Prompt Constraints

The simplest and also most effective method: explicitly require the model to answer based only on retrieval results in the prompt.

prompt = ChatPromptTemplate.from_messages([
    ("system", """You are a retrieval-based Q&A assistant.
    Rules:
    1. Answer based only on the provided context
    2. If the context doesn't contain relevant information, explicitly state "Based on the provided information, I cannot answer this question"
    3. Do not fabricate information and do not answer using your training data
    4. If information in the context is contradictory, point out the contradiction
    """),
    ("human", "Context: {context}\n\nQuestion: {question}")
])

Citation Tracing

Require the model to cite information sources in its answer, making it easy for users to verify:

prompt = ChatPromptTemplate.from_messages([
    ("system", """Answer based on the provided context.
    At the end of your answer, cite sources using [1] [2] [3] format.
    Each citation corresponds to the document fragment number in the provided context."""),
    ("human", "Context:\n{context}\n\nQuestion: {question}")
])

Consistency Verification

Use an independent LLM to verify whether the generated answer is consistent with the retrieval results:

def verify_answer(answer, context):
    """Verify if the answer is grounded in the context"""
    verify_prompt = ChatPromptTemplate.from_template(
        "Given the following context and answer, determine whether the answer is entirely based on the context. "
        "If the answer contains information not found in the context, flag it as 'hallucination'.\n\n"
        "Context: {context}\n\nAnswer: {answer}\n\n"
        "Judgment (consistent/hallucination):"
    )

    result = llm.invoke(verify_prompt.format(context=context, answer=answer))
    return "hallucination" not in result.content

Levels of Hallucination Suppression

Level	Method	Effectiveness	Cost
Prompt constraints	Explicitly require grounding in context	Basic protection	Zero
Citation tracing	Cite information sources	Easy to verify	Low
Consistency verification	Independent LLM verification	Strongest protection	High (extra LLM call)

Recommendation: At minimum, implement prompt constraints + citation tracing. If hallucination is zero-tolerant (e.g., medical, legal scenarios), add consistency verification.

Complete Post-Retrieval Optimization Pipeline

Putting all steps together:

User asks a question
    │
    ▼
┌─────────────────────┐
│ 1. Query Rewriting   │  LLM optimizes the query
│   "How does that     │
│    framework handle   │
│    external data?"    │
│   → "How does        │
│    LangChain framework│
│    integrate external │
│    data sources"     │
└──────────┬──────────┘
           │
           ▼
┌─────────────────────┐
│ 2. Multi-Path Recall │  Vector + BM25 + HyDE
│   Search 3 paths in  │
│   parallel           │
└──────────┬──────────┘
           │
           ▼
┌─────────────────────┐
│ 3. Dedup & Fusion    │  RRF fusion ranking
└──────────┬──────────┘
           │
           ▼
┌─────────────────────┐
│ 4. Reranking         │  Cross-encoder re-ranking
│   top-20 → top-5     │
└──────────┬──────────┘
           │
           ▼
┌─────────────────────┐
│ 5. Context           │  LLM extracts relevant content
│   Compression        │
└──────────┬──────────┘
           │
           ▼
┌─────────────────────┐
│ 6. Answer Generation │  Prompt constraints + citation
│   + Hallucination    │  tracing
│   Suppression        │
└─────────────────────┘

Not every step is needed. Choose based on your scenario:

Basic: Query rewriting + Reranking + Prompt constraints
Advanced: Multi-query rewriting + Multi-path recall + Reranking + Context compression + Citation tracing
Full: All steps + Consistency verification

Quantifying Improvement

Based on actual experience, the improvement from each optimization step is roughly:

Optimization Step	Improvement	Latency Added	Cost Added
Query rewriting	+5-10%	+1-2s	Low
Multi-query rewriting	+5-15%	+2-4s	Medium
HyDE	+10-20%	+1-3s	Low
Hybrid search	+10-15%	+100ms	Zero
Reranking	+10-15%	+200-500ms	Low
Context compression	+5-10%	+1-3s	Medium
Hallucination suppression	+5-10%	+0-5s	Low-High

(Improvements are relative to the baseline without that step; specific values vary by data and scenario.)

Best value optimizations — hybrid search, reranking, and query rewriting. These three steps offer significant performance gains with limited cost and latency increases.

Series:

Post-Retrieval Optimization: Making RAG Results More Precise

Post-Retrieval Optimization: Making RAG Results More Precise

Query Rewriting

Basic Rewriting: Let the LLM Optimize the Query

Multi-Query Rewriting: Ask from Multiple Angles

Question Decomposition: Step-Back Prompting

When to Use Query Rewriting

HyDE: Hypothetical Document Embeddings

Traditional Retrieval vs. HyDE

Implementing HyDE

When HyDE Works Best

Multi-Path Recall

Why Multi-Path Recall Is Needed

Implementing Multi-Path Recall

Deduplication and Fusion

Context Compression

LLM Compression

Truncation Compression

Compression Trade-offs

Hallucination Suppression

Prompt Constraints

Citation Tracing

Consistency Verification

Levels of Hallucination Suppression

Complete Post-Retrieval Optimization Pipeline

Quantifying Improvement

Related Articles

面试官问你：如何解决大模型的上下文长度限制——标准回答框架

大模型上下文长度限制完全指南：从原理到工程落地的 4 种方案

面试官问你：RAG 如何处理 PDF——别再说转文本切片了