Post-Retrieval Optimization: Making RAG Results More Precise
The previous three articles covered document splitting, vector retrieval, and hybrid search. Suppose you've already split your chunks, chosen your Embedding model, and set up hybrid search. Now a user asks a question, and you've retrieved the top-k results — what next?
Many people think retrieval is done, and directly stuff the results into a prompt for the model to answer. In reality, between "retrieval results" and "stuffing into the prompt," there are a series of optimization steps. These steps often determine the final performance of the RAG system.
Query Rewriting
The user's question is often not the optimal retrieval query.
- Vague phrasing: "How do you use that thing?" — what is "that thing"?
- Overly broad: "Tell me about LangChain" — what aspect?
- Contains multiple sub-questions: "What is the difference between LangChain and LlamaIndex, and what scenarios are they each suited for?" — this is actually two questions.
- Wording doesn't match the documents: The user says "how to apply for time off," but the documents say "employee leave request process."
Query Rewriting optimizes the user's question before retrieval, making the search more precise.
Basic Rewriting: Let the LLM Optimize the Query
The simplest approach: let the LLM rewrite the user's question into a more precise, more specific search query.
from langchain_core.output_parsers import StrOutputParser
from langchain_core.prompts import ChatPromptTemplate
from langchain_openai import ChatOpenAI
rewrite_prompt = ChatPromptTemplate.from_template(
"You are a search optimization assistant. Rewrite the user's question into a more precise, more specific search query. "
"Only output the rewritten query, nothing else.\n\nUser question: {query}"
)
rewrite_chain = rewrite_prompt | ChatOpenAI(model="gpt-4o-mini") | StrOutputParser()
# Original query
original_query = "How does that framework handle external data?"
# Rewritten query
rewritten_query = rewrite_chain.invoke({"query": original_query})
# → "How does the LangChain framework integrate external data sources (PDF, databases, API) for retrieval-augmented generation (RAG)"
# Search with the rewritten query
results = retriever.invoke(rewritten_query)
Multi-Query Rewriting: Ask from Multiple Angles
The same question, rewritten from different angles, generating multiple queries and merging retrieval results.
from langchain.retrievers import MultiQueryRetriever
# Automatically generate queries from 3 different angles
multi_query_retriever = MultiQueryRetriever.from_llm(
retriever=base_retriever,
llm=ChatOpenAI(model="gpt-4o-mini"),
prompt=ChatPromptTemplate.from_template(
"The user's question is: {query}\n"
"Please generate 3 search queries from different angles, one per line."
)
)
# Original query: "What are the core modules of LangChain?"
# Might generate:
# 1. "LangChain six core modules LLM Chain LCEL RAG Agent Memory"
# 2. "LangChain module architecture and function overview"
# 3. "What problems does each LangChain module solve and use cases"
results = multi_query_retriever.invoke("What are the core modules of LangChain?")
Question Decomposition: Step-Back Prompting
For complex questions, "step back" first — break them into sub-questions, retrieve separately, then merge results.
# Original question (complex)
query = "What is the performance difference between LangChain and LlamaIndex for large-scale document retrieval?"
# Step-back: decompose into sub-questions
sub_queries = [
"How does LangChain handle large-scale document retrieval?",
"How does LlamaIndex handle large-scale document retrieval?",
"LangChain vs LlamaIndex performance benchmark comparison",
]
# Retrieve separately
all_results = []
for sub_query in sub_queries:
results = retriever.invoke(sub_query)
all_results.extend(results)
# Deduplicate and merge
final_results = deduplicate(all_results)
When to Use Query Rewriting
| Scenario | Rewriting Method | Effect |
|---|---|---|
| Vague phrasing | Basic rewriting | Makes the query more specific |
| Overly broad | Multi-query rewriting | Covers from multiple angles |
| Contains multiple sub-questions | Question decomposition | Retrieve separately, merge results |
| Wording mismatch | Synonym expansion | Matches wording in documents |
My recommendation: Start with basic rewriting; if performance is insufficient, add multi-query rewriting. Question decomposition is suitable for complex queries but increases latency and cost.
HyDE: Hypothetical Document Embeddings
HyDE (Hypothetical Document Embeddings) is a clever retrieval optimization technique. Its core idea: don't match questions to chunks — match "hypothetical answers" to chunks.
Traditional Retrieval vs. HyDE
Traditional retrieval: Vectorize the user's question → find the most similar chunks in the vector database.
The problem: questions and chunks are often phrased differently. A user asks "how to apply for time off," but documents say "employee leave request procedure." The two texts are semantically related, but their vectors may not be very close.
HyDE:
- User asks a question
- Let the LLM generate a "hypothetical answer" (it doesn't need to be true, just semantically close to the real answer in vector space)
- Vectorize the hypothetical answer
- Find chunks most similar to the hypothetical answer in the vector database
Why does it work? Because hypothetical answers and document chunks are both "declarative text" — their phrasing is more similar, and their distances in vector space are also closer.
Implementing HyDE
from langchain.retrievers import ContextualCompressionRetriever
from langchain.retrievers.document_compressors import LLMChainExtractor
from langchain_core.prompts import ChatPromptTemplate
from langchain_openai import ChatOpenAI
# HyDE prompt
hyde_prompt = ChatPromptTemplate.from_template(
"The user asked a question. Please generate a hypothetical answer paragraph that "
"would contain the possible answer to the user's question. "
"Only output the hypothetical answer, nothing else.\n\nQuestion: {query}"
)
# Generate hypothetical answer
hyde_chain = hyde_prompt | ChatOpenAI(model="gpt-4o-mini") | StrOutputParser()
hypothetical_answer = hyde_chain.invoke({"query": "What is LCEL in LangChain?"})
# → "LCEL (LangChain Expression Language) is LangChain's expression language, using pipe "
# "symbols | to connect components to define AI processing pipelines. For example, "
# "prompt | model | parser defines a complete AI processing chain. LCEL supports "
# "streaming output, batch processing, and async execution."
# Search with the hypothetical answer
results = retriever.invoke(hypothetical_answer)
When HyDE Works Best
HyDE is particularly effective in these scenarios:
- Large phrasing differences between user questions and documents
- Short, vague queries
- Documents in specialized domains where users ask questions in everyday language
But HyDE also has costs: an additional LLM call (increased latency and cost), and the quality of the hypothetical answer directly affects retrieval performance.
Multi-Path Recall
Different retrievers excel at different things. Multi-path recall uses multiple retrievers in parallel and merges the results.
Why Multi-Path Recall Is Needed
| Retriever | Excels At | Struggles With |
|---|---|---|
| Vector search | Semantic matching | Exact keywords |
| BM25 | Exact keywords | Semantic understanding |
| Graph search | Entity relationships | Free text |
| Knowledge graph | Structured knowledge | Unstructured text |
Implementing Multi-Path Recall
from langchain.retrievers import EnsembleRetriever
from langchain_community.retrievers import BM25Retriever
# Path 1: Vector search
vector_retriever = vectorstore.as_retriever(search_kwargs={"k": 5})
# Path 2: BM25 keyword search
bm25_retriever = BM25Retriever.from_documents(chunks)
bm25_retriever.k = 5
# Path 3: HyDE search
# (Search with hypothetical answer, see previous section)
# Merge all three paths
ensemble_retriever = EnsembleRetriever(
retrievers=[vector_retriever, bm25_retriever],
weights=[0.5, 0.5]
)
results = ensemble_retriever.invoke("What is LCEL in LangChain?")
Deduplication and Fusion
Multi-path recall results need deduplication. The same chunk may be returned by multiple retrievers and needs score merging.
def merge_results(retriever_results_list):
"""Merge multi-path retrieval results, deduplicate, and re-rank"""
chunk_map = {} # {chunk_id: {chunk, scores: []}}
for results in retriever_results_list:
for chunk in results:
chunk_id = chunk.metadata.get("chunk_id", hash(chunk.page_content))
if chunk_id not in chunk_map:
chunk_map[chunk_id] = {"chunk": chunk, "scores": []}
chunk_map[chunk_id]["scores"].append(chunk.metadata.get("score", 0))
# Merge scores (take max or average)
merged = []
for item in chunk_map.values():
max_score = max(item["scores"])
item["chunk"].metadata["score"] = max_score
merged.append(item["chunk"])
return sorted(merged, key=lambda x: x.metadata["score"], reverse=True)
Context Compression
Retrieved chunks may contain large amounts of content irrelevant to the query. Context compression extracts the parts most relevant to the query, reducing noise and saving tokens.
LLM Compression
Use an LLM to extract query-relevant parts from chunks:
from langchain.retrievers import ContextualCompressionRetriever
from langchain.retrievals.document_compressors import LLMChainExtractor
from langchain_core.prompts import ChatPromptTemplate
# Compression prompt
compress_prompt = ChatPromptTemplate.from_template(
"Given the following document excerpt and user question, extract the parts most relevant to the user's question. "
"Keep only the content that directly answers the question, removing irrelevant information.\n\n"
"User question: {question}\n\nDocument excerpt: {context}\n\nExtraction result:"
)
compressor = LLMChainExtractor.from_llm_and_prompt(
llm=ChatOpenAI(model="gpt-4o-mini"),
prompt=compress_prompt
)
# Compression retriever
compression_retriever = ContextualCompressionRetriever(
base_compressor=compressor,
base_retriever=base_retriever
)
# Search and compress
compressed_results = compression_retriever.invoke("What is LCEL in LangChain?")
# Returns compressed chunks containing only query-relevant parts
Truncation Compression
A simpler approach: keep only the first N tokens of each chunk.
def truncate_chunks(chunks, max_tokens_per_chunk=300):
"""Simple truncation: keep only the first N tokens of each chunk"""
truncated = []
for chunk in chunks:
tokens = chunk.page_content.split()[:max_tokens_per_chunk]
chunk.page_content = " ".join(tokens)
truncated.append(chunk)
return truncated
Compression Trade-offs
| Method | Pros | Cons |
|---|---|---|
| LLM compression | Precise, keeps only relevant content | Slow, expensive, adds latency |
| Truncation compression | Fast, cheap | May lose key information |
| No compression | Preserves complete context | More noise, wastes tokens |
Recommendation: Consider LLM compression if average chunk length exceeds 500 tokens. If chunks are already short (200-300 tokens), compression is unnecessary.
Hallucination Suppression
One of the biggest pain points in RAG is hallucination — the model doesn't answer based on retrieval results, but instead fabricates information.
Prompt Constraints
The simplest and also most effective method: explicitly require the model to answer based only on retrieval results in the prompt.
prompt = ChatPromptTemplate.from_messages([
("system", """You are a retrieval-based Q&A assistant.
Rules:
1. Answer based only on the provided context
2. If the context doesn't contain relevant information, explicitly state "Based on the provided information, I cannot answer this question"
3. Do not fabricate information and do not answer using your training data
4. If information in the context is contradictory, point out the contradiction
"""),
("human", "Context: {context}\n\nQuestion: {question}")
])
Citation Tracing
Require the model to cite information sources in its answer, making it easy for users to verify:
prompt = ChatPromptTemplate.from_messages([
("system", """Answer based on the provided context.
At the end of your answer, cite sources using [1] [2] [3] format.
Each citation corresponds to the document fragment number in the provided context."""),
("human", "Context:\n{context}\n\nQuestion: {question}")
])
Consistency Verification
Use an independent LLM to verify whether the generated answer is consistent with the retrieval results:
def verify_answer(answer, context):
"""Verify if the answer is grounded in the context"""
verify_prompt = ChatPromptTemplate.from_template(
"Given the following context and answer, determine whether the answer is entirely based on the context. "
"If the answer contains information not found in the context, flag it as 'hallucination'.\n\n"
"Context: {context}\n\nAnswer: {answer}\n\n"
"Judgment (consistent/hallucination):"
)
result = llm.invoke(verify_prompt.format(context=context, answer=answer))
return "hallucination" not in result.content
Levels of Hallucination Suppression
| Level | Method | Effectiveness | Cost |
|---|---|---|---|
| Prompt constraints | Explicitly require grounding in context | Basic protection | Zero |
| Citation tracing | Cite information sources | Easy to verify | Low |
| Consistency verification | Independent LLM verification | Strongest protection | High (extra LLM call) |
Recommendation: At minimum, implement prompt constraints + citation tracing. If hallucination is zero-tolerant (e.g., medical, legal scenarios), add consistency verification.
Complete Post-Retrieval Optimization Pipeline
Putting all steps together:
User asks a question
│
▼
┌─────────────────────┐
│ 1. Query Rewriting │ LLM optimizes the query
│ "How does that │
│ framework handle │
│ external data?" │
│ → "How does │
│ LangChain framework│
│ integrate external │
│ data sources" │
└──────────┬──────────┘
│
▼
┌─────────────────────┐
│ 2. Multi-Path Recall │ Vector + BM25 + HyDE
│ Search 3 paths in │
│ parallel │
└──────────┬──────────┘
│
▼
┌─────────────────────┐
│ 3. Dedup & Fusion │ RRF fusion ranking
└──────────┬──────────┘
│
▼
┌─────────────────────┐
│ 4. Reranking │ Cross-encoder re-ranking
│ top-20 → top-5 │
└──────────┬──────────┘
│
▼
┌─────────────────────┐
│ 5. Context │ LLM extracts relevant content
│ Compression │
└──────────┬──────────┘
│
▼
┌─────────────────────┐
│ 6. Answer Generation │ Prompt constraints + citation
│ + Hallucination │ tracing
│ Suppression │
└─────────────────────┘
Not every step is needed. Choose based on your scenario:
- Basic: Query rewriting + Reranking + Prompt constraints
- Advanced: Multi-query rewriting + Multi-path recall + Reranking + Context compression + Citation tracing
- Full: All steps + Consistency verification
Quantifying Improvement
Based on actual experience, the improvement from each optimization step is roughly:
| Optimization Step | Improvement | Latency Added | Cost Added |
|---|---|---|---|
| Query rewriting | +5-10% | +1-2s | Low |
| Multi-query rewriting | +5-15% | +2-4s | Medium |
| HyDE | +10-20% | +1-3s | Low |
| Hybrid search | +10-15% | +100ms | Zero |
| Reranking | +10-15% | +200-500ms | Low |
| Context compression | +5-10% | +1-3s | Medium |
| Hallucination suppression | +5-10% | +0-5s | Low-High |
(Improvements are relative to the baseline without that step; specific values vary by data and scenario.)
Best value optimizations — hybrid search, reranking, and query rewriting. These three steps offer significant performance gains with limited cost and latency increases.
Series: