Evaluation and Production Deployment: Taking RAG from Prototype to Production
The previous four articles covered each step of the RAG pipeline. This article covers the final question: how do you know if your RAG system is good, and how do you deploy it to production.
This is the most easily overlooked but most decisive step in the RAG series. Many people build a RAG system and consider it done once it "runs." But "runs" and "works well" are separated by an entire system of evaluation and continuous optimization.
Why RAG Evaluation Is So Hard
RAG evaluation is difficult because it cannot be measured by a single metric.
Traditional software has clear inputs and outputs — after running test cases, you know whether it's right or wrong. RAG is different:
- Are the retrieval results good? Relevant documents were retrieved, but is the ranking correct? Is the quantity sufficient?
- Is the generated answer good? Based on the retrieval results, is the generated answer accurate? Complete? Free of hallucinations?
- Is the end-to-end effect good? Did the user's question ultimately receive a satisfactory answer?
These three questions correspond to three different levels of evaluation, each with multiple metrics.
Three Levels of RAG Evaluation
Retrieval Evaluation
Measures the quality of results returned by the retriever.
| Metric | Definition | Target |
|---|---|---|
| Recall@k | Proportion of top-k results containing relevant documents | Higher is better, target >90% |
| Precision@k | Proportion of top-k results that are relevant documents | Higher is better |
| MRR (Mean Reciprocal Rank) | Average of inverse ranks where correct answers appear | Closer to 1 is better |
| MAP (Mean Average Precision) | Average precision accounting for ranking order | Closer to 1 is better |
How to calculate: Annotated data is required: each test question must be annotated with a "relevant document list."
# Annotation data example
test_cases = [
{
"question": "What is LCEL in LangChain?",
"relevant_docs": ["lcel-introduction.md", "chain-tutorial.md"]
},
{
"question": "How to configure LangSmith?",
"relevant_docs": ["langsmith-setup.md", "monitoring-guide.md"]
}
]
# Calculate Recall@5
def recall_at_k(retrieved_docs, relevant_docs, k=5):
retrieved_top_k = retrieved_docs[:k]
relevant_count = sum(1 for doc in retrieved_top_k if doc in relevant_docs)
return relevant_count / len(relevant_docs)
# Calculate MRR
def mrr(retrieved_docs, relevant_docs):
for i, doc in enumerate(retrieved_docs):
if doc in relevant_docs:
return 1.0 / (i + 1)
return 0.0
Generation Evaluation
Measures the quality of answers generated by the model based on retrieval results.
| Metric | Definition | Evaluation Method |
|---|---|---|
| Faithfulness | Whether the answer is grounded in retrieval results, free of hallucination | LLM judgment |
| Relevance | Whether the answer directly addresses the user's question | LLM judgment |
| Completeness | Whether the answer covers all key points | LLM judgment |
| Correctness | Whether the factual content of the answer is correct | Human annotation |
LLM as Evaluator: Use another LLM to evaluate the quality of generated answers.
def evaluate_faithfulness(answer, context):
"""Evaluate whether the answer is faithful to the context"""
prompt = f"""Evaluate whether the following answer is entirely based on the provided context.
If the answer contains information not found in the context, label it as "unfaithful".
Context: {context}
Answer: {answer}
Evaluation result (faithful/unfaithful) and reason:"""
result = llm.invoke(prompt)
return "faithful" in result.content
def evaluate_relevance(answer, question):
"""Evaluate whether the answer is relevant"""
prompt = f"""Evaluate whether the following answer directly addresses the user's question.
Question: {question}
Answer: {answer}
Evaluation result (relevant/irrelevant) and reason:"""
result = llm.invoke(prompt)
return "relevant" in result.content
End-to-End Evaluation
Measures the final effectiveness of the entire RAG pipeline.
| Metric | Definition | Notes |
|---|---|---|
| Answer Correctness | Correctness of the final answer | The most core metric |
| Answer Relevancy | Relevance of the final answer to the question | Is it answering the wrong question? |
| Context Recall | How well retrieval results cover the correct answer | Is retrieval sufficient? |
| Context Precision | Proportion of relevant content in retrieval results | Is retrieval precise? |
RAGAS Evaluation Framework
RAGAS (Retrieval Augmented Generation Assessment) is currently the most popular RAG evaluation framework. It provides standardized evaluation procedures and metrics.
Installation and Usage
pip install ragas
from ragas import evaluate
from ragas.metrics import (
faithfulness,
answer_relevancy,
context_recall,
context_precision,
answer_correctness
)
from datasets import Dataset
# Prepare evaluation data
data = {
"question": [
"What is LCEL in LangChain?",
"How to configure LangSmith?",
"What is the difference between RAG and fine-tuning?"
],
"answer": [
"LCEL is LangChain's expression language...",
"Configuring LangSmith requires setting environment variables...",
"RAG augments generation by retrieving external knowledge..."
],
"contexts": [
["LCEL is LangChain's expression language, using pipe symbols..."],
["LangSmith is LangChain's monitoring platform..."],
["RAG augments generation by retrieving..."]
],
"ground_truth": [
"LCEL (LangChain Expression Language) is...",
"Steps to configure LangSmith: 1. Register an account...",
"RAG augments models by retrieving external knowledge..."
]
}
dataset = Dataset.from_dict(data)
# Run evaluation
results = evaluate(
dataset=dataset,
metrics=[
faithfulness,
answer_relevancy,
context_recall,
context_precision,
answer_correctness
],
llm=ChatOpenAI(model="gpt-4o"),
embeddings=OpenAIEmbeddings()
)
print(results)
# {
# 'faithfulness': 0.92,
# 'answer_relevancy': 0.88,
# 'context_recall': 0.85,
# 'context_precision': 0.79,
# 'answer_correctness': 0.83
# }
Interpreting Evaluation Results
| Metric | Score | Interpretation |
|---|---|---|
| faithfulness | 0.92 | Answers are highly faithful to context; very little hallucination |
| answer_relevancy | 0.88 | Answers are highly relevant to questions |
| context_recall | 0.85 | Retrieval results cover most relevant information |
| context_precision | 0.79 | Retrieval results contain some less relevant content |
| answer_correctness | 0.83 | Answers are mostly correct, with minor inaccuracies |
How to judge overall performance? There is no universal passing line. Recommendations:
- Run an evaluation first to establish a baseline
- Re-evaluate after each change to see if metrics improved
- Focus on trends rather than absolute values
Building Evaluation Datasets
The key to evaluation is the dataset. A good evaluation dataset should cover different types of queries.
Query Classification
| Type | Example | Proportion |
|---|---|---|
| Fact queries | "Who is the founder of LangChain?" | 30% |
| Concept understanding | "What is the difference between LCEL and traditional Chains?" | 25% |
| Process queries | "How to configure LangSmith?" | 20% |
| Comparison | "Which is better, RAG or fine-tuning?" | 15% |
| Boundary testing | "What can LangChain do and not do?" | 10% |
Dataset Size
- Minimum viable: 20-30 questions, covering main query types
- Standard evaluation: 100-200 questions, covering various scenarios
- Comprehensive evaluation: 500+ questions, covering edge cases
Annotation Methods
Manual annotation: Most accurate but time-consuming. Each question must be annotated with:
- Relevant document list (for retrieval evaluation)
- Standard answer (for generation evaluation)
LLM-assisted annotation: Use an LLM to generate initial annotations, then human review and correction. Much more efficient.
# LLM-assisted ground truth generation
def generate_ground_truth(question, relevant_docs):
prompt = f"""Based on the following reference documents, answer the user's question.
The answer should be accurate, complete, and concise.
Reference documents: {relevant_docs}
Question: {question}
Ground truth answer:"""
return llm.invoke(prompt).content
Production Deployment Architecture
Basic Architecture
┌─────────────────────────────────────────────────────┐
│ RAG Production System │
│ │
│ ┌──────────┐ ┌──────────────┐ │
│ │ User │────▶│ API Gateway │ │
│ │ Query │ └──────┬───────┘ │
│ └──────────┘ │ │
│ ▼ │
│ ┌──────────────────────────────────────┐ │
│ │ RAG Pipeline │ │
│ │ │ │
│ │ Query Rewrite → Multi-Path Recall → │ │
│ │ Dedup & Fusion → Reranking → │ │
│ │ Context Compression → Generation → │ │
│ │ Hallucination Check │ │
│ └──────────────────┬───────────────────┘ │
│ │ │
│ ┌──────────┼──────────┐ │
│ ▼ ▼ ▼ │
│ ┌──────────┐ ┌──────────┐ ┌──────────┐ │
│ │ Vector │ │ BM25 │ │ LLM │ │
│ │ Database │ │ Index │ │ Service │ │
│ │(Qdrant) │ │(Elastic) │ │(API) │ │
│ └──────────┘ └──────────┘ └──────────┘ │
│ │
│ ┌──────────────────────────────────────┐ │
│ │ Monitoring & Logging │ │
│ │ Retrieval Latency / Generation │ │
│ │ Quality / User Feedback │ │
│ └──────────────────────────────────────┘ │
└─────────────────────────────────────────────────────┘
Key Components
API Gateway: Receives user requests, routes to the RAG pipeline, returns results.
RAG Pipeline: The complete RAG processing flow, including query rewriting, retrieval, reranking, and generation.
Vector Database: Stores and retrieves vectors. Qdrant or Milvus are recommended for production.
BM25 Index: Keyword search. Can use Elasticsearch or maintain separately.
LLM Service: Generates answers. Can be OpenAI API, local models, or self-hosted services.
Monitoring System: Records latency, retrieval results, generated answers, and user feedback for each request.
Performance Optimization
Caching: Identical or similar queries return cached results directly.
from functools import lru_cache
import hashlib
class RAGCache:
def __init__(self, max_size=10000):
self.cache = {}
self.max_size = max_size
def get_cache_key(self, query):
return hashlib.md5(query.encode()).hexdigest()
def get(self, query):
key = self.get_cache_key(query)
return self.cache.get(key)
def set(self, query, result):
key = self.get_cache_key(query)
if len(self.cache) >= self.max_size:
# LRU eviction
oldest_key = next(iter(self.cache))
del self.cache[oldest_key]
self.cache[key] = result
Async Processing: Retrieval and generation can be partially async.
import asyncio
async def rag_pipeline(query):
# 1. Query rewriting (async)
rewritten_query = await rewrite_query(query)
# 2. Multi-path recall (parallel)
vector_task = asyncio.create_task(vector_search(rewritten_query))
bm25_task = asyncio.create_task(bm25_search(rewritten_query))
vector_results, bm25_results = await asyncio.gather(vector_task, bm25_task)
# 3. Dedup and fusion
merged = merge_results(vector_results, bm25_results)
# 4. Reranking
reranked = await rerank(merged, rewritten_query)
# 5. Answer generation
answer = await generate_answer(reranked, query)
return answer
Batching: Multiple queries can be processed in batches to increase throughput.
Degradation Strategies
Production environments need to account for various failure scenarios:
async def rag_with_fallback(query):
try:
# Normal flow
return await rag_pipeline(query)
except VectorDBError:
# Vector DB unavailable, degrade to keyword-only search
logger.warning("Vector DB unavailable, falling back to BM25")
return await bm25_only_pipeline(query)
except LLMError:
# LLM unavailable, return retrieved document summaries
logger.warning("LLM unavailable, returning retrieved docs")
return await retrieve_only(query)
except Exception as e:
# Other errors, return friendly message
logger.error(f"RAG pipeline error: {e}")
return "Sorry, the system is temporarily unable to answer this question. Please try again later."
Monitoring and Continuous Optimization
Monitoring Metrics
| Metric | Description | Alert Threshold |
|---|---|---|
| Retrieval Latency | Time from query to returning retrieval results | >500ms |
| Generation Latency | Time from retrieval to generating answer | >3s |
| End-to-End Latency | Time from user question to receiving answer | >5s |
| Cache Hit Rate | Proportion of cache hits | <30% |
| User Satisfaction | User feedback (upvote/downvote) | Downvote rate >20% |
Logging
Every request should be logged:
log_entry = {
"timestamp": "2026-06-05T10:30:00Z",
"query": "What is LCEL in LangChain?",
"rewritten_query": "LangChain LCEL expression language definition and usage",
"retrieved_docs": ["lcel-intro.md", "chain-tutorial.md"],
"reranked_docs": ["lcel-intro.md"],
"answer": "LCEL is LangChain's expression language...",
"metrics": {
"retrieval_latency_ms": 120,
"generation_latency_ms": 1800,
"total_latency_ms": 1950,
"context_precision": 0.85,
"faithfulness": 0.92
},
"user_feedback": "upvote" # or "downvote"
}
Continuous Optimization Loop
Deploy to Production
│
▼
Collect user queries and feedback
│
▼
Run evaluation periodically (weekly/monthly)
│
▼
Analyze weak points
│
▼
Targeted optimization (splitting/generation/retrieval)
│
▼
A/B test to verify
│
▼
Deploy new version
│
└──→ Back to step 1
Leveraging User Feedback
User feedback is a goldmine for RAG optimization:
- Downvoted answers: Analyze whether it's a retrieval issue (failed to find relevant documents) or a generation issue (found them but answered poorly)
- High-frequency queries: Optimizing these yields the greatest benefit
- No-result queries: Indicates knowledge gaps in the knowledge base; documents need to be added
# Analyze downvote reasons
def analyze_downvote(query, retrieved_docs, answer, context):
if not retrieved_docs:
return "no_results" # Knowledge base has no relevant docs
relevance = evaluate_relevance(answer, query)
if relevance < 0.5:
return "irrelevant_answer" # Retrieved but answer is irrelevant
faithfulness = evaluate_faithfulness(answer, context)
if faithfulness < 0.5:
return "hallucination" # Hallucination detected
return "other" # Other reasons
LangSmith Integration
LangSmith is LangChain's official monitoring platform that can trace every step of the RAG pipeline.
Basic Configuration
import os
os.environ["LANGCHAIN_TRACING_V2"] = "true"
os.environ["LANGCHAIN_API_KEY"] = "your-langsmith-api-key"
os.environ["LANGCHAIN_PROJECT"] = "my-rag-project"
# LangChain code is called normally; all calls are automatically traced
Viewing RAG Chains in LangSmith
LangSmith records the complete chain of each RAG request:
User Query → Query Rewrite → Vector Retrieval → BM25 Retrieval → Dedup & Fusion → Reranking → Generation
│ │ │ │ │ │ │
│ │ │ │ │ │ └─ Output + Token count
│ │ │ │ │ └─ Reranked results
│ │ │ │ └─ Merged deduplicated results
│ │ │ └─ BM25 returned results
│ │ └─ Vector retrieval returned results
│ └─ Rewritten query
└─ Original query
The latency, input/output, and token consumption of each step are clearly visible.
Evaluation in LangSmith
LangSmith has built-in evaluation capabilities, allowing you to run evaluations directly on the platform:
from langsmith import Client
client = Client()
# Create evaluation dataset
dataset = client.create_dataset(
name="rag-evaluation",
description="RAG system evaluation dataset"
)
# Add test cases
client.create_examples(
dataset_id=dataset.id,
inputs=[
{"question": "What is LCEL in LangChain?"},
{"question": "How to configure LangSmith?"},
],
outputs=[
{"answer": "LCEL is LangChain's expression language..."},
{"answer": "Configuring LangSmith requires..."},
]
)
# Run evaluation
from langsmith.evaluation import evaluate
results = evaluate(
target=rag_pipeline,
data=dataset,
evaluators=[faithfulness, answer_relevancy, context_recall],
experiment_prefix="rag-v1"
)
Summary
RAG system evaluation and continuous optimization is an engineering system, not a one-time task.
Key takeaways:
- Three levels of evaluation: Retrieval evaluation, generation evaluation, and end-to-end evaluation
- RAGAS is the most commonly used evaluation framework, providing standardized metrics
- Production deployment requires considering: Caching, async processing, degradation, and monitoring
- Continuous optimization is the core: Collect feedback → Periodic evaluation → Targeted optimization → A/B testing → Deploy
- LangSmith is the standard for RAG monitoring, tracing the input/output and latency of every step
This completes all 5 articles in the RAG series. A recap of the entire series:
- RAG Pipeline Overview — Full pipeline, component selection, real system analysis, common pitfalls
- Document Splitting — Five splitting strategies, best practices for different document types, parameter selection
- Vector Retrieval — Embedding models, vector databases, hybrid search, reranking
- Post-Retrieval Optimization — Query rewriting, HyDE, multi-path recall, context compression, hallucination suppression
- Evaluation and Production Deployment — Evaluation metrics, RAGAS, production architecture, monitoring, continuous optimization
Series: