Evaluation and Production Deployment: Taking RAG from Prototype to Production

The previous four articles covered each step of the RAG pipeline. This article covers the final question: how do you know if your RAG system is good, and how do you deploy it to production.

This is the most easily overlooked but most decisive step in the RAG series. Many people build a RAG system and consider it done once it "runs." But "runs" and "works well" are separated by an entire system of evaluation and continuous optimization.

Why RAG Evaluation Is So Hard

RAG evaluation is difficult because it cannot be measured by a single metric.

Traditional software has clear inputs and outputs — after running test cases, you know whether it's right or wrong. RAG is different:

Are the retrieval results good? Relevant documents were retrieved, but is the ranking correct? Is the quantity sufficient?
Is the generated answer good? Based on the retrieval results, is the generated answer accurate? Complete? Free of hallucinations?
Is the end-to-end effect good? Did the user's question ultimately receive a satisfactory answer?

These three questions correspond to three different levels of evaluation, each with multiple metrics.

Three Levels of RAG Evaluation

Retrieval Evaluation

Measures the quality of results returned by the retriever.

Metric	Definition	Target
Recall@k	Proportion of top-k results containing relevant documents	Higher is better, target >90%
Precision@k	Proportion of top-k results that are relevant documents	Higher is better
MRR (Mean Reciprocal Rank)	Average of inverse ranks where correct answers appear	Closer to 1 is better
MAP (Mean Average Precision)	Average precision accounting for ranking order	Closer to 1 is better

How to calculate: Annotated data is required: each test question must be annotated with a "relevant document list."

# Annotation data example
test_cases = [
    {
        "question": "What is LCEL in LangChain?",
        "relevant_docs": ["lcel-introduction.md", "chain-tutorial.md"]
    },
    {
        "question": "How to configure LangSmith?",
        "relevant_docs": ["langsmith-setup.md", "monitoring-guide.md"]
    }
]

# Calculate Recall@5
def recall_at_k(retrieved_docs, relevant_docs, k=5):
    retrieved_top_k = retrieved_docs[:k]
    relevant_count = sum(1 for doc in retrieved_top_k if doc in relevant_docs)
    return relevant_count / len(relevant_docs)

# Calculate MRR
def mrr(retrieved_docs, relevant_docs):
    for i, doc in enumerate(retrieved_docs):
        if doc in relevant_docs:
            return 1.0 / (i + 1)
    return 0.0

Generation Evaluation

Measures the quality of answers generated by the model based on retrieval results.

Metric	Definition	Evaluation Method
Faithfulness	Whether the answer is grounded in retrieval results, free of hallucination	LLM judgment
Relevance	Whether the answer directly addresses the user's question	LLM judgment
Completeness	Whether the answer covers all key points	LLM judgment
Correctness	Whether the factual content of the answer is correct	Human annotation

LLM as Evaluator: Use another LLM to evaluate the quality of generated answers.

def evaluate_faithfulness(answer, context):
    """Evaluate whether the answer is faithful to the context"""
    prompt = f"""Evaluate whether the following answer is entirely based on the provided context.
    If the answer contains information not found in the context, label it as "unfaithful".

    Context: {context}

    Answer: {answer}

    Evaluation result (faithful/unfaithful) and reason:"""

    result = llm.invoke(prompt)
    return "faithful" in result.content

def evaluate_relevance(answer, question):
    """Evaluate whether the answer is relevant"""
    prompt = f"""Evaluate whether the following answer directly addresses the user's question.

    Question: {question}

    Answer: {answer}

    Evaluation result (relevant/irrelevant) and reason:"""

    result = llm.invoke(prompt)
    return "relevant" in result.content

End-to-End Evaluation

Measures the final effectiveness of the entire RAG pipeline.

Metric	Definition	Notes
Answer Correctness	Correctness of the final answer	The most core metric
Answer Relevancy	Relevance of the final answer to the question	Is it answering the wrong question?
Context Recall	How well retrieval results cover the correct answer	Is retrieval sufficient?
Context Precision	Proportion of relevant content in retrieval results	Is retrieval precise?

RAGAS Evaluation Framework

RAGAS (Retrieval Augmented Generation Assessment) is currently the most popular RAG evaluation framework. It provides standardized evaluation procedures and metrics.

Installation and Usage

pip install ragas

from ragas import evaluate
from ragas.metrics import (
    faithfulness,
    answer_relevancy,
    context_recall,
    context_precision,
    answer_correctness
)
from datasets import Dataset

# Prepare evaluation data
data = {
    "question": [
        "What is LCEL in LangChain?",
        "How to configure LangSmith?",
        "What is the difference between RAG and fine-tuning?"
    ],
    "answer": [
        "LCEL is LangChain's expression language...",
        "Configuring LangSmith requires setting environment variables...",
        "RAG augments generation by retrieving external knowledge..."
    ],
    "contexts": [
        ["LCEL is LangChain's expression language, using pipe symbols..."],
        ["LangSmith is LangChain's monitoring platform..."],
        ["RAG augments generation by retrieving..."]
    ],
    "ground_truth": [
        "LCEL (LangChain Expression Language) is...",
        "Steps to configure LangSmith: 1. Register an account...",
        "RAG augments models by retrieving external knowledge..."
    ]
}

dataset = Dataset.from_dict(data)

# Run evaluation
results = evaluate(
    dataset=dataset,
    metrics=[
        faithfulness,
        answer_relevancy,
        context_recall,
        context_precision,
        answer_correctness
    ],
    llm=ChatOpenAI(model="gpt-4o"),
    embeddings=OpenAIEmbeddings()
)

print(results)
# {
#   'faithfulness': 0.92,
#   'answer_relevancy': 0.88,
#   'context_recall': 0.85,
#   'context_precision': 0.79,
#   'answer_correctness': 0.83
# }

Interpreting Evaluation Results

Metric	Score	Interpretation
faithfulness	0.92	Answers are highly faithful to context; very little hallucination
answer_relevancy	0.88	Answers are highly relevant to questions
context_recall	0.85	Retrieval results cover most relevant information
context_precision	0.79	Retrieval results contain some less relevant content
answer_correctness	0.83	Answers are mostly correct, with minor inaccuracies

How to judge overall performance? There is no universal passing line. Recommendations:

Run an evaluation first to establish a baseline
Re-evaluate after each change to see if metrics improved
Focus on trends rather than absolute values

Building Evaluation Datasets

The key to evaluation is the dataset. A good evaluation dataset should cover different types of queries.

Query Classification

Type	Example	Proportion
Fact queries	"Who is the founder of LangChain?"	30%
Concept understanding	"What is the difference between LCEL and traditional Chains?"	25%
Process queries	"How to configure LangSmith?"	20%
Comparison	"Which is better, RAG or fine-tuning?"	15%
Boundary testing	"What can LangChain do and not do?"	10%

Dataset Size

Minimum viable: 20-30 questions, covering main query types
Standard evaluation: 100-200 questions, covering various scenarios
Comprehensive evaluation: 500+ questions, covering edge cases

Annotation Methods

Manual annotation: Most accurate but time-consuming. Each question must be annotated with:

Relevant document list (for retrieval evaluation)
Standard answer (for generation evaluation)

LLM-assisted annotation: Use an LLM to generate initial annotations, then human review and correction. Much more efficient.

# LLM-assisted ground truth generation
def generate_ground_truth(question, relevant_docs):
    prompt = f"""Based on the following reference documents, answer the user's question.
    The answer should be accurate, complete, and concise.

    Reference documents: {relevant_docs}

    Question: {question}

    Ground truth answer:"""

    return llm.invoke(prompt).content

Production Deployment Architecture

Basic Architecture

┌─────────────────────────────────────────────────────┐
│                  RAG Production System                │
│                                                     │
│  ┌──────────┐     ┌──────────────┐                  │
│  │  User    │────▶│  API Gateway │                  │
│  │  Query   │     └──────┬───────┘                  │
│  └──────────┘            │                          │
│                          ▼                          │
│  ┌──────────────────────────────────────┐           │
│  │          RAG Pipeline                │           │
│  │                                      │           │
│  │  Query Rewrite → Multi-Path Recall →  │           │
│  │  Dedup & Fusion → Reranking →        │           │
│  │  Context Compression → Generation →  │           │
│  │  Hallucination Check                 │           │
│  └──────────────────┬───────────────────┘           │
│                     │                               │
│          ┌──────────┼──────────┐                    │
│          ▼          ▼          ▼                    │
│  ┌──────────┐ ┌──────────┐ ┌──────────┐            │
│  │  Vector   │ │  BM25     │ │  LLM     │            │
│  │  Database │ │  Index    │ │  Service │            │
│  │(Qdrant)  │ │(Elastic)  │ │(API)     │            │
│  └──────────┘ └──────────┘ └──────────┘            │
│                                                     │
│  ┌──────────────────────────────────────┐           │
│  │          Monitoring & Logging         │           │
│  │  Retrieval Latency / Generation       │           │
│  │  Quality / User Feedback              │           │
│  └──────────────────────────────────────┘           │
└─────────────────────────────────────────────────────┘

Key Components

API Gateway: Receives user requests, routes to the RAG pipeline, returns results.

RAG Pipeline: The complete RAG processing flow, including query rewriting, retrieval, reranking, and generation.

Vector Database: Stores and retrieves vectors. Qdrant or Milvus are recommended for production.

BM25 Index: Keyword search. Can use Elasticsearch or maintain separately.

LLM Service: Generates answers. Can be OpenAI API, local models, or self-hosted services.

Monitoring System: Records latency, retrieval results, generated answers, and user feedback for each request.

Performance Optimization

Caching: Identical or similar queries return cached results directly.

from functools import lru_cache
import hashlib

class RAGCache:
    def __init__(self, max_size=10000):
        self.cache = {}
        self.max_size = max_size

    def get_cache_key(self, query):
        return hashlib.md5(query.encode()).hexdigest()

    def get(self, query):
        key = self.get_cache_key(query)
        return self.cache.get(key)

    def set(self, query, result):
        key = self.get_cache_key(query)
        if len(self.cache) >= self.max_size:
            # LRU eviction
            oldest_key = next(iter(self.cache))
            del self.cache[oldest_key]
        self.cache[key] = result

Async Processing: Retrieval and generation can be partially async.

import asyncio

async def rag_pipeline(query):
    # 1. Query rewriting (async)
    rewritten_query = await rewrite_query(query)

    # 2. Multi-path recall (parallel)
    vector_task = asyncio.create_task(vector_search(rewritten_query))
    bm25_task = asyncio.create_task(bm25_search(rewritten_query))
    vector_results, bm25_results = await asyncio.gather(vector_task, bm25_task)

    # 3. Dedup and fusion
    merged = merge_results(vector_results, bm25_results)

    # 4. Reranking
    reranked = await rerank(merged, rewritten_query)

    # 5. Answer generation
    answer = await generate_answer(reranked, query)

    return answer

Batching: Multiple queries can be processed in batches to increase throughput.

Degradation Strategies

Production environments need to account for various failure scenarios:

async def rag_with_fallback(query):
    try:
        # Normal flow
        return await rag_pipeline(query)
    except VectorDBError:
        # Vector DB unavailable, degrade to keyword-only search
        logger.warning("Vector DB unavailable, falling back to BM25")
        return await bm25_only_pipeline(query)
    except LLMError:
        # LLM unavailable, return retrieved document summaries
        logger.warning("LLM unavailable, returning retrieved docs")
        return await retrieve_only(query)
    except Exception as e:
        # Other errors, return friendly message
        logger.error(f"RAG pipeline error: {e}")
        return "Sorry, the system is temporarily unable to answer this question. Please try again later."

Monitoring and Continuous Optimization

Monitoring Metrics

Metric	Description	Alert Threshold
Retrieval Latency	Time from query to returning retrieval results	>500ms
Generation Latency	Time from retrieval to generating answer	>3s
End-to-End Latency	Time from user question to receiving answer	>5s
Cache Hit Rate	Proportion of cache hits	<30%
User Satisfaction	User feedback (upvote/downvote)	Downvote rate >20%

Logging

Every request should be logged:

log_entry = {
    "timestamp": "2026-06-05T10:30:00Z",
    "query": "What is LCEL in LangChain?",
    "rewritten_query": "LangChain LCEL expression language definition and usage",
    "retrieved_docs": ["lcel-intro.md", "chain-tutorial.md"],
    "reranked_docs": ["lcel-intro.md"],
    "answer": "LCEL is LangChain's expression language...",
    "metrics": {
        "retrieval_latency_ms": 120,
        "generation_latency_ms": 1800,
        "total_latency_ms": 1950,
        "context_precision": 0.85,
        "faithfulness": 0.92
    },
    "user_feedback": "upvote"  # or "downvote"
}

Continuous Optimization Loop

Deploy to Production
    │
    ▼
Collect user queries and feedback
    │
    ▼
Run evaluation periodically (weekly/monthly)
    │
    ▼
Analyze weak points
    │
    ▼
Targeted optimization (splitting/generation/retrieval)
    │
    ▼
A/B test to verify
    │
    ▼
Deploy new version
    │
    └──→ Back to step 1

Leveraging User Feedback

User feedback is a goldmine for RAG optimization:

Downvoted answers: Analyze whether it's a retrieval issue (failed to find relevant documents) or a generation issue (found them but answered poorly)
High-frequency queries: Optimizing these yields the greatest benefit
No-result queries: Indicates knowledge gaps in the knowledge base; documents need to be added

# Analyze downvote reasons
def analyze_downvote(query, retrieved_docs, answer, context):
    if not retrieved_docs:
        return "no_results"  # Knowledge base has no relevant docs

    relevance = evaluate_relevance(answer, query)
    if relevance < 0.5:
        return "irrelevant_answer"  # Retrieved but answer is irrelevant

    faithfulness = evaluate_faithfulness(answer, context)
    if faithfulness < 0.5:
        return "hallucination"  # Hallucination detected

    return "other"  # Other reasons

LangSmith Integration

LangSmith is LangChain's official monitoring platform that can trace every step of the RAG pipeline.

Basic Configuration

import os

os.environ["LANGCHAIN_TRACING_V2"] = "true"
os.environ["LANGCHAIN_API_KEY"] = "your-langsmith-api-key"
os.environ["LANGCHAIN_PROJECT"] = "my-rag-project"

# LangChain code is called normally; all calls are automatically traced

Viewing RAG Chains in LangSmith

LangSmith records the complete chain of each RAG request:

User Query → Query Rewrite → Vector Retrieval → BM25 Retrieval → Dedup & Fusion → Reranking → Generation
  │           │               │                │               │               │            │
  │           │               │                │               │               │            └─ Output + Token count
  │           │               │                │               │               └─ Reranked results
  │           │               │                │               └─ Merged deduplicated results
  │           │               │                └─ BM25 returned results
  │           │               └─ Vector retrieval returned results
  │           └─ Rewritten query
  └─ Original query

The latency, input/output, and token consumption of each step are clearly visible.

Evaluation in LangSmith

LangSmith has built-in evaluation capabilities, allowing you to run evaluations directly on the platform:

from langsmith import Client

client = Client()

# Create evaluation dataset
dataset = client.create_dataset(
    name="rag-evaluation",
    description="RAG system evaluation dataset"
)

# Add test cases
client.create_examples(
    dataset_id=dataset.id,
    inputs=[
        {"question": "What is LCEL in LangChain?"},
        {"question": "How to configure LangSmith?"},
    ],
    outputs=[
        {"answer": "LCEL is LangChain's expression language..."},
        {"answer": "Configuring LangSmith requires..."},
    ]
)

# Run evaluation
from langsmith.evaluation import evaluate

results = evaluate(
    target=rag_pipeline,
    data=dataset,
    evaluators=[faithfulness, answer_relevancy, context_recall],
    experiment_prefix="rag-v1"
)

Summary

RAG system evaluation and continuous optimization is an engineering system, not a one-time task.

Key takeaways:

Three levels of evaluation: Retrieval evaluation, generation evaluation, and end-to-end evaluation
RAGAS is the most commonly used evaluation framework, providing standardized metrics
Production deployment requires considering: Caching, async processing, degradation, and monitoring
Continuous optimization is the core: Collect feedback → Periodic evaluation → Targeted optimization → A/B testing → Deploy
LangSmith is the standard for RAG monitoring, tracing the input/output and latency of every step

This completes all 5 articles in the RAG series. A recap of the entire series:

RAG Pipeline Overview — Full pipeline, component selection, real system analysis, common pitfalls
Document Splitting — Five splitting strategies, best practices for different document types, parameter selection
Vector Retrieval — Embedding models, vector databases, hybrid search, reranking
Post-Retrieval Optimization — Query rewriting, HyDE, multi-path recall, context compression, hallucination suppression
Evaluation and Production Deployment — Evaluation metrics, RAGAS, production architecture, monitoring, continuous optimization

Series:

Previous: Post-Retrieval Optimization: Making RAG Results More Precise

Evaluation and Production Deployment: Taking RAG from Prototype to Production

Evaluation and Production Deployment: Taking RAG from Prototype to Production

Why RAG Evaluation Is So Hard

Three Levels of RAG Evaluation

Retrieval Evaluation

Generation Evaluation

End-to-End Evaluation

RAGAS Evaluation Framework

Installation and Usage

Interpreting Evaluation Results

Building Evaluation Datasets

Query Classification

Dataset Size

Annotation Methods

Production Deployment Architecture

Basic Architecture

Key Components

Performance Optimization

Degradation Strategies

Monitoring and Continuous Optimization

Monitoring Metrics

Logging

Continuous Optimization Loop

Leveraging User Feedback

LangSmith Integration

Basic Configuration

Viewing RAG Chains in LangSmith

Evaluation in LangSmith

Summary

Related Articles

面试官问你：如何解决大模型的上下文长度限制——标准回答框架

大模型上下文长度限制完全指南：从原理到工程落地的 4 种方案

面试官问你：RAG 如何处理 PDF——别再说转文本切片了