Evaluation and Production Deployment: Taking RAG from Prototype to Production

Evaluation and Production Deployment: Taking RAG from Prototype to Production

The previous four articles covered each step of the RAG pipeline. This article covers the final question: how do you know if your RAG system is good, and how do you deploy it to production.

This is the most easily overlooked but most decisive step in the RAG series. Many people build a RAG system and consider it done once it "runs." But "runs" and "works well" are separated by an entire system of evaluation and continuous optimization.

Why RAG Evaluation Is So Hard

RAG evaluation is difficult because it cannot be measured by a single metric.

Traditional software has clear inputs and outputs — after running test cases, you know whether it's right or wrong. RAG is different:

  • Are the retrieval results good? Relevant documents were retrieved, but is the ranking correct? Is the quantity sufficient?
  • Is the generated answer good? Based on the retrieval results, is the generated answer accurate? Complete? Free of hallucinations?
  • Is the end-to-end effect good? Did the user's question ultimately receive a satisfactory answer?

These three questions correspond to three different levels of evaluation, each with multiple metrics.

Three Levels of RAG Evaluation

Retrieval Evaluation

Measures the quality of results returned by the retriever.

Metric Definition Target
Recall@k Proportion of top-k results containing relevant documents Higher is better, target >90%
Precision@k Proportion of top-k results that are relevant documents Higher is better
MRR (Mean Reciprocal Rank) Average of inverse ranks where correct answers appear Closer to 1 is better
MAP (Mean Average Precision) Average precision accounting for ranking order Closer to 1 is better

How to calculate: Annotated data is required: each test question must be annotated with a "relevant document list."

# Annotation data example
test_cases = [
    {
        "question": "What is LCEL in LangChain?",
        "relevant_docs": ["lcel-introduction.md", "chain-tutorial.md"]
    },
    {
        "question": "How to configure LangSmith?",
        "relevant_docs": ["langsmith-setup.md", "monitoring-guide.md"]
    }
]

# Calculate Recall@5
def recall_at_k(retrieved_docs, relevant_docs, k=5):
    retrieved_top_k = retrieved_docs[:k]
    relevant_count = sum(1 for doc in retrieved_top_k if doc in relevant_docs)
    return relevant_count / len(relevant_docs)

# Calculate MRR
def mrr(retrieved_docs, relevant_docs):
    for i, doc in enumerate(retrieved_docs):
        if doc in relevant_docs:
            return 1.0 / (i + 1)
    return 0.0

Generation Evaluation

Measures the quality of answers generated by the model based on retrieval results.

Metric Definition Evaluation Method
Faithfulness Whether the answer is grounded in retrieval results, free of hallucination LLM judgment
Relevance Whether the answer directly addresses the user's question LLM judgment
Completeness Whether the answer covers all key points LLM judgment
Correctness Whether the factual content of the answer is correct Human annotation

LLM as Evaluator: Use another LLM to evaluate the quality of generated answers.

def evaluate_faithfulness(answer, context):
    """Evaluate whether the answer is faithful to the context"""
    prompt = f"""Evaluate whether the following answer is entirely based on the provided context.
    If the answer contains information not found in the context, label it as "unfaithful".

    Context: {context}

    Answer: {answer}

    Evaluation result (faithful/unfaithful) and reason:"""

    result = llm.invoke(prompt)
    return "faithful" in result.content

def evaluate_relevance(answer, question):
    """Evaluate whether the answer is relevant"""
    prompt = f"""Evaluate whether the following answer directly addresses the user's question.

    Question: {question}

    Answer: {answer}

    Evaluation result (relevant/irrelevant) and reason:"""

    result = llm.invoke(prompt)
    return "relevant" in result.content

End-to-End Evaluation

Measures the final effectiveness of the entire RAG pipeline.

Metric Definition Notes
Answer Correctness Correctness of the final answer The most core metric
Answer Relevancy Relevance of the final answer to the question Is it answering the wrong question?
Context Recall How well retrieval results cover the correct answer Is retrieval sufficient?
Context Precision Proportion of relevant content in retrieval results Is retrieval precise?

RAGAS Evaluation Framework

RAGAS (Retrieval Augmented Generation Assessment) is currently the most popular RAG evaluation framework. It provides standardized evaluation procedures and metrics.

Installation and Usage

pip install ragas
from ragas import evaluate
from ragas.metrics import (
    faithfulness,
    answer_relevancy,
    context_recall,
    context_precision,
    answer_correctness
)
from datasets import Dataset

# Prepare evaluation data
data = {
    "question": [
        "What is LCEL in LangChain?",
        "How to configure LangSmith?",
        "What is the difference between RAG and fine-tuning?"
    ],
    "answer": [
        "LCEL is LangChain's expression language...",
        "Configuring LangSmith requires setting environment variables...",
        "RAG augments generation by retrieving external knowledge..."
    ],
    "contexts": [
        ["LCEL is LangChain's expression language, using pipe symbols..."],
        ["LangSmith is LangChain's monitoring platform..."],
        ["RAG augments generation by retrieving..."]
    ],
    "ground_truth": [
        "LCEL (LangChain Expression Language) is...",
        "Steps to configure LangSmith: 1. Register an account...",
        "RAG augments models by retrieving external knowledge..."
    ]
}

dataset = Dataset.from_dict(data)

# Run evaluation
results = evaluate(
    dataset=dataset,
    metrics=[
        faithfulness,
        answer_relevancy,
        context_recall,
        context_precision,
        answer_correctness
    ],
    llm=ChatOpenAI(model="gpt-4o"),
    embeddings=OpenAIEmbeddings()
)

print(results)
# {
#   'faithfulness': 0.92,
#   'answer_relevancy': 0.88,
#   'context_recall': 0.85,
#   'context_precision': 0.79,
#   'answer_correctness': 0.83
# }

Interpreting Evaluation Results

Metric Score Interpretation
faithfulness 0.92 Answers are highly faithful to context; very little hallucination
answer_relevancy 0.88 Answers are highly relevant to questions
context_recall 0.85 Retrieval results cover most relevant information
context_precision 0.79 Retrieval results contain some less relevant content
answer_correctness 0.83 Answers are mostly correct, with minor inaccuracies

How to judge overall performance? There is no universal passing line. Recommendations:

  • Run an evaluation first to establish a baseline
  • Re-evaluate after each change to see if metrics improved
  • Focus on trends rather than absolute values

Building Evaluation Datasets

The key to evaluation is the dataset. A good evaluation dataset should cover different types of queries.

Query Classification

Type Example Proportion
Fact queries "Who is the founder of LangChain?" 30%
Concept understanding "What is the difference between LCEL and traditional Chains?" 25%
Process queries "How to configure LangSmith?" 20%
Comparison "Which is better, RAG or fine-tuning?" 15%
Boundary testing "What can LangChain do and not do?" 10%

Dataset Size

  • Minimum viable: 20-30 questions, covering main query types
  • Standard evaluation: 100-200 questions, covering various scenarios
  • Comprehensive evaluation: 500+ questions, covering edge cases

Annotation Methods

Manual annotation: Most accurate but time-consuming. Each question must be annotated with:

  • Relevant document list (for retrieval evaluation)
  • Standard answer (for generation evaluation)

LLM-assisted annotation: Use an LLM to generate initial annotations, then human review and correction. Much more efficient.

# LLM-assisted ground truth generation
def generate_ground_truth(question, relevant_docs):
    prompt = f"""Based on the following reference documents, answer the user's question.
    The answer should be accurate, complete, and concise.

    Reference documents: {relevant_docs}

    Question: {question}

    Ground truth answer:"""

    return llm.invoke(prompt).content

Production Deployment Architecture

Basic Architecture

┌─────────────────────────────────────────────────────┐
│                  RAG Production System                │
│                                                     │
│  ┌──────────┐     ┌──────────────┐                  │
│  │  User    │────▶│  API Gateway │                  │
│  │  Query   │     └──────┬───────┘                  │
│  └──────────┘            │                          │
│                          ▼                          │
│  ┌──────────────────────────────────────┐           │
│  │          RAG Pipeline                │           │
│  │                                      │           │
│  │  Query Rewrite → Multi-Path Recall →  │           │
│  │  Dedup & Fusion → Reranking →        │           │
│  │  Context Compression → Generation →  │           │
│  │  Hallucination Check                 │           │
│  └──────────────────┬───────────────────┘           │
│                     │                               │
│          ┌──────────┼──────────┐                    │
│          ▼          ▼          ▼                    │
│  ┌──────────┐ ┌──────────┐ ┌──────────┐            │
│  │  Vector   │ │  BM25     │ │  LLM     │            │
│  │  Database │ │  Index    │ │  Service │            │
│  │(Qdrant)  │ │(Elastic)  │ │(API)     │            │
│  └──────────┘ └──────────┘ └──────────┘            │
│                                                     │
│  ┌──────────────────────────────────────┐           │
│  │          Monitoring & Logging         │           │
│  │  Retrieval Latency / Generation       │           │
│  │  Quality / User Feedback              │           │
│  └──────────────────────────────────────┘           │
└─────────────────────────────────────────────────────┘

Key Components

API Gateway: Receives user requests, routes to the RAG pipeline, returns results.

RAG Pipeline: The complete RAG processing flow, including query rewriting, retrieval, reranking, and generation.

Vector Database: Stores and retrieves vectors. Qdrant or Milvus are recommended for production.

BM25 Index: Keyword search. Can use Elasticsearch or maintain separately.

LLM Service: Generates answers. Can be OpenAI API, local models, or self-hosted services.

Monitoring System: Records latency, retrieval results, generated answers, and user feedback for each request.

Performance Optimization

Caching: Identical or similar queries return cached results directly.

from functools import lru_cache
import hashlib

class RAGCache:
    def __init__(self, max_size=10000):
        self.cache = {}
        self.max_size = max_size

    def get_cache_key(self, query):
        return hashlib.md5(query.encode()).hexdigest()

    def get(self, query):
        key = self.get_cache_key(query)
        return self.cache.get(key)

    def set(self, query, result):
        key = self.get_cache_key(query)
        if len(self.cache) >= self.max_size:
            # LRU eviction
            oldest_key = next(iter(self.cache))
            del self.cache[oldest_key]
        self.cache[key] = result

Async Processing: Retrieval and generation can be partially async.

import asyncio

async def rag_pipeline(query):
    # 1. Query rewriting (async)
    rewritten_query = await rewrite_query(query)

    # 2. Multi-path recall (parallel)
    vector_task = asyncio.create_task(vector_search(rewritten_query))
    bm25_task = asyncio.create_task(bm25_search(rewritten_query))
    vector_results, bm25_results = await asyncio.gather(vector_task, bm25_task)

    # 3. Dedup and fusion
    merged = merge_results(vector_results, bm25_results)

    # 4. Reranking
    reranked = await rerank(merged, rewritten_query)

    # 5. Answer generation
    answer = await generate_answer(reranked, query)

    return answer

Batching: Multiple queries can be processed in batches to increase throughput.

Degradation Strategies

Production environments need to account for various failure scenarios:

async def rag_with_fallback(query):
    try:
        # Normal flow
        return await rag_pipeline(query)
    except VectorDBError:
        # Vector DB unavailable, degrade to keyword-only search
        logger.warning("Vector DB unavailable, falling back to BM25")
        return await bm25_only_pipeline(query)
    except LLMError:
        # LLM unavailable, return retrieved document summaries
        logger.warning("LLM unavailable, returning retrieved docs")
        return await retrieve_only(query)
    except Exception as e:
        # Other errors, return friendly message
        logger.error(f"RAG pipeline error: {e}")
        return "Sorry, the system is temporarily unable to answer this question. Please try again later."

Monitoring and Continuous Optimization

Monitoring Metrics

Metric Description Alert Threshold
Retrieval Latency Time from query to returning retrieval results >500ms
Generation Latency Time from retrieval to generating answer >3s
End-to-End Latency Time from user question to receiving answer >5s
Cache Hit Rate Proportion of cache hits <30%
User Satisfaction User feedback (upvote/downvote) Downvote rate >20%

Logging

Every request should be logged:

log_entry = {
    "timestamp": "2026-06-05T10:30:00Z",
    "query": "What is LCEL in LangChain?",
    "rewritten_query": "LangChain LCEL expression language definition and usage",
    "retrieved_docs": ["lcel-intro.md", "chain-tutorial.md"],
    "reranked_docs": ["lcel-intro.md"],
    "answer": "LCEL is LangChain's expression language...",
    "metrics": {
        "retrieval_latency_ms": 120,
        "generation_latency_ms": 1800,
        "total_latency_ms": 1950,
        "context_precision": 0.85,
        "faithfulness": 0.92
    },
    "user_feedback": "upvote"  # or "downvote"
}

Continuous Optimization Loop

Deploy to Production
    │
    ▼
Collect user queries and feedback
    │
    ▼
Run evaluation periodically (weekly/monthly)
    │
    ▼
Analyze weak points
    │
    ▼
Targeted optimization (splitting/generation/retrieval)
    │
    ▼
A/B test to verify
    │
    ▼
Deploy new version
    │
    └──→ Back to step 1

Leveraging User Feedback

User feedback is a goldmine for RAG optimization:

  • Downvoted answers: Analyze whether it's a retrieval issue (failed to find relevant documents) or a generation issue (found them but answered poorly)
  • High-frequency queries: Optimizing these yields the greatest benefit
  • No-result queries: Indicates knowledge gaps in the knowledge base; documents need to be added
# Analyze downvote reasons
def analyze_downvote(query, retrieved_docs, answer, context):
    if not retrieved_docs:
        return "no_results"  # Knowledge base has no relevant docs

    relevance = evaluate_relevance(answer, query)
    if relevance < 0.5:
        return "irrelevant_answer"  # Retrieved but answer is irrelevant

    faithfulness = evaluate_faithfulness(answer, context)
    if faithfulness < 0.5:
        return "hallucination"  # Hallucination detected

    return "other"  # Other reasons

LangSmith Integration

LangSmith is LangChain's official monitoring platform that can trace every step of the RAG pipeline.

Basic Configuration

import os

os.environ["LANGCHAIN_TRACING_V2"] = "true"
os.environ["LANGCHAIN_API_KEY"] = "your-langsmith-api-key"
os.environ["LANGCHAIN_PROJECT"] = "my-rag-project"

# LangChain code is called normally; all calls are automatically traced

Viewing RAG Chains in LangSmith

LangSmith records the complete chain of each RAG request:

User Query → Query Rewrite → Vector Retrieval → BM25 Retrieval → Dedup & Fusion → Reranking → Generation
  │           │               │                │               │               │            │
  │           │               │                │               │               │            └─ Output + Token count
  │           │               │                │               │               └─ Reranked results
  │           │               │                │               └─ Merged deduplicated results
  │           │               │                └─ BM25 returned results
  │           │               └─ Vector retrieval returned results
  │           └─ Rewritten query
  └─ Original query

The latency, input/output, and token consumption of each step are clearly visible.

Evaluation in LangSmith

LangSmith has built-in evaluation capabilities, allowing you to run evaluations directly on the platform:

from langsmith import Client

client = Client()

# Create evaluation dataset
dataset = client.create_dataset(
    name="rag-evaluation",
    description="RAG system evaluation dataset"
)

# Add test cases
client.create_examples(
    dataset_id=dataset.id,
    inputs=[
        {"question": "What is LCEL in LangChain?"},
        {"question": "How to configure LangSmith?"},
    ],
    outputs=[
        {"answer": "LCEL is LangChain's expression language..."},
        {"answer": "Configuring LangSmith requires..."},
    ]
)

# Run evaluation
from langsmith.evaluation import evaluate

results = evaluate(
    target=rag_pipeline,
    data=dataset,
    evaluators=[faithfulness, answer_relevancy, context_recall],
    experiment_prefix="rag-v1"
)

Summary

RAG system evaluation and continuous optimization is an engineering system, not a one-time task.

Key takeaways:

  • Three levels of evaluation: Retrieval evaluation, generation evaluation, and end-to-end evaluation
  • RAGAS is the most commonly used evaluation framework, providing standardized metrics
  • Production deployment requires considering: Caching, async processing, degradation, and monitoring
  • Continuous optimization is the core: Collect feedback → Periodic evaluation → Targeted optimization → A/B testing → Deploy
  • LangSmith is the standard for RAG monitoring, tracing the input/output and latency of every step

This completes all 5 articles in the RAG series. A recap of the entire series:

  1. RAG Pipeline Overview — Full pipeline, component selection, real system analysis, common pitfalls
  2. Document Splitting — Five splitting strategies, best practices for different document types, parameter selection
  3. Vector Retrieval — Embedding models, vector databases, hybrid search, reranking
  4. Post-Retrieval Optimization — Query rewriting, HyDE, multi-path recall, context compression, hallucination suppression
  5. Evaluation and Production Deployment — Evaluation metrics, RAGAS, production architecture, monitoring, continuous optimization

Series: