RAG Pipeline Overview: From Theory to Engineering Practice

RAG (Retrieval-Augmented Generation) is one of the hottest technical directions in AI application development today.

RAG-related projects on GitHub number in the tens of thousands, virtually all AI frameworks treat RAG as a core capability, and Stack Overflow has an overwhelming number of RAG-related questions. If you are building applications that need AI to answer "questions not in the training data," RAG is almost an unavoidable technology choice.

But RAG is also a severely underestimated direction. Many people think RAG is just "slicing documents, throwing them into a vector database, and then retrieving." In reality, from prototype to production, every step of RAG involves significant engineering detail. The performance difference between a naive RAG pipeline and a tuned RAG pipeline can be tenfold.

This article doesn't cover code — it first builds your global understanding of RAG.

Why RAG Exists

Large language models have two fatal flaws:

Knowledge has a cutoff date. GPT-4o's knowledge cuts off in May 2025. Ask it about events after that, and it will most likely fabricate.

No private data. Your company's internal documentation, product manuals, technical specifications, historical decision records — these are simply not in the model's training data. Ask ChatGPT "What is our company's leave process?" and it can only give you a generic answer.

RAG solves both of these problems. Its core idea is simple: first retrieve document fragments relevant to the user's question from a knowledge base, stuff those fragments into the prompt, then let the large model generate an answer based on that information.

Think of it as giving the large model an external hard drive. The model doesn't have this knowledge itself, but through retrieval, it can "temporarily read" this knowledge to answer questions.

A direct comparison:

Scenario	Without RAG	With RAG
"What did the March 2026 product launch cover?"	Fabricates a plausible but completely wrong answer	Retrieves accurate information from launch notes
"What is our company's code style guide?"	Gives a generic code style suggestion	Retrieves the actual guide from internal docs
"Has anyone encountered this bug before?"	Has no idea	Finds similar cases and solutions from historical issues

The value of RAG is not in making the model smarter, but in enabling it to access information it wouldn't otherwise know.

The Full RAG Pipeline

A complete RAG pipeline is roughly divided into six steps:

┌─────────────────────────────────────────────────────┐
│                   RAG Pipeline                       │
│                                                     │
│  ┌──────────┐  ┌──────────┐  ┌──────────┐          │
│  │  Document │→│  Document  │→│Embedding │          │
│  │  Loading  │  │ Splitting │  │          │          │
│  └──────────┘  └──────────┘  └──────────┘          │
│                                    │                │
│                                    ▼                │
│  ┌──────────┐  ┌──────────┐  ┌──────────┐          │
│  │  Answer   │←│ Retrieval │  │  Vector   │          │
│  │Generation │  │          │  │ Storage   │          │
│  └──────────┘  └──────────┘  └──────────┘          │
│                                                     │
│  ← Offline Phase (Indexing) →  ← Online Phase →     │
└─────────────────────────────────────────────────────┘

Offline Phase: Building the Index

Step 1: Document Loading

Read the original documents. PDFs, Word files, web pages, Markdown, database records, API response data... different sources require different loaders.

This step looks simple, but it's full of pitfalls. How do you handle tables and charts in PDFs? How do you filter navigation bars and ads from web pages? How do you ensure OCR quality for scanned documents? The quality of document loading directly determines the upper limit of the RAG pipeline.

Step 2: Document Splitting

Cut long documents into smaller chunks. This is one of the most impactful steps in RAG, and also one of the most easily overlooked.

Splitting too coarsely means a chunk contains too much irrelevant content, introducing noise during retrieval. Splitting too finely means a chunk lacks sufficient context, making it impossible for the model to understand its meaning.

Common splitting strategies:

Fixed-size splitting: By character count or token count — simple and crude
Recursive splitting: Split by paragraph first, then by sentence for overly long paragraphs, recursing layer by layer
Semantic splitting: Split by topic or semantic boundaries — each chunk is self-consistent
Parent-child chunks: Small chunks for retrieval, large chunks for generation — balancing precision and context

Step 3: Embedding

Convert text chunks into numeric vectors. Embedding models map semantically similar text to nearby locations in vector space.

The core of this step is choosing the Embedding model. Different models vary greatly in dimensionality, speed, and performance. OpenAI's text-embedding-3-large (3072 dimensions), Cohere's embed-v3 (1024 dimensions), and open-source BGE-large (1024 dimensions) each have their strengths and weaknesses.

Step 4: Vector Storage

Store vectors in a vector database. Mainstream options include Chroma (lightweight, local), Milvus (high-performance, distributed), Pinecone (fully managed cloud service), Weaviate (open-source, hybrid search), and more.

Online Phase: Answering Queries

Step 5: Retrieval

The user asks a question — first convert it to a vector, then find the most similar k chunks in the vector database.

Retrieval is not simply "find the most similar." In production, you often need:

Hybrid search: Vector retrieval + keyword search (BM25), complementing each other
Query rewriting: The user's question may be unclear — have the LLM rewrite or decompose it first
Reranking: Initial retrieval results may not be precise enough — re-rank with a cross-encoder
Multi-path recall: Retrieve from multiple angles and merge results

Step 6: Generation

Use the retrieval results as context, combine them with the user's question in a prompt, and let the large model generate an answer.

The core challenge of generation is hallucination control — the model might answer based on the retrieval results, or it might "make things up." You need to explicitly require the model in the prompt to answer only based on the provided context, and to indicate when it's uncertain.

Component Selection Map

Each step of RAG has multiple options. The following is not a complete selection guide, but helps you build awareness of "what options exist."

Document Loaders

Type	Tool	Use Case
File Parsing	PyPDFLoader, Docx2txtLoader, Unstructured	PDFs, Word, PPT, and other local files
Web Scraping	WebBaseLoader, CheerioLoader, Playwright	Web pages, pages requiring JS rendering
Database	SQLDatabaseLoader, MongoDBAtlasLoader	Structured data, NoSQL data
API	ApifyWrapper, GitHub API Loader	Third-party platform data
Universal	Unstructured (unified interface)	Mixed-format scenarios

Text Splitters

Strategy	Tool	Characteristics
Fixed-size	CharacterTextSplitter	Simple and fast, ignores semantic boundaries
Recursive	RecursiveCharacterTextSplitter	Splits by paragraph → sentence → word, preserving semantics
Semantic	SemanticChunker	Detects topic boundaries based on Embedding similarity
Parent-child	ParentDocumentRetriever	Small chunks for retrieval, large chunks for generation
Sliding window	SentenceWindowRetriever	Retrieves sentences, returns context window

Embedding Models

Model	Dimensions	Characteristics
text-embedding-3-large	3072	OpenAI's strongest, MTEB benchmark leader
text-embedding-3-small	1536	OpenAI's value option, good performance
embed-v3	1024	Cohere, supports compressed dimensions
BGE-large	1024	Open-source top choice, strong Chinese/English results
GTE-large	1024	Alibaba open-source, excellent for Chinese
E5-large	1024	Meta open-source, strong multilingual support

Vector Databases

Database	Type	Characteristics
Chroma	Embedded	Zero-config, Python-native, ideal for prototypes
Milvus	Distributed	High performance, supports large-scale data
Pinecone	Cloud Service	Fully managed, no ops needed, pay-per-use
Weaviate	Open-source	Hybrid search (vector + keyword), GraphQL interface
Qdrant	Open-source	High performance, Rust implementation
pgvector	PG Extension	Leverages existing PostgreSQL infrastructure

Retrieval Strategies

Strategy	Description	Use Case
Vector retrieval	Find most similar chunks via cosine similarity	Semantic matching
Keyword search (BM25)	Based on term frequency and inverse document frequency	Exact keyword matching
Hybrid search	Vector + keyword, RRF or weighted fusion	Most production scenarios
Reranking	Re-rank initial results with cross-encoder	High-precision requirements
Multi-path recall	Multiple retrievers in parallel, merge results	Complex queries

RAG in Practice: Claude Code Memory System

Theory aside, let's look at what a real RAG system looks like.

The MAGMA memory system in Claude Code is a typical RAG implementation. Its core storage is the LanceDB vector database, paired with Obsidian notes as a knowledge source.

System Architecture

User Query
    │
    ▼
┌─────────────────────┐
│ IntelligentRouter   │  ← Intent Analysis + Routing
└──────────┬──────────┘
           │
    ┌──────┴──────┐
    ▼             ▼
┌───────┐   ┌──────────┐
│ CCB   │   │  MAGMA   │
│(File  │   │ (Vector  │
│System)│   │ Database)│
│       │   │          │
│Markdown│  │ LanceDB  │
│Files  │   │ Obsidian │
│       │   │ Knowledge│
│       │   │ Graph    │
└───┬───┘   └────┬─────┘
    │            │
    ▼            ▼
┌─────────────────────────┐
│   CrossSystemRetriever   │  ← Cross-system Retrieval + Dedup
└──────────┬──────────────┘
           │
           ▼
┌─────────────────────────┐
│    MemoryFusion         │  ← Semantic Dedup + Ranking + Token Budget
└──────────┬──────────────┘
           │
           ▼
    Injected into Claude Context

Why the Basic Implementation Is Sufficient

MAGMA's vector retrieval implementation is quite basic — pure vector search, no hybrid search, no reranking. Yet it works well. Why?

Because the memory system has a unique advantage: extremely high data quality.

Memories aren't documents scraped from the internet — they're written by users themselves: user preferences, project context, feedback and guidance. Each memory is human-written, well-formatted, semantically clear, and moderately sized.

This reveals a core RAG principle: data quality > retrieval algorithm. High-quality data paired with a basic retrieval algorithm will always outperform low-quality data paired with a complex retrieval algorithm.

Five Common Reasons RAG Performs Poorly

If you've built a RAG system but the results aren't good, you've likely hit one of these five pitfalls:

Pitfall 1: Document Splitting Too Coarse or Too Fine

Symptoms: Retrieval results contain large amounts of irrelevant content, or are too short for the model to understand.

Diagnosis: Print out the retrieved chunks and check if they are semantically complete. If a chunk mixes three different topics, splitting is too coarse. If a chunk is only half a sentence, splitting is too fine.

Solution: Adjust the splitting strategy based on document type. Use recursive splitting for technical docs, split by turn for conversations, split by function for code.

Pitfall 2: Embedding Model Mismatch with Data

Symptoms: The semantic relevance between retrieval results and questions is low. Relevant documents exist but aren't being found.

Diagnosis: Manually check retrieval results with a few typical questions. If a human considers them relevant but the retrieval results don't, the Embedding model isn't correctly understanding your data.

Solution: Switch to an Embedding model better suited to your data type. Try BGE or GTE for Chinese data, E5 for multilingual data.

Pitfall 3: Retrieval Results Lack Context

Symptoms: Relevant chunks are retrieved, but the model lacks necessary contextual information when answering.

Diagnosis: Check if the retrieved chunks are self-contained. If a chunk heavily uses phrases like "as mentioned above" or "this feature," the context has been severed by splitting.

Solution: Increase overlap, or use the parent-child chunk strategy — small chunks for retrieval, large chunks for generation.

Pitfall 4: Prompts Don't Constrain Hallucination

Symptoms: The model doesn't answer based on retrieval results, but instead combines its own "knowledge" to generate plausible but actually incorrect content.

Diagnosis: Change key information in the retrieval results to something obviously wrong, and check if the model still answers based on the retrieval results.

Solution: Explicitly constrain in the prompt: "Answer based only on the following context. If the context does not contain relevant information, state that you don't know. Do not fabricate information."

Pitfall 5: No Evaluation, Optimization by Guesswork

Symptoms: You changed the splitting strategy, swapped the Embedding model, adjusted retrieval parameters — but have no idea which change actually helped.

Diagnosis: Do you have a standardized evaluation process to measure RAG performance?

Solution: Build an evaluation system. At minimum: a test question set, expected answers, and evaluation metrics (relevance, faithfulness, completeness). Run the evaluation after every change and let the data speak.

RAG System Complexity Map

Back to the original question: is RAG complex or not?

Prototypes are very simple. A few dozen lines of code can get a RAG pipeline running: load documents → split → embed → retrieve → generate. That's the path of most LangChain tutorials.

Production is very complex. Each step has multiple options, and the combinations between choices grow exponentially. Moreover, RAG performance is not determined by any single step, but by the coordination of the entire pipeline.

Simple ←─────────────────────────────────────→ Complex

Basic RAG    Tuned RAG     Production RAG    Enterprise RAG
──────────────────────────────────────────────────────────
Fixed split  Recursive     Semantic split    Adaptive split
Single vector Hybrid       Hybrid+Reranking   Multi-path recall
+ keyword
Basic prompt Constrained   Dynamic prompt     Multi-turn dialogue
             prompt                           + memory
No eval      Human eval   Automated eval     Continuous monitoring
             + feedback

This article helps you build global understanding. The following series articles dive into each step, starting with document splitting, to systematically break down the engineering details of RAG.

Series:

Next: Document Splitting: The First Step of RAG, and the Easiest to Get Wrong

RAG Pipeline Overview: From Theory to Engineering Practice

RAG Pipeline Overview: From Theory to Engineering Practice

Why RAG Exists

The Full RAG Pipeline

Offline Phase: Building the Index

Online Phase: Answering Queries

Component Selection Map

Document Loaders

Text Splitters

Embedding Models

Vector Databases

Retrieval Strategies

RAG in Practice: Claude Code Memory System

System Architecture

Why the Basic Implementation Is Sufficient

Five Common Reasons RAG Performs Poorly

Pitfall 1: Document Splitting Too Coarse or Too Fine

Pitfall 2: Embedding Model Mismatch with Data

Pitfall 3: Retrieval Results Lack Context

Pitfall 4: Prompts Don't Constrain Hallucination

Pitfall 5: No Evaluation, Optimization by Guesswork

RAG System Complexity Map

Related Articles

面试官问你：如何解决大模型的上下文长度限制——标准回答框架

大模型上下文长度限制完全指南：从原理到工程落地的 4 种方案

面试官问你：RAG 如何处理 PDF——别再说转文本切片了