RAG Pipeline Overview: From Theory to Engineering Practice

RAG Pipeline Overview: From Theory to Engineering Practice

RAG (Retrieval-Augmented Generation) is one of the hottest technical directions in AI application development today.

RAG-related projects on GitHub number in the tens of thousands, virtually all AI frameworks treat RAG as a core capability, and Stack Overflow has an overwhelming number of RAG-related questions. If you are building applications that need AI to answer "questions not in the training data," RAG is almost an unavoidable technology choice.

But RAG is also a severely underestimated direction. Many people think RAG is just "slicing documents, throwing them into a vector database, and then retrieving." In reality, from prototype to production, every step of RAG involves significant engineering detail. The performance difference between a naive RAG pipeline and a tuned RAG pipeline can be tenfold.

This article doesn't cover code — it first builds your global understanding of RAG.

Why RAG Exists

Large language models have two fatal flaws:

Knowledge has a cutoff date. GPT-4o's knowledge cuts off in May 2025. Ask it about events after that, and it will most likely fabricate.

No private data. Your company's internal documentation, product manuals, technical specifications, historical decision records — these are simply not in the model's training data. Ask ChatGPT "What is our company's leave process?" and it can only give you a generic answer.

RAG solves both of these problems. Its core idea is simple: first retrieve document fragments relevant to the user's question from a knowledge base, stuff those fragments into the prompt, then let the large model generate an answer based on that information.

Think of it as giving the large model an external hard drive. The model doesn't have this knowledge itself, but through retrieval, it can "temporarily read" this knowledge to answer questions.

A direct comparison:

Scenario Without RAG With RAG
"What did the March 2026 product launch cover?" Fabricates a plausible but completely wrong answer Retrieves accurate information from launch notes
"What is our company's code style guide?" Gives a generic code style suggestion Retrieves the actual guide from internal docs
"Has anyone encountered this bug before?" Has no idea Finds similar cases and solutions from historical issues

The value of RAG is not in making the model smarter, but in enabling it to access information it wouldn't otherwise know.

The Full RAG Pipeline

A complete RAG pipeline is roughly divided into six steps:

┌─────────────────────────────────────────────────────┐
│                   RAG Pipeline                       │
│                                                     │
│  ┌──────────┐  ┌──────────┐  ┌──────────┐          │
│  │  Document │→│  Document  │→│Embedding │          │
│  │  Loading  │  │ Splitting │  │          │          │
│  └──────────┘  └──────────┘  └──────────┘          │
│                                    │                │
│                                    ▼                │
│  ┌──────────┐  ┌──────────┐  ┌──────────┐          │
│  │  Answer   │←│ Retrieval │  │  Vector   │          │
│  │Generation │  │          │  │ Storage   │          │
│  └──────────┘  └──────────┘  └──────────┘          │
│                                                     │
│  ← Offline Phase (Indexing) →  ← Online Phase →     │
└─────────────────────────────────────────────────────┘

Offline Phase: Building the Index

Step 1: Document Loading

Read the original documents. PDFs, Word files, web pages, Markdown, database records, API response data... different sources require different loaders.

This step looks simple, but it's full of pitfalls. How do you handle tables and charts in PDFs? How do you filter navigation bars and ads from web pages? How do you ensure OCR quality for scanned documents? The quality of document loading directly determines the upper limit of the RAG pipeline.

Step 2: Document Splitting

Cut long documents into smaller chunks. This is one of the most impactful steps in RAG, and also one of the most easily overlooked.

Splitting too coarsely means a chunk contains too much irrelevant content, introducing noise during retrieval. Splitting too finely means a chunk lacks sufficient context, making it impossible for the model to understand its meaning.

Common splitting strategies:

  • Fixed-size splitting: By character count or token count — simple and crude
  • Recursive splitting: Split by paragraph first, then by sentence for overly long paragraphs, recursing layer by layer
  • Semantic splitting: Split by topic or semantic boundaries — each chunk is self-consistent
  • Parent-child chunks: Small chunks for retrieval, large chunks for generation — balancing precision and context

Step 3: Embedding

Convert text chunks into numeric vectors. Embedding models map semantically similar text to nearby locations in vector space.

The core of this step is choosing the Embedding model. Different models vary greatly in dimensionality, speed, and performance. OpenAI's text-embedding-3-large (3072 dimensions), Cohere's embed-v3 (1024 dimensions), and open-source BGE-large (1024 dimensions) each have their strengths and weaknesses.

Step 4: Vector Storage

Store vectors in a vector database. Mainstream options include Chroma (lightweight, local), Milvus (high-performance, distributed), Pinecone (fully managed cloud service), Weaviate (open-source, hybrid search), and more.

Online Phase: Answering Queries

Step 5: Retrieval

The user asks a question — first convert it to a vector, then find the most similar k chunks in the vector database.

Retrieval is not simply "find the most similar." In production, you often need:

  • Hybrid search: Vector retrieval + keyword search (BM25), complementing each other
  • Query rewriting: The user's question may be unclear — have the LLM rewrite or decompose it first
  • Reranking: Initial retrieval results may not be precise enough — re-rank with a cross-encoder
  • Multi-path recall: Retrieve from multiple angles and merge results

Step 6: Generation

Use the retrieval results as context, combine them with the user's question in a prompt, and let the large model generate an answer.

The core challenge of generation is hallucination control — the model might answer based on the retrieval results, or it might "make things up." You need to explicitly require the model in the prompt to answer only based on the provided context, and to indicate when it's uncertain.

Component Selection Map

Each step of RAG has multiple options. The following is not a complete selection guide, but helps you build awareness of "what options exist."

Document Loaders

Type Tool Use Case
File Parsing PyPDFLoader, Docx2txtLoader, Unstructured PDFs, Word, PPT, and other local files
Web Scraping WebBaseLoader, CheerioLoader, Playwright Web pages, pages requiring JS rendering
Database SQLDatabaseLoader, MongoDBAtlasLoader Structured data, NoSQL data
API ApifyWrapper, GitHub API Loader Third-party platform data
Universal Unstructured (unified interface) Mixed-format scenarios

Text Splitters

Strategy Tool Characteristics
Fixed-size CharacterTextSplitter Simple and fast, ignores semantic boundaries
Recursive RecursiveCharacterTextSplitter Splits by paragraph → sentence → word, preserving semantics
Semantic SemanticChunker Detects topic boundaries based on Embedding similarity
Parent-child ParentDocumentRetriever Small chunks for retrieval, large chunks for generation
Sliding window SentenceWindowRetriever Retrieves sentences, returns context window

Embedding Models

Model Dimensions Characteristics
text-embedding-3-large 3072 OpenAI's strongest, MTEB benchmark leader
text-embedding-3-small 1536 OpenAI's value option, good performance
embed-v3 1024 Cohere, supports compressed dimensions
BGE-large 1024 Open-source top choice, strong Chinese/English results
GTE-large 1024 Alibaba open-source, excellent for Chinese
E5-large 1024 Meta open-source, strong multilingual support

Vector Databases

Database Type Characteristics
Chroma Embedded Zero-config, Python-native, ideal for prototypes
Milvus Distributed High performance, supports large-scale data
Pinecone Cloud Service Fully managed, no ops needed, pay-per-use
Weaviate Open-source Hybrid search (vector + keyword), GraphQL interface
Qdrant Open-source High performance, Rust implementation
pgvector PG Extension Leverages existing PostgreSQL infrastructure

Retrieval Strategies

Strategy Description Use Case
Vector retrieval Find most similar chunks via cosine similarity Semantic matching
Keyword search (BM25) Based on term frequency and inverse document frequency Exact keyword matching
Hybrid search Vector + keyword, RRF or weighted fusion Most production scenarios
Reranking Re-rank initial results with cross-encoder High-precision requirements
Multi-path recall Multiple retrievers in parallel, merge results Complex queries

RAG in Practice: Claude Code Memory System

Theory aside, let's look at what a real RAG system looks like.

The MAGMA memory system in Claude Code is a typical RAG implementation. Its core storage is the LanceDB vector database, paired with Obsidian notes as a knowledge source.

System Architecture

User Query
    │
    ▼
┌─────────────────────┐
│ IntelligentRouter   │  ← Intent Analysis + Routing
└──────────┬──────────┘
           │
    ┌──────┴──────┐
    ▼             ▼
┌───────┐   ┌──────────┐
│ CCB   │   │  MAGMA   │
│(File  │   │ (Vector  │
│System)│   │ Database)│
│       │   │          │
│Markdown│  │ LanceDB  │
│Files  │   │ Obsidian │
│       │   │ Knowledge│
│       │   │ Graph    │
└───┬───┘   └────┬─────┘
    │            │
    ▼            ▼
┌─────────────────────────┐
│   CrossSystemRetriever   │  ← Cross-system Retrieval + Dedup
└──────────┬──────────────┘
           │
           ▼
┌─────────────────────────┐
│    MemoryFusion         │  ← Semantic Dedup + Ranking + Token Budget
└──────────┬──────────────┘
           │
           ▼
    Injected into Claude Context

Why the Basic Implementation Is Sufficient

MAGMA's vector retrieval implementation is quite basic — pure vector search, no hybrid search, no reranking. Yet it works well. Why?

Because the memory system has a unique advantage: extremely high data quality.

Memories aren't documents scraped from the internet — they're written by users themselves: user preferences, project context, feedback and guidance. Each memory is human-written, well-formatted, semantically clear, and moderately sized.

This reveals a core RAG principle: data quality > retrieval algorithm. High-quality data paired with a basic retrieval algorithm will always outperform low-quality data paired with a complex retrieval algorithm.

Five Common Reasons RAG Performs Poorly

If you've built a RAG system but the results aren't good, you've likely hit one of these five pitfalls:

Pitfall 1: Document Splitting Too Coarse or Too Fine

Symptoms: Retrieval results contain large amounts of irrelevant content, or are too short for the model to understand.

Diagnosis: Print out the retrieved chunks and check if they are semantically complete. If a chunk mixes three different topics, splitting is too coarse. If a chunk is only half a sentence, splitting is too fine.

Solution: Adjust the splitting strategy based on document type. Use recursive splitting for technical docs, split by turn for conversations, split by function for code.

Pitfall 2: Embedding Model Mismatch with Data

Symptoms: The semantic relevance between retrieval results and questions is low. Relevant documents exist but aren't being found.

Diagnosis: Manually check retrieval results with a few typical questions. If a human considers them relevant but the retrieval results don't, the Embedding model isn't correctly understanding your data.

Solution: Switch to an Embedding model better suited to your data type. Try BGE or GTE for Chinese data, E5 for multilingual data.

Pitfall 3: Retrieval Results Lack Context

Symptoms: Relevant chunks are retrieved, but the model lacks necessary contextual information when answering.

Diagnosis: Check if the retrieved chunks are self-contained. If a chunk heavily uses phrases like "as mentioned above" or "this feature," the context has been severed by splitting.

Solution: Increase overlap, or use the parent-child chunk strategy — small chunks for retrieval, large chunks for generation.

Pitfall 4: Prompts Don't Constrain Hallucination

Symptoms: The model doesn't answer based on retrieval results, but instead combines its own "knowledge" to generate plausible but actually incorrect content.

Diagnosis: Change key information in the retrieval results to something obviously wrong, and check if the model still answers based on the retrieval results.

Solution: Explicitly constrain in the prompt: "Answer based only on the following context. If the context does not contain relevant information, state that you don't know. Do not fabricate information."

Pitfall 5: No Evaluation, Optimization by Guesswork

Symptoms: You changed the splitting strategy, swapped the Embedding model, adjusted retrieval parameters — but have no idea which change actually helped.

Diagnosis: Do you have a standardized evaluation process to measure RAG performance?

Solution: Build an evaluation system. At minimum: a test question set, expected answers, and evaluation metrics (relevance, faithfulness, completeness). Run the evaluation after every change and let the data speak.

RAG System Complexity Map

Back to the original question: is RAG complex or not?

Prototypes are very simple. A few dozen lines of code can get a RAG pipeline running: load documents → split → embed → retrieve → generate. That's the path of most LangChain tutorials.

Production is very complex. Each step has multiple options, and the combinations between choices grow exponentially. Moreover, RAG performance is not determined by any single step, but by the coordination of the entire pipeline.

Simple ←─────────────────────────────────────→ Complex

Basic RAG    Tuned RAG     Production RAG    Enterprise RAG
──────────────────────────────────────────────────────────
Fixed split  Recursive     Semantic split    Adaptive split
Single vector Hybrid       Hybrid+Reranking   Multi-path recall
+ keyword
Basic prompt Constrained   Dynamic prompt     Multi-turn dialogue
             prompt                           + memory
No eval      Human eval   Automated eval     Continuous monitoring
             + feedback

This article helps you build global understanding. The following series articles dive into each step, starting with document splitting, to systematically break down the engineering details of RAG.


Series: