RAG Pipeline Overview: From Theory to Engineering Practice
RAG (Retrieval-Augmented Generation) is one of the hottest technical directions in AI application development today.
RAG-related projects on GitHub number in the tens of thousands, virtually all AI frameworks treat RAG as a core capability, and Stack Overflow has an overwhelming number of RAG-related questions. If you are building applications that need AI to answer "questions not in the training data," RAG is almost an unavoidable technology choice.
But RAG is also a severely underestimated direction. Many people think RAG is just "slicing documents, throwing them into a vector database, and then retrieving." In reality, from prototype to production, every step of RAG involves significant engineering detail. The performance difference between a naive RAG pipeline and a tuned RAG pipeline can be tenfold.
This article doesn't cover code — it first builds your global understanding of RAG.
Why RAG Exists
Large language models have two fatal flaws:
Knowledge has a cutoff date. GPT-4o's knowledge cuts off in May 2025. Ask it about events after that, and it will most likely fabricate.
No private data. Your company's internal documentation, product manuals, technical specifications, historical decision records — these are simply not in the model's training data. Ask ChatGPT "What is our company's leave process?" and it can only give you a generic answer.
RAG solves both of these problems. Its core idea is simple: first retrieve document fragments relevant to the user's question from a knowledge base, stuff those fragments into the prompt, then let the large model generate an answer based on that information.
Think of it as giving the large model an external hard drive. The model doesn't have this knowledge itself, but through retrieval, it can "temporarily read" this knowledge to answer questions.
A direct comparison:
| Scenario | Without RAG | With RAG |
|---|---|---|
| "What did the March 2026 product launch cover?" | Fabricates a plausible but completely wrong answer | Retrieves accurate information from launch notes |
| "What is our company's code style guide?" | Gives a generic code style suggestion | Retrieves the actual guide from internal docs |
| "Has anyone encountered this bug before?" | Has no idea | Finds similar cases and solutions from historical issues |
The value of RAG is not in making the model smarter, but in enabling it to access information it wouldn't otherwise know.
The Full RAG Pipeline
A complete RAG pipeline is roughly divided into six steps:
┌─────────────────────────────────────────────────────┐
│ RAG Pipeline │
│ │
│ ┌──────────┐ ┌──────────┐ ┌──────────┐ │
│ │ Document │→│ Document │→│Embedding │ │
│ │ Loading │ │ Splitting │ │ │ │
│ └──────────┘ └──────────┘ └──────────┘ │
│ │ │
│ ▼ │
│ ┌──────────┐ ┌──────────┐ ┌──────────┐ │
│ │ Answer │←│ Retrieval │ │ Vector │ │
│ │Generation │ │ │ │ Storage │ │
│ └──────────┘ └──────────┘ └──────────┘ │
│ │
│ ← Offline Phase (Indexing) → ← Online Phase → │
└─────────────────────────────────────────────────────┘
Offline Phase: Building the Index
Step 1: Document Loading
Read the original documents. PDFs, Word files, web pages, Markdown, database records, API response data... different sources require different loaders.
This step looks simple, but it's full of pitfalls. How do you handle tables and charts in PDFs? How do you filter navigation bars and ads from web pages? How do you ensure OCR quality for scanned documents? The quality of document loading directly determines the upper limit of the RAG pipeline.
Step 2: Document Splitting
Cut long documents into smaller chunks. This is one of the most impactful steps in RAG, and also one of the most easily overlooked.
Splitting too coarsely means a chunk contains too much irrelevant content, introducing noise during retrieval. Splitting too finely means a chunk lacks sufficient context, making it impossible for the model to understand its meaning.
Common splitting strategies:
- Fixed-size splitting: By character count or token count — simple and crude
- Recursive splitting: Split by paragraph first, then by sentence for overly long paragraphs, recursing layer by layer
- Semantic splitting: Split by topic or semantic boundaries — each chunk is self-consistent
- Parent-child chunks: Small chunks for retrieval, large chunks for generation — balancing precision and context
Step 3: Embedding
Convert text chunks into numeric vectors. Embedding models map semantically similar text to nearby locations in vector space.
The core of this step is choosing the Embedding model. Different models vary greatly in dimensionality, speed, and performance. OpenAI's text-embedding-3-large (3072 dimensions), Cohere's embed-v3 (1024 dimensions), and open-source BGE-large (1024 dimensions) each have their strengths and weaknesses.
Step 4: Vector Storage
Store vectors in a vector database. Mainstream options include Chroma (lightweight, local), Milvus (high-performance, distributed), Pinecone (fully managed cloud service), Weaviate (open-source, hybrid search), and more.
Online Phase: Answering Queries
Step 5: Retrieval
The user asks a question — first convert it to a vector, then find the most similar k chunks in the vector database.
Retrieval is not simply "find the most similar." In production, you often need:
- Hybrid search: Vector retrieval + keyword search (BM25), complementing each other
- Query rewriting: The user's question may be unclear — have the LLM rewrite or decompose it first
- Reranking: Initial retrieval results may not be precise enough — re-rank with a cross-encoder
- Multi-path recall: Retrieve from multiple angles and merge results
Step 6: Generation
Use the retrieval results as context, combine them with the user's question in a prompt, and let the large model generate an answer.
The core challenge of generation is hallucination control — the model might answer based on the retrieval results, or it might "make things up." You need to explicitly require the model in the prompt to answer only based on the provided context, and to indicate when it's uncertain.
Component Selection Map
Each step of RAG has multiple options. The following is not a complete selection guide, but helps you build awareness of "what options exist."
Document Loaders
| Type | Tool | Use Case |
|---|---|---|
| File Parsing | PyPDFLoader, Docx2txtLoader, Unstructured | PDFs, Word, PPT, and other local files |
| Web Scraping | WebBaseLoader, CheerioLoader, Playwright | Web pages, pages requiring JS rendering |
| Database | SQLDatabaseLoader, MongoDBAtlasLoader | Structured data, NoSQL data |
| API | ApifyWrapper, GitHub API Loader | Third-party platform data |
| Universal | Unstructured (unified interface) | Mixed-format scenarios |
Text Splitters
| Strategy | Tool | Characteristics |
|---|---|---|
| Fixed-size | CharacterTextSplitter | Simple and fast, ignores semantic boundaries |
| Recursive | RecursiveCharacterTextSplitter | Splits by paragraph → sentence → word, preserving semantics |
| Semantic | SemanticChunker | Detects topic boundaries based on Embedding similarity |
| Parent-child | ParentDocumentRetriever | Small chunks for retrieval, large chunks for generation |
| Sliding window | SentenceWindowRetriever | Retrieves sentences, returns context window |
Embedding Models
| Model | Dimensions | Characteristics |
|---|---|---|
| text-embedding-3-large | 3072 | OpenAI's strongest, MTEB benchmark leader |
| text-embedding-3-small | 1536 | OpenAI's value option, good performance |
| embed-v3 | 1024 | Cohere, supports compressed dimensions |
| BGE-large | 1024 | Open-source top choice, strong Chinese/English results |
| GTE-large | 1024 | Alibaba open-source, excellent for Chinese |
| E5-large | 1024 | Meta open-source, strong multilingual support |
Vector Databases
| Database | Type | Characteristics |
|---|---|---|
| Chroma | Embedded | Zero-config, Python-native, ideal for prototypes |
| Milvus | Distributed | High performance, supports large-scale data |
| Pinecone | Cloud Service | Fully managed, no ops needed, pay-per-use |
| Weaviate | Open-source | Hybrid search (vector + keyword), GraphQL interface |
| Qdrant | Open-source | High performance, Rust implementation |
| pgvector | PG Extension | Leverages existing PostgreSQL infrastructure |
Retrieval Strategies
| Strategy | Description | Use Case |
|---|---|---|
| Vector retrieval | Find most similar chunks via cosine similarity | Semantic matching |
| Keyword search (BM25) | Based on term frequency and inverse document frequency | Exact keyword matching |
| Hybrid search | Vector + keyword, RRF or weighted fusion | Most production scenarios |
| Reranking | Re-rank initial results with cross-encoder | High-precision requirements |
| Multi-path recall | Multiple retrievers in parallel, merge results | Complex queries |
RAG in Practice: Claude Code Memory System
Theory aside, let's look at what a real RAG system looks like.
The MAGMA memory system in Claude Code is a typical RAG implementation. Its core storage is the LanceDB vector database, paired with Obsidian notes as a knowledge source.
System Architecture
User Query
│
▼
┌─────────────────────┐
│ IntelligentRouter │ ← Intent Analysis + Routing
└──────────┬──────────┘
│
┌──────┴──────┐
▼ ▼
┌───────┐ ┌──────────┐
│ CCB │ │ MAGMA │
│(File │ │ (Vector │
│System)│ │ Database)│
│ │ │ │
│Markdown│ │ LanceDB │
│Files │ │ Obsidian │
│ │ │ Knowledge│
│ │ │ Graph │
└───┬───┘ └────┬─────┘
│ │
▼ ▼
┌─────────────────────────┐
│ CrossSystemRetriever │ ← Cross-system Retrieval + Dedup
└──────────┬──────────────┘
│
▼
┌─────────────────────────┐
│ MemoryFusion │ ← Semantic Dedup + Ranking + Token Budget
└──────────┬──────────────┘
│
▼
Injected into Claude Context
Why the Basic Implementation Is Sufficient
MAGMA's vector retrieval implementation is quite basic — pure vector search, no hybrid search, no reranking. Yet it works well. Why?
Because the memory system has a unique advantage: extremely high data quality.
Memories aren't documents scraped from the internet — they're written by users themselves: user preferences, project context, feedback and guidance. Each memory is human-written, well-formatted, semantically clear, and moderately sized.
This reveals a core RAG principle: data quality > retrieval algorithm. High-quality data paired with a basic retrieval algorithm will always outperform low-quality data paired with a complex retrieval algorithm.
Five Common Reasons RAG Performs Poorly
If you've built a RAG system but the results aren't good, you've likely hit one of these five pitfalls:
Pitfall 1: Document Splitting Too Coarse or Too Fine
Symptoms: Retrieval results contain large amounts of irrelevant content, or are too short for the model to understand.
Diagnosis: Print out the retrieved chunks and check if they are semantically complete. If a chunk mixes three different topics, splitting is too coarse. If a chunk is only half a sentence, splitting is too fine.
Solution: Adjust the splitting strategy based on document type. Use recursive splitting for technical docs, split by turn for conversations, split by function for code.
Pitfall 2: Embedding Model Mismatch with Data
Symptoms: The semantic relevance between retrieval results and questions is low. Relevant documents exist but aren't being found.
Diagnosis: Manually check retrieval results with a few typical questions. If a human considers them relevant but the retrieval results don't, the Embedding model isn't correctly understanding your data.
Solution: Switch to an Embedding model better suited to your data type. Try BGE or GTE for Chinese data, E5 for multilingual data.
Pitfall 3: Retrieval Results Lack Context
Symptoms: Relevant chunks are retrieved, but the model lacks necessary contextual information when answering.
Diagnosis: Check if the retrieved chunks are self-contained. If a chunk heavily uses phrases like "as mentioned above" or "this feature," the context has been severed by splitting.
Solution: Increase overlap, or use the parent-child chunk strategy — small chunks for retrieval, large chunks for generation.
Pitfall 4: Prompts Don't Constrain Hallucination
Symptoms: The model doesn't answer based on retrieval results, but instead combines its own "knowledge" to generate plausible but actually incorrect content.
Diagnosis: Change key information in the retrieval results to something obviously wrong, and check if the model still answers based on the retrieval results.
Solution: Explicitly constrain in the prompt: "Answer based only on the following context. If the context does not contain relevant information, state that you don't know. Do not fabricate information."
Pitfall 5: No Evaluation, Optimization by Guesswork
Symptoms: You changed the splitting strategy, swapped the Embedding model, adjusted retrieval parameters — but have no idea which change actually helped.
Diagnosis: Do you have a standardized evaluation process to measure RAG performance?
Solution: Build an evaluation system. At minimum: a test question set, expected answers, and evaluation metrics (relevance, faithfulness, completeness). Run the evaluation after every change and let the data speak.
RAG System Complexity Map
Back to the original question: is RAG complex or not?
Prototypes are very simple. A few dozen lines of code can get a RAG pipeline running: load documents → split → embed → retrieve → generate. That's the path of most LangChain tutorials.
Production is very complex. Each step has multiple options, and the combinations between choices grow exponentially. Moreover, RAG performance is not determined by any single step, but by the coordination of the entire pipeline.
Simple ←─────────────────────────────────────→ Complex
Basic RAG Tuned RAG Production RAG Enterprise RAG
──────────────────────────────────────────────────────────
Fixed split Recursive Semantic split Adaptive split
Single vector Hybrid Hybrid+Reranking Multi-path recall
+ keyword
Basic prompt Constrained Dynamic prompt Multi-turn dialogue
prompt + memory
No eval Human eval Automated eval Continuous monitoring
+ feedback
This article helps you build global understanding. The following series articles dive into each step, starting with document splitting, to systematically break down the engineering details of RAG.
Series: