Document Splitting: The First Step of RAG, and the Easiest to Get Wrong

Document Splitting: The First Step of RAG, and the Easiest to Get Wrong

The previous article established a global understanding of RAG. Starting from this article, we dive into each step of the RAG pipeline, beginning with the first step — document splitting.

Document splitting is one of the most easily overlooked steps in RAG, but one of the most impactful. Many people spend their energy choosing Embedding models and tuning retrieval algorithms, but overlook the fact: if splitting is done poorly, no amount of optimization in subsequent steps can compensate.

Garbage in, garbage out — this principle applies especially to RAG.

Why Splitting Matters So Much

The retrieval granularity is the chunk. A user asks a question, and the system returns the k most similar chunks. If the chunks themselves have problems, the retrieval results won't be good.

Splitting too coarsely: A chunk contains multiple unrelated pieces of content. During retrieval, only a small portion of the chunk is relevant to the query; the rest is noise. The Embedding vector is "diluted" by this noise, leading to imprecise retrieval.

Splitting too finely: A chunk contains only a few sentences, lacking context. When the model receives this chunk, it doesn't know what "as mentioned above" refers to, or what "this feature" is talking about. Broken context leads to degraded generation quality.

Think of it this way: you're trying to find a specific piece of knowledge in a book. If you split by page (too coarse), a page has ten knowledge points and you can barely locate the specific one. If you split by sentence (too fine), a single sentence is stripped of context and you have no idea what it's talking about. The best approach is to split by chapter or paragraph — each fragment is semantically complete without containing too much irrelevant content.

The core tension of splitting: precision vs. context. Smaller chunks give more precise retrieval but less complete context. Larger chunks give more complete context but less precise retrieval. The essence of splitting strategies is finding the balance between this tension.

Five Splitting Strategies

Fixed-Size Splitting

The simplest strategy: split by a fixed number of characters or tokens, with a certain overlap between adjacent chunks.

from langchain.text_splitter import CharacterTextSplitter

splitter = CharacterTextSplitter(
    chunk_size=1000,      # 1000 characters per chunk
    chunk_overlap=200,    # 200 characters overlap between adjacent chunks
    separator="\n\n"      # Prefer splitting at paragraph boundaries
)

chunks = splitter.split_text(document)

Pros: Simple to implement, fast computation, no extra model calls needed.

Cons: Completely ignores semantic boundaries. A paragraph may be cut mid-way through; a chunk may contain two unrelated paragraphs.

Use cases:

  • Rapid prototyping
  • Scenarios where document structure doesn't matter and content is continuous (e.g., novels, continuous text)
  • Internal tools where high performance isn't critical

Parameter selection:

  • chunk_size: Common range 256-2048 characters. Technical docs suggest 500-1000, papers suggest 1000-2000.
  • chunk_overlap: Typically 10%-25% of chunk_size. Too small loses boundary information at chunk edges; too large produces significant duplication.

Recursive Splitting

LangChain's default recommended strategy. Splits recursively by hierarchy: first by paragraph (\n\n), then overly long paragraphs by sentence (\n), then by space, until each chunk meets the size requirement.

from langchain.text_splitter import RecursiveCharacterTextSplitter

splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,
    chunk_overlap=200,
    separators=["\n\n", "\n", ". ", " ", ""]  # Tried in priority order
)

chunks = splitter.split_text(document)

Pros: Tries to preserve semantic boundaries. Natural separators like paragraphs and sentences are used preferentially, avoiding hard cuts mid-paragraph.

Cons: Still relies on separators. If a document has no clear separators (e.g., long text without line breaks), performance degrades to near fixed-size splitting.

Use cases:

  • Most structured documents (technical documentation, blog posts, papers)
  • The preferred choice for general scenarios

Parameter selection:

  • The order of separators matters. Separators earlier in the list have higher priority.
  • For Chinese documents, it's recommended to add "。" and "!" and other punctuation to the separator list.
  • chunk_size should be slightly larger than fixed-size splitting, since recursive splitting respects semantic boundaries better.

Semantic Splitting

Detects topic boundaries based on Embedding similarity. First embeds each sentence, then calculates similarity between adjacent sentences — a sudden drop in similarity indicates a topic switch boundary.

from langchain_experimental.text_splitter import SemanticChunker
from langchain_openai import OpenAIEmbeddings

splitter = SemanticChunker(
    OpenAIEmbeddings(),
    breakpoint_threshold_type="percentile",  # or "standard_deviation", "interquartile"
    breakpoint_threshold_amount=95           # Split when similarity falls below 95th percentile
)

chunks = splitter.split_text(document)

Pros: Truly splits by semantic boundaries. Each chunk is thematically self-consistent, never mixing two different topics.

Cons:

  • Requires calling an Embedding model, adding extra cost
  • High computational cost, not suitable for very large documents
  • Threshold selection requires parameter tuning

Use cases:

  • Long documents with multiple topics (e.g., meeting minutes, encyclopedias)
  • Production systems requiring high retrieval precision
  • Documents with irregular structure where traditional splitting strategies perform poorly

Parameter selection:

  • breakpoint_threshold_type:
    • percentile: Based on percentile, most intuitive
    • standard_deviation: Based on standard deviation, suitable for uniformly distributed similarity
    • interquartile: Based on interquartile range, more robust to outliers
  • breakpoint_threshold_amount: Larger values mean fewer splits (larger chunks); smaller values mean more splits (smaller chunks).

Parent-Child Chunks

The core solution to the "precision vs. context" contradiction. Small chunks (child) are used for retrieval; large chunks (parent) are used for generation.

from langchain.retrievers import ParentDocumentRetriever
from langchain.storage import InMemoryStore
from langchain_community.vectorstores import Chroma
from langchain_text_splitter import RecursiveCharacterTextSplitter

# Small chunks for retrieval
child_splitter = RecursiveCharacterTextSplitter(chunk_size=200)
# Large chunks for generation
parent_splitter = RecursiveCharacterTextSplitter(chunk_size=1000)

retriever = ParentDocumentRetriever(
    vectorstore=vectorstore,
    docstore=InMemoryStore(),
    child_splitter=child_splitter,
    parent_splitter=parent_splitter,
)

# During retrieval: small chunks match (precise), large chunks returned to model (contextual)
results = retriever.invoke("What is the company's leave process?")

Pros: Achieves both precise retrieval and complete context simultaneously. Retrieval uses small chunks for matching precision; generation uses large chunks for sufficient context.

Cons:

  • Requires maintaining two sets of chunks (parent and child), doubling storage costs
  • Higher implementation complexity
  • Requires an additional docstore to store parent chunks

Use cases:

  • Production-grade RAG systems
  • Scenarios with strong inter-paragraph relevance (e.g., legal documents, technical specifications)
  • Scenarios requiring high retrieval precision and generation quality

Sentence Window

Retrieves a single sentence but returns the sentence's context window. Similar to parent-child chunks, but finer-grained — at the sentence level.

from langchain.retrievers import SentenceWindowRetriever

retriever = SentenceWindowRetriever(
    vectorstore=vectorstore,
    window_size=3,  # 3 sentences before and after the retrieved sentence as context
)

# During retrieval: match a single sentence, return the context window
results = retriever.invoke("What is the compensation standard for breach of contract?")

Pros: More flexible than parent-child chunks, with adjustable context window size.

Cons: More complex implementation, requiring sentence-level index maintenance.

Use cases:

  • Scenarios requiring sentence-level retrieval precision
  • Documents where key information is scattered across different sentences

Best Practices for Different Document Types

No splitting strategy is universal. Different document types require different splitting approaches.

Technical Documentation (API docs, tutorials, READMEs)

Characteristics: Clear structure with heading hierarchy; paragraphs are relatively independent.

Recommended strategy: Recursive splitting, split by heading hierarchy.

splitter = RecursiveCharacterTextSplitter(
    chunk_size=800,
    chunk_overlap=150,
    separators=["\n## ", "\n### ", "\n\n", "\n", ". ", " "]
)

Key points:

  • Prioritize splitting by headings (##, ###), with each chunk corresponding to a section
  • Split before heading separators to ensure chunks include the heading (headings are important context)
  • chunk_size shouldn't be too large; technical document paragraphs are typically short

Papers / Long-Form Articles

Characteristics: Long paragraphs with tight contextual connections; key information may span paragraphs.

Recommended strategy: Recursive splitting + larger overlap, or semantic splitting.

splitter = RecursiveCharacterTextSplitter(
    chunk_size=1500,
    chunk_overlap=400,  # Larger overlap to maintain contextual coherence
    separators=["\n\n", "\n", ". ", " "]
)

Key points:

  • Paper paragraphs are typically long, so chunk_size needs to be correspondingly larger
  • Overlap must be large enough to avoid severing key arguments
  • If the paper involves multiple topic shifts, consider semantic splitting

Conversation Records (chat logs, meeting minutes)

Characteristics: Organized by turn, each turn relatively independent, but context may span turns.

Recommended strategy: Split by turn, preserving speaker information.

# Split by conversation turn
def split_conversations(text, max_turns_per_chunk=5):
    turns = text.split("\n")  # Assume each line is one turn
    chunks = []
    for i in range(0, len(turns), max_turns_per_chunk):
        chunk = "\n".join(turns[i:i + max_turns_per_chunk])
        chunks.append(chunk)
    return chunks

Key points:

  • Split by turn, with each chunk containing several rounds of conversation
  • Preserve speaker information ("Alice: ..."), which is important for contextual understanding
  • Never cut in the middle of a single turn

Code Files

Characteristics: Clear syntactic structure (classes, functions, comments); context spans lines.

Recommended strategy: Split by function or class.

from langchain.text_splitter import RecursiveCharacterTextSplitter

splitter = RecursiveCharacterTextSplitter(
    chunk_size=500,
    chunk_overlap=100,
    separators=["\n\ndef ", "\n\nclass ", "\n\n", "\n", " "]
)

Key points:

  • Prioritize splitting by function (def) and class (class)
  • Preserve function signatures and docstrings
  • Code has tight contextual connections, so overlap shouldn't be too small

PDFs / Scanned Documents

Characteristics: Uncontrollable format; may contain tables, charts, headers, footers, OCR errors.

Recommended strategy: Clean first, then split.

# 1. Extract text
raw_text = extract_pdf_text("document.pdf")

# 2. Clean: remove headers, footers, page numbers, OCR noise
cleaned_text = clean_pdf_text(raw_text)

# 3. Split
splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,
    chunk_overlap=200,
    separators=["\n\n", "\n", ". ", " "]
)
chunks = splitter.split_text(cleaned_text)

Key points:

  • PDF text extraction quality varies greatly; always clean before splitting
  • Table content needs special handling (convert to text description or preserve structured format)
  • OCR quality of scanned documents directly affects splitting quality

How to Choose Overlap

Overlap (the overlapping region between adjacent chunks) is the most easily overlooked parameter in splitting.

Overlap too small: Boundary information is lost. The last sentence of a paragraph and the first sentence of the next paragraph may be related; without overlap, this connection is severed.

Overlap too large: Significant duplicate content. This wastes storage and computation, and may lead to many duplicate chunks in retrieval results, reducing diversity.

Rule of thumb:

chunk_size Recommended overlap Notes
256 50-64 ~20%-25%
512 100-128 ~20%-25%
1024 150-250 ~15%-25%
2048 200-400 ~10%-20%

The larger the chunk_size, the smaller the overlap ratio can be. This is because large chunks already contain sufficient context and don't need much overlap to maintain coherence.

Evaluating Splitting Quality

How do you know if your splitting strategy is good? Several practical evaluation methods:

Manual Inspection

The simplest and also most effective method: randomly sample 10 chunks and manually judge:

  • Is each chunk semantically complete? (Can it be understood without context?)
  • Does each chunk contain only one topic?
  • Is there important information severed between adjacent chunks?

Retrieval Hit Rate

Prepare a set of test questions, each annotated with the "document paragraph where the correct answer is located." Then check:

  • Does the retrieval results include the correct paragraph?
  • What proportion of correct paragraphs are included?

If the hit rate is low, it may be because splitting is too coarse (too much noise in the chunk) or too fine (key information is severed).

Chunk Size Distribution

import matplotlib.pyplot as plt

sizes = [len(chunk) for chunk in chunks]
plt.hist(sizes, bins=30)
plt.xlabel("Chunk Size (characters)")
plt.ylabel("Count")
plt.title("Chunk Size Distribution")
plt.show()

If the distribution is too scattered (some chunks are 50 characters, others 5000), it means the splitting strategy isn't controlling chunk size well and needs adjustment.

Common Misconceptions

Misconception 1: Larger chunk_size is better. Wrong. Larger chunks mean lower retrieval precision. A 5000-character chunk might contain ten different topics, making it hard to match precisely during retrieval.

Misconception 2: Smaller chunk_size is better. Wrong. Smaller chunks mean less complete context. A 50-character chunk might be just a single sentence, and the model cannot understand what it's talking about.

Misconception 3: Larger overlap is better. Wrong. Too much overlap produces significant duplicate content, wastes storage and computation, and reduces retrieval result diversity.

Misconception 4: One strategy fits all. Wrong. Different document types need different splitting strategies. Technical docs split by heading, conversations by turn, code by function.

Misconception 5: Splitting is a one-time task. Wrong. Splitting strategies need continuous adjustment based on actual performance. It's recommended to establish an evaluation process and periodically check splitting quality.

Summary

Document splitting is the first step of the RAG pipeline, and also one of the most impactful steps.

Key takeaways:

  • No universal splitting strategy — choose based on document type
  • The core tension is precision vs. context — parent-child chunks are the best solution to this contradiction
  • Overlap matters — typically 15%-25% of chunk_size
  • Evaluate splitting quality — don't rely on guesswork

The next article dives into the second step — vector retrieval. How to choose an Embedding model? How to choose a vector database? How to implement hybrid search?


Series: