Direct answer: For PDF-backed RAG, chunk by heading first and fall back to fixed-size windows of 500-800 tokens with 10-15% overlap. Heading-based chunking preserves semantic boundaries and improves retrieval recall by 12-18% over naive fixed-size chunking on long-form documents. Pure semantic chunking (embedding-based boundary detection) adds another 2-4% recall but doubles preprocessing cost and is rarely worth it under 10,000 pages.
def chunk_pdf(extracted_markdown, target_tokens=600, overlap_pct=0.15):
sections = split_by_heading(extracted_markdown)
chunks = []
for section in sections:
if token_count(section) <= target_tokens:
chunks.append(section)
else:
chunks.extend(sliding_window(section, target_tokens, overlap_pct))
return chunks
Why chunking matters more than retrieval
For most RAG failures the cause is not the embedding model and not the vector database. It is that the chunk boundaries cut through the answer.
A typical example: a user asks “what is the warranty period?” and the document has a section titled “Warranty” with a one-sentence answer on line 3 and a five-paragraph elaboration after it. A fixed-size 1,000-token chunk starting at an arbitrary offset can split the one-sentence answer from its section heading, leaving the retriever to match against “the warranty period shall be calculated from the date of…” with no header context. Retrieval recall drops; reranking cannot save it.
The opposite mistake is also common: chunks so small that no single chunk contains enough context to answer the question, forcing the LLM to stitch together fragments that may or may not be consistent.
This post covers the four chunking strategies that come up in practice, with code for each and a benchmark on a 200-PDF corpus we run for our own RAG pipeline work.
Strategy 1: Fixed-size by tokens
The baseline. Split text into windows of N tokens with M% overlap.
import tiktoken
enc = tiktoken.get_encoding("cl100k_base")
def fixed_size_chunks(text, target_tokens=600, overlap_pct=0.15):
tokens = enc.encode(text)
overlap = int(target_tokens * overlap_pct)
step = target_tokens - overlap
chunks = []
for i in range(0, len(tokens), step):
window = tokens[i:i + target_tokens]
if not window:
break
chunks.append(enc.decode(window))
return chunks
This is what LangChain’s RecursiveCharacterTextSplitter and LlamaIndex’s TokenTextSplitter produce by default. It is fast, deterministic, and predictable.
When it works: uniform narrative text — articles, news, fiction, customer support transcripts. Any document where the prose is dense and continuous, with few hard structural breaks.
When it fails: structured documents like contracts, manuals, scientific papers, and product catalogs. The chunk boundary cuts through a clause definition, a table caption, or a numbered list mid-item.
In our benchmark, fixed-size chunking on a 50-page legal contract corpus reached only 64% retrieval recall at top-5. The same documents chunked by heading reached 81%.
Strategy 2: Heading-based
Split at every markdown heading. Each section becomes one chunk, unless the section exceeds the target size — in which case fall back to fixed-size windows within the section.
import re
HEADING_RE = re.compile(r"^(#{1,6})\s+(.+)$", re.MULTILINE)
def heading_chunks(markdown_text, target_tokens=600, overlap_pct=0.15):
matches = list(HEADING_RE.finditer(markdown_text))
if not matches:
return fixed_size_chunks(markdown_text, target_tokens, overlap_pct)
sections = []
for i, m in enumerate(matches):
start = m.start()
end = matches[i + 1].start() if i + 1 < len(matches) else len(markdown_text)
sections.append(markdown_text[start:end])
chunks = []
for section in sections:
if token_count(section) <= target_tokens:
chunks.append(section)
else:
# Carry the heading line into each sub-chunk for context
heading_line = section.split("\n", 1)[0]
body = section[len(heading_line) + 1:]
for sub in fixed_size_chunks(body, target_tokens, overlap_pct):
chunks.append(f"{heading_line}\n\n{sub}")
return chunks
The critical trick is the heading carry: when a section is too long and must be split, prepend the section heading to every sub-chunk. This costs a few extra tokens per chunk but keeps every chunk self-describing — a retriever matching on “warranty period” finds the chunk regardless of where in the warranty section the relevant sentence sits.
Prerequisites: you need an extractor that produces markdown with real headings. PyMuPDF returns headings as <h1> through <h6> only if the source PDF embedded structure tags, which most don’t. Tools like Marker, Docling, and pdfmux infer headings from font size and weight; the pdf-to-markdown-for-RAG post covers how that inference works.
When it works: every structured document type. Manuals, contracts, papers, financial reports, technical documentation.
When it fails: documents without meaningful heading structure (transcripts, novels, customer chat logs).
Strategy 3: Recursive structural
A middle ground used by LangChain’s RecursiveCharacterTextSplitter. Try to split on the largest structural unit that produces chunks under the target size, falling back through a hierarchy.
SEPARATORS = ["\n## ", "\n### ", "\n\n", "\n", ". ", " ", ""]
def recursive_chunks(text, target_tokens=600, sep_idx=0):
if token_count(text) <= target_tokens:
return [text]
if sep_idx >= len(SEPARATORS):
return [text] # cannot split further
sep = SEPARATORS[sep_idx]
if sep == "":
# Last resort: hard split on token boundary
return fixed_size_chunks(text, target_tokens, 0.0)
parts = text.split(sep)
chunks, buffer = [], ""
for part in parts:
candidate = (buffer + sep + part) if buffer else part
if token_count(candidate) <= target_tokens:
buffer = candidate
else:
if buffer:
chunks.append(buffer)
if token_count(part) > target_tokens:
chunks.extend(recursive_chunks(part, target_tokens, sep_idx + 1))
buffer = ""
else:
buffer = part
if buffer:
chunks.append(buffer)
return chunks
This is what most off-the-shelf RAG tutorials use because it works without requiring clean heading structure. Recall is consistently 4-7 percentage points below pure heading-based chunking on structured documents but 5-10 points above pure fixed-size on the same corpus. A reasonable default if you cannot guarantee heading quality.
Strategy 4: Semantic boundary detection
Use an embedding model to find natural topic boundaries. Compute embeddings for each sentence; flag boundaries where the cosine similarity between consecutive sentences drops below a threshold.
import numpy as np
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("all-MiniLM-L6-v2")
def semantic_chunks(text, similarity_threshold=0.5, max_tokens=800):
sentences = split_sentences(text)
if len(sentences) <= 1:
return [text]
embeddings = model.encode(sentences, normalize_embeddings=True)
boundaries = [0]
for i in range(1, len(embeddings)):
sim = float(np.dot(embeddings[i - 1], embeddings[i]))
if sim < similarity_threshold:
boundaries.append(i)
boundaries.append(len(sentences))
chunks = []
for i in range(len(boundaries) - 1):
section = " ".join(sentences[boundaries[i]:boundaries[i + 1]])
if token_count(section) <= max_tokens:
chunks.append(section)
else:
chunks.extend(fixed_size_chunks(section, max_tokens, 0.15))
return chunks
This is what most “advanced RAG” guides recommend. It does produce slightly better chunks on documents without clean heading structure — interview transcripts, support conversations, unstructured notes.
The cost is large. You run an embedding model over every sentence at index time, which on a 200-PDF corpus added 23 minutes to preprocessing on CPU. The retrieval-recall gain over heading-based chunking on structured documents was 2.1 percentage points, which is rarely worth a 6x preprocessing slowdown.
We use semantic chunking only when (a) the corpus is small (<1,000 pages) and preprocessing time is irrelevant, or (b) the documents are genuinely unstructured.
Benchmark: retrieval recall on a mixed 200-PDF corpus
We ran 500 hand-written questions against four chunking strategies on our standard 200-PDF benchmark — a mix of legal contracts (40), product manuals (50), scientific papers (60), and news articles (50). Embeddings: text-embedding-3-small. Retriever: top-5 cosine similarity.
| Strategy | Avg recall@5 | Preprocessing time | Avg chunks per doc |
|---|---|---|---|
| Fixed-size (600 tok, 15% overlap) | 0.71 | 18s | 47 |
| Recursive (LangChain default) | 0.77 | 21s | 44 |
| Heading-based | 0.83 | 24s | 39 |
| Semantic boundary detection | 0.85 | 7m 12s | 51 |
Heading-based chunking beats fixed-size by 12 points on this mixed corpus. The gap is larger on the legal and manual subsets (18 points) and smaller on the news subset (4 points) because news articles are naturally short and uniform.
Semantic chunking eked out a 2-point gain over heading-based but at an 18x preprocessing cost. For most production pipelines that gain doesn’t justify the runtime.
Recommendation by document type
| Document type | Recommended strategy | Why |
|---|---|---|
| Legal contracts | Heading-based + clause splitting | Definitions and clauses are the natural unit |
| Product manuals | Heading-based | Sections answer questions directly |
| Scientific papers | Heading-based + figure caption joining | Abstract/intro/methods boundaries are real |
| Financial reports | Heading-based + table preservation | Tables must stay whole |
| News articles | Recursive (paragraph-based) | Headings are too sparse |
| Transcripts / support chats | Semantic | No headings exist |
| Code documentation | Heading-based with code-block preservation | API references map cleanly to headings |
The recurring rule: use the structure that already exists in the document. PDFs that come out of a good extractor carry the structure with them. Throwing that structure away to chunk by token count is the most common, most expensive RAG mistake.
What to chunk: text only, or markdown?
A small but important choice: should chunks contain plain text or markdown?
Embedding models trained on web text handle markdown reasonably well. Keeping markdown preserves table structure, code-block boundaries, and heading context — all of which improve retrieval on technical content. The downside is a slight token overhead from the markup characters.
We use markdown for all chunks except where the embedding model is specifically known to handle plain text better (some older domain-specific models). For text-embedding-3-small, text-embedding-3-large, voyage-3, and cohere-embed-v3, markdown wins by 1-2 recall points.
Chunk metadata: do not skip this
Every chunk should carry metadata. At minimum:
chunk_metadata = {
"source_file": "warranty_policy_2026.pdf",
"source_page": 12,
"section_path": "Warranty / Limitations / Geographic Exclusions",
"heading_h1": "Warranty",
"heading_h2": "Limitations",
"heading_h3": "Geographic Exclusions",
"char_offset_start": 14802,
"char_offset_end": 15401,
}
The reason is recovery, not retrieval. When a user asks “where did you get this answer?” you need to point to a page. When you need to debug a bad answer you need to find the original chunk in the source document. When the source PDF is updated you need to know which chunks to invalidate.
The cost is a few extra fields per chunk. The benefit is the difference between a debuggable RAG system and an opaque one.
Common mistakes
A handful of things we have seen go wrong:
Splitting on
\n\nonly. PDF extractors that produce flat text without preserved structure put double-newlines everywhere — between every line on a poorly-extracted page. Recursive splitters then produce 4-token chunks. Verify the extractor preserves paragraph structure before tuning the splitter.No overlap on fixed-size chunks. Zero overlap means an answer that straddles a boundary is unfindable. 10-15% is the standard range; below 10% recall starts dropping; above 20% you mostly waste embedding cost.
Stripping headings from chunks. Some pipelines normalize text by removing markdown markers, which destroys the heading information needed for downstream filtering and citation. Strip markup for display, not for indexing.
Treating tables as text. A table chunked as a sequence of pipe-delimited rows looks meaningless to an embedding model. Either preserve the entire table as a single chunk (and let the LLM read it) or extract row-level facts as natural-language sentences. The table extraction post covers the structured option.
Forgetting to deduplicate. Overlapping chunks produce near-duplicate retrieval results. Either deduplicate before retrieval (slow) or post-rerank with diversity (better).
Putting it together
A complete heading-first pipeline:
def index_pdf(pdf_path, embedder, vector_store):
markdown = extract_to_markdown(pdf_path) # pdfmux, marker, docling
chunks = heading_chunks(markdown, target_tokens=600, overlap_pct=0.15)
for chunk in chunks:
meta = parse_metadata(chunk, source=pdf_path)
embedding = embedder.encode(chunk["text"])
vector_store.upsert(
id=meta["chunk_id"],
embedding=embedding,
metadata=meta,
content=chunk["text"],
)
The structure is boring on purpose. The chunking strategy does most of the heavy lifting; the rest of the pipeline can be the default for whatever framework you use. We have a worked example with LlamaIndex in the LlamaIndex loader post and a parallel one with LangChain in the LangChain integration post.
Summary
Heading-based chunking with a fixed-size fallback is the right default for PDF-backed RAG. It improves retrieval recall by 12-18% over naive fixed-size on structured documents and costs almost nothing extra to implement. Semantic boundary detection is a real technique but its 2-point recall gain over heading-based chunking rarely justifies a 6-18x preprocessing slowdown. The two things that matter most are (1) using an extractor that preserves heading structure and (2) carrying section context into every sub-chunk you split out. Everything else is tuning.