Direct answer: Use SimpleDirectoryReader for digital PDFs where every page is text. Pass a file_extractor={".pdf": PyMuPDFReader()} override when you need speed on long documents. Write a custom BaseReader subclass when your corpus has scanned pages, complex tables, or any layout that LlamaIndex’s default loader will silently mangle. The pdfmux integration (pip install llama-index-readers-pdfmux) is the version of that custom reader that returns a confidence score per page, so a downstream filter can drop bad extractions before they reach your vector index. That last step is the difference between a RAG system that works and one that hallucinates citations.
What “loading a PDF” actually means in LlamaIndex
LlamaIndex’s ingestion pipeline turns files into Document objects. Each Document has a text field, a metadata dict, and an optional excluded_embed_metadata_keys list. Downstream, a node parser splits each Document into TextNode objects, an embedding model turns each node into a vector, and a vector store indexes the result.
A reader is the first step. Its job is:
- Open the file.
- Extract text per page (or per region).
- Wrap each unit in a
Documentwith metadata like{"file_name": "report.pdf", "page_label": "3"}. - Return the list.
Step 2 determines whether the rest of the pipeline works. A node parser cannot fix garbled text. An embedding model will happily embed nonsense. The retrieval quality you get is bounded by the extraction quality you put in. This is why reader choice matters more than chunk size or top-k tuning.
We covered the chunking side in PDF to Markdown for RAG pipelines. This post is about getting the text right in the first place.
The three default paths
LlamaIndex ships several PDF-capable readers. Most teams pick one of three.
Path 1: SimpleDirectoryReader with no overrides
from llama_index.core import SimpleDirectoryReader
documents = SimpleDirectoryReader("./data").load_data()
What happens under the hood: LlamaIndex looks at the file extension, sees .pdf, and dispatches to the bundled PDFReader class, which uses pypdf to extract text page by page.
Speed: fast. About 0.4 seconds per page on digital text on a 2024 MacBook Pro.
Failures: anything pypdf cannot read. That includes scanned pages (returns empty strings), multi-column layouts (text reading order goes left-to-right across columns, producing word salad), tables (cell boundaries are lost; rows merge into single lines), and any PDF where text is stored as paths rather than glyphs (returns garbage). If your corpus is RFP attachments, financial filings, or scanned contracts, you will see all four failure modes within the first 50 documents.
Use when: every PDF in your corpus was generated by a modern word processor and does not contain tables you care about.
Path 2: file_extractor with PyMuPDFReader
from llama_index.core import SimpleDirectoryReader
from llama_index.readers.file import PyMuPDFReader
documents = SimpleDirectoryReader(
input_dir="./data",
file_extractor={".pdf": PyMuPDFReader()},
).load_data()
What changes: PyMuPDF (a Python binding for MuPDF) replaces pypdf. Speed roughly doubles. Layout handling improves on multi-column documents because MuPDF preserves block structure. License is AGPL-3, which matters if you redistribute your application as a binary; for internal RAG pipelines this is usually fine.
Failures: scanned pages still return empty strings (no OCR). Tables are still extracted as flat text. Equations are still lost. PyMuPDF is faster and cleaner than pypdf but does the same thing — it is a digital-text extractor with no awareness of structure.
Use when: your corpus is digital but long. The speed difference matters at 10,000+ documents.
Path 3: PDFReader with manual page handling
from llama_index.readers.file import PDFReader
from pathlib import Path
reader = PDFReader(return_full_document=False)
documents = reader.load_data(Path("./data/report.pdf"))
This gives you one Document per page rather than one per file. Useful when you want page-level metadata for citation, but it does not change the underlying extraction. You still get whatever pypdf produced.
What none of the defaults do
Three things break a RAG system, and none of the three default paths handle any of them:
Scanned pages. A 200-page contract where pages 47 and 48 are scanned amendments will load as empty strings on those pages. The retrieval system will never find the amendment text. The user will ask “what was the indemnification cap?” and your system will confidently quote an earlier section that contradicts the amendment.
Tables with merged cells. Most LlamaIndex defaults flatten tables into single-line text where row boundaries collapse. A 10-K with a debt maturity schedule becomes “2026 2027 2028 250 300 400 5.2% 5.5% 5.8%” with no structure. Embeddings of this string are useless.
Per-page quality variance. A 500-page document can have 480 perfectly extracted pages and 20 catastrophically broken ones. The default readers tell you nothing about which is which. Every page gets embedded. Every page is retrievable. The 20 broken pages pollute your index forever.
The fix is a custom reader that does layout-aware extraction, OCRs scans, preserves table structure, and emits a confidence number per page so downstream filters can drop the failures.
Writing a custom BaseReader
Here is the skeleton for a custom reader in LlamaIndex 0.11+:
from llama_index.core.readers.base import BaseReader
from llama_index.core import Document
from pathlib import Path
from typing import List, Optional
class CustomPDFReader(BaseReader):
def load_data(
self,
file: Path,
extra_info: Optional[dict] = None,
) -> List[Document]:
documents = []
for page_num, page_text, confidence in self.extract_pages(file):
metadata = {
"file_name": file.name,
"page_label": str(page_num),
"extraction_confidence": confidence,
}
if extra_info:
metadata.update(extra_info)
documents.append(Document(text=page_text, metadata=metadata))
return documents
def extract_pages(self, file: Path):
raise NotImplementedError
Plug it into SimpleDirectoryReader the same way as PyMuPDFReader:
from llama_index.core import SimpleDirectoryReader
documents = SimpleDirectoryReader(
input_dir="./data",
file_extractor={".pdf": CustomPDFReader()},
).load_data()
The work is in extract_pages. You need to:
- Detect whether each page is digital text or scanned.
- For digital pages: use a layout-aware extractor (PyMuPDF, pdfplumber, or Docling).
- For scanned pages: route to OCR (Tesseract, RapidOCR, or a multimodal model).
- For tables: detect them and emit Markdown table syntax instead of flat text.
- For each page: produce a confidence score from 0 to 1 based on extractor signals.
Building this from scratch is a 2,000-line project that you will maintain forever. We did it. It is now a Python package.
The pdfmux integration
pdfmux is an open-source Python library that routes each PDF page to the optimal extractor: PyMuPDF for digital text, Docling for tables, RapidOCR for scanned pages. It scores quality on every page and re-extracts failures automatically. The LlamaIndex integration wraps it as a BaseReader.
pip install llama-index-readers-pdfmux
from llama_index.core import SimpleDirectoryReader
from llama_index.readers.pdfmux import PDFMuxReader
reader = PDFMuxReader(min_confidence=0.8, format="markdown")
documents = SimpleDirectoryReader(
input_dir="./data",
file_extractor={".pdf": reader},
).load_data()
Each returned Document has:
document.metadata = {
"file_name": "10-K-2024.pdf",
"page_label": "47",
"extraction_confidence": 0.92,
"extractor_used": "pymupdf",
"page_type": "digital_text",
}
The extraction_confidence field is the part that matters. You filter on it before indexing:
clean_docs = [d for d in documents if d.metadata["extraction_confidence"] >= 0.7]
low_quality = [d for d in documents if d.metadata["extraction_confidence"] < 0.7]
print(f"Indexing {len(clean_docs)} pages, holding {len(low_quality)} for review")
The low_quality list goes to a human queue, not your vector store. This is the single change that takes a RAG system from “works in demos” to “works on real customer documents.”
Comparison table
The four reader options compared on a fixed 200-PDF benchmark drawn from financial filings, academic papers, and scanned contracts. Numbers are from the opendataloader-bench test suite, May 2026.
| Reader | Speed (sec/page) | Digital text | Scanned pages | Tables | Confidence score | License |
|---|---|---|---|---|---|---|
| SimpleDirectoryReader (default) | 0.4 | OK | empty string | flattened | none | MIT (pypdf) |
| PyMuPDFReader | 0.2 | good | empty string | flattened | none | AGPL-3 |
| PDFReader (per-page) | 0.4 | OK | empty string | flattened | none | MIT |
| PDFMuxReader | 1.1 | good | OCR’d | Markdown tables | 0 to 1 per page | MIT |
The 1.1 sec/page on PDFMuxReader looks slow next to 0.2 for PyMuPDFReader, but the PyMuPDFReader number assumes every page is digital text. On a corpus that is 30% scanned, PyMuPDFReader returns empty strings on 30% of pages and is functionally broken regardless of speed.
Code: full RAG pipeline with confidence filtering
A complete LlamaIndex pipeline that ingests a folder of mixed-quality PDFs, drops low-confidence pages, and indexes the rest:
from llama_index.core import (
SimpleDirectoryReader,
VectorStoreIndex,
Settings,
)
from llama_index.core.node_parser import SentenceSplitter
from llama_index.embeddings.openai import OpenAIEmbedding
from llama_index.readers.pdfmux import PDFMuxReader
Settings.embed_model = OpenAIEmbedding(model="text-embedding-3-small")
Settings.node_parser = SentenceSplitter(chunk_size=512, chunk_overlap=50)
reader = PDFMuxReader(min_confidence=0.0, format="markdown")
raw_documents = SimpleDirectoryReader(
input_dir="./data",
file_extractor={".pdf": reader},
).load_data()
clean_documents = []
review_queue = []
for doc in raw_documents:
if doc.metadata["extraction_confidence"] >= 0.7:
clean_documents.append(doc)
else:
review_queue.append({
"file": doc.metadata["file_name"],
"page": doc.metadata["page_label"],
"confidence": doc.metadata["extraction_confidence"],
"extractor_used": doc.metadata["extractor_used"],
})
print(f"Indexing {len(clean_documents)} pages")
print(f"Review queue: {len(review_queue)} pages")
index = VectorStoreIndex.from_documents(clean_documents)
query_engine = index.as_query_engine(similarity_top_k=5)
response = query_engine.query(
"What was the indemnification cap in the 2024 master services agreement?"
)
print(response)
The review_queue list is the part most teams skip. Skipping it is why retrieval works on a demo of three documents and fails on production.
When SimpleDirectoryReader is the right answer
Not every project needs a custom reader. Use the default SimpleDirectoryReader when:
- Every PDF in your corpus is born-digital (no scans).
- Your documents have no tables, or you do not care about tables.
- Your retrieval quality target is 80%, not 95%.
- The corpus is small enough that a human can spot-check the index.
Most internal documentation projects fit this profile. A customer-facing product on top of legal contracts, financial reports, or medical records does not.
When to write your own custom reader instead of using pdfmux
Use a custom reader you write yourself when:
- You have a single PDF format with very specific structure (one issuer’s invoices, one government form). A targeted regex-based extractor will beat any general-purpose tool.
- You have a non-standard text encoding pdfmux does not support (rare; pdfmux handles Arabic, CJK, RTL).
- You need to extract layout-aware fields (signature blocks, header zones) that a general extractor will miss.
For everything else — mixed corpora, customer-uploaded documents, document classes you do not control — the orchestrator pattern wins. We covered why in Why an orchestrator beats a single extractor for RAG.
What about LangChain?
If you came here from a LangChain project, the equivalent loader is langchain-pdfmux. Same engine, same per-page confidence score, same filtering pattern. The full LangChain comparison is in PDF extraction with LangChain.
Summary
The three default LlamaIndex paths (SimpleDirectoryReader, PyMuPDFReader, PDFReader) all do the same thing: extract text from digital pages and silently fail on scans, tables, and complex layouts. For a serious RAG pipeline on real-world documents, you need either a custom BaseReader subclass with OCR fallback and confidence scoring, or the pdfmux LlamaIndex integration, which is that custom reader as a maintained package. The single most useful field in the result is the per-page confidence score, because it is the only signal that lets you filter bad extractions out of your index before they corrupt retrieval.
The right reader does not make your RAG system smart. It just stops the dumb mistakes that make smart-looking systems hallucinate.