PDF extraction with LangChain: loaders, splitters, and the pdfmux integration

TL;DRHow to load PDFs into LangChain. PyPDFLoader, UnstructuredPDFLoader, PyMuPDFLoader, and the pdfmux loader compared on speed, table fidelity, and cost.

Direct answer: Use PyMuPDFLoader for fast, digital-text-only PDFs. Use UnstructuredPDFLoader when you need basic table support and don’t mind the dependency footprint. Use the langchain-pdfmux loader (pip install langchain-pdfmux) when you have mixed PDFs (scanned, complex tables, multi-column) and want per-page confidence scores so a downstream filter can drop bad pages before they pollute your vector index. The other 3 loaders return raw text. pdfmux returns Markdown plus a confidence number per page, which is the difference between a RAG system that works and one that hallucinates.

What a “PDF loader” actually does in LangChain

LangChain’s ingestion pipeline turns documents into a list of Document objects, each with page_content (a string) and metadata (a dict). A loader’s job is:

Read the PDF file.
Extract text per page.
Wrap each page (or the whole document) in a Document with metadata like {"source": "report.pdf", "page": 3}.
Return the list.

The text quality at step 2 determines whether the rest of your RAG pipeline works. A splitter cannot fix garbled text. An embedding model will happily embed nonsense. The retrieval quality you get is bounded by the extraction quality you put in.

This is why loader choice matters more than splitter choice or chunk size. We covered the chunking side in PDF to Markdown for RAG pipelines — this post is about getting the text right in the first place.

The 4 loaders compared

LangChain ships 8+ PDF loaders. Most are thin wrappers around the same 4 backends. Here is the honest comparison:

Loader	Backend	Speed (pages/sec)	Tables	OCR	Markdown	Confidence	License
`PyPDFLoader`	pypdf	50-80	None	No	No	No	BSD
`PyMuPDFLoader`	PyMuPDF	100-200	Limited	No	Partial	No	AGPL-3.0
`UnstructuredPDFLoader`	unstructured	5-15	Yes	Optional	No	No	Apache-2.0
`langchain-pdfmux`	pdfmux	30-100	Yes	Yes (auto)	Yes	Yes (per page)	MIT

Speed numbers are from a 1,422-page corpus of SEC filings, board reports, and scanned legal docs on a 4-core M1 Mac. Full methodology: real-world PDF benchmark.

The honest summary: PyMuPDFLoader wins on raw speed for clean PDFs but loses on tables and forces AGPL on commercial users. UnstructuredPDFLoader handles tables but is 10x slower and pulls in 200+ MB of dependencies. The pdfmux loader is the only one that returns Markdown with structure preserved and reports a confidence score so you can filter bad pages programmatically.

Setup: installing the pdfmux loader for LangChain

pip install langchain-pdfmux langchain langchain-community

If you need OCR for scanned PDFs (recommended for any real document collection):

pip install langchain-pdfmux[ocr]

OCR runs on CPU via Tesseract — no GPU, no API key, no rate limits. We benchmarked CPU OCR throughput in PDF extraction without GPU.

Basic usage: loading a single PDF

from langchain_pdfmux import PdfmuxLoader

loader = PdfmuxLoader("annual-report.pdf")
docs = loader.load()

print(f"Loaded {len(docs)} pages")
for doc in docs[:3]:
    print(f"Page {doc.metadata['page']}: confidence {doc.metadata['confidence']:.2f}")
    print(doc.page_content[:200])
    print("---")

Each Document has:

page_content — Markdown text with headings, tables, and lists preserved
metadata.source — the source filename
metadata.page — 1-indexed page number
metadata.confidence — float between 0.0 and 1.0
metadata.extractor_used — which engine handled this page (pymupdf, tesseract, docling, gemini)
metadata.warnings — list of issues detected

The confidence field is the differentiator. It tells you whether to trust this page in retrieval or flag it for review.

Filtering low-confidence pages

The single most common RAG failure mode is one bad page poisoning a query. A scanned page with garbled OCR will embed into a vector that is nominally close to clean text but factually wrong. The model will cite it confidently and produce a wrong answer.

from langchain_pdfmux import PdfmuxLoader

loader = PdfmuxLoader("legal-contract.pdf")
all_docs = loader.load()

# Filter pages below 0.7 confidence
clean_docs = [d for d in all_docs if d.metadata["confidence"] >= 0.7]
flagged_docs = [d for d in all_docs if d.metadata["confidence"] < 0.7]

print(f"Indexed {len(clean_docs)} clean pages")
print(f"Flagged {len(flagged_docs)} pages for review")

# Index only clean pages
from langchain_community.vectorstores import Chroma
from langchain_openai import OpenAIEmbeddings

vectorstore = Chroma.from_documents(clean_docs, OpenAIEmbeddings())

In our benchmark of 1,422 pages, ~8% scored below 0.7 confidence. Most were:

Scanned pages with poor OCR (image quality below 200 DPI)
Pages that were 90%+ images with no extractable text
Pages with rotated or non-standard text orientation
Pages with unsupported character encodings

Filtering these before embedding eliminated ~30% of hallucinations in the downstream RAG system.

Splitting PDF Markdown into chunks

Once loaded, you split. Because pdfmux returns Markdown with heading structure, you can use LangChain’s MarkdownHeaderTextSplitter to chunk on semantic boundaries:

from langchain_pdfmux import PdfmuxLoader
from langchain.text_splitter import MarkdownHeaderTextSplitter

loader = PdfmuxLoader("technical-spec.pdf")
docs = loader.load()

headers_to_split_on = [
    ("#", "Header 1"),
    ("##", "Header 2"),
    ("###", "Header 3"),
]
splitter = MarkdownHeaderTextSplitter(headers_to_split_on=headers_to_split_on)

chunks = []
for doc in docs:
    sections = splitter.split_text(doc.page_content)
    for section in sections:
        section.metadata.update(doc.metadata)
        chunks.append(section)

print(f"{len(docs)} pages -> {len(chunks)} semantic chunks")

This produces chunks that map to actual sections in the document — far better for retrieval than fixed-size chunking. We documented why heading-based chunking outperforms character-based in the PDF to Markdown for RAG guide.

For PDFs without strong heading structure, fall back to RecursiveCharacterTextSplitter with token-aware sizing:

from langchain.text_splitter import RecursiveCharacterTextSplitter

splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,
    chunk_overlap=200,
    separators=["\n\n## ", "\n\n", "\n", ". ", " "],
)
chunks = splitter.split_documents(docs)

The separators list above tells the splitter to prefer Markdown heading boundaries first, then paragraph breaks, then sentences.

Handling tables in retrieval

Tables are the second most common RAG failure mode. A financial report’s Q3 revenue table extracted as plain text reads Revenue Q1 2025 12.3 Q2 2025 14.1 Q3 2025 15.8 — the relationships between cells are gone. Retrieval will return this chunk for a “Q3 revenue” query and the LLM will guess.

pdfmux preserves tables as Markdown pipe tables:

| Revenue   | Q1 2025 | Q2 2025 | Q3 2025 |
|-----------|---------|---------|---------|
| Product A | $12.3M  | $14.1M  | $15.8M  |
| Product B | $8.7M   | $9.2M   | $10.1M  |

LLMs trained on Markdown can read pipe tables natively. GPT-4, Claude, and Gemini all parse these correctly during inference.

For high-stakes table extraction (financial filings, regulatory submissions, scientific papers), set quality="standard" to route table pages through Docling overlay:

loader = PdfmuxLoader("10-k.pdf", quality="standard")
docs = loader.load()

pdfmux scores 0.911 TEDS (table accuracy) on the opendataloader benchmark — matching Docling and ahead of every other open-source tool. Methodology and full numbers: benchmarking PDF extractors.

Batch ingestion: loading a directory of PDFs

For ingesting a folder of documents — knowledge bases, archives, contract folders — use PdfmuxDirectoryLoader:

from langchain_pdfmux import PdfmuxDirectoryLoader

loader = PdfmuxDirectoryLoader(
    "documents/",
    glob="**/*.pdf",
    quality="standard",
    workers=4,
    show_progress=True,
)
docs = loader.load()

This processes 4 PDFs in parallel. On a 4-core machine with the SEC filings benchmark corpus, throughput hits ~1,800 pages/hour in standard mode and ~9,000 pages/hour in fast mode. The full benchmark covers a directory of 47 PDFs (1,422 pages) processed end-to-end in 9 minutes.

For directories larger than ~10,000 pages, switch to streaming so memory doesn’t blow up:

from langchain_pdfmux import PdfmuxDirectoryLoader

loader = PdfmuxDirectoryLoader("large-corpus/", streaming=True)

for doc in loader.lazy_load():
    # Process and embed one page at a time
    embed_and_index(doc)

lazy_load() yields documents one at a time instead of building the full list in memory.

Choosing the right loader for your use case

The decision tree we recommend:

All your PDFs are clean, digital-text-only, with no tables you care about. Use PyMuPDFLoader. It is 2-3x faster than alternatives for this case. Watch the AGPL license obligation if you ship commercial software — it requires open-sourcing your application.
Your PDFs have tables but no scanned pages, and you don’t need confidence scores. Use UnstructuredPDFLoader with mode="elements" and strategy="hi_res". It will extract tables but is slow and heavyweight. Apache-2.0 licensed.
Your PDFs are mixed: some scanned, some with tables, some with multi-column layouts. You’re building production RAG. Use langchain-pdfmux. It auto-routes each page through the appropriate engine, returns Markdown, and gives you per-page confidence so you can filter. MIT licensed.
You’re a researcher prototyping on 5 PDFs and any text extraction is fine. Use PyPDFLoader. It is the simplest, has zero dependencies beyond pypdf, and is BSD licensed. Just don’t ship it.

For the more general “which PDF library do I use” question — including non-LangChain contexts — see best PDF extraction library for Python and PDF extractor comparison 2026.

Common pitfalls

1. Loading the whole PDF as one document. The default PyPDFLoader.load() returns one Document per page. The default UnstructuredPDFLoader returns one Document for the whole file. Splitting per page first gives you better metadata for retrieval (you can cite which page an answer came from). The pdfmux loader returns one Document per page by default.

2. Skipping confidence checks. Most teams ingest blindly. Then 3 weeks later they find that 11% of their retrieval results are gibberish from poorly-scanned pages. The fix is filtering at ingestion, not at query time.

3. Using the wrong splitter for the source format. MarkdownHeaderTextSplitter only works if the loader returns Markdown. If you use PyPDFLoader with a Markdown splitter, the splitter sees no heading markers and produces one giant chunk per page.

4. Not handling password-protected PDFs. Most loaders silently fail on encrypted PDFs. The pdfmux loader raises PdfmuxEncryptedError so you can catch it explicitly:

from langchain_pdfmux import PdfmuxLoader, PdfmuxEncryptedError

try:
    docs = PdfmuxLoader("encrypted.pdf").load()
except PdfmuxEncryptedError:
    docs = PdfmuxLoader("encrypted.pdf", password="your-password").load()

5. Re-running ingestion every time. Cache loader output. The pdfmux loader is deterministic for a given file — same input, same output. Hash the PDF bytes and cache the resulting Document list to disk or Redis.

Production checklist

Pick a loader based on the decision tree above
Use quality="standard" for any document with tables you care about
Filter pages below 0.7 confidence before embedding
Use MarkdownHeaderTextSplitter if your loader returns Markdown
Cache loader output keyed on file hash
Log metadata.warnings to a monitoring system
Run a sample of your real PDFs through the loader before committing — benchmarks are averages, your corpus is specific
Verify your loader’s license is compatible with your distribution model

FAQ

Does pdfmux work with LlamaIndex too? Yes. The llama-index-readers-pdfmux package provides a PdfmuxReader with the same per-page confidence scoring. The patterns above translate directly.

Can I run pdfmux extraction inside a LangServe deployment? Yes. The loader is thread-safe and works inside FastAPI / LangServe. For high-concurrency setups, set workers=1 per request and let your process manager (gunicorn, uvicorn) handle parallelism — pdfmux is CPU-bound so multiprocessing at the request level scales better than threading inside one request.

What’s the minimum Python version? Python 3.10+. The loader uses match statements and type unions extensively.

Does it support cloud storage like S3? Use LangChain’s S3FileLoader to download to a temp path, then pass that path to PdfmuxLoader. Direct S3 streaming will be added when the underlying pdfmux core supports it.

How do I cite this loader in academic work? The pdfmux methodology is described in our public benchmark; cite the GitHub repo for now. A formal arXiv writeup is planned for Q3 2026.

Keep reading

PDF to Markdown for RAG pipelines — the full ingestion pipeline including embedding and retrieval patterns
Best PDF extraction library for Python in 2026 — ranked comparison across all major tools
PDF extraction without GPU — how the OCR fallback runs on CPU only
Self-healing PDF extraction — the architecture that produces the per-page confidence score

Last updated: April 2026