Every PDF. Best result.
Your keys.

Universal orchestrator for PDF extraction. 5 rule-based backends + BYOK LLM fallback. Routes each page to the best extractor, audits the output, re-extracts failures. Zero config, zero lock-in.

get started → MIT licensed · v1.4.0 new
stars20+ downloads500+/mo python3.11 · 3.12 · 3.13 licenseMIT
pip install pdfmux copy
pdfmux terminal demo showing PDF extraction with confidence scoring and automatic OCR fallback

Most RAG pipelines fail before they reach the model

The problem is not the LLM. The problem is document ingestion. Broken column ordering, missing pages, OCR failures, tables flattened, slide decks returning empty text. Most tools extract once and hope for the best.

typical single-tool output
Page 1: (ok)
Page 2: (ok)
Page 3: 
Page 4: Amoun  Dscriptin
        $450   Consltng    Widgt
        $1200  Setp
Page 5: (ok)
No quality info. No way to know
which pages are broken.
pdfmux output
Page 1: good 0.98
Page 2: good 0.96
Page 3: badOCR'd 0.91
Page 4: badOCR'd 0.87
Page 5: good 0.97
✓ 5 pages, 94% avg confidence
  2 re-extracted with OCR
200+
Real-world PDFs benchmarked
#2
Reading order accuracy
5
Extraction backends
$0
Cost per page

How it works

pdfmux doesn't extract once and hope. It runs a self-healing pipeline that audits every page, detects failures, and re-extracts them automatically.

1. Extract — PyMuPDF on every page (instant)
2. Audit — 5 quality checks per page: good / bad / empty
3. Region OCR — surgical OCR on image regions in bad pages
4. Full OCR — re-extract remaining empty pages completely
5. Merge — combine good + fixed pages in order
6. Headings — font-size analysis detects h1/h2/h3 from relative sizes & bold weight
7. Structure — extract tables, key-values, normalize dates & amounts (JSON output)
All pages good? Zero OCR overhead. You only pay for what's broken. Structured extraction runs automatically on JSON output — pure rule-based, zero LLM cost.

Three lines of Python

python
import pdfmux

# extract as markdown — auto-audits every page
text = pdfmux.extract_text("report.pdf")

# structured json with locked schema
data = pdfmux.extract_json("report.pdf")

# LLM-ready chunks with token estimates
chunks = pdfmux.load_llm_context("report.pdf")
# → [{title, text, page_start, page_end, tokens, confidence}]

Built for LLM pipelines

pdfmux outputs structured content designed for RAG pipelines, vector databases, agent workflows, and knowledge retrieval systems.

Structured extraction

Extract tables, key-value pairs, and map them to your own JSON schema — all rule-based, zero LLM cost. Dates, currencies, and rates are normalized automatically.

bash
# structured json with tables + key-values
$ pdfmux convert invoice.pdf --format json --stdout
# outputs:
{
  "schema_version": "1.2.0",
  "tables": [{
    "headers": ["Description", "Amount"],
    "rows": [["Consulting", "$450.00"],
             ["Setup fee", "$1,200.00"]]
  }],
  "key_values": [
    {"key": "Invoice Date", "value": "2026-02-28"},
    {"key": "Total Due", "value": {"amount": 1650.00, "currency": "USD"}}
  ]
}
bash
# schema-guided — map to your own JSON schema
$ pdfmux convert statement.pdf --format json --schema bank.schema.json
# outputs data mapped to your schema fields via fuzzy matching:
{
  "account_number": "1234-5678-9012",
  "statement_date": "2026-02-28",
  "closing_balance": {"amount": 12450.75, "currency": "AED"},
  "transactions": [
    {"date": "2026-02-01", "description": "ADNOC Fuel", "amount": 185.50}
  ]
}

Command line

bash
# convert — auto-audits every page
$ pdfmux invoice.pdf
✓ invoice.md (2 pages, 95% confidence)

# image-heavy pdf — bad pages re-extracted with OCR
$ pdfmux pitch-deck.pdf
✓ pitch-deck.md (12 pages, 85% confidence, 6 OCR'd)

# quality presets — trade speed for accuracy
$ pdfmux convert report.pdf -q fast      # PyMuPDF only, ~0.01s/page
$ pdfmux convert report.pdf -q standard  # auto-audit + OCR fallback (default)
$ pdfmux convert report.pdf --mode premium  # BYOK LLM for scans & complex layouts

# quick triage — per-page quality without extraction
$ pdfmux analyze report.pdf

# structured json output
$ pdfmux convert report.pdf --format json

# LLM-ready format with token estimates
$ pdfmux convert report.pdf --format llm

# start MCP server for AI agents
$ pdfmux serve

Quality presets

One flag controls the speed/accuracy tradeoff. Default is standard — which handles 90% of real-world PDFs with zero cost.

PresetBest forSpeedCostHow it works
-q fastDigital reports, clean text0.01s/pgFreePyMuPDF only, no auditing
-q standardMixed docs, tables, OCR pages0.3-3s/pgFreeAuto-audit + OCR fallback + Docling tables
--mode premiumScanned docs, complex layouts, handwriting2-5s/pg~$0.01/pgBYOK LLM (Claude, Gemini, GPT-4o, Ollama)

Scanned or handwritten PDFs? Use --mode premium with any LLM provider. The base install handles 90% of digital PDFs; your LLM of choice handles the rest — stamps, handwriting, mixed layouts, low-contrast scans.

Extractors

pdfmux picks the best extractor per page automatically. Install more backends for better results.

ExtractorHandlesSpeedInstall
PyMuPDFDigital text0.01s/pgbase
OpenDataLoaderComplex layouts, reading order0.05s/pgpdfmux[opendataloader]
RapidOCRScanned / images0.5-2s/pgpdfmux[ocr]
DoclingTables0.3-3s/pgpdfmux[tables]
BYOK LLMHardest cases (scans, handwriting)2-5s/pgpdfmux[llm] / [llm-claude] / [llm-openai]

Expected accuracy across document types

For ingestion systems like pdfmux, what matters most is semantic chunk accuracy — correct text in the right order with reliable boundaries for RAG.

Document TypeText ExtractionLayout RecoveryTable Extraction
Simple text PDFs99–100%95–98%N/A
Academic papers97–99%90–95%80–90%
Business reports96–98%90–94%75–88%
Slide decks95–98%88–92%60–75%
Financial filings95–97%85–92%70–85%
Scanned PDFs85–95%75–88%60–75%
Legal contracts97–99%92–96%80–90%
Forms / gov docs90–96%80–90%65–80%

Aggregate across a mixed dataset:

MetricExpected Range
Text extraction accuracy96–99%
Layout recovery accuracy88–95%
Table extraction accuracy70–88%
OCR document accuracy85–94%

Benchmark results

Tested on opendataloader-bench — 200 real-world PDFs across academic papers, financial filings, legal contracts, and scanned documents.

#2
Overall ranking
0.900 composite score
#1
Free tool ranking
Beats docling, marker, mineru & more
$0
Cost per page
No AI calls, no GPU
hybrid (AI)
0.909
pdfmux
0.900
docling
0.877
marker
0.861
opendataloader
0.844
mineru
0.831
ToolOverallReading OrderTablesCost
hybrid (AI-assisted)0.9090.9350.928paid API
pdfmux0.9000.9180.887free
docling0.8770.9000.887local ML
marker0.8610.8900.808local ML
opendataloader local0.8440.9130.494local
mineru0.8310.8570.873local ML

#2 overall, #1 among free tools. pdfmux now beats docling (+2.3%), marker (+3.9%), and every other open-source extractor. It achieves 99% of the paid #1 score at zero cost per page — with the best heading detection of any engine, paid or free.

Built-in

🔄
Self-healing pipeline
Bad pages detected and re-extracted automatically. Zero manual intervention.
📊
Confidence scoring
5 quality checks per page. Know exactly which pages to trust.
⚙️
5 extraction backends
PyMuPDF, OpenDataLoader, RapidOCR, Docling + BYOK LLM. Best extractor per page.
🧭
Per-page routing
Each page gets the right extractor based on content type and quality.
🤖
MCP server
Give AI agents PDF reading ability. Three tools, one command to start.
📋
Structured extraction
Tables, key-values, and schema mapping. Zero LLM cost — pure rule-based.
📑
Heading detection
Font-size analysis maps headings to h1/h2/h3 automatically. Handles bold-at-same-size variants.
🔍
Value normalization
Dates, currencies, and rates parsed automatically. "28 Feb 2026" → 2026-02-28.
🔓
MIT licensed
Open source with a frozen API. Your code won't break on updates.

MCP server

Give your AI agent the ability to read PDFs. Three tools: convert_pdf for extraction, analyze_pdf for quick triage, batch_convert for directories.

{ "mcpServers": { "pdfmux": { "command": "pdfmux", "args": ["serve"] } } }

Works with

LangChain LlamaIndex Claude Cursor Claude Code Gemini

LangChain & LlamaIndex

python
# LangChain
from pdfmux.integrations.langchain import PDFMuxLoader

loader = PDFMuxLoader("report.pdf")
docs = loader.load()  # → list[Document]

# LlamaIndex
from pdfmux.integrations.llamaindex import PDFMuxReader

reader = PDFMuxReader()
docs = reader.load_data("report.pdf")  # → list[Document]

Optional extras

bash
# add OCR for scanned pages (~200MB, CPU-only)
$ pip install "pdfmux[ocr]"

# add table extraction
$ pip install "pdfmux[tables]"

# add LLM vision for complex layouts (Gemini, Claude, GPT-4o, Ollama)
$ pip install "pdfmux[llm]"

# add LangChain or LlamaIndex loader
$ pip install "pdfmux[langchain]"
$ pip install "pdfmux[llamaindex]"

# install everything
$ pip install "pdfmux[all]"

What pdfmux is not

Frequently asked questions

How is pdfmux different from just using PyMuPDF?
PyMuPDF is one of pdfmux's backends. pdfmux adds a quality audit on top: it scores every page, detects failures (blank output, mojibake, broken columns), and automatically re-extracts bad pages with a better extractor. PyMuPDF alone gives you text. pdfmux gives you reliable text.
Does it work offline?
Yes. The base install plus the OCR and tables extras all run locally with zero network calls. LLM providers (Gemini, Claude, GPT-4o) require API keys and internet. Ollama runs fully local. You can run an air-gapped pipeline with pip install "pdfmux[ocr,tables,llm-ollama]".
What about scanned PDFs, stamps, or handwriting?
Use --mode premium with any LLM provider: pip install "pdfmux[llm-claude]" for Claude, pdfmux[llm] for Gemini, or pdfmux[llm-openai] for GPT-4o. pdfmux routes scanned/handwritten pages to your LLM automatically. The base install is CPU-only and handles 90% of digital PDFs without any API key.
Is it production ready?
v1.4.0 is production stable with a frozen API. 487 tests passing. Agentic multi-pass extraction, cost-aware routing, budget caps, and configurable timeouts make it safe for processing untrusted documents at scale.
How do I add a new extractor?
Implement the BaseExtractor interface with an extract() method and register it. The router will include it in per-page quality comparisons automatically. See the docs for the full extractor development guide.
Can I use it with LangChain or LlamaIndex?
Built-in. Install pdfmux[langchain] or pdfmux[llamaindex] and use the native loader classes. They return standard Document objects with confidence metadata attached.

From the blog

View all posts →