Universal orchestrator for PDF extraction. 5 rule-based backends + BYOK LLM fallback. Routes each page to the best extractor, audits the output, re-extracts failures. Zero config, zero lock-in.
pip install pdfmux
copy
The problem is not the LLM. The problem is document ingestion. Broken column ordering, missing pages, OCR failures, tables flattened, slide decks returning empty text. Most tools extract once and hope for the best.
Page 1: (ok) Page 2: (ok) Page 3: Page 4: Amoun Dscriptin $450 Consltng Widgt $1200 Setp Page 5: (ok) No quality info. No way to know which pages are broken.
Page 1: good 0.98 Page 2: good 0.96 Page 3: bad → OCR'd 0.91 Page 4: bad → OCR'd 0.87 Page 5: good 0.97 ✓ 5 pages, 94% avg confidence 2 re-extracted with OCR
pdfmux doesn't extract once and hope. It runs a self-healing pipeline that audits every page, detects failures, and re-extracts them automatically.
import pdfmux # extract as markdown — auto-audits every page text = pdfmux.extract_text("report.pdf") # structured json with locked schema data = pdfmux.extract_json("report.pdf") # LLM-ready chunks with token estimates chunks = pdfmux.load_llm_context("report.pdf") # → [{title, text, page_start, page_end, tokens, confidence}]
pdfmux outputs structured content designed for RAG pipelines, vector databases, agent workflows, and knowledge retrieval systems.
Extract tables, key-value pairs, and map them to your own JSON schema — all rule-based, zero LLM cost. Dates, currencies, and rates are normalized automatically.
# structured json with tables + key-values $ pdfmux convert invoice.pdf --format json --stdout # outputs: { "schema_version": "1.2.0", "tables": [{ "headers": ["Description", "Amount"], "rows": [["Consulting", "$450.00"], ["Setup fee", "$1,200.00"]] }], "key_values": [ {"key": "Invoice Date", "value": "2026-02-28"}, {"key": "Total Due", "value": {"amount": 1650.00, "currency": "USD"}} ] }
# schema-guided — map to your own JSON schema $ pdfmux convert statement.pdf --format json --schema bank.schema.json # outputs data mapped to your schema fields via fuzzy matching: { "account_number": "1234-5678-9012", "statement_date": "2026-02-28", "closing_balance": {"amount": 12450.75, "currency": "AED"}, "transactions": [ {"date": "2026-02-01", "description": "ADNOC Fuel", "amount": 185.50} ] }
# convert — auto-audits every page $ pdfmux invoice.pdf ✓ invoice.md (2 pages, 95% confidence) # image-heavy pdf — bad pages re-extracted with OCR $ pdfmux pitch-deck.pdf ✓ pitch-deck.md (12 pages, 85% confidence, 6 OCR'd) # quality presets — trade speed for accuracy $ pdfmux convert report.pdf -q fast # PyMuPDF only, ~0.01s/page $ pdfmux convert report.pdf -q standard # auto-audit + OCR fallback (default) $ pdfmux convert report.pdf --mode premium # BYOK LLM for scans & complex layouts # quick triage — per-page quality without extraction $ pdfmux analyze report.pdf # structured json output $ pdfmux convert report.pdf --format json # LLM-ready format with token estimates $ pdfmux convert report.pdf --format llm # start MCP server for AI agents $ pdfmux serve
One flag controls the speed/accuracy tradeoff. Default is standard — which handles 90% of real-world PDFs with zero cost.
| Preset | Best for | Speed | Cost | How it works |
|---|---|---|---|---|
-q fast | Digital reports, clean text | 0.01s/pg | Free | PyMuPDF only, no auditing |
-q standard | Mixed docs, tables, OCR pages | 0.3-3s/pg | Free | Auto-audit + OCR fallback + Docling tables |
--mode premium | Scanned docs, complex layouts, handwriting | 2-5s/pg | ~$0.01/pg | BYOK LLM (Claude, Gemini, GPT-4o, Ollama) |
Scanned or handwritten PDFs? Use --mode premium with any LLM provider. The base install handles 90% of digital PDFs; your LLM of choice handles the rest — stamps, handwriting, mixed layouts, low-contrast scans.
pdfmux picks the best extractor per page automatically. Install more backends for better results.
| Extractor | Handles | Speed | Install |
|---|---|---|---|
| PyMuPDF | Digital text | 0.01s/pg | base |
| OpenDataLoader | Complex layouts, reading order | 0.05s/pg | pdfmux[opendataloader] |
| RapidOCR | Scanned / images | 0.5-2s/pg | pdfmux[ocr] |
| Docling | Tables | 0.3-3s/pg | pdfmux[tables] |
| BYOK LLM | Hardest cases (scans, handwriting) | 2-5s/pg | pdfmux[llm] / [llm-claude] / [llm-openai] |
For ingestion systems like pdfmux, what matters most is semantic chunk accuracy — correct text in the right order with reliable boundaries for RAG.
| Document Type | Text Extraction | Layout Recovery | Table Extraction |
|---|---|---|---|
| Simple text PDFs | 99–100% | 95–98% | N/A |
| Academic papers | 97–99% | 90–95% | 80–90% |
| Business reports | 96–98% | 90–94% | 75–88% |
| Slide decks | 95–98% | 88–92% | 60–75% |
| Financial filings | 95–97% | 85–92% | 70–85% |
| Scanned PDFs | 85–95% | 75–88% | 60–75% |
| Legal contracts | 97–99% | 92–96% | 80–90% |
| Forms / gov docs | 90–96% | 80–90% | 65–80% |
Aggregate across a mixed dataset:
| Metric | Expected Range |
|---|---|
| Text extraction accuracy | 96–99% |
| Layout recovery accuracy | 88–95% |
| Table extraction accuracy | 70–88% |
| OCR document accuracy | 85–94% |
Tested on opendataloader-bench — 200 real-world PDFs across academic papers, financial filings, legal contracts, and scanned documents.
| Tool | Overall | Reading Order | Tables | Cost |
|---|---|---|---|---|
| hybrid (AI-assisted) | 0.909 | 0.935 | 0.928 | paid API |
| pdfmux | 0.900 | 0.918 | 0.887 | free |
| docling | 0.877 | 0.900 | 0.887 | local ML |
| marker | 0.861 | 0.890 | 0.808 | local ML |
| opendataloader local | 0.844 | 0.913 | 0.494 | local |
| mineru | 0.831 | 0.857 | 0.873 | local ML |
#2 overall, #1 among free tools. pdfmux now beats docling (+2.3%), marker (+3.9%), and every other open-source extractor. It achieves 99% of the paid #1 score at zero cost per page — with the best heading detection of any engine, paid or free.
Give your AI agent the ability to read PDFs. Three tools: convert_pdf for extraction, analyze_pdf for quick triage, batch_convert for directories.
# LangChain from pdfmux.integrations.langchain import PDFMuxLoader loader = PDFMuxLoader("report.pdf") docs = loader.load() # → list[Document] # LlamaIndex from pdfmux.integrations.llamaindex import PDFMuxReader reader = PDFMuxReader() docs = reader.load_data("report.pdf") # → list[Document]
# add OCR for scanned pages (~200MB, CPU-only) $ pip install "pdfmux[ocr]" # add table extraction $ pip install "pdfmux[tables]" # add LLM vision for complex layouts (Gemini, Claude, GPT-4o, Ollama) $ pip install "pdfmux[llm]" # add LangChain or LlamaIndex loader $ pip install "pdfmux[langchain]" $ pip install "pdfmux[llamaindex]" # install everything $ pip install "pdfmux[all]"
pip install pdfmux.pip install "pdfmux[ocr,tables,llm-ollama]".--mode premium with any LLM provider: pip install "pdfmux[llm-claude]" for Claude, pdfmux[llm] for Gemini, or pdfmux[llm-openai] for GPT-4o. pdfmux routes scanned/handwritten pages to your LLM automatically. The base install is CPU-only and handles 90% of digital PDFs without any API key.BaseExtractor interface with an extract() method and register it. The router will include it in per-page quality comparisons automatically. See the docs for the full extractor development guide.pdfmux[langchain] or pdfmux[llamaindex] and use the native loader classes. They return standard Document objects with confidence metadata attached.