How does pdfmux handle scanned PDFs and handwriting?

Use --mode premium with any LLM provider: pip install pdfmux[llm-claude] for Claude, pdfmux[llm] for Gemini, or pdfmux[llm-openai] for GPT-4o. pdfmux routes scanned and handwritten pages to your LLM automatically. The base install handles 90% of digital PDFs without any API key.

What is the best free PDF extraction library for Python?

pdfmux audits its own output — it flags pages it cannot read instead of silently dropping them, and re-extracts failures with a stronger backend. It also certifies ANY extractor's output via pdfmux verify: point it at your current parser and it reports which pages were silently dropped. Runs on CPU with no GPU and no per-page cost; optional BYOK LLM fallback only for the pages that need it.

What is the best free alternative to LlamaParse for PDF extraction?

pdfmux is a free, MIT-licensed alternative to LlamaParse. LlamaParse costs $3 per 1,000 pages and requires sending documents to a paid cloud API. pdfmux runs locally on CPU, and it certifies any extractor's output for silently-dropped pages (pdfmux verify). It also supports BYOK LLM fallback (Claude, GPT-4o, Gemini, Ollama) if you want premium accuracy on hard pages — you only pay for the pages that need LLM help, not every page.

Does pdfmux work as a LangChain or LlamaIndex document loader with quality scoring?

Yes — pdfmux is the only LangChain and LlamaIndex loader with built-in per-page quality scoring. Install pdfmux[langchain] or pdfmux[llamaindex], import PDFMuxLoader, and call load(). You get standard Document objects with confidence scores (0–1) and quality grades (good/bad/empty) attached as metadata. You can filter out low-confidence chunks before embedding to prevent garbage in your vector store.

The PDF parser that doesn’t
break your RAG pipeline.

Seven extractors routed per page, plus BYOK vision LLMs. Confidence score per page. Failed pages re-extracted with a stronger backend — automatically. Smart cache makes re-runs instant. Streaming output, watch mode, profiles, Arabic / RTL support. Native MCP server for Claude, Cursor, and any AI agent. Free, MIT, runs on your hardware.

7 extractors — PyMuPDF, OpenDataLoader, Docling, RapidOCR, Surya, Marker, Mistral OCR — routed per page automatically
BYOK any LLM — Gemini, Gemma 4, Claude, GPT-4o, Ollama, Mistral, or any OpenAI-compatible API
Agentic multi-pass — extract, audit confidence, re-extract failures with a stronger backend
Smart result cache — SHA-256 hashed, 30-day TTL, 1 GB LRU; re-runs are instant
Streaming output — pdfmux stream emits NDJSON page events for long documents
Watch mode — pdfmux watch ./inbox/ auto-converts PDFs as they land
Cost prediction — pdfmux estimate previews spend before running
Profiles — --profile invoices (built-ins for invoices, receipts, papers, contracts, bulk-rag)
Arabic / RTL — Unicode BiDi reordering and Gemma 4 OCR for Arabic, Persian, Urdu, Hebrew
Cost modes — economy ($0/page), balanced, premium with hard budget caps
RAG-ready — section-aware chunks with token estimates, 5 built-in schema presets
MCP server + LangChain + LlamaIndex — native integrations for AI pipelines

get started → MIT licensed · v1.7.0 new

stars20+ downloads500+/mo python3.11 · 3.12 · 3.13 licenseMIT

pip install pdfmux copy

pdfmux terminal demo showing PDF extraction with confidence scoring and automatic OCR fallback

Most RAG pipelines fail before they reach the model

The problem is not the LLM. The problem is document ingestion. Broken column ordering, missing pages, OCR failures, tables flattened, slide decks returning empty text. Most tools extract once and hope for the best.

typical single-tool output

Page 1: (ok)
Page 2: (ok)
Page 3: 
Page 4: Amoun  Dscriptin
        $450   Consltng    Widgt
        $1200  Setp
Page 5: (ok)
No quality info. No way to know
which pages are broken.

pdfmux output

Page 1: good 0.98
Page 2: good 0.96
Page 3: bad → OCR'd 0.91
Page 4: bad → OCR'd 0.87
Page 5: good 0.97
✓ 5 pages, 94% avg confidence
  2 re-extracted with OCR

200+

Real-world PDFs benchmarked

Reading order accuracy

Extraction backends

Cost per page

How it works

pdfmux doesn't extract once and hope. It runs a self-healing pipeline that audits every page, detects failures, and re-extracts them automatically.

1. Extract — PyMuPDF on every page (instant)
2. Audit — 5 quality checks per page: good / bad / empty
3. Region OCR — surgical OCR on image regions in bad pages
4. Full OCR — re-extract remaining empty pages completely
5. Merge — combine good + fixed pages in order
6. Headings — font-size analysis detects h1/h2/h3 from relative sizes & bold weight
7. Structure — extract tables, key-values, normalize dates & amounts (JSON output)

All pages good? Zero OCR overhead. You only pay for what's broken. Structured extraction runs automatically on JSON output — pure rule-based, zero LLM cost.

Three lines of Python

python

import pdfmux

# extract as markdown — auto-audits every page
text = pdfmux.extract_text("report.pdf")

# structured json with locked schema
data = pdfmux.extract_json("report.pdf")

# LLM-ready chunks with token estimates
chunks = pdfmux.load_llm_context("report.pdf")
# → [{title, text, page_start, page_end, tokens, confidence}]

Built for LLM pipelines

pdfmux outputs structured content designed for RAG pipelines, vector databases, agent workflows, and knowledge retrieval systems.

Section boundaries with page references
Per-page confidence scoring
Structured chunks with token estimates
Table and key-value extraction with normalized values
Schema-guided mapping via fuzzy matching
Locked JSON schema (API frozen for 1.x)

Structured extraction

Extract tables, key-value pairs, and map them to your own JSON schema — all rule-based, zero LLM cost. Dates, currencies, and rates are normalized automatically.

bash

# structured json with tables + key-values
$ pdfmux convert invoice.pdf --format json --stdout
# outputs:
{
  "schema_version": "1.2.0",
  "tables": [{
    "headers": ["Description", "Amount"],
    "rows": [["Consulting", "$450.00"],
             ["Setup fee", "$1,200.00"]]
  }],
  "key_values": [
    {"key": "Invoice Date", "value": "2026-02-28"},
    {"key": "Total Due", "value": {"amount": 1650.00, "currency": "USD"}}
  ]
}

bash

# schema-guided — map to your own JSON schema
$ pdfmux convert statement.pdf --format json --schema bank.schema.json
# outputs data mapped to your schema fields via fuzzy matching:
{
  "account_number": "1234-5678-9012",
  "statement_date": "2026-02-28",
  "closing_balance": {"amount": 12450.75, "currency": "AED"},
  "transactions": [
    {"date": "2026-02-01", "description": "ADNOC Fuel", "amount": 185.50}
  ]
}

Command line

bash

# convert — auto-audits every page
$ pdfmux invoice.pdf
✓ invoice.md (2 pages, 95% confidence)

# image-heavy pdf — bad pages re-extracted with OCR
$ pdfmux pitch-deck.pdf
✓ pitch-deck.md (12 pages, 85% confidence, 6 OCR'd)

# quality presets — trade speed for accuracy
$ pdfmux convert report.pdf -q fast      # PyMuPDF only, ~0.01s/page
$ pdfmux convert report.pdf -q standard  # auto-audit + OCR fallback (default)
$ pdfmux convert report.pdf --mode premium  # BYOK LLM for scans & complex layouts

# quick triage — per-page quality without extraction
$ pdfmux analyze report.pdf

# structured json output
$ pdfmux convert report.pdf --format json

# LLM-ready format with token estimates
$ pdfmux convert report.pdf --format llm

# start MCP server for AI agents
$ pdfmux serve

Quality presets

One flag controls the speed/accuracy tradeoff. Default is standard — which handles 90% of real-world PDFs with zero cost.

Preset	Best for	Speed	Cost	How it works
`-q fast`	Digital reports, clean text	0.01s/pg	Free	PyMuPDF only, no auditing
`-q standard`	Mixed docs, tables, OCR pages	0.3-3s/pg	Free	Auto-audit + OCR fallback + Docling tables
`--mode premium`	Scanned docs, complex layouts, handwriting	2-5s/pg	~$0.01/pg	BYOK LLM (Claude, Gemini, GPT-4o, Ollama)

Scanned or handwritten PDFs? Use --mode premium with any LLM provider. The base install handles 90% of digital PDFs; your LLM of choice handles the rest — stamps, handwriting, mixed layouts, low-contrast scans.

Extractors

pdfmux picks the best extractor per page automatically. Install more backends for better results.

Extractor	Handles	Speed	Install
PyMuPDF	Digital text	0.01s/pg	base
OpenDataLoader	Complex layouts, reading order	0.05s/pg	pdfmux[opendataloader]
RapidOCR	Scanned / images	0.5-2s/pg	pdfmux[ocr]
Docling	Tables	0.3-3s/pg	pdfmux[tables]
BYOK LLM	Hardest cases (scans, handwriting)	2-5s/pg	pdfmux[llm] / [llm-claude] / [llm-openai]

Expected accuracy across document types

For ingestion systems like pdfmux, what matters most is semantic chunk accuracy — correct text in the right order with reliable boundaries for RAG.

Document Type	Text Extraction	Layout Recovery	Table Extraction
Simple text PDFs	99–100%	95–98%	N/A
Academic papers	97–99%	90–95%	80–90%
Business reports	96–98%	90–94%	75–88%
Slide decks	95–98%	88–92%	60–75%
Financial filings	95–97%	85–92%	70–85%
Scanned PDFs	85–95%	75–88%	60–75%
Legal contracts	97–99%	92–96%	80–90%
Forms / gov docs	90–96%	80–90%	65–80%

Aggregate across a mixed dataset:

Metric	Expected Range
Text extraction accuracy	96–99%
Layout recovery accuracy	88–95%
Table extraction accuracy	70–88%
OCR document accuracy	85–94%

Measured on a real customer batch

We ran 433 real customer documents — technical and safety data sheets, mixed digital and scanned, some encoding-corrupted. Run the naive way first (an early pdfmux CLI, pypdf fallback, no OCR), the pipeline silently dropped 16 — 11 of them with no log line at all. Rebuilt with the per-page audit and budgeted OCR cascade: every document processed, zero silent failures.

Silently dropped, naive v1

11 with no error at all

433/433

Processed after rebuild

0 silent failures

Cost per page

No AI calls, no GPU

That failing run was pdfmux's own early pipeline — our tool failing at the exact thing it promises. It's why every unrecoverable page is now flagged, not dropped. You can certify any extractor the same way, free: pip install pdfmux && pdfmux verify yourfile.pdf.

#2 of all tools, #1 free

On opendataloader-bench — 200 real-world PDFs scored on reading order, tables, and heading structure — pdfmux 1.8.2 scores 0.903 overall: second only to the paid hybrid engine (0.909), and ahead of every open-source extractor. No GPU, $0 per page. Full methodology & per-document scores →

Library	Overall	Reading order	Tables	Cost
opendataloader-hybrid	0.909	0.935	0.928	paid API
pdfmux	0.903	0.920	0.911	free
Docling	0.877	0.900	0.887	free
marker	0.861	0.890	0.808	free
mineru	0.831	0.857	0.873	free

Built-in

🔄

Self-healing pipeline

Bad pages detected and re-extracted automatically. Zero manual intervention.

📊

Confidence scoring

5 quality checks per page. Know exactly which pages to trust.

⚙️

5 extraction backends

PyMuPDF, OpenDataLoader, RapidOCR, Docling + BYOK LLM. Best extractor per page.

🧭

Per-page routing

Each page gets the right extractor based on content type and quality.

🤖

MCP server

Give AI agents PDF reading ability. Three tools, one command to start.

📋

Structured extraction

Tables, key-values, and schema mapping. Zero LLM cost — pure rule-based.

📑

Heading detection

Font-size analysis maps headings to h1/h2/h3 automatically. Handles bold-at-same-size variants.

🔍

Value normalization

Dates, currencies, and rates parsed automatically. "28 Feb 2026" → 2026-02-28.

🔓

MIT licensed

Open source with a frozen API. Your code won't break on updates.

MCP server

Give your AI agent the ability to read PDFs. Three tools: convert_pdf for extraction, analyze_pdf for quick triage, batch_convert for directories.

{ "mcpServers": { "pdfmux": { "command": "pdfmux", "args": ["serve"] } } }

pdfmux Cloud — the audit trail, hosted

The library stays MIT and local — that doesn’t change. Cloud is for teams that want the same extraction through a hosted API: signed per-batch manifests, per-key page quotas, and billing instead of infra. Live now — sign in and mint an API key on the free tier, no card.

Free

100 pages / month

Signed extraction manifest
Community support

Verified

$49 / mo

5,000 pages / month

Signed manifest + confidence scores
Email support

Verified Pro

$199 / mo

50,000 pages / month

Priority queue
Audit export
Priority support

start free — 100 pages / month → No card required · dashboard at app.pdfmux.com

Works with

LangChain LlamaIndex Claude Cursor Claude Code Gemini

LangChain & LlamaIndex

python

# LangChain
from pdfmux.integrations.langchain import PDFMuxLoader

loader = PDFMuxLoader("report.pdf")
docs = loader.load()  # → list[Document]

# LlamaIndex
from pdfmux.integrations.llamaindex import PDFMuxReader

reader = PDFMuxReader()
docs = reader.load_data("report.pdf")  # → list[Document]

Optional extras

bash

# add OCR for scanned pages (~200MB, CPU-only)
$ pip install "pdfmux[ocr]"

# add table extraction
$ pip install "pdfmux[tables]"

# add LLM vision for complex layouts (Gemini, Claude, GPT-4o, Ollama)
$ pip install "pdfmux[llm]"

# add LangChain or LlamaIndex loader
$ pip install "pdfmux[langchain]"
$ pip install "pdfmux[llamaindex]"

# install everything
$ pip install "pdfmux[all]"

What pdfmux is not

Not cloud-only. pdfmux is a local Python library first. Your documents never leave your machine unless you opt into an LLM provider (Gemini, Claude, GPT-4o, Ollama, or any BYOK model) — the base install is fully offline. pdfmux Cloud is a separate, optional hosted tier for teams that want quotas and signed manifests without running infra.
Not a GUI. CLI, Python API, and MCP server for developers. No web interface, no upload form — just pip install pdfmux.
Not locked to one LLM. Bring your own keys — Gemini, Claude, GPT-4o, Ollama, or any OpenAI-compatible API via a 5-line YAML config. pdfmux routes to the best model per page type automatically.
Not a replacement for individual extractors. pdfmux orchestrates PyMuPDF, OpenDataLoader, RapidOCR, Docling, and your LLM of choice. It doesn't replace them — it routes each page to the right one and re-extracts failures.

Frequently asked questions

How is pdfmux different from just using PyMuPDF?

PyMuPDF is one of pdfmux's backends. pdfmux adds a quality audit on top: it scores every page, detects failures (blank output, mojibake, broken columns), and automatically re-extracts bad pages with a better extractor. PyMuPDF alone gives you text. pdfmux gives you reliable text.

Does it work offline?

Yes. The base install plus the OCR and tables extras all run locally with zero network calls. LLM providers (Gemini, Claude, GPT-4o) require API keys and internet. Ollama runs fully local. You can run an air-gapped pipeline with pip install "pdfmux[ocr,tables,llm-ollama]".

What about scanned PDFs, stamps, or handwriting?

Use --mode premium with any LLM provider: pip install "pdfmux[llm-claude]" for Claude, pdfmux[llm] for Gemini, or pdfmux[llm-openai] for GPT-4o. pdfmux routes scanned/handwritten pages to your LLM automatically. The base install is CPU-only and handles 90% of digital PDFs without any API key.

Is it production ready?

v1.7.0 is production stable with a frozen API. Strict mode is on by default — a batch that drops below 0.75 confidence fails the run instead of silently shipping bad text. Agentic multi-pass extraction, cost-aware routing, budget caps, and configurable timeouts make it safe for processing untrusted documents at scale.

How do I add a new extractor?

Implement the BaseExtractor interface with an extract() method and register it. The router will include it in per-page quality comparisons automatically. See the docs for the full extractor development guide.

Can I use it with LangChain or LlamaIndex?

Built-in. Install pdfmux[langchain] or pdfmux[llamaindex] and use the native loader classes. They return standard Document objects with confidence metadata attached. pdfmux is the only loader with per-page quality scoring — filter low-confidence chunks before embedding to keep garbage out of your vector store.

What's the best free alternative to LlamaParse?

pdfmux. LlamaParse costs $3 per 1,000 pages and sends documents to a paid cloud API. pdfmux is MIT-licensed and runs locally on CPU, and it certifies any extractor's output for silent drops (pdfmux verify). BYOK LLM fallback (Claude, GPT-4o, Gemini, Ollama) is optional — you only pay for the pages that actually need LLM help.

Does pdfmux have an MCP server for Claude Desktop and Cursor?

Yes, built in. After pip install pdfmux, add pdfmux to your claude_desktop_config.json or Cursor MCP config. It exposes tools for PDF-to-markdown extraction, table extraction, and per-page quality audits — so Claude or Cursor can read any PDF reliably with zero additional setup.

Why does my RAG pipeline hallucinate on PDF documents?

Most RAG hallucinations on PDFs start at extraction, not retrieval. Standard extractors silently fail on scanned pages, two-column layouts, tables, and mojibake. Your chunks look fine but contain broken text. pdfmux scores every page (0–1 confidence), detects failures, and automatically re-extracts bad pages with a stronger backend — so your embeddings are built on clean text, not silent garbage.

How does pdfmux detect bad pages in PDF extraction?

4-signal per-page confidence scoring: (1) text density vs page area, (2) character distribution (catches mojibake), (3) structural coherence (catches broken column order), and (4) OCR-comparison probe when under threshold. Pages below the quality bar get re-extracted with a different backend (OpenDataLoader, Docling, OCR, or an LLM) until quality is recovered.

Can pdfmux extract tables from scanned PDFs?

Yes. pdfmux routes scanned pages through RapidOCR and IBM Docling's table-structure model (97.9% TEDS accuracy on complex tables). Install with pip install "pdfmux[ocr,tables]". For handwritten tables or very poor scans, use --mode premium with an LLM provider to recover tables rule-based backends can't.