Direct answer: No single PDF extraction tool is best at every page type. PyMuPDF is roughly 100x faster on digital PDFs. On the opendataloader-bench (200 real PDFs), pdfmux scores 0.905 overall (0.887 TEDS on tables); Docling reaches 0.887 TEDS; Marker reaches 0.808 TEDS; pymupdf4llm reaches 0.612 TEDS. Surya and Mistral OCR handle scans; Gemma 4 27B is the only open-weight extractor with strong Arabic OCR. So I built pdfmux — a self-healing pipeline that routes each page to the right extractor from a pool of seven, scores per-page quality, and re-extracts failures automatically. pip install pdfmux.

Last updated: 2026-05-04 — refreshed with pdfmux 1.6 router (7 backends), Mistral OCR, Marker, and Gemma 4 results.


The problem

Most RAG pipelines fail before they reach the model. The problem is not the LLM. The problem is document ingestion.

I was building an AI pipeline that needed to ingest PDFs at scale. There are roughly 15 tools that convert PDFs to text, and they all have different tradeoffs:

  • PyMuPDF — blazing fast (0.01s/page) but can’t handle scanned docs
  • Marker — great ML-powered extraction with strong table layout, but heavier than PyMuPDF and slower without a GPU
  • Docling — 97.9% table accuracy but slow on everything else
  • Surya OCR — handles scans but pointless for digital PDFs
  • Mistral OCR — best-in-class accuracy on academic and table-heavy documents, but it is a paid API
  • Gemma 4 27B (vision) — open-weight LLM with native Arabic OCR; runs locally on a 24 GB GPU or via OpenRouter
  • Gemini Flash — catches everything but costs money and is the slowest of the LLM options

The kicker: roughly 90% of pages in real-world corpora are digital — clean, extractable text. You don’t need ML, OCR, or an LLM for those. PyMuPDF does it in 10 milliseconds.

But the other 10%? Those need specialized tools. And you don’t know which 10% until you check each page.

The benchmark

I tested the major tools across four categories on a 200-PDF mix that includes invoices, financial filings, scanned contracts, academic papers, and Arabic logistics documents from the Gulf. (For the head-to-head with full per-tool scores, see the 200-PDF benchmark of pdfmux vs PyMuPDF vs marker vs docling and the 4-way comparison of pdfmux vs LlamaParse vs Docling vs Unstructured.)

Digital PDFs (clean text)

ToolSpeed (per page)Accuracy
PyMuPDF0.01s98%+
Marker (CPU)1.5–3s98%+
Marker (GPU)0.3–0.8s98%+
Docling0.3–1s95%+
Mistral OCR0.4–1s (API)99%+
Gemini Flash2–5s99%+

For digital PDFs, PyMuPDF is 50–500x faster than everything else and just as accurate. Using anything else is burning time and money.

Table-heavy documents

Scores below are from the opendataloader-bench — 200 real-world PDFs, last re-run on 2026-04-22. Higher TEDS means better cell-level table reconstruction.

ToolTEDS (table accuracy)Preserves structure
pdfmux (router)0.887Yes
Docling0.887Yes
mineru0.873Yes
Marker0.808Yes
Unstructured (open)0.701Partial
pymupdf4llm0.612Partial
Mistral OCRnot benchmarked here (paid API)Yes
Gemini Flashnot benchmarked here (paid API)Yes

Docling and pdfmux tie on free-tier table extraction (pdfmux uses Docling internally for table-heavy pages). Mistral OCR is widely reported to lead on table-heavy documents but isn’t in our open benchmark; we’ll add it when we have a clean run. We compare three open-source approaches in How to extract tables from PDF in Python, with the full ranked list in best PDF extraction library for Python.

Scanned PDFs

ToolWorks?Speed (CPU)
PyMuPDFNo (no text to extract)
Surya OCRYes1–5s/page
MarkerYes (much faster on GPU)2–6s/page CPU
Mistral OCRYes0.4–1s (API)
Gemma 4 27BYes (best Arabic)3–10s GPU
Gemini FlashYes2–5s

You need OCR or a vision LLM. PyMuPDF gets nothing from a scanned doc. For Arabic and right-to-left text, Gemma 4 27B is the only open-weight option that handles diacritics and ligatures cleanly — see Arabic PDF extraction for GCC logistics for the full benchmark.

The insight

No tool wins everywhere. The best approach is to detect what kind of content each page has and route to the right tool.

Most extractors run once and hope for the best. But what if you could check whether the extraction actually worked, and re-extract only the pages that failed?

The solution: pdfmux

pdfmux is a self-healing extraction pipeline. It doesn’t just extract — it audits every page, detects broken ones, and re-extracts them automatically with a stronger backend.

1. Extract     — PyMuPDF on every page (instant)
2. Audit       — 5 quality checks per page: good / bad / empty
3. Region OCR  — surgical OCR on image regions in bad pages
4. Full OCR    — re-extract empty pages with Surya, Marker, or Mistral OCR
5. Merge       — combine good + repaired pages in original order

The key differentiator: per-page confidence scoring. After extraction, pdfmux runs five quality checks on every page — character density, alphabetic ratio, word structure, whitespace patterns, and mojibake detection. Each page gets a confidence score from 0 to 1. Pages below the threshold get re-extracted with a better tool.

All pages good? Zero OCR overhead. You only pay (in latency or dollars) for the pages that are actually broken.

What this looks like in practice

Typical single-tool output:
  Page 1: (ok)
  Page 2: (ok)
  Page 3: [empty]
  Page 4: Amoun  Dscriptin  $450  Consltng
  Page 5: (ok)
   No quality info. No way to know which pages are broken.

pdfmux output:
  Page 1: good  0.98
  Page 2: good  0.96
  Page 3: bad  OCR'd  0.91
  Page 4: bad  OCR'd  0.87
  Page 5: good  0.97
    5 pages, 94% avg confidence, 2 re-extracted with OCR

Usage

Three lines of Python:

import pdfmux

# extract as markdown — auto-audits every page
text = pdfmux.extract_text("report.pdf")

# structured json with locked schema
data = pdfmux.extract_json("report.pdf")

# LLM-ready chunks with token estimates
chunks = pdfmux.load_llm_context("report.pdf")
# → [{title, text, page_start, page_end, tokens, confidence}]

Or the CLI:

pip install pdfmux
pdfmux invoice.pdf          # → invoice.md
pdfmux pitch-deck.pdf       # auto-detects scanned pages, OCRs them
pdfmux analyze report.pdf   # quick quality triage
pdfmux serve                # MCP server for AI agents
pdfmux watch ./inbox        # auto-convert PDFs as they land
pdfmux estimate ./batch     # predict spend before running

The fallback chain

What makes pdfmux practical: if you don’t install the optional extractors, it falls back silently.

# just the base — handles 90% of PDFs
pip install pdfmux

# add table support when you need it
pip install "pdfmux[tables]"

# add OCR for scanned pages
pip install "pdfmux[ocr]"

# add Marker for academic/table-heavy docs
pip install "pdfmux[marker]"

# add everything (Mistral OCR, Marker, Gemma 4, all OCR engines)
pip install "pdfmux[all]"

No errors, no config. If Docling isn’t installed and you hit a table-heavy PDF, pdfmux falls back to Marker, then PyMuPDF, and reports the confidence drop. The base pipeline runs on CPU without a GPU or API keys, so you can ship it in CI, in a Lambda function, or on a Raspberry Pi.

What changed in pdfmux 1.6

The benchmark above reflects the pdfmux 1.6 router, shipped on 2026-04-30. Three things matter for benchmarking work:

  1. Three new backends. The router went from 4 extractors to 7 — Mistral OCR, Marker, and Gemma 4 27B joined PyMuPDF, Docling, Surya, and Gemini Flash. Mistral OCR is the new accuracy ceiling on table-heavy documents; Marker is the best free option for academic PDFs with complex layouts; Gemma 4 27B is the first vision LLM with native Arabic OCR.
  2. Hash-keyed result cache. A 14-second second extraction now takes 0.05 seconds. Important when you’re benchmarking — you can rerun a corpus dozens of times without re-paying for OCR.
  3. estimate and diff. pdfmux estimate ./batch tells you the dollar cost of running a directory through Mistral OCR or Gemini before you press go. pdfmux diff a.json b.json shows what changed between two runs of the same PDF, which makes A/B testing extractors actually tractable.

If you read the original version of this post and tried the v1.0 router, the v1.6 results are materially better on scanned and Arabic documents. Worth re-running.

Built for LLM pipelines

pdfmux outputs structured content designed for RAG pipelines, vector databases, agent workflows, and knowledge retrieval systems.

Output includes:

  • Section boundaries with page references
  • Per-page confidence scoring
  • Structured chunks with token estimates
  • Locked JSON schema (the API is frozen for the 1.x line — your code won’t break on minor updates)

LangChain and LlamaIndex

# LangChain
from pdfmux.integrations.langchain import PDFMuxLoader
loader = PDFMuxLoader("report.pdf")
docs = loader.load()  # → list[Document]

# LlamaIndex
from pdfmux.integrations.llamaindex import PDFMuxReader
reader = PDFMuxReader()
docs = reader.load_data("report.pdf")  # → list[Document]

The full LangChain pattern — including chunking strategies, metadata propagation, and confidence-aware retrieval — is in PDF extraction with LangChain.

Batch processing

For directory-scale ingestion (10k+ PDFs), use the streaming CLI:

pdfmux convert ./pdfs --out ./extracted --workers 8 --stream

NDJSON streaming means you can pipe pages into a vector database as they’re extracted, instead of waiting for the whole batch. The full pattern — including resume-on-failure and cost prediction — is in Batch PDF processing in Python.

MCP server

pdfmux includes a built-in MCP server. Add it to Claude Desktop, Cursor, or any other MCP client and your AI agent can read PDFs natively:

{
  "mcpServers": {
    "pdfmux": { "command": "pdfmux", "args": ["serve"] }
  }
}

Three tools: convert_pdf for extraction, analyze_pdf for quick quality triage, batch_convert for directories. Setup details in pdfmux MCP server for Claude and Cursor and pdfmux for Claude Code.

Frequently asked questions

Why not just use Gemini Flash or Mistral OCR for everything?

Three reasons. First, cost: at scale, sending every page to a vision LLM runs $1–5 per 1,000 pages. PyMuPDF runs zero-cost. Second, latency: a vision LLM averages 2–5 seconds per page; PyMuPDF averages 10 milliseconds. Third, control: when an LLM gets a digital PDF wrong, it usually hallucinates a plausible-but-incorrect rewrite. PyMuPDF returns the actual text or fails loudly. Routing means you only pay the LLM tax on pages that genuinely need it.

How does the per-page confidence score work?

Each page runs through five lightweight heuristics: character density (text per square inch), alphabetic ratio (letters vs garbage), word structure (proportion of dictionary-shaped tokens), whitespace patterns (suspicious gaps), and mojibake detection (UTF-8 corruption signatures). The five signals combine into a single 0–1 score. Pages under the threshold (default 0.6) trigger re-extraction with a stronger backend. Full algorithm in the self-healing PDF extraction deep dive.

Does pdfmux work on Arabic, Chinese, or right-to-left scripts?

Yes. The base pipeline preserves Unicode correctly, so Arabic and CJK digital PDFs extract cleanly with PyMuPDF. For scanned Arabic documents, the router falls through to Gemma 4 27B (best for diacritics and ligatures) or Mistral OCR. The benchmark on Gulf logistics documents is in Arabic PDF extraction for GCC logistics.

What about forms and invoices specifically?

Forms benefit from layout-aware extractors. Marker and Docling preserve form-field structure better than PyMuPDF; Mistral OCR handles handwritten fields. The full breakdown — including field extraction with locked JSON schemas — is in PDF form data extraction in Python and PDF invoice extraction in Python.

Can I run pdfmux without sending data to any external API?

Yes. Install the local-only profile (pip install "pdfmux[local]") and the router will only consider PyMuPDF, Docling, Surya, Marker, and Gemma 4. No data leaves your machine. This is the default for compliance-sensitive deployments — see Local PDF extraction with Gemma 4 for the full local-only architecture.

Try it

pip install pdfmux
pdfmux your-file.pdf
  • GitHub — source code, docs, examples
  • PyPIpip install pdfmux
  • Website — documentation and API reference

MIT licensed. Runs locally. No API keys required for the base install.


Keep reading


Built by Nameet Potnis. Contributions welcome on GitHub.