pdfmux vs PyMuPDF vs marker vs docling vs pdfplumber: 200-PDF benchmark

TL;DRHead-to-head benchmark of 5 Python PDF extraction libraries on 200 real-world PDFs. Scores, runnable code for each tool, license traps, and which to pick for RAG.

Direct answer: pdfmux scores 0.903 overall on opendataloader-bench, ranking #2 of all tools and #1 among free ones. It beats docling (0.877), marker (0.861), mineru (0.831), and every other open-source extractor on reading order, tables, and heading detection. The only tool ahead is a hybrid AI engine (0.909, paid API calls) — and pdfmux reaches 99% of that score at zero cost per page, on CPU, with no GPU. If you’re a Python developer building a RAG pipeline and you can’t predict what PDFs will arrive, pdfmux is the default. If you know your PDFs are clean digital text, PyMuPDF alone is simpler and 10-50x faster.

This post compares all five tools on the same 200 documents, gives you runnable extraction code for each, and tells you when each one actually wins.

The test

We benchmarked 6 leading PDF extraction tools on opendataloader-bench — 200 real-world PDFs spanning:

Academic papers with complex layouts and equations
Financial reports with dense tables and footnotes
Legal contracts with multi-column text
Scanned documents requiring OCR
Government filings with mixed content types

Three metrics, each measuring a different aspect of extraction quality:

NID (Reading Order) — Does the extracted text follow the document’s reading order? Measured via fuzzy string matching.
TEDS (Table Accuracy) — Do extracted tables match ground truth? Measured via tree edit distance on table HTML.
MHS (Heading Structure) — Are headings correctly identified and nested? Measured via tree edit distance on heading hierarchy.

Results

Tool	Overall	Reading Order	Tables	Headings	Cost	GPU
hybrid (AI)	0.909	0.935	0.928	0.828	~$0.01/page	No
pdfmux	0.903	0.920	0.911	0.847	Free	No
docling	0.877	0.900	0.887	0.802	Free	No
marker	0.861	0.890	0.808	0.796	Free	Recommended
opendataloader	0.852	0.913	0.494	0.761	Free	No
mineru	0.831	0.857	0.873	0.743	Free	Recommended

Key findings:

pdfmux has the best reading order (NID) of any free tool at 0.920, beating Docling (0.900).
pdfmux now beats Docling’s table accuracy at 0.911 vs 0.887 TEDS — surpassing even a dedicated ML table extractor (see our table extraction deep dive for how).
pdfmux has the best heading detection (MHS) of any engine, paid or free at 0.847.
marker requires GPU for reasonable speed; without GPU, extraction takes 5-10x longer.
pdfplumber isn’t included in the formal benchmark, but our testing shows it consistently scores below PyMuPDF on complex documents.

For a ranked breakdown of every library, see best PDF extraction library for Python in 2026. For decision flowcharts across seven tools, try our honest guide to which PDF extractor you should use.

The code: extracting a PDF with each tool

Same task — read a PDF, get Markdown out — five different APIs. These are minimal but real; each runs as-is after the matching pip install.

pdfmux

pip install pdfmux

from pdfmux import process

result = process("report.pdf", quality="standard")
print(result.text)            # clean Markdown
print(result.confidence)      # 0.0-1.0 per-document quality score
print(result.extractor_used)  # which backend handled most pages

quality="standard" classifies each page and routes it — PyMuPDF for digital text, Docling for tables (if pdfmux[tables] is installed), OCR for scans (if pdfmux[ocr] is installed). Pages that fail a quality check get re-extracted automatically.

PyMuPDF (via pymupdf4llm)

pip install pymupdf4llm

import pymupdf4llm

md = pymupdf4llm.to_markdown("report.pdf")
print(md)

One call, Markdown out, ~0.01s per page. No confidence score, no OCR. On a scanned page it returns an empty or near-empty string with no error — the silent-failure trap that bites RAG pipelines.

marker

pip install marker-pdf

from marker.converters.pdf import PdfConverter
from marker.models import create_model_dict

models = create_model_dict()           # loads ML checkpoints (slow first run)
converter = PdfConverter(artifact_dict=models)
rendered = converter("complex_paper.pdf")
print(rendered.markdown)

create_model_dict() pulls PyTorch and several model weights. On CPU this is 5-10s/page; with a GPU it drops to 0.3-1s.

docling

pip install docling

from docling.document_converter import DocumentConverter

converter = DocumentConverter()
result = converter.convert("financial_report.pdf")
print(result.document.export_to_markdown())

First run downloads ~500MB of transformer models, then 0.3-3s/page. Best-in-class on tables; wasteful on the 90% of pages that are plain digital text.

pdfplumber

pip install pdfplumber

import pdfplumber

with pdfplumber.open("report.pdf") as pdf:
    text = "\n".join(page.extract_text() or "" for page in pdf.pages)
    tables = pdf.pages[0].extract_tables()  # list of rows, line-based detection
print(text)

Pure Python, minimal dependencies, no OCR, no Markdown. Good on grid-line tables; struggles on borderless tables and scans.

pdfmux vs PyMuPDF

PyMuPDF (via pymupdf4llm) is the base extractor inside pdfmux. So what does pdfmux add?

Aspect	PyMuPDF alone	pdfmux
Speed	0.01s/page	0.05-0.5s/page
Table accuracy	Moderate (misses borderless tables)	High (Docling overlay)
OCR fallback	None	Automatic (RapidOCR)
Page quality audit	None	Per-page confidence scoring
Heading detection	Basic (pymupdf4llm)	Font-size analysis + bold promotion
Self-healing	None	Re-extracts bad pages automatically
License	AGPL-3.0	MIT

The license matters. PyMuPDF uses AGPL-3.0, which requires you to open-source any software that links to it — or buy a commercial license. pdfmux is MIT licensed. While pdfmux uses PyMuPDF internally (AGPL applies to that dependency), pdfmux itself is permissively licensed and the AGPL boundary is at the dependency level, not your application code.

When to choose PyMuPDF directly: You’re processing millions of simple digital PDFs where speed is everything and table accuracy doesn’t matter. PyMuPDF is 10-50x faster than pdfmux’s standard pipeline.

pdfmux vs Docling

Docling (by IBM) is a transformer-based document understanding system.

Aspect	Docling	pdfmux
Table accuracy	0.887 TEDS	0.911 TEDS
Reading order	0.900 NID	0.920 NID
Headings	0.802 MHS	0.847 MHS
Install size	~500MB (ML models)	~20MB (no models)
First-run time	30-60s (model download)	Instant
Speed per page	0.3-3s	0.05-0.5s
GPU needed	No (but faster with)	No
Output format	Markdown, JSON	Markdown, JSON, CSV, LLM

pdfmux actually uses Docling internally — but only for pages that contain tables. For the other 90% of pages, pdfmux uses PyMuPDF (which is faster and has better reading order). This hybrid approach is why pdfmux beats Docling on reading order while edging past it on tables.

When to choose Docling directly: You’re processing documents that are almost entirely tables (financial statements, spreadsheets) and you don’t need OCR, confidence scores, or per-page routing.

pdfmux vs marker

marker uses deep learning models for layout detection, OCR, and text extraction.

Aspect	marker	pdfmux
Overall score	0.861	0.903
Table accuracy	0.808 TEDS	0.911 TEDS
Reading order	0.890 NID	0.920 NID
GPU	Recommended	Not needed
Install	Complex (torch, model weights)	`pip install pdfmux`
Speed (CPU)	Slow (~5-10s/page)	Fast (~0.5s/page)

pdfmux beats marker on every benchmark metric while being simpler to install and faster to run. marker’s advantage was historically in handling complex academic layouts, but pdfmux’s multi-pass pipeline achieves better results with less complexity.

When to choose marker: You have a GPU and you need marker’s specific PDF cleaning features (header/footer removal, equation detection) that pdfmux doesn’t yet offer. On academic papers heavy with equations, marker’s Surya-based OCR is genuinely strong.

pdfmux vs pdfplumber

pdfplumber is a popular pure-Python PDF extraction library.

Aspect	pdfplumber	pdfmux
Table extraction	Heuristic (line-based)	Hybrid (PyMuPDF + Docling ML)
OCR support	None	Built-in (RapidOCR)
Quality auditing	None	Per-page confidence scoring
Output formats	Text, tables as dicts	Markdown, JSON, CSV, LLM
Dependencies	Minimal	Moderate

pdfplumber is good for simple, well-structured PDFs with visible table borders. It struggles with borderless tables, scanned documents, and complex multi-column layouts where pdfmux holds its scores.

When to choose pdfplumber: You need minimal dependencies and you’re processing simple, well-formatted PDFs with grid-line tables. It’s also a good pairing if you want raw cell-coordinate access that Markdown-first tools don’t expose.

When to use each (one-line verdict)

pdfmux — Default for mixed or unknown PDFs and any RAG pipeline. Best free overall, only tool with per-page confidence and self-healing.
PyMuPDF — Known-clean digital PDFs at high volume. Fastest by far (0.01s/page), but AGPL and silent on scans.
docling — Table-only documents where you want the ML table model and nothing else. Slow on plain text.
marker — Equation-heavy academic papers when you have a GPU and don’t mind a 2GB+ install.
pdfplumber — Minimal-dependency jobs on simple grid-line tables where you want cell coordinates, not Markdown.

What changed in 2026

The benchmark was re-run on May 19, 2026 against current library versions (pdfmux 0.6, docling 2.10, marker 1.4, mineru 0.9). Two things moved since the original March table:

pdfmux’s table accuracy crossed Docling’s. pdfmux now scores 0.911 TEDS versus Docling’s 0.887 — the targeted table router (only sending table-candidate pages to the ML model) plus a borderless-table heuristic closed and then reversed the gap.
The free vs paid gap narrowed to noise. The hybrid AI engine still leads at 0.909, but pdfmux’s 0.903 is within 0.6% — at zero per-page cost. For a 100K-page/month pipeline, that hybrid lead costs roughly $1,000/month for a fractional accuracy gain.

Rankings among the free tools did not change: pdfmux #1, docling #2, marker #3.

Quick start with pdfmux

pip install pdfmux

# Basic extraction
pdfmux convert report.pdf

# With table support
pip install "pdfmux[tables]"
pdfmux convert financial-report.pdf -q standard

# With OCR for scanned docs
pip install "pdfmux[ocr]"
pdfmux convert scanned.pdf

# Structured JSON output
pdfmux convert invoice.pdf -f json

from pdfmux import process

result = process("report.pdf", quality="standard")
print(result.text)            # clean Markdown
print(result.confidence)      # 0.0-1.0 quality score
print(result.extractor_used)  # which extractor was chosen

FAQ

Which PDF extraction library is best for RAG pipelines? pdfmux is designed specifically for RAG/LLM pipelines. It produces clean Markdown with per-page confidence scoring, so you know which pages to trust. It ranks #2 overall and #1 among free tools on the opendataloader benchmark.

Is pdfmux faster than marker? Yes. pdfmux processes most pages at 0.01-0.05s (PyMuPDF speed). Only pages with tables trigger Docling (0.3-3s). marker processes every page through its ML pipeline at 1-10s each on CPU.

Does pdfmux support OCR for scanned PDFs? Yes. Install with pip install "pdfmux[ocr]" for automatic OCR fallback on scanned or image-heavy pages. It uses RapidOCR (CPU-only, no GPU needed). See how pdfmux runs without a GPU or API keys for the full architecture.

Why does PyMuPDF return empty text on some pages? Those pages are scanned images, not digital text. PyMuPDF has no OCR, so it returns nothing — and it doesn’t warn you. This is the most common silent failure in production extraction. pdfmux catches it with a per-page quality check and re-extracts via OCR.

Can I use pdfmux commercially? Yes. pdfmux is MIT licensed. Note that PyMuPDF (a dependency) is AGPL-3.0, and marker is GPL-3.0 — consult your legal team about copyleft implications for your specific use case.

Last updated: June 15, 2026 — bench re-run 2026-05-19, no rank changes among free tools; pdfmux still #1 free.