TL;DR: We ran pdfmux across 11 real-world public documents — SEC filings, academic papers, Supreme Court opinions, and government reports — totaling 1,422 pages. pdfmux processed them at 6.8 pages/sec, was 7% faster than pymupdf4llm overall, extracted more content on 8 out of 11 documents, hit 100% confidence on every single one, and recovered content where pymupdf4llm crashed entirely. Zero cloud calls. Zero cost. Every number is below.
The problem nobody talks about
Most PDF extraction benchmarks use synthetic test cases — clean, well-formatted documents designed to make tools look good. But the PDFs that actually break your RAG pipeline at 2am are the messy ones:
- SEC filings with 500+ pages mixing legal text, financial tables, and cover-page graphics
- Academic papers where equations are embedded as images inside otherwise digital text
- Legal opinions with footnote hierarchies three levels deep
- Government reports where the cover is a scanned image and the rest is digital
We built pdfmux because we kept hitting these edge cases — the same ones that led to our original benchmark of every PDF-to-Markdown tool. (For the synthetic benchmark results on 200 PDFs, see our opendataloader-bench comparison.) So we decided to benchmark it against the documents people actually need to process — not the ones that make for easy demos.
The test corpus: 1,422 pages of real documents
We assembled 11 publicly available documents covering the most common PDF extraction challenges. Every document is downloadable — you can reproduce this entire benchmark yourself.
| Document | Type | Pages | Size | Why it’s hard |
|---|---|---|---|---|
| Tesla 10-K (2024) | SEC annual filing | 144 | 1.6 MB | Financial tables, forward-looking statements |
| Apple 10-K | SEC annual filing | 121 | 1.0 MB | Dense financial tables, footnotes |
| Berkshire Hathaway AR | Annual report | 150 | 1.8 MB | Warren Buffett’s letter + complex financials |
| Uber S-1 (2019) | SEC IPO filing | 522 | 5.4 MB | Massive document, 15 graphical pages, complex tables |
| Airbnb Pitch Deck | Startup presentation | 12 | 842 KB | Fully graphical/image-based slides |
| Attention Is All You Need | Academic paper | 15 | 2.1 MB | Equations, architecture diagrams |
| BERT | Academic paper | 16 | 757 KB | Dense results tables |
| Trump v. United States | Supreme Court opinion | 119 | 519 KB | Legal citations, footnotes, syllabus |
| EY 10-K Guide | Professional services | 154 | 2.5 MB | Mixed layouts, branded graphics |
| FDA New Drug Therapy | Government report | 34 | 10 MB | Cover graphic, data tables |
| FDA PDUFA Performance | Government report | 135 | 1.4 MB | Tables, structured regulatory data |
Total: 1,422 pages across 28.4 MB of real-world PDFs.
The tools we compared
We tested three approaches on every document:
- PyMuPDF raw —
page.get_text()on every page. Zero formatting, zero structure. The baseline. - pymupdf4llm — The most popular PDF-to-Markdown library for LLM pipelines. Adds headers, bold, lists, and table detection.
- pdfmux — Our self-healing pipeline that classifies each page, routes to the best extractor, audits quality, and re-extracts failures.
Methodology: All tests ran on an Apple Silicon Mac with Python 3.12. pdfmux used “fast” quality mode — no Docling, no OCR models, just intelligent routing and auditing on top of the same pymupdf4llm backend. Each document was processed 3 times; we report the median. Total cost: $0 — everything runs locally without a GPU.
For context: LlamaParse charges per page, Unstructured has cloud-only features, and many “AI-powered” extractors require API keys. pdfmux wraps open-source extractors and adds intelligence on top — no API keys, no cloud calls, no per-page fees.
Speed: every document, no cherry-picking
The common assumption: “a pipeline wrapper must be slower.” Here’s the full table — including the documents where pdfmux was slower:
| Document | pymupdf4llm | pdfmux | Delta |
|---|---|---|---|
| Airbnb Pitch Deck (12p) | 0.33s | 0.24s | 27% faster |
| Attention Paper (15p) | 2.00s | 2.09s | ~same |
| BERT Paper (16p) | 3.65s | 4.24s | 16% slower |
| FDA Drug Therapy (34p) | 5.81s | 5.87s | ~same |
| Supreme Court (119p) | 6.13s | 6.45s | ~same |
| Apple 10-K (121p) | 16.88s | 15.32s | 9% faster |
| FDA PDUFA (135p) | 47.20s | 38.91s | 18% faster |
| Tesla 10-K (144p) | 21.93s | 22.22s | ~same |
| Berkshire AR (150p) | 24.29s | 23.38s | 4% faster |
| EY 10-K Guide (154p) | 22.89s | 22.02s | 4% faster |
| Uber S-1 (522p) | 72.61s | 68.16s | 6% faster |
| Total (1,422 pages) | 223.72s | 208.90s | 7% faster |
That’s 6.8 pages/sec for pdfmux vs 6.4 pages/sec for pymupdf4llm.
The pattern is clear: on small files (15-16 pages), pdfmux’s classification overhead is noticeable — BERT is 16% slower in wall clock time (0.6 seconds). On larger documents where it matters, pdfmux is consistently faster because page-level routing skips unnecessary work. The Uber S-1 saved 4.5 seconds. The FDA PDUFA report saved 8.3 seconds.
The overhead is constant (~0.5s for classification + audit). The savings scale with document size.
Content quality: pdfmux wins on 8 out of 11 documents
This is where it gets interesting. We compared total extracted characters — more characters generally means more content recovered, fewer gaps in your RAG pipeline.
| Document | pymupdf4llm | pdfmux | Delta | Winner |
|---|---|---|---|---|
| Airbnb Pitch Deck | 0 (crashed) | 2,766 | +2,766 | pdfmux |
| Uber S-1 | 1,713,729 | 1,746,341 | +32,612 (+1.9%) | pdfmux |
| Attention Paper | 40,498 | 41,326 | +828 (+2.0%) | pdfmux |
| Supreme Court | 248,104 | 248,690 | +586 (+0.24%) | pdfmux |
| EY 10-K Guide | 528,984 | 529,390 | +406 (+0.08%) | pdfmux |
| Tesla 10-K | 485,141 | 485,534 | +393 (+0.08%) | pdfmux |
| Apple 10-K | 445,323 | 445,683 | +360 (+0.08%) | pdfmux |
| Berkshire AR | 498,793 | 498,986 | +193 (+0.04%) | pdfmux |
| BERT Paper | 66,230 | 66,200 | -30 (-0.05%) | pymupdf4llm |
| FDA Drug Therapy | 65,669 | 65,591 | -78 (-0.12%) | pymupdf4llm |
| FDA PDUFA | 238,245 | 237,877 | -368 (-0.15%) | pymupdf4llm |
pdfmux extracts more content on 8 of 11 documents. The 3 losses are trivial — 30, 78, and 368 characters respectively (whitespace normalization differences). The wins include a complete crash recovery and 32K characters of additional content from a 522-page SEC filing.
When pymupdf4llm crashes, pdfmux keeps going
The Airbnb pitch deck is a 12-page presentation where half the pages are image-heavy — text rendered as graphics. This is the kind of document that breaks extraction pipelines in production.
| Tool | Output | What happened |
|---|---|---|
| PyMuPDF raw | 2,738 chars | Extracted text fragments, missed graphical content |
| pymupdf4llm | 0 chars | Threw an unhandled error and produced nothing |
| pdfmux | 2,766 chars | Detected 6 graphical pages, extracted available text, flagged pages for OCR |
In a production RAG pipeline, this is the difference between a silent failure and a graceful recovery. pymupdf4llm returns nothing — your pipeline silently indexes an empty document. pdfmux recovers what it can, tells you exactly which pages need OCR treatment, and never crashes.
The Uber S-1: 32,612 characters recovered
Uber’s 522-page IPO filing is the stress test. It has dense legal text, complex financial tables, and 15 pages with heavy graphical content (charts, infographics, full-page images).
pymupdf4llm: 1,713,729 chars
pdfmux: 1,746,341 chars (+32,612 chars, +1.9%)
That 32K character difference is roughly 8 pages worth of content — likely from the 15 graphical pages that pdfmux identified and handled with its fallback extraction path. For a RAG pipeline ingesting SEC filings, that’s the difference between having complete risk factors and missing them entirely.
Page-level intelligence: what other tools don’t tell you
What makes pdfmux different isn’t just extraction — it’s understanding what each page is before extracting it. Every other tool treats all pages the same. pdfmux classifies first, then routes.
Across all 11 documents, pdfmux identified:
- 27 graphical pages that would produce degraded output with text-only extraction
- 6 empty pages (title pages, dividers) that other tools silently skip or mishandle
- 4 documents with complex tables that benefit from specialized extraction
- 100% confidence on every document — the audit validated every single page
Here’s the per-document breakdown:
Document Digital Graphical Empty Tables
─────────────────────────────────────────────────────────────
Airbnb Pitch Deck 6 6 0 Yes
Uber S-1 507 15 1 Yes
EY 10-K Guide 152 2 0 No
Tesla 10-K 143 1 2 No
Apple 10-K 120 1 0 No
FDA Drug Therapy 33 1 0 Yes
FDA PDUFA 134 1 0 No
Berkshire AR 150 0 3 No
Attention Paper 15 0 0 No
BERT Paper 16 0 0 Yes
Supreme Court 119 0 0 No
This classification intelligence feeds into routing decisions. Digital pages go to the fast path. Graphical pages get flagged for OCR. Table pages get specialized extraction. Empty pages are reported, not silently dropped.
What does the output actually look like?
Here’s a snippet from the Supreme Court opinion extraction — clean, LLM-ready markdown with zero post-processing:
# **SUPREME COURT OF THE UNITED STATES**
Syllabus
TRUMP _v_ . UNITED STATES
CERTIORARI TO THE UNITED STATES COURT OF APPEALS FOR
THE DISTRICT OF COLUMBIA CIRCUIT
No. 23–939. Argued April 25, 2024—Decided July 1, 2024
A federal grand jury indicted former President Donald J. Trump on four
counts for conduct that occurred during his Presidency following the
November 2020 election...
Headers preserved. Italics for case citations. Paragraph structure intact. Feed this directly to GPT-4, Claude, or any LLM — no regex cleanup needed. You can also use pdfmux as an MCP server to give your AI agent native PDF reading ability.
The full scorecard
| Metric | PyMuPDF raw | pymupdf4llm | pdfmux |
|---|---|---|---|
| Total time (1,422 pages) | 3.16s | 223.72s | 208.90s |
| Pages per second | 450.0 | 6.4 | 6.8 |
| Total chars extracted | 4,373,506 | 4,330,716 | 4,368,384 |
| Documents with more content | — | 3 of 11 | 8 of 11 |
| Avg confidence | — | — | 100% |
| Failures | 0 | 1 (Airbnb crash) | 0 |
| Graphical pages detected | 0 | 0 | 27 |
| Empty pages detected | 0 | 0 | 6 |
| Cloud calls required | 0 | 0 | 0 |
| Cost | $0 | $0 | $0 |
PyMuPDF raw is fast but produces unstructured text — no headers, no tables, no markdown. pymupdf4llm adds structure but crashes on edge cases and misses graphical content. pdfmux wraps the same extraction backends, adds classification and auditing intelligence, and comes out faster, more complete, and more reliable.
What’s next
This benchmark ran in fast mode — pdfmux’s lightest configuration. We’re working on:
- Standard mode benchmarks with Docling table extraction and OCR for graphical pages
- RAG quality evaluation — not just character counts, but retrieval accuracy on questions about these documents
- Larger corpus — 100+ documents across more industries (healthcare, insurance, real estate)
- Head-to-head vs LlamaParse, Unstructured, and Marker on the same corpus
Follow the GitHub repo for updates.
Try it yourself
Every PDF in this benchmark is publicly available. Reproduce the entire thing:
pip install pdfmux
# Quick conversion
pdfmux convert tesla-10k.pdf -o output.md
# See per-page classification
pdfmux analyze tesla-10k.pdf
# Benchmark all backends
pdfmux bench paper.pdf
# With confidence scores
pdfmux convert document.pdf -o out.md --confidence
For the full benchmark script and raw results, see the benchmarks directory on GitHub.
Keep reading
- Best PDF extraction library for Python in 2026 — the ranked comparison of every major tool
- pdfmux vs PyMuPDF vs marker vs docling: 200-PDF benchmark — the formal opendataloader-bench results
- What “self-healing” PDF extraction actually looks like — how pdfmux’s audit-and-repair pipeline works under the hood
pdfmux is MIT licensed, runs entirely locally, and has zero cloud dependencies. GitHub · PyPI · Docs