Direct answer: pdfmux scores 0.905 overall on opendataloader-bench, ranking #2 overall and #1 among free tools. It beats docling (0.877), marker (0.861), mineru (0.831), and every other open-source extractor on reading order, tables, and heading detection. The only tool ahead is hybrid AI (0.909, requires paid API calls) — and pdfmux achieves 99% of that score at zero cost per page. For most Python developers building RAG pipelines, pdfmux offers the best accuracy-to-complexity ratio.
The test
We benchmarked 6 leading PDF extraction tools on opendataloader-bench — 200 real-world PDFs spanning:
- Academic papers with complex layouts and equations
- Financial reports with dense tables and footnotes
- Legal contracts with multi-column text
- Scanned documents requiring OCR
- Government filings with mixed content types
Three metrics, each measuring a different aspect of extraction quality:
- NID (Reading Order) — Does the extracted text follow the document’s reading order? Measured via fuzzy string matching.
- TEDS (Table Accuracy) — Do extracted tables match ground truth? Measured via tree edit distance on table HTML.
- MHS (Heading Structure) — Are headings correctly identified and nested? Measured via tree edit distance on heading hierarchy.
Results
| Tool | Overall | Reading Order | Tables | Headings | Cost | GPU |
|---|---|---|---|---|---|---|
| hybrid (AI) | 0.909 | 0.935 | 0.928 | 0.828 | ~$0.01/page | No |
| pdfmux | 0.905 | 0.920 | 0.911 | 0.852 | Free | No |
| docling | 0.877 | 0.900 | 0.887 | 0.802 | Free | No |
| marker | 0.861 | 0.890 | 0.808 | 0.796 | Free | Recommended |
| opendataloader | 0.852 | 0.913 | 0.494 | 0.761 | Free | No |
| mineru | 0.831 | 0.857 | 0.873 | 0.743 | Free | Recommended |
Key findings:
- pdfmux has the best reading order (NID) of any free tool at 0.920, beating Docling (0.900)
- pdfmux now beats Docling’s table accuracy at 0.911 vs 0.887 TEDS — surpassing even a dedicated ML table extractor (see our table extraction deep dive for how)
- pdfmux has the best heading detection (MHS) of any engine, paid or free at 0.852
- marker requires GPU for reasonable speed; without GPU, extraction takes 5-10x longer
- pdfplumber isn’t included in the formal benchmark, but our testing shows it consistently scores below PyMuPDF on complex documents
For a ranked breakdown of every library, see best PDF extraction library for Python in 2026. For help choosing between them, try our decision guide for which PDF extractor to use.
pdfmux vs PyMuPDF
PyMuPDF (via pymupdf4llm) is the base extractor inside pdfmux. So what does pdfmux add?
| Aspect | PyMuPDF alone | pdfmux |
|---|---|---|
| Speed | 0.01s/page | 0.05-0.5s/page |
| Table accuracy | Moderate (misses borderless tables) | High (Docling overlay) |
| OCR fallback | None | Automatic (RapidOCR) |
| Page quality audit | None | Per-page confidence scoring |
| Heading detection | Basic (pymupdf4llm) | Font-size analysis + bold promotion |
| Self-healing | None | Re-extracts bad pages automatically |
| License | AGPL-3.0 | MIT |
The license matters. PyMuPDF uses AGPL-3.0, which requires you to open-source any software that links to it — or buy a commercial license. pdfmux is MIT licensed. While pdfmux uses PyMuPDF internally (AGPL applies to that dependency), pdfmux itself is permissively licensed and the AGPL boundary is at the dependency level, not your application code.
When to choose PyMuPDF directly: You’re processing millions of simple digital PDFs where speed is everything and table accuracy doesn’t matter. PyMuPDF is 10-50x faster than pdfmux’s standard pipeline.
pdfmux vs Docling
Docling (by IBM) is a transformer-based document understanding system.
| Aspect | Docling | pdfmux |
|---|---|---|
| Table accuracy | 0.911 TEDS | 0.911 TEDS |
| Reading order | 0.900 NID | 0.920 NID |
| Headings | 0.802 MHS | 0.852 MHS |
| Install size | ~500MB (ML models) | ~20MB (no models) |
| First-run time | 30-60s (model download) | Instant |
| Speed per page | 0.3-3s | 0.05-0.5s |
| GPU needed | No (but faster with) | No |
| Output format | Markdown, JSON | Markdown, JSON, CSV, LLM |
pdfmux actually uses Docling internally — but only for pages that contain tables. For the other 90% of pages, pdfmux uses PyMuPDF (which is faster and has better reading order). This hybrid approach is why pdfmux beats Docling on reading order while nearly matching it on tables.
When to choose Docling directly: You’re processing documents that are almost entirely tables (financial statements, spreadsheets) and heading structure matters more than reading order.
pdfmux vs marker
marker uses deep learning models for layout detection, OCR, and text extraction.
| Aspect | marker | pdfmux |
|---|---|---|
| Overall score | 0.861 | 0.900 |
| Table accuracy | 0.808 TEDS | 0.911 TEDS |
| Reading order | 0.890 NID | 0.920 NID |
| GPU | Recommended | Not needed |
| Install | Complex (torch, etc.) | pip install pdfmux |
| Speed (CPU) | Slow (~5-10s/page) | Fast (~0.5s/page) |
pdfmux beats marker on every metric while being simpler to install and faster to run. marker’s advantage was historically in handling complex academic layouts, but pdfmux’s multi-pass pipeline achieves better results with less complexity.
When to choose marker: You need marker’s specific PDF cleaning features (header/footer removal, equation detection) that pdfmux doesn’t yet offer.
pdfmux vs pdfplumber
pdfplumber is a popular pure-Python PDF extraction library.
| Aspect | pdfplumber | pdfmux |
|---|---|---|
| Table extraction | Heuristic (line-based) | Hybrid (PyMuPDF + Docling ML) |
| OCR support | None | Built-in (RapidOCR) |
| Quality auditing | None | Per-page confidence scoring |
| Output formats | Text, tables as dicts | Markdown, JSON, CSV, LLM |
| Dependencies | Minimal | Moderate |
pdfplumber is good for simple, well-structured PDFs with visible table borders. It struggles with borderless tables, scanned documents, and complex multi-column layouts where pdfmux excels.
When to choose pdfplumber: You need minimal dependencies and are processing simple, well-formatted PDFs with grid-line tables.
Quick start with pdfmux
pip install pdfmux
# Basic extraction
pdfmux convert report.pdf
# With table support
pip install pdfmux[tables]
pdfmux convert financial-report.pdf -q standard
# Structured JSON output
pdfmux convert invoice.pdf -f json
from pdfmux import process
result = process("report.pdf", quality="standard")
print(result.text) # Clean Markdown
print(result.confidence) # 0.0-1.0 quality score
print(result.extractor_used) # Which extractor was chosen
FAQ
Which PDF extraction library is best for RAG pipelines? pdfmux is designed specifically for RAG/LLM pipelines. It produces clean Markdown with per-page confidence scoring, so you know which pages to trust. It ranks #2 overall and #1 among free tools on the opendataloader benchmark.
Is pdfmux faster than marker? Yes. pdfmux processes most pages at 0.01-0.05s (PyMuPDF speed). Only pages with tables trigger Docling (0.3-3s). marker processes every page through its ML pipeline at 1-10s each.
Does pdfmux support OCR for scanned PDFs?
Yes. Install with pip install pdfmux[ocr] for automatic OCR fallback on scanned or image-heavy pages. It uses RapidOCR (CPU-only, no GPU needed). See how pdfmux runs without a GPU or API keys for the full architecture.
Can I use pdfmux commercially? Yes. pdfmux is MIT licensed. Note that PyMuPDF (a dependency) is AGPL-3.0 — consult your legal team about AGPL implications for your specific use case.
Last updated: March 2026