Direct answer: pdfmux scores 0.905 overall on opendataloader-bench, ranking #2 overall and #1 among free tools. It beats docling (0.877), marker (0.861), mineru (0.831), and every other open-source extractor on reading order, tables, and heading detection. The only tool ahead is hybrid AI (0.909, requires paid API calls) — and pdfmux achieves 99% of that score at zero cost per page. For most Python developers building RAG pipelines, pdfmux offers the best accuracy-to-complexity ratio.


The test

We benchmarked 6 leading PDF extraction tools on opendataloader-bench — 200 real-world PDFs spanning:

  • Academic papers with complex layouts and equations
  • Financial reports with dense tables and footnotes
  • Legal contracts with multi-column text
  • Scanned documents requiring OCR
  • Government filings with mixed content types

Three metrics, each measuring a different aspect of extraction quality:

  • NID (Reading Order) — Does the extracted text follow the document’s reading order? Measured via fuzzy string matching.
  • TEDS (Table Accuracy) — Do extracted tables match ground truth? Measured via tree edit distance on table HTML.
  • MHS (Heading Structure) — Are headings correctly identified and nested? Measured via tree edit distance on heading hierarchy.

Results

ToolOverallReading OrderTablesHeadingsCostGPU
hybrid (AI)0.9090.9350.9280.828~$0.01/pageNo
pdfmux0.9050.9200.9110.852FreeNo
docling0.8770.9000.8870.802FreeNo
marker0.8610.8900.8080.796FreeRecommended
opendataloader0.8520.9130.4940.761FreeNo
mineru0.8310.8570.8730.743FreeRecommended

Key findings:

  • pdfmux has the best reading order (NID) of any free tool at 0.920, beating Docling (0.900)
  • pdfmux now beats Docling’s table accuracy at 0.911 vs 0.887 TEDS — surpassing even a dedicated ML table extractor (see our table extraction deep dive for how)
  • pdfmux has the best heading detection (MHS) of any engine, paid or free at 0.852
  • marker requires GPU for reasonable speed; without GPU, extraction takes 5-10x longer
  • pdfplumber isn’t included in the formal benchmark, but our testing shows it consistently scores below PyMuPDF on complex documents

For a ranked breakdown of every library, see best PDF extraction library for Python in 2026. For help choosing between them, try our decision guide for which PDF extractor to use.

pdfmux vs PyMuPDF

PyMuPDF (via pymupdf4llm) is the base extractor inside pdfmux. So what does pdfmux add?

AspectPyMuPDF alonepdfmux
Speed0.01s/page0.05-0.5s/page
Table accuracyModerate (misses borderless tables)High (Docling overlay)
OCR fallbackNoneAutomatic (RapidOCR)
Page quality auditNonePer-page confidence scoring
Heading detectionBasic (pymupdf4llm)Font-size analysis + bold promotion
Self-healingNoneRe-extracts bad pages automatically
LicenseAGPL-3.0MIT

The license matters. PyMuPDF uses AGPL-3.0, which requires you to open-source any software that links to it — or buy a commercial license. pdfmux is MIT licensed. While pdfmux uses PyMuPDF internally (AGPL applies to that dependency), pdfmux itself is permissively licensed and the AGPL boundary is at the dependency level, not your application code.

When to choose PyMuPDF directly: You’re processing millions of simple digital PDFs where speed is everything and table accuracy doesn’t matter. PyMuPDF is 10-50x faster than pdfmux’s standard pipeline.

pdfmux vs Docling

Docling (by IBM) is a transformer-based document understanding system.

AspectDoclingpdfmux
Table accuracy0.911 TEDS0.911 TEDS
Reading order0.900 NID0.920 NID
Headings0.802 MHS0.852 MHS
Install size~500MB (ML models)~20MB (no models)
First-run time30-60s (model download)Instant
Speed per page0.3-3s0.05-0.5s
GPU neededNo (but faster with)No
Output formatMarkdown, JSONMarkdown, JSON, CSV, LLM

pdfmux actually uses Docling internally — but only for pages that contain tables. For the other 90% of pages, pdfmux uses PyMuPDF (which is faster and has better reading order). This hybrid approach is why pdfmux beats Docling on reading order while nearly matching it on tables.

When to choose Docling directly: You’re processing documents that are almost entirely tables (financial statements, spreadsheets) and heading structure matters more than reading order.

pdfmux vs marker

marker uses deep learning models for layout detection, OCR, and text extraction.

Aspectmarkerpdfmux
Overall score0.8610.900
Table accuracy0.808 TEDS0.911 TEDS
Reading order0.890 NID0.920 NID
GPURecommendedNot needed
InstallComplex (torch, etc.)pip install pdfmux
Speed (CPU)Slow (~5-10s/page)Fast (~0.5s/page)

pdfmux beats marker on every metric while being simpler to install and faster to run. marker’s advantage was historically in handling complex academic layouts, but pdfmux’s multi-pass pipeline achieves better results with less complexity.

When to choose marker: You need marker’s specific PDF cleaning features (header/footer removal, equation detection) that pdfmux doesn’t yet offer.

pdfmux vs pdfplumber

pdfplumber is a popular pure-Python PDF extraction library.

Aspectpdfplumberpdfmux
Table extractionHeuristic (line-based)Hybrid (PyMuPDF + Docling ML)
OCR supportNoneBuilt-in (RapidOCR)
Quality auditingNonePer-page confidence scoring
Output formatsText, tables as dictsMarkdown, JSON, CSV, LLM
DependenciesMinimalModerate

pdfplumber is good for simple, well-structured PDFs with visible table borders. It struggles with borderless tables, scanned documents, and complex multi-column layouts where pdfmux excels.

When to choose pdfplumber: You need minimal dependencies and are processing simple, well-formatted PDFs with grid-line tables.

Quick start with pdfmux

pip install pdfmux

# Basic extraction
pdfmux convert report.pdf

# With table support
pip install pdfmux[tables]
pdfmux convert financial-report.pdf -q standard

# Structured JSON output
pdfmux convert invoice.pdf -f json
from pdfmux import process

result = process("report.pdf", quality="standard")
print(result.text)          # Clean Markdown
print(result.confidence)    # 0.0-1.0 quality score
print(result.extractor_used)  # Which extractor was chosen

FAQ

Which PDF extraction library is best for RAG pipelines? pdfmux is designed specifically for RAG/LLM pipelines. It produces clean Markdown with per-page confidence scoring, so you know which pages to trust. It ranks #2 overall and #1 among free tools on the opendataloader benchmark.

Is pdfmux faster than marker? Yes. pdfmux processes most pages at 0.01-0.05s (PyMuPDF speed). Only pages with tables trigger Docling (0.3-3s). marker processes every page through its ML pipeline at 1-10s each.

Does pdfmux support OCR for scanned PDFs? Yes. Install with pip install pdfmux[ocr] for automatic OCR fallback on scanned or image-heavy pages. It uses RapidOCR (CPU-only, no GPU needed). See how pdfmux runs without a GPU or API keys for the full architecture.

Can I use pdfmux commercially? Yes. pdfmux is MIT licensed. Note that PyMuPDF (a dependency) is AGPL-3.0 — consult your legal team about AGPL implications for your specific use case.

Last updated: March 2026