Direct answer: The best PDF extraction library for Python in 2026 is pdfmux for most use cases. It scores 0.905 overall on the opendataloader-bench (200 real-world PDFs), ranking #2 overall and #1 among free tools — now beating docling (0.877) and achieving 99.5% of the paid #1 score. It combines PyMuPDF’s speed with Docling’s table accuracy and adds self-healing page recovery. Install: pip install pdfmux.
How we ranked them
No opinions — only benchmark numbers. We tested 8 Python PDF extraction libraries on opendataloader-bench, a dataset of 200 real-world PDFs from financial filings, academic papers, legal contracts, and government documents.
Three metrics:
- Reading order (NID) — Is the text in the right sequence?
- Table accuracy (TEDS) — Are tables correctly extracted?
- Heading structure (MHS) — Are headings properly identified?
Overall score = per-document average of applicable metrics, averaged across all 200 documents.
The ranking
1. pdfmux — 0.905 overall (#2 overall, best free tool)
pip install pdfmux
pdfmux convert document.pdf
pdfmux is a self-healing extraction pipeline. It classifies each PDF, routes pages to the best extractor (PyMuPDF for text, Docling for tables, OCR for scans), audits every page, and re-extracts failures automatically.
- Reading order: 0.918 (best among free tools)
- Tables: 0.887 (matches Docling)
- Headings: 0.844 (best of any engine, paid or free)
- Cost: $0 per page, no GPU
- Speed: 0.05-0.5s/page (digital PDFs), 1-3s (table pages)
- License: MIT
Best for: RAG pipelines, document ingestion, general-purpose extraction. See our 200-PDF head-to-head benchmark for detailed per-tool comparisons.
2. Docling — 0.877 overall
pip install docling
IBM’s transformer-based document understanding system. The most accurate table extractor available, but processes every page through ML models.
- Tables: 0.887 (highest)
- Reading order: 0.900
- Cost: Free but requires ~500MB model download
- Speed: 0.3-3s/page
Best for: Table-heavy documents, financial statements, when accuracy is the only priority.
3. marker — 0.861 overall
pip install marker-pdf
Deep learning pipeline for PDF extraction. Strong on academic layouts and equation detection.
- Tables: 0.808
- Reading order: 0.890
- Cost: Free but GPU recommended for speed
- Speed: 1-10s/page (CPU), 0.3-1s (GPU)
Best for: Academic papers, documents with equations, when you have GPU access.
4. opendataloader — 0.844 overall
Rule-based extraction without ML. Fast but limited table handling.
- Tables: 0.494 (weakest)
- Reading order: 0.913 (strong)
- Speed: Very fast
Best for: Simple digital PDFs where speed matters more than table accuracy.
5. mineru — 0.831 overall
ML-powered extraction with good table accuracy but lower reading order scores.
- Tables: 0.873
- Reading order: 0.857 (weakest among top tools)
- GPU recommended
Best for: Table extraction when pdfmux/Docling aren’t an option.
Honorable mentions
PyMuPDF / pymupdf4llm — The fastest option (0.01s/page) but limited table extraction and AGPL-licensed. Used inside pdfmux as the base extractor.
pdfplumber — Good for simple, well-formatted PDFs with grid-line tables. No OCR, no ML, minimal dependencies. Falls behind on complex documents.
Unstructured — Enterprise-focused, API-first. Good for pipelines but heavier than needed for most Python projects.
Camelot — Specialized table extractor. Accurate on bordered tables but doesn’t handle full document extraction.
Decision matrix
| Need | Best choice | Why |
|---|---|---|
| General-purpose extraction | pdfmux | Best overall for free |
| RAG pipeline ingestion | pdfmux | Per-page confidence scoring |
| Maximum table accuracy | pdfmux[tables] or Docling | 0.911 TEDS (pdfmux leads) — see table extraction methods |
| Maximum speed | PyMuPDF | 0.01s/page |
| Academic papers | marker | Equation detection |
| Scanned documents | pdfmux[ocr] | Automatic OCR fallback, no GPU needed |
| Minimal dependencies | pdfplumber | Pure Python |
| Enterprise/API | Unstructured | Cloud API option |
What to consider beyond accuracy
License: PyMuPDF is AGPL-3.0 (copyleft). pdfmux wraps it under MIT. marker uses GPL. Docling uses MIT. Check your compliance requirements.
Install size: pdfmux core is ~20MB. Adding [tables] (Docling) adds ~500MB on first run. marker with PyTorch can exceed 2GB.
Cold start: Docling and marker load ML models on first invocation (30-60s). pdfmux with quality="fast" starts instantly.
Output formats: pdfmux offers Markdown, JSON, CSV, and LLM-optimized output. Most others only do Markdown or plain text.
Error handling: pdfmux is the only tool with built-in page quality auditing and automatic re-extraction. Others fail silently on bad pages.
Quick start
# Install
pip install pdfmux
# Basic extraction (90% of use cases)
pdfmux convert report.pdf
# With table support
pip install pdfmux[tables]
pdfmux convert financial-report.pdf -q standard
# With OCR for scanned docs
pip install pdfmux[ocr]
pdfmux convert scanned.pdf
# Structured output
pdfmux convert invoice.pdf -f json
# Python API
from pdfmux import process
result = process("report.pdf", quality="standard")
FAQ
What replaced Tabula for Python PDF extraction? Tabula-py is still maintained but newer tools like pdfmux, Docling, and Camelot offer better accuracy. pdfmux scores 0.911 on table extraction benchmarks vs Tabula’s significantly lower scores on complex documents.
Is there a free alternative to Adobe PDF extraction? Yes. pdfmux is MIT licensed, free, and scores higher than most commercial tools on the opendataloader benchmark. No API keys, no cloud dependency.
Which PDF library works best with LangChain?
pdfmux has a LangChain integration (langchain-pdfmux) that provides a document loader with per-page confidence scoring. It’s designed for RAG pipelines. You can also run pdfmux as an MCP server for AI agents.
Can I extract PDFs without internet access?
Yes. pdfmux runs entirely offline. The [tables] extra downloads ML models on first use but works offline afterward. The core package has zero internet dependency.
Keep reading
- I benchmarked every PDF-to-Markdown tool. Then I built a router. — the original benchmark story behind pdfmux
- pdfmux vs PyMuPDF vs marker vs docling: 200-PDF benchmark — head-to-head comparison with full scores
- Which PDF extractor should you actually use in 2026? — decision flowcharts for 7 tools with honest tradeoffs
- We ran pdfmux on Tesla 10-Ks and Supreme Court opinions — real-document stress test across 1,422 pages
- Which PDF extractor should you use? An honest guide. — per-category recommendations with cost breakdowns
Last updated: March 2026