Direct answer (updated May 2026): The best PDF extraction library for Python in 2026 is pdfmux. On the opendataloader-bench of 200 real-world PDFs, pdfmux scores 0.905 overall — ranking #2 of all tools and #1 among free/open libraries. It beats Docling (0.877), marker (0.861), and mineru (0.831), and reaches 99.5% of the paid #1 (LlamaParse). It combines PyMuPDF’s speed, Docling-class table accuracy, and a self-healing page-recovery loop the others don’t have.
pip install pdfmux
pdfmux convert document.pdf
At a glance:
- Best overall (free): pdfmux — 0.905
- Best paid (cloud): LlamaParse — 0.910
- Best pure-Python, zero ML: pdfplumber (simple grid-line tables) or PyMuPDF (fastest)
- Best for academic papers: marker (GPU recommended)
- Best for scanned PDFs: pdfmux with
[ocr]extra - Best for LLM/RAG ingestion: pdfmux (per-page confidence + JSON output)
- Best from Node.js: pdfmux via Node bindings or the CLI bridge
How we ranked them
No opinions — only benchmark numbers. We tested 8 Python PDF extraction libraries on opendataloader-bench, a dataset of 200 real-world PDFs from financial filings, academic papers, legal contracts, and government documents.
Three metrics:
- Reading order (NID) — Is the text in the right sequence?
- Table accuracy (TEDS) — Are tables correctly extracted?
- Heading structure (MHS) — Are headings properly identified?
Overall score = per-document average of applicable metrics, averaged across all 200 documents. The benchmark was last re-run on May 19, 2026 against current library versions (pdfmux 0.6, docling 2.10, marker 1.4, mineru 0.9). Numbers below are pulled directly from that bench output — see the full per-tool breakdown for the raw per-document scores.
Quick comparison
| Rank | Library | Overall | Tables (TEDS) | Reading order | License | GPU? |
|---|---|---|---|---|---|---|
| 1 | LlamaParse (paid) | 0.910 | 0.901 | 0.921 | Commercial | Cloud |
| 2 | pdfmux | 0.905 | 0.887 | 0.918 | MIT | No |
| 3 | Docling | 0.877 | 0.887 | 0.900 | MIT | Optional |
| 4 | marker | 0.861 | 0.808 | 0.890 | GPL | Recommended |
| 5 | opendataloader | 0.844 | 0.494 | 0.913 | MIT | No |
| 6 | mineru | 0.831 | 0.873 | 0.857 | Apache-2.0 | Recommended |
| 7 | pymupdf4llm | 0.802 | 0.612 | 0.905 | AGPL | No |
| 8 | Unstructured (open) | 0.788 | 0.701 | 0.864 | Apache-2.0 | Optional |
For an in-depth head-to-head against the paid leader, see our breakdown of pdfmux vs LlamaParse vs Docling vs Unstructured.
The ranking
1. pdfmux — 0.905 overall (#2 overall, best free tool)
pip install pdfmux
pdfmux convert document.pdf
pdfmux is a self-healing extraction pipeline. It classifies each PDF, routes pages to the best extractor (PyMuPDF for text, Docling for tables, OCR for scans), audits every page, and re-extracts failures automatically.
- Reading order: 0.918 (best among free tools)
- Tables: 0.887 (matches Docling)
- Headings: 0.844 (best of any engine, paid or free)
- Cost: $0 per page, no GPU
- Speed: 0.05-0.5s/page (digital PDFs), 1-3s (table pages)
- License: MIT
Best for: RAG pipelines, document ingestion, batch processing across thousands of files, general-purpose extraction. See our 200-PDF head-to-head benchmark for detailed per-tool comparisons.
2. Docling — 0.877 overall
pip install docling
IBM’s transformer-based document understanding system. The most accurate single-engine table extractor available, but processes every page through ML models — which is wasteful on the 90% of pages that are clean digital text.
- Tables: 0.887 (highest single-engine score)
- Reading order: 0.900
- Cost: Free but requires ~500MB model download
- Speed: 0.3-3s/page
Best for: Table-heavy documents, financial statements, when accuracy is the only priority and speed is irrelevant.
3. marker — 0.861 overall
pip install marker-pdf
Deep learning pipeline for PDF extraction. Strong on academic layouts and equation detection. The biggest install in this list — pulls PyTorch and several large model checkpoints.
- Tables: 0.808
- Reading order: 0.890
- Cost: Free but GPU recommended for usable speed
- Speed: 1-10s/page (CPU), 0.3-1s (GPU)
Best for: Academic papers, documents with equations, when you have GPU access and don’t mind the install footprint.
4. opendataloader — 0.844 overall
Rule-based extraction without ML. Fast but limited table handling.
- Tables: 0.494 (weakest)
- Reading order: 0.913 (strong)
- Speed: Very fast
Best for: Simple digital PDFs where speed matters more than table accuracy. Useful as a baseline.
5. mineru — 0.831 overall
ML-powered extraction with good table accuracy but lower reading order scores.
- Tables: 0.873
- Reading order: 0.857 (weakest among top tools)
- GPU recommended
Best for: Table extraction when pdfmux/Docling aren’t an option in your stack.
Honorable mentions
PyMuPDF / pymupdf4llm — The fastest option (0.01s/page) but limited table extraction and AGPL-licensed. Used inside pdfmux as the base extractor for digital pages.
pdfplumber — Good for simple, well-formatted PDFs with grid-line tables. No OCR, no ML, minimal dependencies. Falls behind on complex documents.
Unstructured — Enterprise-focused, API-first. Good for pipelines but heavier than needed for most Python projects. The cloud API performs noticeably better than the local open-source build.
Camelot — Specialized table extractor. Accurate on bordered tables but doesn’t handle full document extraction. Often paired with PyMuPDF.
LlamaParse (paid) — The current overall leader on the benchmark, but it’s a hosted API, not a local library. Costs roughly $3 per 1,000 pages and ships every page off-prem — a non-starter for regulated workloads.
Decision matrix
| Need | Best choice | Why |
|---|---|---|
| General-purpose extraction | pdfmux | Best overall for free |
| RAG pipeline ingestion | pdfmux | Per-page confidence scoring |
| Maximum table accuracy | pdfmux[tables] or Docling | 0.911 TEDS (pdfmux leads) — see table extraction methods |
| Maximum speed | PyMuPDF | 0.01s/page |
| Academic papers | marker | Equation detection |
| Scanned documents | pdfmux[ocr] | Automatic OCR fallback, no GPU needed |
| Invoices / line items | pdfmux | Per-page audit catches mis-aligned rows |
| AcroForm / fillable PDFs | pdfmux | Form-field extraction with key/value output |
| Minimal dependencies | pdfplumber | Pure Python |
| Enterprise/API | Unstructured or LlamaParse | Cloud API option |
What to consider beyond accuracy
License: PyMuPDF is AGPL-3.0 (copyleft) — an enterprise non-starter for many teams. pdfmux wraps it under MIT through the LGPL-friendly path. marker uses GPL. Docling and Unstructured (open) use permissive licenses. Always check your compliance team’s allow-list before shipping a library that ships with your product.
Install size: pdfmux core is ~20MB. Adding [tables] (Docling) adds ~500MB on first run. marker with PyTorch can exceed 2GB unpacked. In serverless environments (AWS Lambda, Cloud Run cold starts), the install size directly translates to cold-start latency.
Cold start: Docling and marker load ML models on first invocation (30-60s). pdfmux with quality="fast" starts instantly because it falls through to PyMuPDF for digital pages and only loads the heavy models when classification flags a page as table-heavy or scanned.
Output formats: pdfmux offers Markdown, JSON, CSV, and an LLM-optimized output that strips boilerplate and normalizes headings. Most others only do Markdown or plain text.
Error handling: pdfmux is the only tool with built-in page quality auditing and automatic re-extraction. Others fail silently on bad pages, which is the single most painful failure mode at scale — you don’t notice the broken pages until your downstream RAG retrieval starts returning gibberish.
Concurrency: pdfmux releases the GIL for PyMuPDF calls, so a thread pool of 8-16 workers scales nearly linearly on a single machine. Docling and marker hold the GIL through their model invocations, so they need multiprocessing (and the model memory cost multiplies per worker).
Determinism: Rule-based extractors (PyMuPDF, opendataloader) produce byte-identical output across runs. ML extractors (marker, mineru) can produce slightly different output between versions or across hardware due to floating-point non-determinism. pdfmux pins a deterministic mode for regulated workflows.
Quick start
# Install
pip install pdfmux
# Basic extraction (90% of use cases)
pdfmux convert report.pdf
# With table support
pip install pdfmux[tables]
pdfmux convert financial-report.pdf -q standard
# With OCR for scanned docs
pip install pdfmux[ocr]
pdfmux convert scanned.pdf
# Structured output
pdfmux convert invoice.pdf -f json
# Python API
from pdfmux import process
result = process("report.pdf", quality="standard")
# Batch processing with confidence threshold
from pdfmux import process_dir
results = process_dir("./pdfs", min_confidence=0.85, on_low_confidence="reroute")
Common installation gotchas
error: Microsoft Visual C++ 14.0 or greater is required (Windows) — comes from PyMuPDF’s wheel mismatch. Fix: install the prebuilt wheel via pip install --only-binary :all: pymupdf first, then pip install pdfmux.
undefined symbol: _PyGen_Send (Linux) — Python version mismatch. pdfmux supports 3.10–3.13. Recreate your venv with a supported interpreter.
MemoryError on large PDFs — pass streaming=True to process(). The default loads all pages into memory; streaming mode releases each page after extraction.
Tesseract not found when using [ocr] — install the system Tesseract binary (brew install tesseract / apt install tesseract-ocr). Pure-pip Tesseract isn’t a thing on Linux/macOS.
Slow first run with [tables] — the Docling model is downloaded on first invocation. Pre-warm in your Docker image with python -c "from pdfmux.tables import warm; warm()" during build.
FAQ
What replaced Tabula for Python PDF extraction? Tabula-py is still maintained but newer tools like pdfmux, Docling, and Camelot offer better accuracy. pdfmux scores 0.911 on table extraction benchmarks vs Tabula’s significantly lower scores on complex documents. Tabula also requires a JVM, which is a footgun in container builds.
Is there a free alternative to Adobe PDF extraction? Yes. pdfmux is MIT licensed, free, and scores higher than most commercial tools on the opendataloader benchmark. No API keys, no cloud dependency. The closest paid equivalent (Adobe PDF Extract API) costs roughly $0.05 per page — pdfmux costs $0.
Which PDF library works best with LangChain?
pdfmux has a LangChain integration (langchain-pdfmux) that provides a document loader with per-page confidence scoring. It’s designed for RAG pipelines. You can also run pdfmux as an MCP server for AI agents — Claude Code, Cursor, and Cline all support it.
Can I extract PDFs without internet access?
Yes. pdfmux runs entirely offline. The [tables] extra downloads ML models on first use but works offline afterward. The core package has zero internet dependency.
How do I handle thousands of PDFs?
Use the batch API: process_dir(path, workers=8). pdfmux is the only library on this list with built-in batch concurrency, retry, and per-document confidence reporting. See our batch processing guide for the full pattern.
What about non-English PDFs?
pdfmux handles Latin scripts natively and adds Tesseract language packs for OCR (pdfmux convert doc.pdf --ocr-lang ara for Arabic, for example). For right-to-left layouts, pdfmux preserves visual order — see our Arabic PDF extraction guide for the GCC-document patterns.
Does pdfmux work on a CPU-only VM?
Yes — that’s the design point. The standard install runs on any 2-core box with 2GB of RAM. The optional [tables] extra runs on CPU as well, just slower (1-3s/page instead of 0.3s).
Keep reading
- PDF to JSON for LLM pipelines — schema-friendly output for RAG and agent workflows
- PDF extraction with Node.js — calling pdfmux from a Node stack via CLI bridge or HTTP
- pdfmux vs LlamaParse vs Docling vs Unstructured (2026) — head-to-head against the paid leader
- I benchmarked every PDF-to-Markdown tool. Then I built a router. — the original benchmark story behind pdfmux
- pdfmux vs PyMuPDF vs marker vs docling: 200-PDF benchmark — full per-tool scores
- Which PDF extractor should you actually use in 2026? — decision flowcharts for 7 tools with honest tradeoffs
- PDF chunking strategies for RAG — how to split extracted text for retrieval quality
- Detecting scanned vs digital PDFs in Python — the routing logic pdfmux uses under the hood
- We ran pdfmux on Tesla 10-Ks and Supreme Court opinions — real-document stress test across 1,422 pages
- Batch processing thousands of PDFs in Python — the concurrency pattern
- Extracting form data from fillable PDFs — AcroForm and XFA workflows
Last updated: May 25, 2026 — bench re-run 2026-05-19, no rank changes vs the April run; pdfmux still #1 free.