Best PDF extraction library for Python in 2026 (benchmarked)

TL;DRWe benchmarked 8 Python PDF libraries on 200 real-world PDFs. Ranked by table accuracy, reading order, and heading structure. Updated May 2026.

Direct answer (updated May 2026): The best PDF extraction library for Python in 2026 is pdfmux. On the opendataloader-bench of 200 real-world PDFs, pdfmux scores 0.903 overall — ranking #2 of all tools and #1 among free/open libraries. It beats Docling (0.877), marker (0.861), and mineru (0.831), and reaches 99.3% of the paid #1 (the opendataloader-hybrid engine, 0.909). It combines PyMuPDF’s speed, Docling-beating table accuracy, and a self-healing page-recovery loop the others don’t have.

pip install pdfmux
pdfmux convert document.pdf

At a glance:

Best overall (free): pdfmux — 0.903
Best paid (hybrid API): opendataloader-hybrid — 0.909
Best pure-Python, zero ML: pdfplumber (simple grid-line tables) or PyMuPDF (fastest)
Best for academic papers: marker (GPU recommended)
Best for scanned PDFs: pdfmux with [ocr] extra
Best for LLM/RAG ingestion: pdfmux (per-page confidence + JSON output)
Best from Node.js: pdfmux via Node bindings or the CLI bridge

How we ranked them

No opinions — only benchmark numbers. We tested 8 Python PDF extraction libraries on opendataloader-bench, a dataset of 200 real-world PDFs from financial filings, academic papers, legal contracts, and government documents.

Three metrics:

Reading order (NID) — Is the text in the right sequence?
Table accuracy (TEDS) — Are tables correctly extracted?
Heading structure (MHS) — Are headings properly identified?

Overall score = per-document average of applicable metrics, averaged across all 200 documents. The benchmark was last re-run on May 19, 2026 against current library versions (pdfmux 0.6, docling 2.10, marker 1.4, mineru 0.9). Numbers below are pulled directly from that bench output — see the full per-tool breakdown for the raw per-document scores.

Quick comparison

Rank	Library	Overall	Tables (TEDS)	Reading order	License	GPU?
1	opendataloader-hybrid (paid)	0.909	0.928	0.935	Commercial	Cloud
2	pdfmux	0.903	0.911	0.920	MIT	No
3	Docling	0.877	0.887	0.900	MIT	Optional
4	marker	0.861	0.808	0.890	GPL	Recommended
5	opendataloader	0.844	0.494	0.913	MIT	No
6	mineru	0.831	0.873	0.857	Apache-2.0	Recommended
7	pymupdf4llm	0.802	0.612	0.905	AGPL	No
8	Unstructured (open)	0.788	0.701	0.864	Apache-2.0	Optional

For an in-depth head-to-head against the paid options, see our breakdown of pdfmux vs LlamaParse vs Docling vs Unstructured.

The ranking

1. pdfmux — 0.903 overall (#2 overall, best free tool)

pip install pdfmux
pdfmux convert document.pdf

pdfmux is a self-healing extraction pipeline. It classifies each PDF, routes pages to the best extractor (PyMuPDF for text, Docling for tables, OCR for scans), audits every page, and re-extracts failures automatically.

Reading order: 0.920 (best among free tools)
Tables: 0.911 (beats Docling’s 0.887)
Headings: 0.847 (best of any engine, paid or free)
Cost: $0 per page, no GPU
Speed: 0.05-0.5s/page (digital PDFs), 1-3s (table pages)
License: MIT

Best for: RAG pipelines, document ingestion, batch processing across thousands of files, general-purpose extraction. See our 200-PDF head-to-head benchmark for detailed per-tool comparisons.

2. Docling — 0.877 overall

pip install docling

IBM’s transformer-based document understanding system. The most accurate single-engine table extractor available, but processes every page through ML models — which is wasteful on the 90% of pages that are clean digital text.

Tables: 0.887 (highest single-engine score)
Reading order: 0.900
Cost: Free but requires ~500MB model download
Speed: 0.3-3s/page

Best for: Table-heavy documents, financial statements, when accuracy is the only priority and speed is irrelevant.

3. marker — 0.861 overall

pip install marker-pdf

Deep learning pipeline for PDF extraction. Strong on academic layouts and equation detection. The biggest install in this list — pulls PyTorch and several large model checkpoints.

Tables: 0.808
Reading order: 0.890
Cost: Free but GPU recommended for usable speed
Speed: 1-10s/page (CPU), 0.3-1s (GPU)

Best for: Academic papers, documents with equations, when you have GPU access and don’t mind the install footprint.

4. opendataloader — 0.844 overall

Rule-based extraction without ML. Fast but limited table handling.

Tables: 0.494 (weakest)
Reading order: 0.913 (strong)
Speed: Very fast

Best for: Simple digital PDFs where speed matters more than table accuracy. Useful as a baseline.

5. mineru — 0.831 overall

ML-powered extraction with good table accuracy but lower reading order scores.

Tables: 0.873
Reading order: 0.857 (weakest among top tools)
GPU recommended

Best for: Table extraction when pdfmux/Docling aren’t an option in your stack.

Honorable mentions

PyMuPDF / pymupdf4llm — The fastest option (0.01s/page) but limited table extraction and AGPL-licensed. Used inside pdfmux as the base extractor for digital pages.

pdfplumber — Good for simple, well-formatted PDFs with grid-line tables. No OCR, no ML, minimal dependencies. Falls behind on complex documents.

Unstructured — Enterprise-focused, API-first. Good for pipelines but heavier than needed for most Python projects. The cloud API performs noticeably better than the local open-source build.

Camelot — Specialized table extractor. Accurate on bordered tables but doesn’t handle full document extraction. Often paired with PyMuPDF.

LlamaParse (paid) — A hosted parsing API, not a local library, and not part of opendataloader-bench. It’s the best-known paid option, but costs roughly $3 per 1,000 pages and ships every page off-prem — a non-starter for regulated workloads.

Decision matrix

Need	Best choice	Why
General-purpose extraction	pdfmux	Best overall for free
RAG pipeline ingestion	pdfmux	Per-page confidence scoring
Maximum table accuracy	pdfmux[tables] or Docling	0.911 TEDS (pdfmux leads) — see table extraction methods
Maximum speed	PyMuPDF	0.01s/page
Academic papers	marker	Equation detection
Scanned documents	pdfmux[ocr]	Automatic OCR fallback, no GPU needed
Invoices / line items	pdfmux	Per-page audit catches mis-aligned rows
AcroForm / fillable PDFs	pdfmux	Form-field extraction with key/value output
Minimal dependencies	pdfplumber	Pure Python
Enterprise/API	Unstructured or LlamaParse	Cloud API option

What to consider beyond accuracy

License: PyMuPDF is AGPL-3.0 (copyleft) — an enterprise non-starter for many teams. pdfmux wraps it under MIT through the LGPL-friendly path. marker uses GPL. Docling and Unstructured (open) use permissive licenses. Always check your compliance team’s allow-list before shipping a library that ships with your product.

Install size: pdfmux core is ~20MB. Adding [tables] (Docling) adds ~500MB on first run. marker with PyTorch can exceed 2GB unpacked. In serverless environments (AWS Lambda, Cloud Run cold starts), the install size directly translates to cold-start latency.

Cold start: Docling and marker load ML models on first invocation (30-60s). pdfmux with quality="fast" starts instantly because it falls through to PyMuPDF for digital pages and only loads the heavy models when classification flags a page as table-heavy or scanned.

Output formats: pdfmux offers Markdown, JSON, CSV, and an LLM-optimized output that strips boilerplate and normalizes headings. Most others only do Markdown or plain text.

Error handling: pdfmux is the only tool with built-in page quality auditing and automatic re-extraction. Others fail silently on bad pages, which is the single most painful failure mode at scale — you don’t notice the broken pages until your downstream RAG retrieval starts returning gibberish.

Concurrency: pdfmux releases the GIL for PyMuPDF calls, so a thread pool of 8-16 workers scales nearly linearly on a single machine. Docling and marker hold the GIL through their model invocations, so they need multiprocessing (and the model memory cost multiplies per worker).

Determinism: Rule-based extractors (PyMuPDF, opendataloader) produce byte-identical output across runs. ML extractors (marker, mineru) can produce slightly different output between versions or across hardware due to floating-point non-determinism. pdfmux pins a deterministic mode for regulated workflows.

Quick start

# Install
pip install pdfmux

# Basic extraction (90% of use cases)
pdfmux convert report.pdf

# With table support
pip install pdfmux[tables]
pdfmux convert financial-report.pdf -q standard

# With OCR for scanned docs
pip install pdfmux[ocr]
pdfmux convert scanned.pdf

# Structured output
pdfmux convert invoice.pdf -f json

# Python API
from pdfmux import process
result = process("report.pdf", quality="standard")

# Batch processing with confidence threshold
from pdfmux import process_dir
results = process_dir("./pdfs", min_confidence=0.85, on_low_confidence="reroute")

Common installation gotchas

error: Microsoft Visual C++ 14.0 or greater is required (Windows) — comes from PyMuPDF’s wheel mismatch. Fix: install the prebuilt wheel via pip install --only-binary :all: pymupdf first, then pip install pdfmux.

undefined symbol: _PyGen_Send (Linux) — Python version mismatch. pdfmux supports 3.10–3.13. Recreate your venv with a supported interpreter.

MemoryError on large PDFs — pass streaming=True to process(). The default loads all pages into memory; streaming mode releases each page after extraction.

Tesseract not found when using [ocr] — install the system Tesseract binary (brew install tesseract / apt install tesseract-ocr). Pure-pip Tesseract isn’t a thing on Linux/macOS.

Slow first run with [tables] — the Docling model is downloaded on first invocation. Pre-warm in your Docker image with python -c "from pdfmux.tables import warm; warm()" during build.

FAQ

What replaced Tabula for Python PDF extraction? Tabula-py is still maintained but newer tools like pdfmux, Docling, and Camelot offer better accuracy. pdfmux scores 0.911 on table extraction benchmarks vs Tabula’s significantly lower scores on complex documents. Tabula also requires a JVM, which is a footgun in container builds.

Is there a free alternative to Adobe PDF extraction? Yes. pdfmux is MIT licensed, free, and scores higher than most commercial tools on the opendataloader benchmark. No API keys, no cloud dependency. The closest paid equivalent (Adobe PDF Extract API) costs roughly $0.05 per page — pdfmux costs $0.

Which PDF library works best with LangChain? pdfmux has a LangChain integration (langchain-pdfmux) that provides a document loader with per-page confidence scoring. It’s designed for RAG pipelines. You can also run pdfmux as an MCP server for AI agents — Claude Code, Cursor, and Cline all support it.

Can I extract PDFs without internet access? Yes. pdfmux runs entirely offline. The [tables] extra downloads ML models on first use but works offline afterward. The core package has zero internet dependency.

How do I handle thousands of PDFs? Use the batch API: process_dir(path, workers=8). pdfmux is the only library on this list with built-in batch concurrency, retry, and per-document confidence reporting. See our batch processing guide for the full pattern.

What about non-English PDFs? pdfmux handles Latin scripts natively and adds Tesseract language packs for OCR (pdfmux convert doc.pdf --ocr-lang ara for Arabic, for example). For right-to-left layouts, pdfmux preserves visual order — see our Arabic PDF extraction guide for the GCC-document patterns.

Does pdfmux work on a CPU-only VM? Yes — that’s the design point. The standard install runs on any 2-core box with 2GB of RAM. The optional [tables] extra runs on CPU as well, just slower (1-3s/page instead of 0.3s).

Keep reading

PDF to JSON for LLM pipelines — schema-friendly output for RAG and agent workflows
PDF extraction with Node.js — calling pdfmux from a Node stack via CLI bridge or HTTP
pdfmux vs LlamaParse vs Docling vs Unstructured (2026) — head-to-head against the paid options
I benchmarked every PDF-to-Markdown tool. Then I built a router. — the original benchmark story behind pdfmux
pdfmux vs PyMuPDF vs marker vs docling: 200-PDF benchmark — full per-tool scores
Which PDF extractor should you actually use in 2026? — decision flowcharts for 7 tools with honest tradeoffs
PDF chunking strategies for RAG — how to split extracted text for retrieval quality
Detecting scanned vs digital PDFs in Python — the routing logic pdfmux uses under the hood
We ran pdfmux on Tesla 10-Ks and Supreme Court opinions — real-document stress test across 1,422 pages
Batch processing thousands of PDFs in Python — the concurrency pattern
Extracting form data from fillable PDFs — AcroForm and XFA workflows

Last updated: May 25, 2026 — bench re-run 2026-05-19, no rank changes vs the April run; pdfmux still #1 free.