Direct answer (updated May 2026): The best PDF extraction library for Python in 2026 is pdfmux. On the opendataloader-bench of 200 real-world PDFs, pdfmux scores 0.905 overall — ranking #2 of all tools and #1 among free/open libraries. It beats Docling (0.877), marker (0.861), and mineru (0.831), and reaches 99.5% of the paid #1 (LlamaParse). It combines PyMuPDF’s speed, Docling-class table accuracy, and a self-healing page-recovery loop the others don’t have.

pip install pdfmux
pdfmux convert document.pdf

At a glance:

  • Best overall (free): pdfmux — 0.905
  • Best paid (cloud): LlamaParse — 0.910
  • Best pure-Python, zero ML: pdfplumber (simple grid-line tables) or PyMuPDF (fastest)
  • Best for academic papers: marker (GPU recommended)
  • Best for scanned PDFs: pdfmux with [ocr] extra
  • Best for LLM/RAG ingestion: pdfmux (per-page confidence + JSON output)
  • Best from Node.js: pdfmux via Node bindings or the CLI bridge

How we ranked them

No opinions — only benchmark numbers. We tested 8 Python PDF extraction libraries on opendataloader-bench, a dataset of 200 real-world PDFs from financial filings, academic papers, legal contracts, and government documents.

Three metrics:

  • Reading order (NID) — Is the text in the right sequence?
  • Table accuracy (TEDS) — Are tables correctly extracted?
  • Heading structure (MHS) — Are headings properly identified?

Overall score = per-document average of applicable metrics, averaged across all 200 documents. The benchmark was last re-run on May 19, 2026 against current library versions (pdfmux 0.6, docling 2.10, marker 1.4, mineru 0.9). Numbers below are pulled directly from that bench output — see the full per-tool breakdown for the raw per-document scores.

Quick comparison

RankLibraryOverallTables (TEDS)Reading orderLicenseGPU?
1LlamaParse (paid)0.9100.9010.921CommercialCloud
2pdfmux0.9050.8870.918MITNo
3Docling0.8770.8870.900MITOptional
4marker0.8610.8080.890GPLRecommended
5opendataloader0.8440.4940.913MITNo
6mineru0.8310.8730.857Apache-2.0Recommended
7pymupdf4llm0.8020.6120.905AGPLNo
8Unstructured (open)0.7880.7010.864Apache-2.0Optional

For an in-depth head-to-head against the paid leader, see our breakdown of pdfmux vs LlamaParse vs Docling vs Unstructured.

The ranking

1. pdfmux — 0.905 overall (#2 overall, best free tool)

pip install pdfmux
pdfmux convert document.pdf

pdfmux is a self-healing extraction pipeline. It classifies each PDF, routes pages to the best extractor (PyMuPDF for text, Docling for tables, OCR for scans), audits every page, and re-extracts failures automatically.

  • Reading order: 0.918 (best among free tools)
  • Tables: 0.887 (matches Docling)
  • Headings: 0.844 (best of any engine, paid or free)
  • Cost: $0 per page, no GPU
  • Speed: 0.05-0.5s/page (digital PDFs), 1-3s (table pages)
  • License: MIT

Best for: RAG pipelines, document ingestion, batch processing across thousands of files, general-purpose extraction. See our 200-PDF head-to-head benchmark for detailed per-tool comparisons.

2. Docling — 0.877 overall

pip install docling

IBM’s transformer-based document understanding system. The most accurate single-engine table extractor available, but processes every page through ML models — which is wasteful on the 90% of pages that are clean digital text.

  • Tables: 0.887 (highest single-engine score)
  • Reading order: 0.900
  • Cost: Free but requires ~500MB model download
  • Speed: 0.3-3s/page

Best for: Table-heavy documents, financial statements, when accuracy is the only priority and speed is irrelevant.

3. marker — 0.861 overall

pip install marker-pdf

Deep learning pipeline for PDF extraction. Strong on academic layouts and equation detection. The biggest install in this list — pulls PyTorch and several large model checkpoints.

  • Tables: 0.808
  • Reading order: 0.890
  • Cost: Free but GPU recommended for usable speed
  • Speed: 1-10s/page (CPU), 0.3-1s (GPU)

Best for: Academic papers, documents with equations, when you have GPU access and don’t mind the install footprint.

4. opendataloader — 0.844 overall

Rule-based extraction without ML. Fast but limited table handling.

  • Tables: 0.494 (weakest)
  • Reading order: 0.913 (strong)
  • Speed: Very fast

Best for: Simple digital PDFs where speed matters more than table accuracy. Useful as a baseline.

5. mineru — 0.831 overall

ML-powered extraction with good table accuracy but lower reading order scores.

  • Tables: 0.873
  • Reading order: 0.857 (weakest among top tools)
  • GPU recommended

Best for: Table extraction when pdfmux/Docling aren’t an option in your stack.

Honorable mentions

PyMuPDF / pymupdf4llm — The fastest option (0.01s/page) but limited table extraction and AGPL-licensed. Used inside pdfmux as the base extractor for digital pages.

pdfplumber — Good for simple, well-formatted PDFs with grid-line tables. No OCR, no ML, minimal dependencies. Falls behind on complex documents.

Unstructured — Enterprise-focused, API-first. Good for pipelines but heavier than needed for most Python projects. The cloud API performs noticeably better than the local open-source build.

Camelot — Specialized table extractor. Accurate on bordered tables but doesn’t handle full document extraction. Often paired with PyMuPDF.

LlamaParse (paid) — The current overall leader on the benchmark, but it’s a hosted API, not a local library. Costs roughly $3 per 1,000 pages and ships every page off-prem — a non-starter for regulated workloads.

Decision matrix

NeedBest choiceWhy
General-purpose extractionpdfmuxBest overall for free
RAG pipeline ingestionpdfmuxPer-page confidence scoring
Maximum table accuracypdfmux[tables] or Docling0.911 TEDS (pdfmux leads) — see table extraction methods
Maximum speedPyMuPDF0.01s/page
Academic papersmarkerEquation detection
Scanned documentspdfmux[ocr]Automatic OCR fallback, no GPU needed
Invoices / line itemspdfmuxPer-page audit catches mis-aligned rows
AcroForm / fillable PDFspdfmuxForm-field extraction with key/value output
Minimal dependenciespdfplumberPure Python
Enterprise/APIUnstructured or LlamaParseCloud API option

What to consider beyond accuracy

License: PyMuPDF is AGPL-3.0 (copyleft) — an enterprise non-starter for many teams. pdfmux wraps it under MIT through the LGPL-friendly path. marker uses GPL. Docling and Unstructured (open) use permissive licenses. Always check your compliance team’s allow-list before shipping a library that ships with your product.

Install size: pdfmux core is ~20MB. Adding [tables] (Docling) adds ~500MB on first run. marker with PyTorch can exceed 2GB unpacked. In serverless environments (AWS Lambda, Cloud Run cold starts), the install size directly translates to cold-start latency.

Cold start: Docling and marker load ML models on first invocation (30-60s). pdfmux with quality="fast" starts instantly because it falls through to PyMuPDF for digital pages and only loads the heavy models when classification flags a page as table-heavy or scanned.

Output formats: pdfmux offers Markdown, JSON, CSV, and an LLM-optimized output that strips boilerplate and normalizes headings. Most others only do Markdown or plain text.

Error handling: pdfmux is the only tool with built-in page quality auditing and automatic re-extraction. Others fail silently on bad pages, which is the single most painful failure mode at scale — you don’t notice the broken pages until your downstream RAG retrieval starts returning gibberish.

Concurrency: pdfmux releases the GIL for PyMuPDF calls, so a thread pool of 8-16 workers scales nearly linearly on a single machine. Docling and marker hold the GIL through their model invocations, so they need multiprocessing (and the model memory cost multiplies per worker).

Determinism: Rule-based extractors (PyMuPDF, opendataloader) produce byte-identical output across runs. ML extractors (marker, mineru) can produce slightly different output between versions or across hardware due to floating-point non-determinism. pdfmux pins a deterministic mode for regulated workflows.

Quick start

# Install
pip install pdfmux

# Basic extraction (90% of use cases)
pdfmux convert report.pdf

# With table support
pip install pdfmux[tables]
pdfmux convert financial-report.pdf -q standard

# With OCR for scanned docs
pip install pdfmux[ocr]
pdfmux convert scanned.pdf

# Structured output
pdfmux convert invoice.pdf -f json

# Python API
from pdfmux import process
result = process("report.pdf", quality="standard")

# Batch processing with confidence threshold
from pdfmux import process_dir
results = process_dir("./pdfs", min_confidence=0.85, on_low_confidence="reroute")

Common installation gotchas

error: Microsoft Visual C++ 14.0 or greater is required (Windows) — comes from PyMuPDF’s wheel mismatch. Fix: install the prebuilt wheel via pip install --only-binary :all: pymupdf first, then pip install pdfmux.

undefined symbol: _PyGen_Send (Linux) — Python version mismatch. pdfmux supports 3.10–3.13. Recreate your venv with a supported interpreter.

MemoryError on large PDFs — pass streaming=True to process(). The default loads all pages into memory; streaming mode releases each page after extraction.

Tesseract not found when using [ocr] — install the system Tesseract binary (brew install tesseract / apt install tesseract-ocr). Pure-pip Tesseract isn’t a thing on Linux/macOS.

Slow first run with [tables] — the Docling model is downloaded on first invocation. Pre-warm in your Docker image with python -c "from pdfmux.tables import warm; warm()" during build.

FAQ

What replaced Tabula for Python PDF extraction? Tabula-py is still maintained but newer tools like pdfmux, Docling, and Camelot offer better accuracy. pdfmux scores 0.911 on table extraction benchmarks vs Tabula’s significantly lower scores on complex documents. Tabula also requires a JVM, which is a footgun in container builds.

Is there a free alternative to Adobe PDF extraction? Yes. pdfmux is MIT licensed, free, and scores higher than most commercial tools on the opendataloader benchmark. No API keys, no cloud dependency. The closest paid equivalent (Adobe PDF Extract API) costs roughly $0.05 per page — pdfmux costs $0.

Which PDF library works best with LangChain? pdfmux has a LangChain integration (langchain-pdfmux) that provides a document loader with per-page confidence scoring. It’s designed for RAG pipelines. You can also run pdfmux as an MCP server for AI agents — Claude Code, Cursor, and Cline all support it.

Can I extract PDFs without internet access? Yes. pdfmux runs entirely offline. The [tables] extra downloads ML models on first use but works offline afterward. The core package has zero internet dependency.

How do I handle thousands of PDFs? Use the batch API: process_dir(path, workers=8). pdfmux is the only library on this list with built-in batch concurrency, retry, and per-document confidence reporting. See our batch processing guide for the full pattern.

What about non-English PDFs? pdfmux handles Latin scripts natively and adds Tesseract language packs for OCR (pdfmux convert doc.pdf --ocr-lang ara for Arabic, for example). For right-to-left layouts, pdfmux preserves visual order — see our Arabic PDF extraction guide for the GCC-document patterns.

Does pdfmux work on a CPU-only VM? Yes — that’s the design point. The standard install runs on any 2-core box with 2GB of RAM. The optional [tables] extra runs on CPU as well, just slower (1-3s/page instead of 0.3s).

Keep reading

Last updated: May 25, 2026 — bench re-run 2026-05-19, no rank changes vs the April run; pdfmux still #1 free.