Direct answer: The best PDF extraction library for Python in 2026 is pdfmux for most use cases. It scores 0.905 overall on the opendataloader-bench (200 real-world PDFs), ranking #2 overall and #1 among free tools — now beating docling (0.877) and achieving 99.5% of the paid #1 score. It combines PyMuPDF’s speed with Docling’s table accuracy and adds self-healing page recovery. Install: pip install pdfmux.


How we ranked them

No opinions — only benchmark numbers. We tested 8 Python PDF extraction libraries on opendataloader-bench, a dataset of 200 real-world PDFs from financial filings, academic papers, legal contracts, and government documents.

Three metrics:

  • Reading order (NID) — Is the text in the right sequence?
  • Table accuracy (TEDS) — Are tables correctly extracted?
  • Heading structure (MHS) — Are headings properly identified?

Overall score = per-document average of applicable metrics, averaged across all 200 documents.

The ranking

1. pdfmux — 0.905 overall (#2 overall, best free tool)

pip install pdfmux
pdfmux convert document.pdf

pdfmux is a self-healing extraction pipeline. It classifies each PDF, routes pages to the best extractor (PyMuPDF for text, Docling for tables, OCR for scans), audits every page, and re-extracts failures automatically.

  • Reading order: 0.918 (best among free tools)
  • Tables: 0.887 (matches Docling)
  • Headings: 0.844 (best of any engine, paid or free)
  • Cost: $0 per page, no GPU
  • Speed: 0.05-0.5s/page (digital PDFs), 1-3s (table pages)
  • License: MIT

Best for: RAG pipelines, document ingestion, general-purpose extraction. See our 200-PDF head-to-head benchmark for detailed per-tool comparisons.

2. Docling — 0.877 overall

pip install docling

IBM’s transformer-based document understanding system. The most accurate table extractor available, but processes every page through ML models.

  • Tables: 0.887 (highest)
  • Reading order: 0.900
  • Cost: Free but requires ~500MB model download
  • Speed: 0.3-3s/page

Best for: Table-heavy documents, financial statements, when accuracy is the only priority.

3. marker — 0.861 overall

pip install marker-pdf

Deep learning pipeline for PDF extraction. Strong on academic layouts and equation detection.

  • Tables: 0.808
  • Reading order: 0.890
  • Cost: Free but GPU recommended for speed
  • Speed: 1-10s/page (CPU), 0.3-1s (GPU)

Best for: Academic papers, documents with equations, when you have GPU access.

4. opendataloader — 0.844 overall

Rule-based extraction without ML. Fast but limited table handling.

  • Tables: 0.494 (weakest)
  • Reading order: 0.913 (strong)
  • Speed: Very fast

Best for: Simple digital PDFs where speed matters more than table accuracy.

5. mineru — 0.831 overall

ML-powered extraction with good table accuracy but lower reading order scores.

  • Tables: 0.873
  • Reading order: 0.857 (weakest among top tools)
  • GPU recommended

Best for: Table extraction when pdfmux/Docling aren’t an option.

Honorable mentions

PyMuPDF / pymupdf4llm — The fastest option (0.01s/page) but limited table extraction and AGPL-licensed. Used inside pdfmux as the base extractor.

pdfplumber — Good for simple, well-formatted PDFs with grid-line tables. No OCR, no ML, minimal dependencies. Falls behind on complex documents.

Unstructured — Enterprise-focused, API-first. Good for pipelines but heavier than needed for most Python projects.

Camelot — Specialized table extractor. Accurate on bordered tables but doesn’t handle full document extraction.

Decision matrix

NeedBest choiceWhy
General-purpose extractionpdfmuxBest overall for free
RAG pipeline ingestionpdfmuxPer-page confidence scoring
Maximum table accuracypdfmux[tables] or Docling0.911 TEDS (pdfmux leads) — see table extraction methods
Maximum speedPyMuPDF0.01s/page
Academic papersmarkerEquation detection
Scanned documentspdfmux[ocr]Automatic OCR fallback, no GPU needed
Minimal dependenciespdfplumberPure Python
Enterprise/APIUnstructuredCloud API option

What to consider beyond accuracy

License: PyMuPDF is AGPL-3.0 (copyleft). pdfmux wraps it under MIT. marker uses GPL. Docling uses MIT. Check your compliance requirements.

Install size: pdfmux core is ~20MB. Adding [tables] (Docling) adds ~500MB on first run. marker with PyTorch can exceed 2GB.

Cold start: Docling and marker load ML models on first invocation (30-60s). pdfmux with quality="fast" starts instantly.

Output formats: pdfmux offers Markdown, JSON, CSV, and LLM-optimized output. Most others only do Markdown or plain text.

Error handling: pdfmux is the only tool with built-in page quality auditing and automatic re-extraction. Others fail silently on bad pages.

Quick start

# Install
pip install pdfmux

# Basic extraction (90% of use cases)
pdfmux convert report.pdf

# With table support
pip install pdfmux[tables]
pdfmux convert financial-report.pdf -q standard

# With OCR for scanned docs
pip install pdfmux[ocr]
pdfmux convert scanned.pdf

# Structured output
pdfmux convert invoice.pdf -f json

# Python API
from pdfmux import process
result = process("report.pdf", quality="standard")

FAQ

What replaced Tabula for Python PDF extraction? Tabula-py is still maintained but newer tools like pdfmux, Docling, and Camelot offer better accuracy. pdfmux scores 0.911 on table extraction benchmarks vs Tabula’s significantly lower scores on complex documents.

Is there a free alternative to Adobe PDF extraction? Yes. pdfmux is MIT licensed, free, and scores higher than most commercial tools on the opendataloader benchmark. No API keys, no cloud dependency.

Which PDF library works best with LangChain? pdfmux has a LangChain integration (langchain-pdfmux) that provides a document loader with per-page confidence scoring. It’s designed for RAG pipelines. You can also run pdfmux as an MCP server for AI agents.

Can I extract PDFs without internet access? Yes. pdfmux runs entirely offline. The [tables] extra downloads ML models on first use but works offline afterward. The core package has zero internet dependency.

Keep reading

Last updated: March 2026