Which PDF extractor should you use? An honest guide.

TL;DRI maintain a tool that uses PyMuPDF, Docling, Surya, RapidOCR, and Gemini Flash internally. Here is when each one wins, when each one fails, and how to pick the right one

TL;DR: No single PDF extractor wins at everything. PyMuPDF is 100x faster on digital PDFs. Docling has 97.9% table accuracy. RapidOCR handles scans on CPU in 200MB. Gemini Flash catches everything but costs money. The right tool depends on your documents — this guide helps you pick, with real numbers from maintaining pdfmux, which uses all of them internally.

Why this guide is different

Every PDF tool publishes benchmarks showing themselves winning. Unstructured’s blog says Unstructured is best. LlamaParse’s comparisons say LlamaParse is best. Docling’s papers highlight Docling’s table accuracy.

I’m in a different position. I maintain pdfmux, a self-healing PDF extraction pipeline that routes to the right extractor per page. pdfmux uses PyMuPDF, Docling, RapidOCR, Surya, and Gemini Flash as backends. I don’t compete with any of them — I use them. When one of them gets better, pdfmux gets better.

That means I have no incentive to lie about which tool wins where. Here’s what I’ve found after testing them across thousands of documents.

The landscape in numbers

First, let’s calibrate on what developers are actually using. Monthly PyPI downloads (March 2026):

Tool	Monthly Downloads	What it is
PyMuPDF	43M	C-based PDF engine, text + images + tables
pdfplumber	18.5M	Pure Python, good table extraction
Unstructured	4.9M	Enterprise document processing platform
Docling	5.0M	IBM Research, transformer-based tables
pypdf	~15M	Pure Python, basic operations
pdfminer.six	~8M	Layout-aware text extraction
Marker	~500K	ML-powered, GPU-preferred

PyMuPDF dominates because it’s fast, reliable, and has zero external dependencies. Most developers start here. The question is what to use when PyMuPDF isn’t enough.

Category 1: Clean, digital PDFs

These are PDFs created by software — Word exports, LaTeX papers, programmatic reports. Text is embedded and extractable. This is 90% of PDFs you’ll encounter.

Tool	Speed (per page)	Accuracy	Notes
PyMuPDF	0.01s	98%+	Fastest by 50-500x. No dependencies.
pdfplumber	0.05-0.1s	97%+	Slightly better on some layouts
Docling	0.3-1s	95%+	Overkill, loads transformer models
Marker	0.5-2s	98%+	Needs GPU for reasonable speed
Gemini Flash	2-5s	99%+	Costs money, sends data to Google
Unstructured (OSS)	0.1-0.5s	96%+	Complex setup, many dependencies
LlamaParse	1-3s	98%+	Cloud only, $0.003/page

Winner: PyMuPDF. Not close. At 0.01 seconds per page, it processes a 100-page document in 1 second. It’s been maintained for over a decade, handles edge cases well, and the C backend means Python overhead is negligible.

When PyMuPDF loses on digital PDFs: Multi-column layouts where reading order matters. PyMuPDF extracts text in the raw PDF stream order, which sometimes interleaves columns. For these cases, pdfmux detects multi-column layout (clustering text block x-coordinates with a 50-point gap threshold) and reorders into left-to-right, top-to-bottom reading order.

Recommendation

If your PDFs are digital, use PyMuPDF. Don’t overthink it. pip install pymupdf and move on. (For a ranked list with benchmark scores, see the best PDF extraction libraries for Python in 2026.)

import pymupdf4llm

text = pymupdf4llm.to_markdown("report.pdf")

Or if you want confidence scoring to verify extraction quality:

import pdfmux

text = pdfmux.extract_text("report.pdf", quality="fast")
# quality="fast" uses PyMuPDF only, skips the audit — maximum speed

Category 2: Table-heavy documents

Financial reports, invoices, data sheets, regulatory filings. The text is digital, but the structure matters — you need to extract tables as actual tables, not as garbled text.

Tool	Table Accuracy	Preserves Structure	How it works
Docling	97.9%	Yes (markdown tables)	Transformer-based table detection
Gemini Flash	~95%	Yes	Vision model, understands layout
Marker	~85%	Yes	ML-based with GPU
pdfplumber	~80%	Partial	Rule-based cell detection
Unstructured (OSS)	~85%	Yes	Multiple strategies
PyMuPDF	~60%	Partial	`find_tables()` heuristic
LlamaParse	~93%	Yes	Cloud ML pipeline

Winner: Docling. IBM Research built it specifically for structured document understanding. (We compare three methods for extracting tables from PDFs in Python in a dedicated guide.) The transformer models detect table boundaries, identify headers, and extract cell contents with near-human accuracy. The 97.9% figure comes from their benchmarks on DocLayNet, a diverse dataset of financial, technical, and legal documents.

The tradeoff: Docling is slow. First run loads transformer models (~5-10 seconds). After that, it processes at 0.3-1 second per page. For a 10-page invoice, that’s fine. For a 500-page annual report, you’re waiting minutes.

The smart approach: targeted table extraction

pdfmux solves the speed problem by not running Docling on every page. It first detects which pages likely contain tables using a fast heuristic (5 signals scored additively):

Signal	Score	How it’s detected
Drawn grid lines	2	≥3 horizontal + ≥2 vertical lines
Number-dense lines	2	≥5 lines where ≥30% of chars are digits/currency
Column alignment	2	≥3 columns with ≥4 aligned text blocks each
Whitespace patterns	1	≥5 lines with ≥3 runs of 3+ spaces
PyMuPDF find_tables()	2	Built-in heuristic finds tables

A page is flagged as table-candidate if total score ≥ 2. For documents over 50 pages, pdfmux only sends table-candidate pages to Docling, and processes the rest with PyMuPDF. This cuts processing time by 80-95% on most documents while still getting Docling’s accuracy where it matters.

Recommendation

# If you know your doc has tables
import pdfmux
text = pdfmux.extract_text("financial_report.pdf")
# standard mode auto-detects tables and routes to Docling

# Or just use Docling directly for small documents
from docling.document_converter import DocumentConverter
converter = DocumentConverter()
result = converter.convert("invoice.pdf")
print(result.document.export_to_markdown())

Install table support: pip install "pdfmux[tables]" (adds Docling, ~500MB with transformer models).

Category 3: Scanned PDFs

Paper documents that were scanned — no embedded text. OCR is mandatory. This is where most pipelines fail silently.

Tool	Works?	Speed	Footprint	GPU Required?	Quality
RapidOCR	Yes	1-3s/page	~200MB	No (CPU)	Good
Surya OCR	Yes	1-5s/page	~5GB	Recommended	Very good
Marker	Yes	0.5-2s/page	~5GB	Yes	Very good
Gemini Flash	Yes	2-5s/page	None (cloud)	No	Excellent
Tesseract	Yes	0.5-2s/page	~100MB	No	Adequate
PyMuPDF	No	—	—	—	Returns empty
pdfplumber	No	—	—	—	Returns empty
Docling	Partial	Slow	~500MB	No	Limited OCR

Winner (quality): Gemini Flash. Vision models understand layout, context, and even handwriting. But it costs money ($0.01-0.05/document) and sends your data to Google’s servers.

Winner (practical): RapidOCR. PaddleOCR v4 models compiled to ONNX, runs on CPU, ~200MB footprint, no GPU required, no external API calls. For a deeper look at how pdfmux achieves near-AI accuracy without a GPU or API keys, see the architecture breakdown. pdfmux defaults to RapidOCR for OCR because it hits the best tradeoff of quality, speed, and deployability.

The silent failure problem: PyMuPDF, pdfplumber, and most basic extraction tools return empty text or near-empty text on scanned pages. They don’t error. They don’t warn. They just return nothing, and your RAG pipeline indexes empty documents, and your agent gives wrong answers, and nobody knows why until a human manually checks.

pdfmux’s approach: extract with PyMuPDF first (instant), then run 5 quality checks. Pages with <20 characters are classified as “empty.” Pages with <200 characters and images are classified as “bad.” Both get re-extracted with OCR automatically.

The confidence scoring checks

Each page starts at 1.0 and gets penalties:

Check	Penalty	What it catches
Character density <50 chars	-0.3	Near-empty pages from scanned docs
Alphabetic ratio <0.3	-0.25	Garbled OCR output, encoding errors
Average word length <2 or >25	-0.15	Broken word boundaries
Excessive whitespace runs	-0.1	Layout extraction artifacts
Mojibake patterns (â€, ï¿½)	-0.2	Unicode encoding failures

A page scoring below 0.5 is re-extracted with OCR. A page scoring 0.0 (near-empty) gets full-page OCR. The re-extracted version replaces the original only if it produces more text.

Recommendation

For occasional scanned PDFs: pip install "pdfmux[ocr]" and let the auto-detection handle it.

For heavy OCR workloads with a GPU: consider Surya directly, or Gemini Flash if you’re OK with cloud processing.

import pdfmux

# Automatically detects scanned pages and OCRs them
text = pdfmux.extract_text("scanned_contract.pdf")
# quality="standard" does smart routing — only OCRs pages that need it

# Force OCR on everything
text = pdfmux.extract_text("scanned_contract.pdf", quality="high")
# quality="high" uses Gemini Flash if available, max quality

Category 4: Mixed documents

The hardest category. Digital pages with some scanned pages. Table pages mixed with text pages. Forms with embedded images containing text. This is surprisingly common in real-world business documents — think a contract where the signature page was scanned, or a report with embedded screenshots of data.

Tool	Handles mixed?	How?
pdfmux	Yes	Per-page classification + targeted extraction
Gemini Flash	Yes	Vision model processes every page
Unstructured	Partial	Document-level strategy selection
LlamaParse	Yes	Cloud ML pipeline
PyMuPDF	No	Digital pages fine, scanned pages empty
Marker	Partial	ML-based, but all-or-nothing approach

The problem: Most tools make a document-level decision — either treat the whole thing as digital or treat it as scanned. For a 50-page document where pages 1-48 are digital and pages 49-50 are scanned, running OCR on all 50 pages wastes time and can actually degrade quality on the digital pages.

pdfmux’s approach: Classify every page independently. Extract digital pages with PyMuPDF (instant). OCR only the pages that need it. This means:

48 digital pages processed in ~0.5 seconds
2 scanned pages processed with OCR in ~4-6 seconds
Total: ~6 seconds instead of ~150 seconds for full-document OCR
Digital pages get perfect extraction, not OCR approximation

The dynamic OCR budget system keeps processing time proportional to the actual problem:

Document type	OCR budget
Mostly digital (<25% graphical)	30% of pages (enough for scattered scans)
Mixed (25-50% graphical)	graphical ratio + 10%
Mostly scanned (>50% graphical)	100% — OCR everything

Category 5: LLM pipeline / RAG use cases

If you’re building RAG pipelines, agent workflows, or LLM-powered applications, your extraction tool needs to produce LLM-friendly output. That means structured markdown, not raw text.

Tool	Output format	Chunk-ready?	Token estimates?	Metadata?
pdfmux	Markdown	Yes (section-aware)	Yes	Confidence, pages, extractor
PyMuPDF4LLM	Markdown	No	No	No
Unstructured	Elements	Yes	No	Element types
LlamaParse	Markdown	Via API	No	Limited
Docling	Markdown	No	No	Table structure
Marker	Markdown	No	No	No

Why markdown matters: structured Markdown preserves the semantic boundaries (headings, list nesting, table rows) that downstream chunkers and LLMs use to keep related content together. Compared to flat plain text, the chunker can split on ## rather than on arbitrary character windows; compared to HTML, the LLM context spends fewer tokens on markup; compared to JSON tables, the LLM can read the table in the same pass as the surrounding prose. The exact lift varies by corpus and pipeline — running the same eval set through your pipeline twice (markdown vs flat text) is the only way to size it on your documents.

pdfmux’s load_llm_context() function returns section-aware chunks (see our complete guide to PDF-to-Markdown for RAG pipelines for the full ingestion workflow):

import pdfmux

chunks = pdfmux.load_llm_context("report.pdf")
for chunk in chunks:
    print(f"Section: {chunk['title']}")
    print(f"Pages: {chunk['page_start']}-{chunk['page_end']}")
    print(f"Tokens: ~{chunk['tokens']}")
    print(f"Confidence: {chunk['confidence']}")
    print(f"Text: {chunk['text'][:100]}...")

Token estimation uses chars // 4 (GPT-family approximation). Section detection splits on ATX headings (# through ######). If no headings are found, it falls back to one chunk per page.

Framework integrations

LangChain:

from pdfmux.integrations.langchain import PDFMuxLoader

loader = PDFMuxLoader("report.pdf")
docs = loader.load()
# Each Document has page_content + metadata (confidence, pages, extractor)

LlamaIndex:

from pdfmux.integrations.llamaindex import PDFMuxReader

reader = PDFMuxReader()
docs = reader.load_data("report.pdf")

The decision flowchart

Is your PDF digital (created by software)?
├─ Yes → Is it table-heavy?
│        ├─ Yes → Docling (or pdfmux with tables extra)
│        └─ No  → PyMuPDF (fastest, 0.01s/page)
│
├─ No (scanned) → Do you have a GPU?
│                  ├─ Yes → Marker or Surya
│                  └─ No  → RapidOCR (CPU, 200MB)
│
├─ Mixed / Not sure → pdfmux (auto-detects and routes)
│
└─ Building an LLM pipeline?
   └─ Yes → pdfmux (confidence scoring, chunking, framework integrations)

Or the one-line version: if you don’t know what your PDFs look like, use pdfmux. It routes to the right tool for each page and tells you when the extraction quality is low.

Cost comparison

Tool	Cost	Best for
PyMuPDF	Free, MIT	Digital PDFs
pdfplumber	Free, MIT	Tables (simpler than Docling)
Docling	Free, MIT	High-accuracy tables
RapidOCR	Free, Apache 2.0	Scanned PDFs, CPU-only
Surya	Free, GPL 3.0	High-quality OCR with GPU
Marker	Free, GPL 3.0	Full ML pipeline with GPU
pdfmux	Free, MIT	Smart routing between all of the above
Unstructured (OSS)	Free, Apache 2.0	Enterprise document processing
LlamaParse	$0.003/page	Cloud OCR + tables
Gemini Flash	~$0.01-0.05/doc	Best quality, cloud only
AWS Textract	$0.015/page	Enterprise, AWS ecosystem
Adobe PDF Services	$0.05/operation	Enterprise, Adobe ecosystem

For most developers: PyMuPDF (free) handles 90% of cases. Add pdfmux (free) when you need confidence scoring, OCR routing, or table detection. Only pay for cloud services when you need the absolute best quality on difficult documents and can’t run models locally.

Try it

# Just PyMuPDF (90% of use cases)
pip install pymupdf pymupdf4llm

# Smart routing with confidence scoring
pip install pdfmux

# Add OCR for scanned docs
pip install "pdfmux[ocr]"

# Add table extraction
pip install "pdfmux[tables]"

# Everything
pip install "pdfmux[all]"

GitHub — source code, docs, examples
PyPI — pip install pdfmux
pdfmux.com — documentation

Built by Nameet Potnis. Contributions welcome.