TL;DR: No single PDF extractor wins at everything. PyMuPDF is 100x faster on digital PDFs. Docling has 97.9% table accuracy. RapidOCR handles scans on CPU in 200MB. Gemini Flash catches everything but costs money. The right tool depends on your documents — this guide helps you pick, with real numbers from maintaining pdfmux, which uses all of them internally.
Why this guide is different
Every PDF tool publishes benchmarks showing themselves winning. Unstructured’s blog says Unstructured is best. LlamaParse’s comparisons say LlamaParse is best. Docling’s papers highlight Docling’s table accuracy.
I’m in a different position. I maintain pdfmux, a self-healing PDF extraction pipeline that routes to the right extractor per page. pdfmux uses PyMuPDF, Docling, RapidOCR, Surya, and Gemini Flash as backends. I don’t compete with any of them — I use them. When one of them gets better, pdfmux gets better.
That means I have no incentive to lie about which tool wins where. Here’s what I’ve found after testing them across thousands of documents.
The landscape in numbers
First, let’s calibrate on what developers are actually using. Monthly PyPI downloads (March 2026):
| Tool | Monthly Downloads | What it is |
|---|---|---|
| PyMuPDF | 43M | C-based PDF engine, text + images + tables |
| pdfplumber | 18.5M | Pure Python, good table extraction |
| Unstructured | 4.9M | Enterprise document processing platform |
| Docling | 5.0M | IBM Research, transformer-based tables |
| pypdf | ~15M | Pure Python, basic operations |
| pdfminer.six | ~8M | Layout-aware text extraction |
| Marker | ~500K | ML-powered, GPU-preferred |
PyMuPDF dominates because it’s fast, reliable, and has zero external dependencies. Most developers start here. The question is what to use when PyMuPDF isn’t enough.
Category 1: Clean, digital PDFs
These are PDFs created by software — Word exports, LaTeX papers, programmatic reports. Text is embedded and extractable. This is 90% of PDFs you’ll encounter.
| Tool | Speed (per page) | Accuracy | Notes |
|---|---|---|---|
| PyMuPDF | 0.01s | 98%+ | Fastest by 50-500x. No dependencies. |
| pdfplumber | 0.05-0.1s | 97%+ | Slightly better on some layouts |
| Docling | 0.3-1s | 95%+ | Overkill, loads transformer models |
| Marker | 0.5-2s | 98%+ | Needs GPU for reasonable speed |
| Gemini Flash | 2-5s | 99%+ | Costs money, sends data to Google |
| Unstructured (OSS) | 0.1-0.5s | 96%+ | Complex setup, many dependencies |
| LlamaParse | 1-3s | 98%+ | Cloud only, $0.003/page |
Winner: PyMuPDF. Not close. At 0.01 seconds per page, it processes a 100-page document in 1 second. It’s been maintained for over a decade, handles edge cases well, and the C backend means Python overhead is negligible.
When PyMuPDF loses on digital PDFs: Multi-column layouts where reading order matters. PyMuPDF extracts text in the raw PDF stream order, which sometimes interleaves columns. For these cases, pdfmux detects multi-column layout (clustering text block x-coordinates with a 50-point gap threshold) and reorders into left-to-right, top-to-bottom reading order.
Recommendation
If your PDFs are digital, use PyMuPDF. Don’t overthink it. pip install pymupdf and move on. (For a ranked list with benchmark scores, see the best PDF extraction libraries for Python in 2026.)
import pymupdf4llm
text = pymupdf4llm.to_markdown("report.pdf")
Or if you want confidence scoring to verify extraction quality:
import pdfmux
text = pdfmux.extract_text("report.pdf", quality="fast")
# quality="fast" uses PyMuPDF only, skips the audit — maximum speed
Category 2: Table-heavy documents
Financial reports, invoices, data sheets, regulatory filings. The text is digital, but the structure matters — you need to extract tables as actual tables, not as garbled text.
| Tool | Table Accuracy | Preserves Structure | How it works |
|---|---|---|---|
| Docling | 97.9% | Yes (markdown tables) | Transformer-based table detection |
| Gemini Flash | ~95% | Yes | Vision model, understands layout |
| Marker | ~85% | Yes | ML-based with GPU |
| pdfplumber | ~80% | Partial | Rule-based cell detection |
| Unstructured (OSS) | ~85% | Yes | Multiple strategies |
| PyMuPDF | ~60% | Partial | find_tables() heuristic |
| LlamaParse | ~93% | Yes | Cloud ML pipeline |
Winner: Docling. IBM Research built it specifically for structured document understanding. (We compare three methods for extracting tables from PDFs in Python in a dedicated guide.) The transformer models detect table boundaries, identify headers, and extract cell contents with near-human accuracy. The 97.9% figure comes from their benchmarks on DocLayNet, a diverse dataset of financial, technical, and legal documents.
The tradeoff: Docling is slow. First run loads transformer models (~5-10 seconds). After that, it processes at 0.3-1 second per page. For a 10-page invoice, that’s fine. For a 500-page annual report, you’re waiting minutes.
The smart approach: targeted table extraction
pdfmux solves the speed problem by not running Docling on every page. It first detects which pages likely contain tables using a fast heuristic (5 signals scored additively):
| Signal | Score | How it’s detected |
|---|---|---|
| Drawn grid lines | 2 | ≥3 horizontal + ≥2 vertical lines |
| Number-dense lines | 2 | ≥5 lines where ≥30% of chars are digits/currency |
| Column alignment | 2 | ≥3 columns with ≥4 aligned text blocks each |
| Whitespace patterns | 1 | ≥5 lines with ≥3 runs of 3+ spaces |
| PyMuPDF find_tables() | 2 | Built-in heuristic finds tables |
A page is flagged as table-candidate if total score ≥ 2. For documents over 50 pages, pdfmux only sends table-candidate pages to Docling, and processes the rest with PyMuPDF. This cuts processing time by 80-95% on most documents while still getting Docling’s accuracy where it matters.
Recommendation
# If you know your doc has tables
import pdfmux
text = pdfmux.extract_text("financial_report.pdf")
# standard mode auto-detects tables and routes to Docling
# Or just use Docling directly for small documents
from docling.document_converter import DocumentConverter
converter = DocumentConverter()
result = converter.convert("invoice.pdf")
print(result.document.export_to_markdown())
Install table support: pip install "pdfmux[tables]" (adds Docling, ~500MB with transformer models).
Category 3: Scanned PDFs
Paper documents that were scanned — no embedded text. OCR is mandatory. This is where most pipelines fail silently.
| Tool | Works? | Speed | Footprint | GPU Required? | Quality |
|---|---|---|---|---|---|
| RapidOCR | Yes | 1-3s/page | ~200MB | No (CPU) | Good |
| Surya OCR | Yes | 1-5s/page | ~5GB | Recommended | Very good |
| Marker | Yes | 0.5-2s/page | ~5GB | Yes | Very good |
| Gemini Flash | Yes | 2-5s/page | None (cloud) | No | Excellent |
| Tesseract | Yes | 0.5-2s/page | ~100MB | No | Adequate |
| PyMuPDF | No | — | — | — | Returns empty |
| pdfplumber | No | — | — | — | Returns empty |
| Docling | Partial | Slow | ~500MB | No | Limited OCR |
Winner (quality): Gemini Flash. Vision models understand layout, context, and even handwriting. But it costs money ($0.01-0.05/document) and sends your data to Google’s servers.
Winner (practical): RapidOCR. PaddleOCR v4 models compiled to ONNX, runs on CPU, ~200MB footprint, no GPU required, no external API calls. For a deeper look at how pdfmux achieves near-AI accuracy without a GPU or API keys, see the architecture breakdown. pdfmux defaults to RapidOCR for OCR because it hits the best tradeoff of quality, speed, and deployability.
The silent failure problem: PyMuPDF, pdfplumber, and most basic extraction tools return empty text or near-empty text on scanned pages. They don’t error. They don’t warn. They just return nothing, and your RAG pipeline indexes empty documents, and your agent gives wrong answers, and nobody knows why until a human manually checks.
pdfmux’s approach: extract with PyMuPDF first (instant), then run 5 quality checks. Pages with <20 characters are classified as “empty.” Pages with <200 characters and images are classified as “bad.” Both get re-extracted with OCR automatically.
The confidence scoring checks
Each page starts at 1.0 and gets penalties:
| Check | Penalty | What it catches |
|---|---|---|
| Character density <50 chars | -0.3 | Near-empty pages from scanned docs |
| Alphabetic ratio <0.3 | -0.25 | Garbled OCR output, encoding errors |
| Average word length <2 or >25 | -0.15 | Broken word boundaries |
| Excessive whitespace runs | -0.1 | Layout extraction artifacts |
| Mojibake patterns (â€, �) | -0.2 | Unicode encoding failures |
A page scoring below 0.5 is re-extracted with OCR. A page scoring 0.0 (near-empty) gets full-page OCR. The re-extracted version replaces the original only if it produces more text.
Recommendation
For occasional scanned PDFs: pip install "pdfmux[ocr]" and let the auto-detection handle it.
For heavy OCR workloads with a GPU: consider Surya directly, or Gemini Flash if you’re OK with cloud processing.
import pdfmux
# Automatically detects scanned pages and OCRs them
text = pdfmux.extract_text("scanned_contract.pdf")
# quality="standard" does smart routing — only OCRs pages that need it
# Force OCR on everything
text = pdfmux.extract_text("scanned_contract.pdf", quality="high")
# quality="high" uses Gemini Flash if available, max quality
Category 4: Mixed documents
The hardest category. Digital pages with some scanned pages. Table pages mixed with text pages. Forms with embedded images containing text. This is surprisingly common in real-world business documents — think a contract where the signature page was scanned, or a report with embedded screenshots of data.
| Tool | Handles mixed? | How? |
|---|---|---|
| pdfmux | Yes | Per-page classification + targeted extraction |
| Gemini Flash | Yes | Vision model processes every page |
| Unstructured | Partial | Document-level strategy selection |
| LlamaParse | Yes | Cloud ML pipeline |
| PyMuPDF | No | Digital pages fine, scanned pages empty |
| Marker | Partial | ML-based, but all-or-nothing approach |
The problem: Most tools make a document-level decision — either treat the whole thing as digital or treat it as scanned. For a 50-page document where pages 1-48 are digital and pages 49-50 are scanned, running OCR on all 50 pages wastes time and can actually degrade quality on the digital pages.
pdfmux’s approach: Classify every page independently. Extract digital pages with PyMuPDF (instant). OCR only the pages that need it. This means:
- 48 digital pages processed in ~0.5 seconds
- 2 scanned pages processed with OCR in ~4-6 seconds
- Total: ~6 seconds instead of ~150 seconds for full-document OCR
- Digital pages get perfect extraction, not OCR approximation
The dynamic OCR budget system keeps processing time proportional to the actual problem:
| Document type | OCR budget |
|---|---|
| Mostly digital (<25% graphical) | 30% of pages (enough for scattered scans) |
| Mixed (25-50% graphical) | graphical ratio + 10% |
| Mostly scanned (>50% graphical) | 100% — OCR everything |
Category 5: LLM pipeline / RAG use cases
If you’re building RAG pipelines, agent workflows, or LLM-powered applications, your extraction tool needs to produce LLM-friendly output. That means structured markdown, not raw text.
| Tool | Output format | Chunk-ready? | Token estimates? | Metadata? |
|---|---|---|---|---|
| pdfmux | Markdown | Yes (section-aware) | Yes | Confidence, pages, extractor |
| PyMuPDF4LLM | Markdown | No | No | No |
| Unstructured | Elements | Yes | No | Element types |
| LlamaParse | Markdown | Via API | No | Limited |
| Docling | Markdown | No | No | Table structure |
| Marker | Markdown | No | No | No |
Why markdown matters: Research shows markdown achieves 60.7% LLM accuracy versus 44.3% for CSV output. Markdown gives 20-35% better RAG accuracy compared to HTML or plain text, with 10-15% token savings versus JSON.
pdfmux’s load_llm_context() function returns section-aware chunks (see our complete guide to PDF-to-Markdown for RAG pipelines for the full ingestion workflow):
import pdfmux
chunks = pdfmux.load_llm_context("report.pdf")
for chunk in chunks:
print(f"Section: {chunk['title']}")
print(f"Pages: {chunk['page_start']}-{chunk['page_end']}")
print(f"Tokens: ~{chunk['tokens']}")
print(f"Confidence: {chunk['confidence']}")
print(f"Text: {chunk['text'][:100]}...")
Token estimation uses chars // 4 (GPT-family approximation). Section detection splits on ATX headings (# through ######). If no headings are found, it falls back to one chunk per page.
Framework integrations
LangChain:
from pdfmux.integrations.langchain import PDFMuxLoader
loader = PDFMuxLoader("report.pdf")
docs = loader.load()
# Each Document has page_content + metadata (confidence, pages, extractor)
LlamaIndex:
from pdfmux.integrations.llamaindex import PDFMuxReader
reader = PDFMuxReader()
docs = reader.load_data("report.pdf")
The decision flowchart
Is your PDF digital (created by software)?
├─ Yes → Is it table-heavy?
│ ├─ Yes → Docling (or pdfmux with tables extra)
│ └─ No → PyMuPDF (fastest, 0.01s/page)
│
├─ No (scanned) → Do you have a GPU?
│ ├─ Yes → Marker or Surya
│ └─ No → RapidOCR (CPU, 200MB)
│
├─ Mixed / Not sure → pdfmux (auto-detects and routes)
│
└─ Building an LLM pipeline?
└─ Yes → pdfmux (confidence scoring, chunking, framework integrations)
Or the one-line version: if you don’t know what your PDFs look like, use pdfmux. It routes to the right tool for each page and tells you when the extraction quality is low.
Cost comparison
| Tool | Cost | Best for |
|---|---|---|
| PyMuPDF | Free, MIT | Digital PDFs |
| pdfplumber | Free, MIT | Tables (simpler than Docling) |
| Docling | Free, MIT | High-accuracy tables |
| RapidOCR | Free, Apache 2.0 | Scanned PDFs, CPU-only |
| Surya | Free, GPL 3.0 | High-quality OCR with GPU |
| Marker | Free, GPL 3.0 | Full ML pipeline with GPU |
| pdfmux | Free, MIT | Smart routing between all of the above |
| Unstructured (OSS) | Free, Apache 2.0 | Enterprise document processing |
| LlamaParse | $0.003/page | Cloud OCR + tables |
| Gemini Flash | ~$0.01-0.05/doc | Best quality, cloud only |
| AWS Textract | $0.015/page | Enterprise, AWS ecosystem |
| Adobe PDF Services | $0.05/operation | Enterprise, Adobe ecosystem |
For most developers: PyMuPDF (free) handles 90% of cases. Add pdfmux (free) when you need confidence scoring, OCR routing, or table detection. Only pay for cloud services when you need the absolute best quality on difficult documents and can’t run models locally.
Try it
# Just PyMuPDF (90% of use cases)
pip install pymupdf pymupdf4llm
# Smart routing with confidence scoring
pip install pdfmux
# Add OCR for scanned docs
pip install "pdfmux[ocr]"
# Add table extraction
pip install "pdfmux[tables]"
# Everything
pip install "pdfmux[all]"
- GitHub — source code, docs, examples
- PyPI —
pip install pdfmux - pdfmux.com — documentation
Built by Nameet Potnis. Contributions welcome.