TL;DR: No single PDF extractor wins at everything. PyMuPDF is 100x faster on digital PDFs. Docling has 97.9% table accuracy. RapidOCR handles scans on CPU in 200MB. Gemini Flash catches everything but costs money. The right tool depends on your documents — this guide helps you pick, with real numbers from maintaining pdfmux, which uses all of them internally.


Why this guide is different

Every PDF tool publishes benchmarks showing themselves winning. Unstructured’s blog says Unstructured is best. LlamaParse’s comparisons say LlamaParse is best. Docling’s papers highlight Docling’s table accuracy.

I’m in a different position. I maintain pdfmux, a self-healing PDF extraction pipeline that routes to the right extractor per page. pdfmux uses PyMuPDF, Docling, RapidOCR, Surya, and Gemini Flash as backends. I don’t compete with any of them — I use them. When one of them gets better, pdfmux gets better.

That means I have no incentive to lie about which tool wins where. Here’s what I’ve found after testing them across thousands of documents.


The landscape in numbers

First, let’s calibrate on what developers are actually using. Monthly PyPI downloads (March 2026):

ToolMonthly DownloadsWhat it is
PyMuPDF43MC-based PDF engine, text + images + tables
pdfplumber18.5MPure Python, good table extraction
Unstructured4.9MEnterprise document processing platform
Docling5.0MIBM Research, transformer-based tables
pypdf~15MPure Python, basic operations
pdfminer.six~8MLayout-aware text extraction
Marker~500KML-powered, GPU-preferred

PyMuPDF dominates because it’s fast, reliable, and has zero external dependencies. Most developers start here. The question is what to use when PyMuPDF isn’t enough.


Category 1: Clean, digital PDFs

These are PDFs created by software — Word exports, LaTeX papers, programmatic reports. Text is embedded and extractable. This is 90% of PDFs you’ll encounter.

ToolSpeed (per page)AccuracyNotes
PyMuPDF0.01s98%+Fastest by 50-500x. No dependencies.
pdfplumber0.05-0.1s97%+Slightly better on some layouts
Docling0.3-1s95%+Overkill, loads transformer models
Marker0.5-2s98%+Needs GPU for reasonable speed
Gemini Flash2-5s99%+Costs money, sends data to Google
Unstructured (OSS)0.1-0.5s96%+Complex setup, many dependencies
LlamaParse1-3s98%+Cloud only, $0.003/page

Winner: PyMuPDF. Not close. At 0.01 seconds per page, it processes a 100-page document in 1 second. It’s been maintained for over a decade, handles edge cases well, and the C backend means Python overhead is negligible.

When PyMuPDF loses on digital PDFs: Multi-column layouts where reading order matters. PyMuPDF extracts text in the raw PDF stream order, which sometimes interleaves columns. For these cases, pdfmux detects multi-column layout (clustering text block x-coordinates with a 50-point gap threshold) and reorders into left-to-right, top-to-bottom reading order.

Recommendation

If your PDFs are digital, use PyMuPDF. Don’t overthink it. pip install pymupdf and move on. (For a ranked list with benchmark scores, see the best PDF extraction libraries for Python in 2026.)

import pymupdf4llm

text = pymupdf4llm.to_markdown("report.pdf")

Or if you want confidence scoring to verify extraction quality:

import pdfmux

text = pdfmux.extract_text("report.pdf", quality="fast")
# quality="fast" uses PyMuPDF only, skips the audit — maximum speed

Category 2: Table-heavy documents

Financial reports, invoices, data sheets, regulatory filings. The text is digital, but the structure matters — you need to extract tables as actual tables, not as garbled text.

ToolTable AccuracyPreserves StructureHow it works
Docling97.9%Yes (markdown tables)Transformer-based table detection
Gemini Flash~95%YesVision model, understands layout
Marker~85%YesML-based with GPU
pdfplumber~80%PartialRule-based cell detection
Unstructured (OSS)~85%YesMultiple strategies
PyMuPDF~60%Partialfind_tables() heuristic
LlamaParse~93%YesCloud ML pipeline

Winner: Docling. IBM Research built it specifically for structured document understanding. (We compare three methods for extracting tables from PDFs in Python in a dedicated guide.) The transformer models detect table boundaries, identify headers, and extract cell contents with near-human accuracy. The 97.9% figure comes from their benchmarks on DocLayNet, a diverse dataset of financial, technical, and legal documents.

The tradeoff: Docling is slow. First run loads transformer models (~5-10 seconds). After that, it processes at 0.3-1 second per page. For a 10-page invoice, that’s fine. For a 500-page annual report, you’re waiting minutes.

The smart approach: targeted table extraction

pdfmux solves the speed problem by not running Docling on every page. It first detects which pages likely contain tables using a fast heuristic (5 signals scored additively):

SignalScoreHow it’s detected
Drawn grid lines2≥3 horizontal + ≥2 vertical lines
Number-dense lines2≥5 lines where ≥30% of chars are digits/currency
Column alignment2≥3 columns with ≥4 aligned text blocks each
Whitespace patterns1≥5 lines with ≥3 runs of 3+ spaces
PyMuPDF find_tables()2Built-in heuristic finds tables

A page is flagged as table-candidate if total score ≥ 2. For documents over 50 pages, pdfmux only sends table-candidate pages to Docling, and processes the rest with PyMuPDF. This cuts processing time by 80-95% on most documents while still getting Docling’s accuracy where it matters.

Recommendation

# If you know your doc has tables
import pdfmux
text = pdfmux.extract_text("financial_report.pdf")
# standard mode auto-detects tables and routes to Docling

# Or just use Docling directly for small documents
from docling.document_converter import DocumentConverter
converter = DocumentConverter()
result = converter.convert("invoice.pdf")
print(result.document.export_to_markdown())

Install table support: pip install "pdfmux[tables]" (adds Docling, ~500MB with transformer models).


Category 3: Scanned PDFs

Paper documents that were scanned — no embedded text. OCR is mandatory. This is where most pipelines fail silently.

ToolWorks?SpeedFootprintGPU Required?Quality
RapidOCRYes1-3s/page~200MBNo (CPU)Good
Surya OCRYes1-5s/page~5GBRecommendedVery good
MarkerYes0.5-2s/page~5GBYesVery good
Gemini FlashYes2-5s/pageNone (cloud)NoExcellent
TesseractYes0.5-2s/page~100MBNoAdequate
PyMuPDFNoReturns empty
pdfplumberNoReturns empty
DoclingPartialSlow~500MBNoLimited OCR

Winner (quality): Gemini Flash. Vision models understand layout, context, and even handwriting. But it costs money ($0.01-0.05/document) and sends your data to Google’s servers.

Winner (practical): RapidOCR. PaddleOCR v4 models compiled to ONNX, runs on CPU, ~200MB footprint, no GPU required, no external API calls. For a deeper look at how pdfmux achieves near-AI accuracy without a GPU or API keys, see the architecture breakdown. pdfmux defaults to RapidOCR for OCR because it hits the best tradeoff of quality, speed, and deployability.

The silent failure problem: PyMuPDF, pdfplumber, and most basic extraction tools return empty text or near-empty text on scanned pages. They don’t error. They don’t warn. They just return nothing, and your RAG pipeline indexes empty documents, and your agent gives wrong answers, and nobody knows why until a human manually checks.

pdfmux’s approach: extract with PyMuPDF first (instant), then run 5 quality checks. Pages with <20 characters are classified as “empty.” Pages with <200 characters and images are classified as “bad.” Both get re-extracted with OCR automatically.

The confidence scoring checks

Each page starts at 1.0 and gets penalties:

CheckPenaltyWhat it catches
Character density <50 chars-0.3Near-empty pages from scanned docs
Alphabetic ratio <0.3-0.25Garbled OCR output, encoding errors
Average word length <2 or >25-0.15Broken word boundaries
Excessive whitespace runs-0.1Layout extraction artifacts
Mojibake patterns (â€, �)-0.2Unicode encoding failures

A page scoring below 0.5 is re-extracted with OCR. A page scoring 0.0 (near-empty) gets full-page OCR. The re-extracted version replaces the original only if it produces more text.

Recommendation

For occasional scanned PDFs: pip install "pdfmux[ocr]" and let the auto-detection handle it.

For heavy OCR workloads with a GPU: consider Surya directly, or Gemini Flash if you’re OK with cloud processing.

import pdfmux

# Automatically detects scanned pages and OCRs them
text = pdfmux.extract_text("scanned_contract.pdf")
# quality="standard" does smart routing — only OCRs pages that need it

# Force OCR on everything
text = pdfmux.extract_text("scanned_contract.pdf", quality="high")
# quality="high" uses Gemini Flash if available, max quality

Category 4: Mixed documents

The hardest category. Digital pages with some scanned pages. Table pages mixed with text pages. Forms with embedded images containing text. This is surprisingly common in real-world business documents — think a contract where the signature page was scanned, or a report with embedded screenshots of data.

ToolHandles mixed?How?
pdfmuxYesPer-page classification + targeted extraction
Gemini FlashYesVision model processes every page
UnstructuredPartialDocument-level strategy selection
LlamaParseYesCloud ML pipeline
PyMuPDFNoDigital pages fine, scanned pages empty
MarkerPartialML-based, but all-or-nothing approach

The problem: Most tools make a document-level decision — either treat the whole thing as digital or treat it as scanned. For a 50-page document where pages 1-48 are digital and pages 49-50 are scanned, running OCR on all 50 pages wastes time and can actually degrade quality on the digital pages.

pdfmux’s approach: Classify every page independently. Extract digital pages with PyMuPDF (instant). OCR only the pages that need it. This means:

  • 48 digital pages processed in ~0.5 seconds
  • 2 scanned pages processed with OCR in ~4-6 seconds
  • Total: ~6 seconds instead of ~150 seconds for full-document OCR
  • Digital pages get perfect extraction, not OCR approximation

The dynamic OCR budget system keeps processing time proportional to the actual problem:

Document typeOCR budget
Mostly digital (<25% graphical)30% of pages (enough for scattered scans)
Mixed (25-50% graphical)graphical ratio + 10%
Mostly scanned (>50% graphical)100% — OCR everything

Category 5: LLM pipeline / RAG use cases

If you’re building RAG pipelines, agent workflows, or LLM-powered applications, your extraction tool needs to produce LLM-friendly output. That means structured markdown, not raw text.

ToolOutput formatChunk-ready?Token estimates?Metadata?
pdfmuxMarkdownYes (section-aware)YesConfidence, pages, extractor
PyMuPDF4LLMMarkdownNoNoNo
UnstructuredElementsYesNoElement types
LlamaParseMarkdownVia APINoLimited
DoclingMarkdownNoNoTable structure
MarkerMarkdownNoNoNo

Why markdown matters: Research shows markdown achieves 60.7% LLM accuracy versus 44.3% for CSV output. Markdown gives 20-35% better RAG accuracy compared to HTML or plain text, with 10-15% token savings versus JSON.

pdfmux’s load_llm_context() function returns section-aware chunks (see our complete guide to PDF-to-Markdown for RAG pipelines for the full ingestion workflow):

import pdfmux

chunks = pdfmux.load_llm_context("report.pdf")
for chunk in chunks:
    print(f"Section: {chunk['title']}")
    print(f"Pages: {chunk['page_start']}-{chunk['page_end']}")
    print(f"Tokens: ~{chunk['tokens']}")
    print(f"Confidence: {chunk['confidence']}")
    print(f"Text: {chunk['text'][:100]}...")

Token estimation uses chars // 4 (GPT-family approximation). Section detection splits on ATX headings (# through ######). If no headings are found, it falls back to one chunk per page.

Framework integrations

LangChain:

from pdfmux.integrations.langchain import PDFMuxLoader

loader = PDFMuxLoader("report.pdf")
docs = loader.load()
# Each Document has page_content + metadata (confidence, pages, extractor)

LlamaIndex:

from pdfmux.integrations.llamaindex import PDFMuxReader

reader = PDFMuxReader()
docs = reader.load_data("report.pdf")

The decision flowchart

Is your PDF digital (created by software)?
├─ Yes → Is it table-heavy?
│        ├─ Yes → Docling (or pdfmux with tables extra)
│        └─ No  → PyMuPDF (fastest, 0.01s/page)
├─ No (scanned) → Do you have a GPU?
│                  ├─ Yes → Marker or Surya
│                  └─ No  → RapidOCR (CPU, 200MB)
├─ Mixed / Not sure → pdfmux (auto-detects and routes)
└─ Building an LLM pipeline?
   └─ Yes → pdfmux (confidence scoring, chunking, framework integrations)

Or the one-line version: if you don’t know what your PDFs look like, use pdfmux. It routes to the right tool for each page and tells you when the extraction quality is low.


Cost comparison

ToolCostBest for
PyMuPDFFree, MITDigital PDFs
pdfplumberFree, MITTables (simpler than Docling)
DoclingFree, MITHigh-accuracy tables
RapidOCRFree, Apache 2.0Scanned PDFs, CPU-only
SuryaFree, GPL 3.0High-quality OCR with GPU
MarkerFree, GPL 3.0Full ML pipeline with GPU
pdfmuxFree, MITSmart routing between all of the above
Unstructured (OSS)Free, Apache 2.0Enterprise document processing
LlamaParse$0.003/pageCloud OCR + tables
Gemini Flash~$0.01-0.05/docBest quality, cloud only
AWS Textract$0.015/pageEnterprise, AWS ecosystem
Adobe PDF Services$0.05/operationEnterprise, Adobe ecosystem

For most developers: PyMuPDF (free) handles 90% of cases. Add pdfmux (free) when you need confidence scoring, OCR routing, or table detection. Only pay for cloud services when you need the absolute best quality on difficult documents and can’t run models locally.


Try it

# Just PyMuPDF (90% of use cases)
pip install pymupdf pymupdf4llm

# Smart routing with confidence scoring
pip install pdfmux

# Add OCR for scanned docs
pip install "pdfmux[ocr]"

# Add table extraction
pip install "pdfmux[tables]"

# Everything
pip install "pdfmux[all]"

Built by Nameet Potnis. Contributions welcome.