Which PDF extractor should you actually use in 2026?

TL;DRThere are now 7+ serious PDF extraction tools — OpenDataLoader, Docling, Marker, MinerU, pymupdf4llm, MarkItDown, pdfmux, and more. Here is when to use each one, with real

TL;DR: The PDF extraction landscape shifted hard in late 2025 and early 2026. OpenDataLoader (Hancom) took the #1 benchmark slot with a hybrid AI engine. Docling still owns table extraction. Marker still needs a GPU. pdfmux still routes between all of them. This guide covers 7 tools with honest assessments — what each is good at, what each is bad at, and which one you should pick based on your actual use case.

There are too many PDF extractors now

A year ago, the decision was simple: PyMuPDF for digital, Docling for tables, Tesseract or Surya for scans. Maybe LlamaParse if you wanted to pay.

In 2026, the field has exploded. OpenDataLoader entered with enterprise backing and top benchmark scores. MinerU built a full ML pipeline. MarkItDown brought Microsoft into the game. Every month there’s a new tool claiming to be the best PDF parser.

I maintain pdfmux, a self-healing extraction pipeline that routes to different backends per page. I’ve tested all of these tools extensively because I need to know which ones are worth integrating. This is what I’ve found — no spin, just data. (For the original benchmark story, see how I benchmarked every PDF-to-Markdown tool and then built a router.)

The decision flowchart

Before the deep dives, here’s the quick version:

What are your PDFs like?
│
├─ Digital (software-generated, clean text)
│  ├─ Simple text/paragraphs → pymupdf4llm (fastest, 0.01s/page)
│  ├─ Heavy tables → Docling (97.9% table accuracy)
│  └─ Need bounding boxes / accessibility → OpenDataLoader
│
├─ Scanned (paper → scanner → PDF)
│  ├─ Have a GPU → Marker or MinerU
│  ├─ No GPU → pdfmux with RapidOCR (CPU, ~200MB)
│  └─ Budget available → Gemini Flash (best quality)
│
├─ Mixed (some digital, some scanned)
│  └─ pdfmux (classifies each page, routes automatically)
│
├─ Not just PDFs (Word, PowerPoint, HTML, etc.)
│  └─ MarkItDown (wide format support) or OpenDataLoader
│
└─ Building a RAG / LLM pipeline?
   ├─ Need confidence scores → pdfmux
   ├─ Enterprise compliance → OpenDataLoader
   └─ Just need it to work → pdfmux or OpenDataLoader

Now let’s look at each tool properly.

1. OpenDataLoader (Hancom)

What it is: A hybrid AI document extraction engine from Hancom (Korean enterprise software company). Open-source with 3.8K GitHub stars as of March 2026. Uses a combination of rule-based and AI models to handle layout detection, table extraction, and OCR in a single pipeline.

GitHub stars: ~3,800 | License: Apache 2.0

When to use it

You need bounding box coordinates for every extracted element
Accessibility compliance matters (WCAG, Section 508)
Your documents span multiple languages (CJK support is strong — Hancom’s heritage)
Enterprise environment where “corporate-backed” matters for procurement
You need SDKs beyond Python (Java, C++, REST API available)

When NOT to use it

Simple digital PDFs where pymupdf4llm is 50-100x faster
You want a minimal dependency footprint — OpenDataLoader pulls in ML models
You’re running in a constrained environment (Lambda, small containers)
You need the fastest possible throughput on clean documents

Install and usage

pip install opendataloader

from opendataloader import DocumentLoader

loader = DocumentLoader()
result = loader.load("report.pdf")
# Returns structured elements with bounding boxes, types, confidence

Benchmark numbers

Metric	Score
Reading order accuracy	#1 on LayoutBench (as of Feb 2026)
Table extraction	~94% (behind Docling’s 97.9%)
OCR quality	Very good, multi-engine
Speed (digital PDF)	0.2-0.8s/page
Speed (scanned PDF)	1-4s/page
Disk footprint	~1-2GB with models

Honest take

OpenDataLoader is the most well-rounded tool in the field right now. The hybrid approach — combining traditional PDF parsing with AI models — means it rarely fails catastrophically. The bounding box support is genuinely useful if you need to highlight or annotate source locations. The multi-language SDK story is the best in the space.

The downsides: it’s slower than pymupdf4llm on simple documents by 20-80x. The model download is heavy. And the “corporate-backed” angle cuts both ways — Hancom’s priorities may not always align with open-source community needs. The project is young and the API has changed between versions.

2. Docling (IBM)

What it is: IBM Research’s document understanding toolkit. Built specifically for structured document extraction — tables, figures, layout detection. Uses transformer models internally (DocLayNet-trained).

GitHub stars: ~18K | License: MIT | PyPI: ~5M monthly downloads

When to use it

Your documents are table-heavy (financial reports, invoices, data sheets)
Table accuracy is more important than speed
You need structured markdown output with preserved table formatting

When NOT to use it

Simple digital PDFs with no tables (pymupdf4llm is 30-100x faster)
Scanned documents (Docling’s OCR is limited)
Low-resource environments (loads transformer models on first run, ~500MB)
You need sub-second latency

Install and usage

pip install docling

from docling.document_converter import DocumentConverter

converter = DocumentConverter()
result = converter.convert("financial_report.pdf")
markdown = result.document.export_to_markdown()

Benchmark numbers

Metric	Score
Table extraction accuracy	97.9% (DocLayNet benchmark)
Layout detection	93%+
Speed (digital PDF)	0.3-1s/page
Speed (with tables)	1-3s/page
First-run overhead	5-10s (model loading)
Disk footprint	~500MB (transformer models)

Honest take

Docling is the best table extractor available. Period. The 97.9% accuracy on DocLayNet is not marketing — I’ve verified it on real financial documents. If you’re extracting invoices, SEC filings, or any document where table structure matters, Docling should be in your pipeline.

The problem is that Docling is mediocre at everything else. It’s slow on simple text documents. Its OCR support is an afterthought. And the 500MB model download means it’s not great for lightweight deployments. Use it for what it’s good at — tables — and use something else for the rest.

3. Marker

What it is: ML-powered PDF-to-markdown converter. Uses a full deep learning pipeline for layout detection, OCR, and text extraction. Built by VikParuchuri (also behind Surya OCR).

GitHub stars: ~20K | License: GPL 3.0 | PyPI: ~500K monthly downloads

When to use it

You have a GPU available
Your documents are complex (multi-column, mixed content, academic papers)
You want high-quality markdown output from any PDF type
Quality matters more than speed

When NOT to use it

You don’t have a GPU (CPU inference is painfully slow — 10-30s/page)
Simple digital PDFs (pymupdf4llm gives you the same quality in 1% of the time)
Production environments where you need predictable latency
GPL 3.0 license is a problem for your project

Install and usage

pip install marker-pdf

from marker.converters.pdf import PdfConverter
from marker.models import create_model_dict

models = create_model_dict()
converter = PdfConverter(artifact_dict=models)
rendered = converter("complex_paper.pdf")
text = rendered.markdown

Benchmark numbers

Metric	Score
Overall extraction quality	Very high (especially complex layouts)
Speed (GPU)	0.5-2s/page
Speed (CPU)	10-30s/page
OCR quality	Very good (Surya-based)
Disk footprint	~5GB (multiple ML models)
GPU VRAM needed	4-8GB recommended

Honest take

Marker produces excellent output. On complex academic papers, multi-column layouts, and documents with mixed text/figures, it’s consistently better than rule-based tools. The Surya OCR integration means it handles scans well too.

The dealbreaker for most people is the GPU requirement. Without a GPU, Marker is unusable in production. With a GPU, it’s one of the best tools available — but you’re paying for that GPU. At $0.50-1.00/hour for a cloud GPU, the cost per document adds up fast compared to free CPU-based alternatives. Also, GPL 3.0 means you can’t use it in proprietary software without open-sourcing your code.

4. MinerU

What it is: A full ML document extraction pipeline from the OpenDataLab team. End-to-end: layout detection, formula recognition, OCR, table extraction, reading order — all using deep learning models.

GitHub stars: ~30K | License: AGPL 3.0

When to use it

Academic papers with formulas (LaTeX output for equations)
Complex multi-column layouts
You need a complete ML pipeline and have the infrastructure for it
Research environments where setup complexity is acceptable

When NOT to use it

Production services (complex setup, heavy dependencies)
Simple documents (massive overkill)
Constrained environments (needs multiple GB of models)
You need a stable API (still evolving rapidly)
AGPL license is a problem

Install and usage

pip install magic-pdf
# Plus model downloads — see their docs for the full setup

from magic_pdf.data.data_reader_writer import FileBasedDataReader
# Setup is more involved — see MinerU documentation

Benchmark numbers

Metric	Score
Layout detection	Excellent (YOLO-based)
Formula extraction	Best in class (LaTeX output)
Table extraction	~90%
Speed	2-5s/page (GPU)
Disk footprint	5-10GB (multiple model weights)
Setup complexity	High

Honest take

MinerU is impressive engineering. The formula recognition alone makes it the best choice for academic and scientific documents. The layout detection is strong, and the full pipeline approach means fewer edge cases than cobbling tools together.

But MinerU is not a “pip install and go” tool. The setup involves downloading multiple model weights, configuring paths, and dealing with dependency conflicts. The AGPL license is restrictive. And for non-academic documents — business reports, contracts, invoices — it’s severe overkill. If you’re not extracting LaTeX formulas, you probably don’t need MinerU.

5. pymupdf4llm

What it is: A thin wrapper around PyMuPDF that outputs LLM-friendly markdown. The “just works” option for digital PDFs.

GitHub stars: Part of PyMuPDF (~30K) | License: AGPL 3.0 (PyMuPDF) | PyPI: ~43M monthly downloads (PyMuPDF)

When to use it

Your PDFs are digital (software-generated, not scanned)
Speed is critical (batch processing thousands of documents)
You want zero external dependencies beyond PyMuPDF
Simple API, minimal setup

When NOT to use it

Scanned PDFs (returns empty text — silently)
Table-heavy documents (basic table detection, ~60% accuracy)
You need confidence scores or quality metrics
Mixed documents where some pages are scanned

Install and usage

pip install pymupdf4llm

import pymupdf4llm

md = pymupdf4llm.to_markdown("report.pdf")

Benchmark numbers

Metric	Score
Speed (digital PDF)	0.01s/page (fastest in class)
Digital text accuracy	98%+
Table extraction	~60%
Scanned PDF support	None
Disk footprint	~30MB

Honest take

pymupdf4llm is the right answer for the majority of PDF extraction tasks. Most PDFs are digital. Most digital PDFs are straightforward. At 0.01 seconds per page, you can process 10,000 pages per minute on a single core with no GPU.

The problem is that pymupdf4llm doesn’t tell you when it fails. Feed it a scanned document and it returns empty strings or near-empty strings with no error, no warning, nothing. Your RAG pipeline indexes empty documents and nobody knows until a human checks. For reliable pipelines, you need something on top of pymupdf4llm that verifies extraction quality — which is exactly what pdfmux does. For a per-category breakdown with cost analysis, see our honest guide to which PDF extractor you should use.

6. MarkItDown (Microsoft)

What it is: Microsoft’s document-to-markdown converter. Supports a wide range of formats: PDF, Word, PowerPoint, Excel, HTML, images, audio transcription, and more.

GitHub stars: ~40K+ | License: MIT

When to use it

You need to convert many document types, not just PDFs
Microsoft Office formats are common in your pipeline (Word, PowerPoint, Excel)
You want a single tool for all document types
MIT license is important

When NOT to use it

PDF quality matters (MarkItDown’s PDF handling is basic)
Tables, scans, or complex layouts (not optimized for these)
You need per-page confidence or quality metrics
PDF is your primary format (use a dedicated PDF tool)

Install and usage

pip install markitdown

from markitdown import MarkItDown

md = MarkItDown()
result = md.convert("presentation.pptx")
print(result.text_content)

Benchmark numbers

Metric	Score
Format support	Widest (PDF, DOCX, PPTX, XLSX, HTML, images, audio)
PDF text accuracy	~90% (basic extraction)
Table extraction	Basic
Scanned PDF support	Limited
Disk footprint	~100MB

Honest take

MarkItDown’s value is breadth, not depth. If your pipeline needs to ingest Word docs, PowerPoint decks, Excel sheets, AND PDFs, MarkItDown gives you one interface for all of them. The 40K+ GitHub stars reflect how useful that is.

But for PDFs specifically, MarkItDown is not competitive with dedicated tools. Its PDF extraction is essentially a basic text dump — no layout intelligence, no table structure, no OCR. If PDFs are your primary concern, use a dedicated PDF tool and use MarkItDown for the other formats.

7. pdfmux

What it is: A self-healing PDF extraction pipeline that routes each page to the best available extractor, scores quality, and re-extracts failures automatically. Full disclosure: I built it.

GitHub stars: ~700 | License: MIT | PyPI: growing

When to use it

Mixed documents (some digital, some scanned, some tables)
You need confidence scores to know which pages extracted well
Building RAG pipelines (LangChain/LlamaIndex integrations, chunking, token estimates)
You want smart routing without manually picking extractors
You can’t predict what PDFs users will upload

When NOT to use it

You know all your PDFs are clean digital text (pymupdf4llm is simpler and just as fast)
You need bounding box coordinates (OpenDataLoader does this better)
Enterprise compliance/accessibility requirements (OpenDataLoader)
You need formula extraction (MinerU)
You want the absolute highest extraction quality regardless of cost (Gemini Flash)

Install and usage

pip install pdfmux              # base — handles 90% of PDFs
pip install "pdfmux[ocr]"       # add OCR for scanned docs
pip install "pdfmux[tables]"    # add Docling for table extraction
pip install "pdfmux[all]"       # everything

import pdfmux

text = pdfmux.extract_text("anything.pdf")
# Automatically: PyMuPDF → audit → OCR bad pages → Docling on tables
# Returns markdown with per-page confidence scores

Benchmark numbers

Metric	Score
Reading order accuracy	#2 on LayoutBench (behind OpenDataLoader)
Digital PDF speed	0.01s/page (PyMuPDF backend)
Scanned PDF speed	1-3s/page (RapidOCR, CPU)
Table accuracy	97.9% (Docling backend, when installed)
Confidence scoring	Yes (0.0-1.0 per page, 5 quality checks)
Disk footprint (base)	~30MB
Disk footprint (all)	~2GB

Honest take

pdfmux is good at the orchestration problem — figuring out what kind of page you have and routing to the right tool. The confidence scoring is genuinely unique. No other tool tells you “page 7 scored 0.3, I re-extracted it with OCR and now it scores 0.87.”

Where pdfmux falls short:

It’s a router, not an engine. pdfmux is only as good as its backends. If you don’t install the OCR or table extras, it falls back to pymupdf4llm — which means scanned pages return empty and tables are approximate.
No bounding boxes. OpenDataLoader returns coordinate-level element positions. pdfmux returns text and markdown. If you need to highlight source locations in a UI, pdfmux can’t do that today.
Smaller community. With ~700 stars versus OpenDataLoader’s 3.8K or Marker’s 20K, there are fewer Stack Overflow answers, fewer tutorials, fewer edge cases already reported and fixed.
No formula support. Academic papers with equations should go through MinerU, not pdfmux.

Cost comparison

Real costs matter. Here’s what each tool actually requires:

Tool	License	GPU Required?	Disk Space	API Cost	Best for
pymupdf4llm	AGPL 3.0	No	30MB	Free	Digital PDFs, speed
pdfmux (base)	MIT	No	30MB	Free	Smart routing, confidence
pdfmux (all)	MIT	No	~2GB	Free	Mixed documents
Docling	MIT	No	500MB	Free	Tables
OpenDataLoader	Apache 2.0	No (helps)	1-2GB	Free	Enterprise, bounding boxes
Marker	GPL 3.0	Yes	5GB	Free + GPU cost	Complex layouts
MinerU	AGPL 3.0	Yes	5-10GB	Free + GPU cost	Academic papers, formulas
MarkItDown	MIT	No	100MB	Free	Multi-format
Gemini Flash	Proprietary	No	None	~$0.01-0.05/doc	Best quality, cloud
LlamaParse	Proprietary	No	None	$0.003/page	Cloud, managed
AWS Textract	Proprietary	No	None	$0.015/page	AWS ecosystem

GPU cost note: A cloud GPU (T4/A10) runs $0.50-1.50/hour. If you’re processing 100 documents/hour with Marker, that’s $0.005-0.015 per document in GPU cost alone — comparable to LlamaParse’s per-page pricing but with the hassle of managing infrastructure.

If you’re building RAG

This section is specifically for people building retrieval-augmented generation pipelines, AI agents, or LLM-powered applications. Your extraction tool has a direct impact on answer quality.

What matters for RAG

Extraction accuracy — garbage in, garbage out. A hallucinating LLM on top of garbled extraction is a liability.
Structured output — markdown outperforms plain text by 20-35% in RAG accuracy benchmarks. Tables preserved as markdown tables, not flattened text.
Confidence signals — you need to know when extraction failed so you can flag low-confidence chunks rather than serving wrong answers confidently.
Chunking quality — section-aware chunks beat fixed-size splits. A chunk that crosses section boundaries confuses the retriever.
Cost at scale — if you’re processing thousands of documents, $0.003/page adds up. Free local tools matter.

My recommendation for RAG

Tier 1 — Start here:

pip install pdfmux

pdfmux with the base install handles 90% of documents. You get confidence scores, section-aware chunking, token estimates, and LangChain/LlamaIndex integrations out of the box. Cost: $0. (See our PDF-to-Markdown for RAG guide for the complete ingestion pipeline.)

Tier 2 — When you need more:

pip install "pdfmux[all]"

Adds Docling for tables and RapidOCR for scans. Handles mixed documents automatically. Still runs on CPU, still free. Disk cost: ~2GB.

Tier 3 — Enterprise or high-stakes:

Consider OpenDataLoader for bounding box support (useful for citation highlighting in UIs) or Gemini Flash for the absolute highest extraction quality on difficult documents. Budget $0.01-0.05 per document for the cloud API.

What I’d avoid for RAG:

MarkItDown for PDFs — its PDF extraction is too basic. Use it for Office formats, not PDFs.
Marker without a GPU — CPU inference is too slow for production RAG pipelines.
MinerU for business documents — the formula extraction is great but the setup cost isn’t justified unless you have academic papers.
pymupdf4llm alone — fast and accurate on digital PDFs, but the silent failure on scanned pages will bite you. At minimum, add a confidence check.

The real comparison: what do you actually need?

After testing all of these tools across hundreds of documents, here’s the pattern I see:

Most people need pymupdf4llm + a safety net. 90% of PDFs are digital. pymupdf4llm handles those perfectly in milliseconds. The remaining 10% — scans, complex tables, mixed documents — need specialized tools. The question is how you handle that 10%.

Option A: Ignore the 10% and accept occasional failures. This is what most production pipelines do, whether they admit it or not.

Option B: Run everything through an ML pipeline (Marker, MinerU, OpenDataLoader). This works but is 50-500x slower than necessary for the 90% of documents that don’t need it.

Option C: Detect which pages need help and route accordingly. This is what pdfmux does. Extract fast, audit quality, re-extract only what’s broken.

I’m biased toward option C — I built the tool. But the engineering argument is sound regardless of which tool you use: don’t run expensive extraction on pages that don’t need it.

What changed since 2025

For anyone coming from the previous version of this guide:

OpenDataLoader is new and legitimate. Corporate backing from Hancom, real benchmark results, multi-language SDKs. It’s not vaporware.
MinerU matured significantly. Formula extraction is now production-quality. Setup is still complex.
Docling hit 5M monthly downloads. The IBM team is actively maintaining it. Table accuracy remains the best available.
MarkItDown crossed 40K stars. Microsoft’s backing gives it momentum, but the PDF extraction hasn’t improved much.
The “best PDF parser” is still context-dependent. Anyone telling you one tool wins at everything is selling you something.

Try them

# The fast default (90% of cases)
pip install pymupdf4llm

# Smart routing with confidence scoring
pip install pdfmux

# Add OCR and tables
pip install "pdfmux[all]"

# Enterprise with bounding boxes
pip install opendataloader

# Best tables
pip install docling

# ML-powered (needs GPU)
pip install marker-pdf

# Multi-format (not just PDFs)
pip install markitdown

Pick the one that matches your documents and constraints. There’s no universal winner — but there is a right tool for your specific use case.

Keep reading

Best PDF extraction library for Python in 2026 — the ranked benchmark results behind these recommendations
pdfmux vs PyMuPDF vs marker vs docling: 200-PDF benchmark — head-to-head numbers on opendataloader-bench
We ran pdfmux on Tesla 10-Ks and Supreme Court opinions — 1,422-page stress test with real SEC filings and legal documents
How to give your AI agent the ability to read any PDF — connect any of these tools to Claude or Cursor via MCP

Built by Nameet Potnis. Have a PDF extraction war story? Open an issue or find me at @nameetp.