TL;DR: The PDF extraction landscape shifted hard in late 2025 and early 2026. OpenDataLoader (Hancom) took the #1 benchmark slot with a hybrid AI engine. Docling still owns table extraction. Marker still needs a GPU. pdfmux still routes between all of them. This guide covers 7 tools with honest assessments — what each is good at, what each is bad at, and which one you should pick based on your actual use case.
There are too many PDF extractors now
A year ago, the decision was simple: PyMuPDF for digital, Docling for tables, Tesseract or Surya for scans. Maybe LlamaParse if you wanted to pay.
In 2026, the field has exploded. OpenDataLoader entered with enterprise backing and top benchmark scores. MinerU built a full ML pipeline. MarkItDown brought Microsoft into the game. Every month there’s a new tool claiming to be the best PDF parser.
I maintain pdfmux, a self-healing extraction pipeline that routes to different backends per page. I’ve tested all of these tools extensively because I need to know which ones are worth integrating. This is what I’ve found — no spin, just data. (For the original benchmark story, see how I benchmarked every PDF-to-Markdown tool and then built a router.)
The decision flowchart
Before the deep dives, here’s the quick version:
What are your PDFs like?
│
├─ Digital (software-generated, clean text)
│ ├─ Simple text/paragraphs → pymupdf4llm (fastest, 0.01s/page)
│ ├─ Heavy tables → Docling (97.9% table accuracy)
│ └─ Need bounding boxes / accessibility → OpenDataLoader
│
├─ Scanned (paper → scanner → PDF)
│ ├─ Have a GPU → Marker or MinerU
│ ├─ No GPU → pdfmux with RapidOCR (CPU, ~200MB)
│ └─ Budget available → Gemini Flash (best quality)
│
├─ Mixed (some digital, some scanned)
│ └─ pdfmux (classifies each page, routes automatically)
│
├─ Not just PDFs (Word, PowerPoint, HTML, etc.)
│ └─ MarkItDown (wide format support) or OpenDataLoader
│
└─ Building a RAG / LLM pipeline?
├─ Need confidence scores → pdfmux
├─ Enterprise compliance → OpenDataLoader
└─ Just need it to work → pdfmux or OpenDataLoader
Now let’s look at each tool properly.
1. OpenDataLoader (Hancom)
What it is: A hybrid AI document extraction engine from Hancom (Korean enterprise software company). Open-source with 3.8K GitHub stars as of March 2026. Uses a combination of rule-based and AI models to handle layout detection, table extraction, and OCR in a single pipeline.
GitHub stars: ~3,800 | License: Apache 2.0
When to use it
- You need bounding box coordinates for every extracted element
- Accessibility compliance matters (WCAG, Section 508)
- Your documents span multiple languages (CJK support is strong — Hancom’s heritage)
- Enterprise environment where “corporate-backed” matters for procurement
- You need SDKs beyond Python (Java, C++, REST API available)
When NOT to use it
- Simple digital PDFs where pymupdf4llm is 50-100x faster
- You want a minimal dependency footprint — OpenDataLoader pulls in ML models
- You’re running in a constrained environment (Lambda, small containers)
- You need the fastest possible throughput on clean documents
Install and usage
pip install opendataloader
from opendataloader import DocumentLoader
loader = DocumentLoader()
result = loader.load("report.pdf")
# Returns structured elements with bounding boxes, types, confidence
Benchmark numbers
| Metric | Score |
|---|---|
| Reading order accuracy | #1 on LayoutBench (as of Feb 2026) |
| Table extraction | ~94% (behind Docling’s 97.9%) |
| OCR quality | Very good, multi-engine |
| Speed (digital PDF) | 0.2-0.8s/page |
| Speed (scanned PDF) | 1-4s/page |
| Disk footprint | ~1-2GB with models |
Honest take
OpenDataLoader is the most well-rounded tool in the field right now. The hybrid approach — combining traditional PDF parsing with AI models — means it rarely fails catastrophically. The bounding box support is genuinely useful if you need to highlight or annotate source locations. The multi-language SDK story is the best in the space.
The downsides: it’s slower than pymupdf4llm on simple documents by 20-80x. The model download is heavy. And the “corporate-backed” angle cuts both ways — Hancom’s priorities may not always align with open-source community needs. The project is young and the API has changed between versions.
2. Docling (IBM)
What it is: IBM Research’s document understanding toolkit. Built specifically for structured document extraction — tables, figures, layout detection. Uses transformer models internally (DocLayNet-trained).
GitHub stars: ~18K | License: MIT | PyPI: ~5M monthly downloads
When to use it
- Your documents are table-heavy (financial reports, invoices, data sheets)
- Table accuracy is more important than speed
- You need structured markdown output with preserved table formatting
When NOT to use it
- Simple digital PDFs with no tables (pymupdf4llm is 30-100x faster)
- Scanned documents (Docling’s OCR is limited)
- Low-resource environments (loads transformer models on first run, ~500MB)
- You need sub-second latency
Install and usage
pip install docling
from docling.document_converter import DocumentConverter
converter = DocumentConverter()
result = converter.convert("financial_report.pdf")
markdown = result.document.export_to_markdown()
Benchmark numbers
| Metric | Score |
|---|---|
| Table extraction accuracy | 97.9% (DocLayNet benchmark) |
| Layout detection | 93%+ |
| Speed (digital PDF) | 0.3-1s/page |
| Speed (with tables) | 1-3s/page |
| First-run overhead | 5-10s (model loading) |
| Disk footprint | ~500MB (transformer models) |
Honest take
Docling is the best table extractor available. Period. The 97.9% accuracy on DocLayNet is not marketing — I’ve verified it on real financial documents. If you’re extracting invoices, SEC filings, or any document where table structure matters, Docling should be in your pipeline.
The problem is that Docling is mediocre at everything else. It’s slow on simple text documents. Its OCR support is an afterthought. And the 500MB model download means it’s not great for lightweight deployments. Use it for what it’s good at — tables — and use something else for the rest.
3. Marker
What it is: ML-powered PDF-to-markdown converter. Uses a full deep learning pipeline for layout detection, OCR, and text extraction. Built by VikParuchuri (also behind Surya OCR).
GitHub stars: ~20K | License: GPL 3.0 | PyPI: ~500K monthly downloads
When to use it
- You have a GPU available
- Your documents are complex (multi-column, mixed content, academic papers)
- You want high-quality markdown output from any PDF type
- Quality matters more than speed
When NOT to use it
- You don’t have a GPU (CPU inference is painfully slow — 10-30s/page)
- Simple digital PDFs (pymupdf4llm gives you the same quality in 1% of the time)
- Production environments where you need predictable latency
- GPL 3.0 license is a problem for your project
Install and usage
pip install marker-pdf
from marker.converters.pdf import PdfConverter
from marker.models import create_model_dict
models = create_model_dict()
converter = PdfConverter(artifact_dict=models)
rendered = converter("complex_paper.pdf")
text = rendered.markdown
Benchmark numbers
| Metric | Score |
|---|---|
| Overall extraction quality | Very high (especially complex layouts) |
| Speed (GPU) | 0.5-2s/page |
| Speed (CPU) | 10-30s/page |
| OCR quality | Very good (Surya-based) |
| Disk footprint | ~5GB (multiple ML models) |
| GPU VRAM needed | 4-8GB recommended |
Honest take
Marker produces excellent output. On complex academic papers, multi-column layouts, and documents with mixed text/figures, it’s consistently better than rule-based tools. The Surya OCR integration means it handles scans well too.
The dealbreaker for most people is the GPU requirement. Without a GPU, Marker is unusable in production. With a GPU, it’s one of the best tools available — but you’re paying for that GPU. At $0.50-1.00/hour for a cloud GPU, the cost per document adds up fast compared to free CPU-based alternatives. Also, GPL 3.0 means you can’t use it in proprietary software without open-sourcing your code.
4. MinerU
What it is: A full ML document extraction pipeline from the OpenDataLab team. End-to-end: layout detection, formula recognition, OCR, table extraction, reading order — all using deep learning models.
GitHub stars: ~30K | License: AGPL 3.0
When to use it
- Academic papers with formulas (LaTeX output for equations)
- Complex multi-column layouts
- You need a complete ML pipeline and have the infrastructure for it
- Research environments where setup complexity is acceptable
When NOT to use it
- Production services (complex setup, heavy dependencies)
- Simple documents (massive overkill)
- Constrained environments (needs multiple GB of models)
- You need a stable API (still evolving rapidly)
- AGPL license is a problem
Install and usage
pip install magic-pdf
# Plus model downloads — see their docs for the full setup
from magic_pdf.data.data_reader_writer import FileBasedDataReader
# Setup is more involved — see MinerU documentation
Benchmark numbers
| Metric | Score |
|---|---|
| Layout detection | Excellent (YOLO-based) |
| Formula extraction | Best in class (LaTeX output) |
| Table extraction | ~90% |
| Speed | 2-5s/page (GPU) |
| Disk footprint | 5-10GB (multiple model weights) |
| Setup complexity | High |
Honest take
MinerU is impressive engineering. The formula recognition alone makes it the best choice for academic and scientific documents. The layout detection is strong, and the full pipeline approach means fewer edge cases than cobbling tools together.
But MinerU is not a “pip install and go” tool. The setup involves downloading multiple model weights, configuring paths, and dealing with dependency conflicts. The AGPL license is restrictive. And for non-academic documents — business reports, contracts, invoices — it’s severe overkill. If you’re not extracting LaTeX formulas, you probably don’t need MinerU.
5. pymupdf4llm
What it is: A thin wrapper around PyMuPDF that outputs LLM-friendly markdown. The “just works” option for digital PDFs.
GitHub stars: Part of PyMuPDF (~30K) | License: AGPL 3.0 (PyMuPDF) | PyPI: ~43M monthly downloads (PyMuPDF)
When to use it
- Your PDFs are digital (software-generated, not scanned)
- Speed is critical (batch processing thousands of documents)
- You want zero external dependencies beyond PyMuPDF
- Simple API, minimal setup
When NOT to use it
- Scanned PDFs (returns empty text — silently)
- Table-heavy documents (basic table detection, ~60% accuracy)
- You need confidence scores or quality metrics
- Mixed documents where some pages are scanned
Install and usage
pip install pymupdf4llm
import pymupdf4llm
md = pymupdf4llm.to_markdown("report.pdf")
Benchmark numbers
| Metric | Score |
|---|---|
| Speed (digital PDF) | 0.01s/page (fastest in class) |
| Digital text accuracy | 98%+ |
| Table extraction | ~60% |
| Scanned PDF support | None |
| Disk footprint | ~30MB |
Honest take
pymupdf4llm is the right answer for the majority of PDF extraction tasks. Most PDFs are digital. Most digital PDFs are straightforward. At 0.01 seconds per page, you can process 10,000 pages per minute on a single core with no GPU.
The problem is that pymupdf4llm doesn’t tell you when it fails. Feed it a scanned document and it returns empty strings or near-empty strings with no error, no warning, nothing. Your RAG pipeline indexes empty documents and nobody knows until a human checks. For reliable pipelines, you need something on top of pymupdf4llm that verifies extraction quality — which is exactly what pdfmux does. For a per-category breakdown with cost analysis, see our honest guide to which PDF extractor you should use.
6. MarkItDown (Microsoft)
What it is: Microsoft’s document-to-markdown converter. Supports a wide range of formats: PDF, Word, PowerPoint, Excel, HTML, images, audio transcription, and more.
GitHub stars: ~40K+ | License: MIT
When to use it
- You need to convert many document types, not just PDFs
- Microsoft Office formats are common in your pipeline (Word, PowerPoint, Excel)
- You want a single tool for all document types
- MIT license is important
When NOT to use it
- PDF quality matters (MarkItDown’s PDF handling is basic)
- Tables, scans, or complex layouts (not optimized for these)
- You need per-page confidence or quality metrics
- PDF is your primary format (use a dedicated PDF tool)
Install and usage
pip install markitdown
from markitdown import MarkItDown
md = MarkItDown()
result = md.convert("presentation.pptx")
print(result.text_content)
Benchmark numbers
| Metric | Score |
|---|---|
| Format support | Widest (PDF, DOCX, PPTX, XLSX, HTML, images, audio) |
| PDF text accuracy | ~90% (basic extraction) |
| Table extraction | Basic |
| Scanned PDF support | Limited |
| Disk footprint | ~100MB |
Honest take
MarkItDown’s value is breadth, not depth. If your pipeline needs to ingest Word docs, PowerPoint decks, Excel sheets, AND PDFs, MarkItDown gives you one interface for all of them. The 40K+ GitHub stars reflect how useful that is.
But for PDFs specifically, MarkItDown is not competitive with dedicated tools. Its PDF extraction is essentially a basic text dump — no layout intelligence, no table structure, no OCR. If PDFs are your primary concern, use a dedicated PDF tool and use MarkItDown for the other formats.
7. pdfmux
What it is: A self-healing PDF extraction pipeline that routes each page to the best available extractor, scores quality, and re-extracts failures automatically. Full disclosure: I built it.
GitHub stars: ~700 | License: MIT | PyPI: growing
When to use it
- Mixed documents (some digital, some scanned, some tables)
- You need confidence scores to know which pages extracted well
- Building RAG pipelines (LangChain/LlamaIndex integrations, chunking, token estimates)
- You want smart routing without manually picking extractors
- You can’t predict what PDFs users will upload
When NOT to use it
- You know all your PDFs are clean digital text (pymupdf4llm is simpler and just as fast)
- You need bounding box coordinates (OpenDataLoader does this better)
- Enterprise compliance/accessibility requirements (OpenDataLoader)
- You need formula extraction (MinerU)
- You want the absolute highest extraction quality regardless of cost (Gemini Flash)
Install and usage
pip install pdfmux # base — handles 90% of PDFs
pip install "pdfmux[ocr]" # add OCR for scanned docs
pip install "pdfmux[tables]" # add Docling for table extraction
pip install "pdfmux[all]" # everything
import pdfmux
text = pdfmux.extract_text("anything.pdf")
# Automatically: PyMuPDF → audit → OCR bad pages → Docling on tables
# Returns markdown with per-page confidence scores
Benchmark numbers
| Metric | Score |
|---|---|
| Reading order accuracy | #2 on LayoutBench (behind OpenDataLoader) |
| Digital PDF speed | 0.01s/page (PyMuPDF backend) |
| Scanned PDF speed | 1-3s/page (RapidOCR, CPU) |
| Table accuracy | 97.9% (Docling backend, when installed) |
| Confidence scoring | Yes (0.0-1.0 per page, 5 quality checks) |
| Disk footprint (base) | ~30MB |
| Disk footprint (all) | ~2GB |
Honest take
pdfmux is good at the orchestration problem — figuring out what kind of page you have and routing to the right tool. The confidence scoring is genuinely unique. No other tool tells you “page 7 scored 0.3, I re-extracted it with OCR and now it scores 0.87.”
Where pdfmux falls short:
- It’s a router, not an engine. pdfmux is only as good as its backends. If you don’t install the OCR or table extras, it falls back to pymupdf4llm — which means scanned pages return empty and tables are approximate.
- No bounding boxes. OpenDataLoader returns coordinate-level element positions. pdfmux returns text and markdown. If you need to highlight source locations in a UI, pdfmux can’t do that today.
- Smaller community. With ~700 stars versus OpenDataLoader’s 3.8K or Marker’s 20K, there are fewer Stack Overflow answers, fewer tutorials, fewer edge cases already reported and fixed.
- No formula support. Academic papers with equations should go through MinerU, not pdfmux.
Cost comparison
Real costs matter. Here’s what each tool actually requires:
| Tool | License | GPU Required? | Disk Space | API Cost | Best for |
|---|---|---|---|---|---|
| pymupdf4llm | AGPL 3.0 | No | 30MB | Free | Digital PDFs, speed |
| pdfmux (base) | MIT | No | 30MB | Free | Smart routing, confidence |
| pdfmux (all) | MIT | No | ~2GB | Free | Mixed documents |
| Docling | MIT | No | 500MB | Free | Tables |
| OpenDataLoader | Apache 2.0 | No (helps) | 1-2GB | Free | Enterprise, bounding boxes |
| Marker | GPL 3.0 | Yes | 5GB | Free + GPU cost | Complex layouts |
| MinerU | AGPL 3.0 | Yes | 5-10GB | Free + GPU cost | Academic papers, formulas |
| MarkItDown | MIT | No | 100MB | Free | Multi-format |
| Gemini Flash | Proprietary | No | None | ~$0.01-0.05/doc | Best quality, cloud |
| LlamaParse | Proprietary | No | None | $0.003/page | Cloud, managed |
| AWS Textract | Proprietary | No | None | $0.015/page | AWS ecosystem |
GPU cost note: A cloud GPU (T4/A10) runs $0.50-1.50/hour. If you’re processing 100 documents/hour with Marker, that’s $0.005-0.015 per document in GPU cost alone — comparable to LlamaParse’s per-page pricing but with the hassle of managing infrastructure.
If you’re building RAG
This section is specifically for people building retrieval-augmented generation pipelines, AI agents, or LLM-powered applications. Your extraction tool has a direct impact on answer quality.
What matters for RAG
- Extraction accuracy — garbage in, garbage out. A hallucinating LLM on top of garbled extraction is a liability.
- Structured output — markdown outperforms plain text by 20-35% in RAG accuracy benchmarks. Tables preserved as markdown tables, not flattened text.
- Confidence signals — you need to know when extraction failed so you can flag low-confidence chunks rather than serving wrong answers confidently.
- Chunking quality — section-aware chunks beat fixed-size splits. A chunk that crosses section boundaries confuses the retriever.
- Cost at scale — if you’re processing thousands of documents, $0.003/page adds up. Free local tools matter.
My recommendation for RAG
Tier 1 — Start here:
pip install pdfmux
pdfmux with the base install handles 90% of documents. You get confidence scores, section-aware chunking, token estimates, and LangChain/LlamaIndex integrations out of the box. Cost: $0. (See our PDF-to-Markdown for RAG guide for the complete ingestion pipeline.)
Tier 2 — When you need more:
pip install "pdfmux[all]"
Adds Docling for tables and RapidOCR for scans. Handles mixed documents automatically. Still runs on CPU, still free. Disk cost: ~2GB.
Tier 3 — Enterprise or high-stakes:
Consider OpenDataLoader for bounding box support (useful for citation highlighting in UIs) or Gemini Flash for the absolute highest extraction quality on difficult documents. Budget $0.01-0.05 per document for the cloud API.
What I’d avoid for RAG:
- MarkItDown for PDFs — its PDF extraction is too basic. Use it for Office formats, not PDFs.
- Marker without a GPU — CPU inference is too slow for production RAG pipelines.
- MinerU for business documents — the formula extraction is great but the setup cost isn’t justified unless you have academic papers.
- pymupdf4llm alone — fast and accurate on digital PDFs, but the silent failure on scanned pages will bite you. At minimum, add a confidence check.
The real comparison: what do you actually need?
After testing all of these tools across hundreds of documents, here’s the pattern I see:
Most people need pymupdf4llm + a safety net. 90% of PDFs are digital. pymupdf4llm handles those perfectly in milliseconds. The remaining 10% — scans, complex tables, mixed documents — need specialized tools. The question is how you handle that 10%.
Option A: Ignore the 10% and accept occasional failures. This is what most production pipelines do, whether they admit it or not.
Option B: Run everything through an ML pipeline (Marker, MinerU, OpenDataLoader). This works but is 50-500x slower than necessary for the 90% of documents that don’t need it.
Option C: Detect which pages need help and route accordingly. This is what pdfmux does. Extract fast, audit quality, re-extract only what’s broken.
I’m biased toward option C — I built the tool. But the engineering argument is sound regardless of which tool you use: don’t run expensive extraction on pages that don’t need it.
What changed since 2025
For anyone coming from the previous version of this guide:
- OpenDataLoader is new and legitimate. Corporate backing from Hancom, real benchmark results, multi-language SDKs. It’s not vaporware.
- MinerU matured significantly. Formula extraction is now production-quality. Setup is still complex.
- Docling hit 5M monthly downloads. The IBM team is actively maintaining it. Table accuracy remains the best available.
- MarkItDown crossed 40K stars. Microsoft’s backing gives it momentum, but the PDF extraction hasn’t improved much.
- The “best PDF parser” is still context-dependent. Anyone telling you one tool wins at everything is selling you something.
Try them
# The fast default (90% of cases)
pip install pymupdf4llm
# Smart routing with confidence scoring
pip install pdfmux
# Add OCR and tables
pip install "pdfmux[all]"
# Enterprise with bounding boxes
pip install opendataloader
# Best tables
pip install docling
# ML-powered (needs GPU)
pip install marker-pdf
# Multi-format (not just PDFs)
pip install markitdown
Pick the one that matches your documents and constraints. There’s no universal winner — but there is a right tool for your specific use case.
Keep reading
- Best PDF extraction library for Python in 2026 — the ranked benchmark results behind these recommendations
- pdfmux vs PyMuPDF vs marker vs docling: 200-PDF benchmark — head-to-head numbers on opendataloader-bench
- We ran pdfmux on Tesla 10-Ks and Supreme Court opinions — 1,422-page stress test with real SEC filings and legal documents
- How to give your AI agent the ability to read any PDF — connect any of these tools to Claude or Cursor via MCP
Built by Nameet Potnis. Have a PDF extraction war story? Open an issue or find me at @nameetp.