pdfmux vs OpenDataLoader: why an orchestrator beats a single extractor for RAG

TL;DROpenDataLoader PDF v2.0 tops benchmarks at 0.90 accuracy — but production RAG pipelines need more than one extractor. How pdfmux uses OpenDataLoader alongside PyMuPDF,

Direct answer: OpenDataLoader PDF v2.0 is excellent — 0.90 overall accuracy in hybrid mode, 0.94 reading order, fast, open-source, backed by Hancom. It deserves its 8K GitHub stars. But no single extractor handles every PDF you will encounter in production. Scanned invoices, borderless tables, garbled encodings, password-protected files, mixed-language documents — every tool has blind spots. pdfmux is an orchestration layer that uses OpenDataLoader alongside PyMuPDF, Docling, and RapidOCR, routing each page to the extractor most likely to succeed and re-extracting when one fails. The question is not “which extractor is best” — it is “what happens when your best extractor encounters a document it cannot handle.”

OpenDataLoader deserves the hype

Let’s start with what OpenDataLoader gets right, because there is a lot.

Hancom released OpenDataLoader PDF v2.0 in March 2026 and it immediately topped GitHub’s global trending chart. Within a week it crossed 7,000 stars. The benchmarks explain why:

Metric	OpenDataLoader (hybrid)	OpenDataLoader (fast)
Overall	0.90	0.72
Reading Order (NID)	0.94	0.91
Table Accuracy (TEDS)	0.93	0.49
Heading Structure (MHS)	0.83	0.76
Speed (s/page)	0.43	0.05

Those are strong numbers. The 0.94 reading order score is the highest of any open-source tool we have tested. Table accuracy in hybrid mode (0.93 TEDS) beats Docling (0.89), Marker (0.81), and MinerU (0.87). The fast mode processes pages at 0.05 seconds each — comparable to raw PyMuPDF.

OpenDataLoader also ships with features that matter for production use: JSON output with bounding boxes and semantic types, OCR for 80+ languages, LaTeX extraction for mathematical formulas, and an annotated PDF mode for visual debugging. It requires Java 11+ but no GPU.

If you are evaluating PDF extraction tools for the first time, OpenDataLoader is a strong default choice. We use it ourselves.

The problem with relying on a single extractor

Here is what benchmarks do not tell you: what happens on the documents outside the benchmark set.

Every PDF extraction tool — OpenDataLoader included — is optimized for a particular distribution of documents. Benchmark PDFs tend to be clean, well-structured, digitally created. Production document pipelines encounter a different reality.

The scanned page problem. OpenDataLoader’s fast mode scores 0.91 on reading order — on digital PDFs. Feed it a 50-page contract where pages 12-15 were scanned from a fax machine, and those four pages come back empty or garbled. The overall document score drops from 0.95 to 0.70, and your RAG pipeline indexes garbage for the most important section.

Hybrid mode handles scans better, but it requires routing to an AI backend. What if the backend is down? What if those scanned pages are in Japanese? What if the scan is 150 DPI instead of 300?

The silent failure problem. This is the dangerous one. A single extractor either works or it does not, and often it does not tell you which. PyMuPDF can extract text from a digitally-created PDF that renders as perfect Markdown — but if the PDF was created by a bad authoring tool that embedded characters in the wrong encoding, you get plausible-looking text where every number is wrong. OpenDataLoader’s fast mode can return a reading order score of 0.91 on the benchmark while silently merging two columns into one on your specific financial report.

When you use a single extractor, you have no reference point. If the output looks reasonable, you ship it. If it is wrong, you find out when a customer reports that your AI cited a number that does not exist in the source document.

The table edge case. OpenDataLoader hybrid scores 0.93 on tables — impressive. But that is an average across 200 documents. On the subset of borderless tables (no visible grid lines), accuracy drops. On nested tables (tables within tables), it drops further. On tables that span page breaks, every extractor struggles. Your production pipeline does not process “average” documents. It processes specific documents, and the question is whether your specific documents fall in the 93% that work or the 7% that do not.

What an orchestrator does differently

pdfmux does not compete with OpenDataLoader. It uses OpenDataLoader — alongside PyMuPDF, Docling, and RapidOCR — as extraction backends. The value is in the layer above the extractors.

Here is the architecture:

PDF Input
    │
    ▼
┌─────────────────┐
│  Page Classifier │  ← Digital? Scanned? Table-heavy? Mixed?
└────────┬────────┘
         │
    ┌────┴────┬──────────┬──────────┐
    ▼         ▼          ▼          ▼
 PyMuPDF  OpenData   Docling    RapidOCR
 (fast)   Loader     (tables)   (scans)
    │         │          │          │
    └────┬────┴──────────┴──────────┘
         │
         ▼
┌─────────────────┐
│  Quality Audit   │  ← Per-page confidence scoring (0.0–1.0)
└────────┬────────┘
         │
    Pass ─┤── Fail
         │        │
         ▼        ▼
      Output   Re-extract with fallback

Three things happen that a single extractor cannot provide:

1. Page-level routing

Not every page in a document needs the same extractor. Page 1 might be a clean digital cover page (PyMuPDF, 0.01s). Page 2 might contain a complex borderless table (Docling, 1.2s). Page 3 might be a scanned appendix (RapidOCR, 0.9s).

pdfmux classifies each page in under 1 millisecond and routes it to the extractor with the highest expected accuracy for that page type. OpenDataLoader’s hybrid mode does something similar internally — it routes complex pages to an AI backend — but it is limited to its own extraction pipeline. pdfmux can use any extractor, including OpenDataLoader itself.

from pdfmux import process

result = process("mixed-document.pdf", quality="standard")

# Each page was routed independently
for page in result.pages:
    print(f"Page {page.number}: {page.extractor} → confidence {page.confidence:.2f}")
# Page 1: pymupdf → confidence 0.97
# Page 2: docling → confidence 0.91
# Page 3: rapidocr → confidence 0.88
# Page 4: pymupdf → confidence 0.96

2. Quality auditing with confidence scores

After extraction, pdfmux runs five quality checks on every page:

Text density: Does the page contain a plausible amount of text relative to its visual content?
Character entropy: Are the extracted characters real text or encoding garbage?
Table structure: Do extracted tables have consistent row/column counts?
Reading order: Does the text sequence match the expected reading flow?
OCR confidence: For scanned pages, what is the OCR engine’s per-character confidence?

Each check produces a score. The composite confidence score (0.0 to 1.0) tells your pipeline whether to trust the extraction. This is the metadata that single extractors do not provide.

result = process("quarterly-report.pdf")

# Flag pages that need human review
for page in result.pages:
    if page.confidence < 0.80:
        print(f"Page {page.number}: confidence {page.confidence:.2f} — {page.warnings}")
# Page 14: confidence 0.62 — ["low text density, possible scan"]
# Page 23: confidence 0.71 — ["table cell count mismatch"]

In a production RAG pipeline, this confidence score is how you prevent hallucination at the ingestion layer. If you do not know whether the extraction is reliable, you cannot know whether the AI’s answer is grounded in real content or in extraction artifacts.

3. Self-healing on failures

When a quality check fails, pdfmux does not just flag the problem — it fixes it. A page extracted by PyMuPDF that returns garbled characters gets re-extracted by OpenDataLoader. A table that Docling misreads gets re-extracted with a different table detection strategy. A scanned page where RapidOCR returns low confidence gets re-processed at higher DPI.

This self-healing loop recovers content from degraded documents that any single-pass extractor would silently leave incomplete. The recovery is highest on exactly the documents that matter most: the edge cases, the poorly-authored PDFs, the scanned-and-rescanned contracts that arrive in enterprise document pipelines every day. The exact recovery rate is a function of how degraded your corpus actually is — the right way to size it is to diff the manifest.json from a quality=fast run against a quality=high run on the same documents.

Head-to-head benchmark

We ran both tools through opendataloader-bench — the same 200-PDF benchmark that OpenDataLoader uses for its own evaluation:

Metric	pdfmux	OpenDataLoader (hybrid)	OpenDataLoader (fast)
Overall	0.903	0.900	0.720
Reading Order (NID)	0.920	0.940	0.910
Table Accuracy (TEDS)	0.911	0.930	0.490
Heading Structure (MHS)	0.847	0.830	0.760
Speed (digital, s/page)	0.05	0.05	0.05
Speed (mixed, s/page)	0.50	0.43	0.05
Confidence scoring	Yes	No	No
Self-healing	Yes	No	No
GPU required	No	No	No
Cost per page	Free	Free	Free

Let’s be honest about these numbers: OpenDataLoader hybrid mode beats pdfmux on reading order (+2.4%) and table accuracy (+4.8%). On this specific benchmark, OpenDataLoader’s hybrid engine produces better raw extraction on most pages.

pdfmux wins on heading detection (+1.7%) and provides capabilities that are not captured by benchmark scores: confidence scoring, self-healing, and multi-extractor routing.

The real difference shows up on documents that are not in the benchmark.

Where benchmarks end and production begins

We tested both tools on 500 additional documents sourced from real customer pipelines — documents that were not in opendataloader-bench. These included:

87 scanned documents (fax quality, mixed DPI, skewed pages)
63 documents with mixed digital and scanned pages
45 documents with borderless tables
31 password-protected PDFs
28 documents with non-Latin scripts (Arabic, Chinese, Japanese, Korean)
19 documents with embedded forms
The remainder: standard digital PDFs of varying complexity

Document Type	pdfmux	OpenDataLoader (hybrid)	Delta
Standard digital	0.93	0.94	-1.1%
Scanned (clean, 300 DPI)	0.87	0.85	+2.4%
Scanned (degraded, <200 DPI)	0.79	0.71	+11.3%
Mixed digital + scanned	0.88	0.82	+7.3%
Borderless tables	0.85	0.89	-4.5%
Multi-language	0.84	0.80	+5.0%
Standard digital (simple)	0.96	0.96	Tie

The pattern follows what we have seen with every extractor comparison we have run: on clean digital documents, all good tools converge. The gap opens on degraded, mixed, and edge-case documents. Pdfmux’s advantage comes from having multiple extraction paths — when one engine struggles, another picks up the slack.

OpenDataLoader’s hybrid mode still wins on borderless tables. Its AI-backed table detection is genuinely impressive. If your pipeline is primarily digital documents with complex tables, OpenDataLoader hybrid may be the better standalone choice.

The cost of silent failures in RAG

Here is why this matters beyond benchmark points.

In a RAG pipeline, extraction errors compound. A misread table becomes a wrong number in the vector store. A garbled scanned page becomes a missing section that the retriever cannot find. A merged two-column layout becomes nonsense text that the LLM treats as authoritative.

The failure mode is not “the system breaks.” The failure mode is “the system confidently returns wrong answers sourced from bad extractions.” Users do not know the extraction was bad. They trust the AI’s response because it cites a document.

Single extractors give you no signal when this happens. The text looks plausible. The Markdown renders cleanly. There is no confidence score, no warning, no indication that page 14 came back empty because it was a scan.

pdfmux’s confidence scoring does not eliminate this risk — no tool can — but it quantifies it. A confidence score of 0.62 on page 14 tells your pipeline to either re-extract, route to human review, or exclude that page from indexing. That metadata is the difference between a RAG system that silently degrades and one that fails gracefully.

from pdfmux import process

result = process("customer-contract.pdf", quality="high")

# Build a RAG index only from pages you can trust
for page in result.pages:
    if page.confidence >= 0.85:
        index.add(page.text, metadata={"page": page.number, "confidence": page.confidence})
    else:
        review_queue.add(page)  # Human reviews low-confidence pages

Using OpenDataLoader inside pdfmux

pdfmux treats extractors as pluggable backends. As of v0.8, OpenDataLoader is a supported backend alongside PyMuPDF, Docling, and RapidOCR. You can configure which extractors are available and how page routing prioritizes them:

from pdfmux import process

# Use OpenDataLoader as the primary extractor, with fallbacks
result = process(
    "document.pdf",
    quality="standard",
    extractors=["opendataloader", "docling", "rapidocr"]
)

# OpenDataLoader handles most pages
# Docling picks up table-heavy pages where its ML model excels
# RapidOCR catches scanned pages

This is not an either/or choice. If OpenDataLoader is the best extractor for your document type — and for many document types, it is — pdfmux will route pages to it. The orchestrator adds value precisely because it can use the best tool for each page rather than committing to one tool for every page.

When to use OpenDataLoader directly

There are real scenarios where pdfmux’s overhead is not worth it:

Homogeneous digital documents. If your pipeline processes a known document type — say, digitally-created academic papers from a specific publisher — and you have validated that OpenDataLoader handles them well, the orchestrator layer adds latency without adding accuracy. Use OpenDataLoader directly.

Hybrid mode with API budget. OpenDataLoader’s hybrid mode routes complex pages to an AI backend, which is conceptually similar to what pdfmux does with multiple extractors. If you are comfortable with the AI backend dependency and your documents are primarily digital with occasional complex tables, hybrid mode may be sufficient.

Fast mode for speed-critical pipelines. At 0.05 seconds per page with no external dependencies, OpenDataLoader’s fast mode is hard to beat for bulk processing of simple PDFs where some accuracy loss is acceptable.

Java-native environments. OpenDataLoader has first-class Java and Node.js SDKs. If your stack is JVM-based, OpenDataLoader integrates more naturally than pdfmux (Python-only).

When you need the orchestrator

Mixed document pipelines. If your ingestion pipeline receives PDFs from multiple sources — customer uploads, email attachments, scanned archives, generated reports — no single extractor handles all of them well. This is where multi-extractor routing pays for itself.

RAG systems where accuracy compounds. Every extraction error becomes a potential hallucination in the AI’s responses. Confidence scoring and self-healing reduce the error rate at the ingestion layer, which propagates through the entire pipeline. See our guide to PDF extraction for RAG for the full architecture.

Enterprise compliance requirements. When you need to prove that extracted data is reliable — financial audits, legal discovery, medical records — per-page confidence scores provide the audit trail that regulators expect.

Scanned and degraded documents. Our tests show an 11.3% accuracy gap on degraded scans. If your pipeline encounters scanned documents regularly, pdfmux’s multi-pass OCR with quality auditing handles them materially better than any single-pass extractor.

FAQ

Is pdfmux a competitor to OpenDataLoader?

No. pdfmux is an orchestration layer that uses OpenDataLoader (and other extractors) as backends. We do not maintain our own extraction engine — we route to the best available engine for each page. When OpenDataLoader improves, pdfmux’s output improves too.

Does pdfmux require OpenDataLoader to be installed?

No. pdfmux uses PyMuPDF as its default fast extractor and Docling for table-heavy pages. OpenDataLoader is an optional backend that you can enable by installing it separately (pip install opendataloader-pdf). Pdfmux detects available extractors at runtime and routes accordingly.

Which is faster?

On simple digital PDFs, both tools process at roughly 0.05 seconds per page. On mixed documents, OpenDataLoader hybrid mode averages 0.43s/page versus pdfmux’s 0.50s/page. The speed difference is negligible for most pipelines. Where pdfmux is slower, it is doing additional work (quality auditing, potential re-extraction) that a single-pass tool skips.

Do I need a GPU for either tool?

No. Both pdfmux and OpenDataLoader run on CPU only. OpenDataLoader requires Java 11+. pdfmux requires Python 3.9+ and installs heavier dependencies (~350MB including Docling’s ML models versus OpenDataLoader’s lighter footprint).

Can I use OpenDataLoader’s hybrid mode through pdfmux?

Yes. When OpenDataLoader is configured as a backend in pdfmux, you can specify whether to use fast or hybrid mode. Pdfmux’s page classifier can route simple pages to OpenDataLoader fast mode and complex pages to hybrid mode — getting speed where you can afford it and accuracy where you need it.

How does this compare to commercial APIs like Reducto or LlamaParse?

Both pdfmux and OpenDataLoader are free and open-source. Commercial APIs achieve 0.91-0.93 overall accuracy at $0.01-0.05 per page. At 100K pages/month, that is $1,000-5,000. pdfmux and OpenDataLoader both achieve 0.90 at zero per-page cost. For most teams, the commercial premium is not justified. See our real-world benchmark for the full comparison.

Bottom line

OpenDataLoader PDF v2.0 is the best single extractor available in open source today. Its hybrid mode accuracy, speed in fast mode, and comprehensive output formats make it a deserved frontrunner.

But production document pipelines do not fail on average documents. They fail on the exceptions — the scanned page in an otherwise digital document, the table that does not match the expected format, the encoding that produces plausible-looking garbage. These failures are silent, and they compound through your RAG pipeline into wrong answers that users trust.

pdfmux exists for the gap between “works on the benchmark” and “works on every document your users upload.” It uses OpenDataLoader where OpenDataLoader excels, and routes to other extractors where they perform better. The confidence scoring tells you when to trust the output. The self-healing loop recovers content that single-pass extraction misses.

Use OpenDataLoader directly when your documents are predictable. Use pdfmux when they are not.

pip install pdfmux
pdfmux convert your-document.pdf