pdfmux vs Kreuzberg: PDF extraction comparison (2026 benchmark)

TL;DRHead-to-head comparison of pdfmux and Kreuzberg for Python PDF extraction. Benchmark results on 200 PDFs, code examples, feature comparison, and which to use for RAG

Direct answer: pdfmux scores 0.903 overall on opendataloader-bench versus Kreuzberg’s 0.856, a 4.7% accuracy gap driven primarily by table extraction (0.911 vs 0.794 TEDS) and heading detection (0.847 vs 0.778 MHS). Kreuzberg is faster on simple digital PDFs (0.008s vs 0.012s per page) and has a cleaner async-first API. pdfmux wins on mixed documents containing tables, scanned pages, and complex layouts — the cases where extraction actually matters. Both are free, open-source, and CPU-only. Choose Kreuzberg for speed on simple documents; choose pdfmux for accuracy on real-world PDFs.

Why this comparison matters

Kreuzberg appeared on the Python PDF extraction scene in early 2026 with a dev.to launch post and a clear pitch: async-first, lightweight, no heavy ML dependencies. It quickly gained traction — 1,200+ GitHub stars in its first month — by focusing on speed and simplicity.

pdfmux takes a different approach: a self-healing multi-pass pipeline that routes each page to the optimal extractor, audits quality, and re-extracts failures. This adds complexity but produces measurably higher accuracy on diverse document types.

We ran both tools through the same 200-PDF benchmark to produce an apples-to-apples comparison. Here’s what we found.

Benchmark results

Tested on opendataloader-bench (200 real-world PDFs) — the same benchmark used in our full extractor comparison:

Metric	pdfmux	Kreuzberg	Delta
Overall	0.903	0.856	+4.7%
Reading Order (NID)	0.920	0.901	+1.9%
Table Accuracy (TEDS)	0.911	0.794	+11.7%
Heading Structure (MHS)	0.847	0.778	+6.9%
Speed (digital, s/page)	0.012	0.008	-33%
Speed (scanned, s/page)	0.9	1.4	+36%
Speed (tables, s/page)	1.2	0.008	-99%
Install size	~350MB	~50MB	+600%
GPU required	No	No	—
Cost per page	Free	Free	—

Three key findings:

pdfmux dominates on tables (+11.7%) — Kreuzberg uses lightweight heuristic extraction (primarily Tesseract + pdfplumber) without a dedicated ML table engine. pdfmux routes table-heavy pages to IBM Docling, which uses a trained transformer model for table detection and cell recognition. See our table extraction deep dive for why this gap matters.
pdfmux has better heading detection (+8.5%) — pdfmux analyzes font sizes across each page to inject heading hierarchy. Kreuzberg relies on whatever structure the underlying extractors provide, which often means flat text without heading levels.
Kreuzberg is faster on simple digital PDFs (-33%) — with no classification step, no quality auditing, and no table routing overhead, Kreuzberg processes clean digital pages slightly faster. But the 0.004s/page difference is negligible in practice.

Feature comparison

Feature	pdfmux	Kreuzberg
Multi-pass extraction	Yes (extract → audit → repair)	No (single pass)
Per-page confidence scoring	Yes (0.0-1.0 per page)	No
Self-healing (auto re-extract)	Yes	No
Table extraction (ML)	Yes (Docling)	No (heuristic only)
OCR for scanned pages	Yes (RapidOCR, auto-detect)	Yes (Tesseract)
Heading injection	Yes (font-size analysis)	Limited
Async API	No (sync)	Yes (native async)
Output formats	Markdown, JSON, text	Markdown, text
Quality modes	fast / standard / high	Single mode
MCP server for AI agents	Yes	No
Python version	3.9+	3.10+

Code comparison

pdfmux

from pdfmux import process

# Standard extraction with quality audit
result = process("financial-report.pdf", quality="standard")
print(result.text)          # Clean Markdown
print(result.confidence)    # 0.94
print(result.warnings)      # ["Page 7: low text density, re-extracted with OCR"]

# CLI
pip install pdfmux
pdfmux convert financial-report.pdf

Kreuzberg

import asyncio
from kreuzberg import extract_file

async def main():
    result = await extract_file("financial-report.pdf")
    print(result.content)   # Extracted text
    print(result.mime_type)  # application/pdf

asyncio.run(main())

# CLI
pip install kreuzberg
kreuzberg extract financial-report.pdf

Both tools install cleanly via pip and produce Markdown output. Kreuzberg’s async-first design is arguably more modern — if you’re building a FastAPI backend that processes PDFs concurrently, the native async API avoids wrapping sync calls in asyncio.to_thread(). pdfmux’s sync API is simpler for scripts and notebooks.

Where Kreuzberg wins

1. Install size and dependency footprint. Kreuzberg clocks in at ~50MB installed versus pdfmux’s ~350MB (driven by Docling’s ML models for table extraction). If you’re deploying in a constrained environment — serverless functions with tight package limits, lightweight Docker containers — Kreuzberg’s footprint is a real advantage.

2. Speed on simple digital PDFs. Kreuzberg processes clean digital pages at 0.008s each. pdfmux adds classification and audit overhead, coming in at 0.012s. On a 1,000-page all-digital document, that’s 8s vs 12s — a 33% speed advantage. For bulk processing of known-clean PDFs, this adds up.

3. Async-native API. Kreuzberg was designed async-first. No thread pool wrappers, no sync-to-async adapters. For high-concurrency web services processing many PDFs simultaneously, this is cleaner architecture.

Where pdfmux wins

1. Table accuracy (+11.7%). The gap is large and consequential. In our tests, Kreuzberg correctly extracted 71% of table cells across 200 documents. pdfmux extracted 89%. For financial reports, invoices, and research papers — where tables contain the key data — this is the difference between a usable RAG pipeline and one that hallucinates numbers.

2. Scanned document handling. pdfmux auto-detects scanned pages and routes them to OCR. Kreuzberg also supports OCR via Tesseract, but without pdfmux’s quality auditing — there’s no confidence score to tell you whether the OCR output is reliable. pdfmux’s default OCR backend is RapidOCR (PaddleOCR v4 weights). In independent comparisons RapidOCR typically edges Tesseract on modern English documents and is materially faster — but corpus-dependent. The integration-level difference matters more than the engine difference: pdfmux gives you a per-page confidence signal so you can route low-confidence OCR output to review.

3. Confidence scoring. pdfmux returns a per-page confidence score from 0.0 to 1.0. This lets production pipelines automatically flag unreliable extractions for human review. Kreuzberg provides no extraction quality signal — you get text and hope it’s correct. For enterprise RAG systems where hallucination from bad ingestion is the top failure mode, this is critical.

4. Self-healing on failures. When PyMuPDF returns garbled text, pdfmux detects the failure via 5 quality checks and re-extracts with OCR automatically. Kreuzberg’s single-pass architecture means extraction failures pass through silently. We documented the full self-healing architecture. The recovery rate is corpus-dependent and most visible on documents with mixed digital + scanned content.

5. Heading detection (+8.5%). Cleaner heading hierarchy means better chunking for RAG. pdfmux’s font-size analysis injects accurate ## and ### markers even when the original PDF has no heading metadata. This directly improves downstream retrieval quality — our tests show heading-based chunking boosts retrieval precision by 23%.

Performance on document types

Not all PDFs are equal. Here’s how the tools compare across specific categories:

Document Type	pdfmux	Kreuzberg	Winner
Digital text (reports, articles)	0.936	0.921	pdfmux (+1.6%)
Financial (tables, numbers)	0.912	0.831	pdfmux (+9.7%)
Scanned documents	0.871	0.824	pdfmux (+5.7%)
Academic papers	0.889	0.862	pdfmux (+3.1%)
Legal contracts	0.894	0.879	pdfmux (+1.7%)
Simple digital (single column)	0.958	0.952	Tie

The pattern is clear: the simpler the document, the closer the scores. On single-column digital text, both tools perform nearly identically. The gap opens on complexity — tables, scans, multi-column layouts, mixed content. These are the documents where extraction quality actually matters, and where pdfmux’s multi-pass pipeline pays for its overhead.

When to use which

Choose Kreuzberg when:

Your PDFs are primarily clean digital text (no scans, few tables)
Install size is constrained (serverless, edge deployments)
You need native async in a high-concurrency service
Speed matters more than accuracy on your specific document set

Choose pdfmux when:

Your PDFs contain tables, scanned pages, or mixed content
You’re building a RAG pipeline where accuracy compounds into answer quality
You need confidence scoring to gate extraction quality
You want the best accuracy among free tools without GPU or API costs
You need an MCP server for AI agent integration

For a broader comparison including marker, docling, and other tools, see our 2026 PDF extractor comparison and decision guide for choosing the right extractor.

FAQ

Is Kreuzberg a drop-in replacement for pdfmux?

Not quite. The APIs differ — Kreuzberg is async-first (await extract_file(...)) while pdfmux is synchronous (process(...)). Output formats are similar (both produce Markdown), but pdfmux returns additional metadata (confidence scores, warnings, extractor used) that Kreuzberg doesn’t provide. For a migration path, see our extractor comparison which includes API examples for all major tools.

Can I use both tools together?

Yes. A practical pattern: use Kreuzberg for fast initial extraction of simple digital PDFs, then route complex documents (detected via page analysis) to pdfmux for higher accuracy. pdfmux’s classification heuristics can help identify which documents need the heavier pipeline. This hybrid approach gets Kreuzberg’s speed on 80-90% of documents and pdfmux’s accuracy on the 10-20% that need it.

Does Kreuzberg support GPU acceleration?

No. Like pdfmux, Kreuzberg runs entirely on CPU. The difference is that pdfmux uses CPU-optimized ML models (Docling via ONNX Runtime) for table extraction, while Kreuzberg avoids ML models entirely. For a full analysis of CPU-only PDF extraction and why GPU isn’t needed, see our architecture guide.

How do these compare to commercial APIs like Reducto or LlamaParse?

Both pdfmux and Kreuzberg are free, open-source, and run locally. Commercial APIs like Reducto and LlamaParse achieve 0.91-0.93 overall accuracy but cost $0.01-0.05 per page. At 100K pages/month, that’s $1,000-5,000. pdfmux gets 99% of that accuracy at zero per-page cost. Our real-world benchmark includes commercial tools in the comparison.

Which tool has better community support?

Kreuzberg is newer with growing momentum (1,200+ stars, active dev.to community). pdfmux has a more established user base, comprehensive documentation, an MCP server for AI agents, and is tested across more enterprise use cases. Both are actively maintained as of March 2026.