pdfmux vs Marker: Which PDF-to-markdown tool should you use?
pdfmux wins on speed, install size, and ease of deployment. Marker (~18k GitHub stars) is a popular deep learning-based converter that turns PDFs, EPUBs, and MOBIs into markdown. It achieves impressive accuracy using a multi-model pipeline (detection, OCR, table recognition). However, pdfmux delivers comparable quality with a fraction of the compute overhead — no GPU required.
For teams building production pipelines where reliability, speed, and simple deployment matter more than bleeding-edge OCR on degraded scans, pdfmux is the practical choice.
Feature Comparison
| Feature | pdfmux | Marker |
|---|---|---|
| Output format | Markdown, JSON | Markdown |
| PDF support | Text-based + scanned | Text-based + scanned |
| EPUB/MOBI support | No | Yes |
| GPU required | No | Recommended |
| Table extraction | Built-in, structured | ML-based detection |
| Installation size | ~15 MB | ~2 GB (with models) |
| Processing approach | Hybrid (rules + ML) | Deep learning pipeline |
| License | MIT | GPL-3.0 |
Benchmark Comparison
| Metric | pdfmux | Marker |
|---|---|---|
| Text accuracy (text-based PDFs) | 94.2% | 93.8% |
| Text accuracy (scanned PDFs) | 88.1% | 91.4% |
| Table extraction F1 | 91.8% | 89.2% |
| Speed — text PDFs (pages/sec) | 45 | 8 |
| Speed — scanned PDFs (pages/sec) | 12 | 5 |
| Install size | 15 MB | ~2 GB |
| Memory usage | 85 MB | 2-4 GB |
Marker excels on heavily degraded scans thanks to its Surya OCR backbone. pdfmux is 5-8x faster on text-based PDFs and produces better table structures.
When to Use Marker
Marker is the right choice when you need:
- Multi-format conversion — you process EPUBs and MOBIs alongside PDFs
- Scanned document handling — heavily degraded or handwritten documents where deep learning OCR shines
- Academic paper conversion — Marker has strong support for equation rendering and scientific layouts
- GPU-available environments — you have GPU infrastructure and can tolerate longer processing times
- GPL-compatible projects — your licensing allows GPL-3.0
When to Use pdfmux
pdfmux is the better choice when you need:
- Fast, lightweight extraction — 5-8x faster than Marker on text-based PDFs, no GPU needed
- Production deployment — small install footprint, predictable memory usage, easy to containerize
- Table-heavy documents — financial reports, invoices, and forms with complex table structures
- Structured output — JSON with metadata, not just markdown text
- Commercial licensing — MIT license for proprietary applications
- Cost-effective scaling — CPU-only processing means cheaper cloud instances
Quick Code Comparison
pdfmux:
import pdfmux
result = pdfmux.convert("paper.pdf")
print(result.markdown)
Marker:
from marker.converters.pdf import PdfConverter
converter = PdfConverter()
rendered = converter("paper.pdf")
print(rendered.markdown)
FAQ
Is Marker more accurate than pdfmux?
On scanned or degraded PDFs, Marker’s deep learning pipeline can edge out pdfmux on OCR accuracy. On text-based PDFs (which represent the majority of real-world documents), pdfmux matches or exceeds Marker’s accuracy while running 5-8x faster.
Can I run Marker without a GPU?
Yes, Marker runs on CPU, but it’s significantly slower — often 10-20x slower than GPU mode. pdfmux is designed for CPU-first processing and maintains consistent performance without GPU acceleration.
Does Marker support structured JSON output?
Marker primarily outputs markdown. If you need structured JSON with tables, metadata, and semantic sections, pdfmux provides that natively. With Marker, you’d need to parse the markdown output yourself.
Looking for detailed benchmarks? Read our comprehensive PDF extraction benchmark. For a broader comparison of all Python PDF libraries, see Best PDF Extraction Library for Python.