Direct answer: pdfmux extracts text, tables, and headings from PDFs without a GPU or API keys by combining fast heuristic extraction (PyMuPDF) with targeted ML table extraction (Docling) only where needed. The result: 0.905 overall accuracy on the 200-PDF opendataloader-bench — 99.5% of the paid #1 score (LlamaParse 0.910) — at zero cost per page. Install: pip install pdfmux.
The cost problem in PDF extraction
The best PDF extraction tools in 2026 are expensive:
- Gemini Flash / GPT-4o: $0.01-0.05 per page. Process 100K pages/month and you’re paying $1,000-5,000.
- LlamaParse (paid): ~$3 per 1,000 pages and ships every page off-prem — a non-starter for regulated workloads.
- marker: Free but needs a GPU. A T4 GPU on AWS costs ~$0.50/hour. Processing 100K pages takes ~28 hours = $14, plus the engineering cost of GPU-tier infra.
- Docling: Free, no GPU required, but loads 500MB of transformer models and processes every page through them — even the 90% that are simple digital text.
For a startup building a RAG pipeline, these costs add up fast. (We benchmarked every major PDF extraction tool to find where the cost-accuracy tradeoffs actually land.) And for self-hosted deployments (healthcare, legal, finance), GPU dependencies are often a non-starter — both for compliance reasons and because GPU instance availability in regulated cloud regions is intermittent at best.
How pdfmux avoids the GPU tax
pdfmux’s key insight: 90% of PDF pages don’t need ML. They’re digital text — clean, extractable, and perfectly handled by heuristic tools like PyMuPDF at 0.01 seconds per page.
The other 10% — scanned pages, complex tables, image-heavy layouts — do need specialized tools. But you only need to pay the ML cost on those pages, and the right specialized tool runs comfortably on CPU when you’re not forcing it through every page in the document.
The architecture
PDF ──→ Classify (heuristic, <1ms)
│
├─ Digital text (90%) ──→ PyMuPDF (0.01s, CPU, free)
│ │
│ ├─ Audit quality (5 checks)
│ ├─ Bad page? → Re-extract with OCR
│ └─ Inject headings (font-size analysis)
│
├─ Has tables (5%) ──→ Docling (0.3-3s, CPU, free)
│ └─ ML table detection + extraction (see [table extraction methods](/blog/extract-tables-from-pdf-python/))
│
└─ Scanned (5%) ──→ RapidOCR (0.5-2s, CPU, free)
└─ CPU-only OCR engine
No GPU anywhere in this pipeline. The ML components (Docling for tables, RapidOCR for scans) are designed to run on CPU. The classifier itself uses zero ML — it’s a few hundred lines of geometry on the PDF’s existing layout metadata.
The classification step
Before extracting anything, pdfmux classifies each PDF using lightweight heuristics (no ML):
- Drawn lines: Counts horizontal and vertical rules (tables have grid lines)
- Number density: Counts lines with high numeric content (financial tables)
- Column alignment: Checks if text blocks align vertically (columnar data)
- Whitespace patterns: Detects regular spacing patterns (table structure)
- PyMuPDF find_tables(): Quick heuristic table detection
This takes <1 millisecond and correctly routes 95%+ of documents. For the remaining edge cases, pdfmux runs a Docling table overlay as a safety net — extracting only table blocks from Docling’s output and merging them into the PyMuPDF text. The same logic also drives the scanned-PDF detection path, so the OCR fallback only fires when there’s no extractable text layer to begin with.
The self-healing loop
After extraction, pdfmux doesn’t just return the text. It audits every page with 5 quality checks and re-extracts failures:
for each page:
score = audit(page.text, page.image_count)
if score == "bad":
# Text present but low quality (encoding errors, truncation)
page.text = region_ocr(page) # OCR only the bad regions
elif score == "empty":
# No text extracted (likely scanned/image)
page.text = full_ocr(page) # OCR the entire page
This self-healing loop catches failures that other tools miss silently. The result: higher effective accuracy without needing expensive models. On the real-document stress test across 1,422 pages of SEC 10-Ks and Supreme Court opinions, the audit-and-retry path lifted final accuracy on the hardest 8% of pages from 0.61 to 0.89 without touching a GPU.
Benchmark proof
Tested on opendataloader-bench (200 real-world PDFs from financial filings, academic papers, legal contracts, and government documents). Benchmark last re-run April 22, 2026 against current library versions:
| Tool | Overall | GPU Required | Cost/Page | Install Size |
|---|---|---|---|---|
| LlamaParse (paid) | 0.910 | Cloud | ~$0.003 | N/A (API) |
| pdfmux | 0.905 | No | $0 | ~20MB |
| docling | 0.877 | Optional | $0 | ~500MB |
| marker | 0.861 | Recommended | $0 | ~2GB |
| mineru | 0.831 | Recommended | $0 | ~2GB |
pdfmux achieves 99.5% of the paid top score (see the full benchmarked ranking of every Python library) while being:
- Free (zero cost per page, no API keys)
- Small (20MB core install, no model downloads for basic use)
- Fast (0.05s/page average across the benchmark)
- CPU-only (runs on any server, laptop, or CI runner)
- Best-in-class on headings (MHS 0.844 — highest of any engine on the benchmark, paid or free)
When you DO need more
pdfmux has optional extras for harder cases:
# Core: digital PDFs, no dependencies
pip install pdfmux
# Add table extraction (downloads Docling models, ~500MB)
pip install pdfmux[tables]
# Add OCR for scanned documents (~200MB)
pip install pdfmux[ocr]
# Add LLM extraction for the hardest cases
pip install pdfmux[llm] # Requires GEMINI_API_KEY
The [llm] extra uses Gemini Flash as a final fallback — only for pages that both PyMuPDF and OCR couldn’t handle. In practice, this is <1% of pages, and you can disable the path entirely with quality="cpu-only" if your compliance posture forbids any outbound API call.
Real-world cost comparison
Processing 100,000 PDF pages per month:
| Tool | Monthly Cost | Infrastructure |
|---|---|---|
| GPT-4o Vision | $1,000-5,000 | API only |
| Gemini Flash | $100-500 | API only |
| marker (GPU) | ~$350 | T4 GPU instance |
| Docling (CPU) | ~$50 | 4-core server |
| pdfmux (CPU) | ~$20 | 2-core server |
| PyMuPDF only | ~$5 | Any server |
pdfmux on a $20/month Hetzner VPS processes 100K pages with table extraction, OCR fallback, and quality auditing. We validated this throughput in our real-world benchmark across 1,422 pages of SEC filings and Supreme Court opinions. The same VPS handles the chunking pipeline that feeds the embedding step, so for most RAG workloads you don’t need to split extraction and chunking across separate boxes.
Serverless and cold-start economics
CPU-only doesn’t just save the GPU bill — it changes where you can run extraction.
- AWS Lambda: The core
pdfmuxpackage (~20MB) fits comfortably under the 250MB unzipped Lambda layer cap. Cold starts are sub-second because there are no ML models to load. The[tables]extra blows past the Lambda layer cap (~500MB of Docling weights) — use a container image deployment, or split tables out to a separate ECS service. - Cloud Run / Fly.io: A container with
pdfmux[tables,ocr]runs in ~1GB of memory at idle. Cold start to first response is 8-15s on first invocation (Docling model load), then sub-second on warm instances. - CI runners: pdfmux runs inside GitHub Actions, GitLab CI, and Jenkins without any GPU runner setup. Useful for golden-file regression testing extraction pipelines against your real PDF corpus on every PR.
- Edge devices: The core package runs on Raspberry Pi 4 and Apple Silicon with no special build steps. RapidOCR has ARM wheels.
Compare this to marker or mineru, where the GPU dependency forces you into a narrow set of cloud regions and instance families. The accuracy delta (pdfmux 0.905 vs marker 0.861) goes the wrong way for them anyway — you’re paying for GPU infra and getting a lower score.
License, determinism, and other non-accuracy concerns
Three things that don’t show up in benchmark numbers but tend to surface during procurement:
- License: pdfmux is MIT. PyMuPDF underneath it is AGPL-3.0, which pdfmux’s packaging routes around via the LGPL-friendly build path — relevant if your legal team treats AGPL as a non-starter. marker is GPL, Docling and Unstructured are permissive, LlamaParse is commercial-only.
- Determinism: Rule-based extractors (the PyMuPDF path inside pdfmux, plus opendataloader) produce byte-identical output across runs. ML extractors can produce slightly different output between versions or across hardware due to floating-point non-determinism. For regulated workflows (financial filings, legal discovery, medical records), pdfmux has a
deterministic=Truemode that pins the rule-based path and refuses to fall through to ML — you get reproducibility in exchange for a small accuracy hit on edge cases. - GIL behavior: pdfmux releases the GIL during PyMuPDF calls, so a thread pool of 8-16 workers scales nearly linearly on a single machine. Docling and marker hold the GIL through their model invocations, so they need multiprocessing — and the model memory cost multiplies per worker. On the same 4-core VPS, pdfmux throughput is 3-4x higher than Docling for typical mixed corpora.
FAQ
Can pdfmux run in a Docker container without GPU?
Yes. The entire pipeline runs on CPU. A basic Docker image with pip install pdfmux[tables,ocr] is all you need. For Alpine, use the slim-bookworm Debian base instead — RapidOCR’s wheels don’t ship for musl.
How does pdfmux handle scanned PDFs without GPU-based OCR? pdfmux uses RapidOCR, a CPU-optimized OCR engine based on ONNX runtime. It’s not as fast as GPU-based Tesseract or PaddleOCR, but it handles most scanned documents accurately. The classifier routes scanned pages there automatically using the scanned-PDF detection heuristics, so you don’t have to flag them yourself.
What’s the accuracy tradeoff vs GPU-based tools? pdfmux scores 0.905 overall vs marker’s 0.861 (which recommends GPU). pdfmux is actually more accurate than both marker and Docling (0.877) while using zero GPU. The only tool ahead is LlamaParse (0.910, paid hosted API).
Can I use pdfmux in a serverless function?
The core package (without [tables]) works in Lambda/Cloud Functions with a ~20MB footprint. The [tables] extra requires a longer cold start (Docling model loading) — use a container or persistent server for that. Most teams split it: core in Lambda, tables on a small persistent worker that the Lambda calls only when classification flags table-heavy pages.
How does this compare to the rest of the field? We maintain a decision-flowchart comparison of 7 tools for picking by use case rather than raw score. The TL;DR: pdfmux wins on cost, license, and headings; Docling wins if tables are the only thing you care about; LlamaParse wins if you’re allowed to send your data to a hosted API.
Keep reading
- Best PDF extraction library for Python in 2026 (benchmarked) — full ranked list with per-tool scores on the 200-PDF benchmark
- What “self-healing” PDF extraction actually looks like — the full algorithm behind the extract-audit-repair pipeline
- PDF chunking strategies for RAG — what to do with the extracted text once you have it
- How to give your AI agent the ability to read any PDF — run this CPU-only pipeline as an MCP server for Claude, Cursor, or any agent
Last updated: May 18, 2026