PDF extraction without GPU or API keys: how pdfmux does it

TL;DRHow pdfmux achieves 99% of AI-powered extraction accuracy with zero GPU, zero API keys, and zero cost per page. Architecture explained.

Direct answer: pdfmux extracts text, tables, and headings from PDFs without a GPU or API keys by combining fast heuristic extraction (PyMuPDF) with targeted ML table extraction (Docling) only where needed. The result: 0.903 overall accuracy on the 200-PDF opendataloader-bench — 99.3% of the paid #1 score (the opendataloader-hybrid engine, 0.909) — at zero cost per page. Install: pip install pdfmux.

The cost problem in PDF extraction

The best PDF extraction tools in 2026 are expensive:

Gemini Flash / GPT-4o: $0.01-0.05 per page. Process 100K pages/month and you’re paying $1,000-5,000.
LlamaParse (paid): ~$3 per 1,000 pages and ships every page off-prem — a non-starter for regulated workloads.
marker: Free but needs a GPU. A T4 GPU on AWS costs ~$0.50/hour. Processing 100K pages takes ~28 hours = $14, plus the engineering cost of GPU-tier infra.
Docling: Free, no GPU required, but loads 500MB of transformer models and processes every page through them — even the 90% that are simple digital text.

For a startup building a RAG pipeline, these costs add up fast. (We benchmarked every major PDF extraction tool to find where the cost-accuracy tradeoffs actually land.) And for self-hosted deployments (healthcare, legal, finance), GPU dependencies are often a non-starter — both for compliance reasons and because GPU instance availability in regulated cloud regions is intermittent at best.

How pdfmux avoids the GPU tax

pdfmux’s key insight: 90% of PDF pages don’t need ML. They’re digital text — clean, extractable, and perfectly handled by heuristic tools like PyMuPDF at 0.01 seconds per page.

The other 10% — scanned pages, complex tables, image-heavy layouts — do need specialized tools. But you only need to pay the ML cost on those pages, and the right specialized tool runs comfortably on CPU when you’re not forcing it through every page in the document.

The architecture

PDF ──→ Classify (heuristic, <1ms)
         │
         ├─ Digital text (90%) ──→ PyMuPDF (0.01s, CPU, free)
         │                          │
         │                          ├─ Audit quality (5 checks)
         │                          ├─ Bad page? → Re-extract with OCR
         │                          └─ Inject headings (font-size analysis)
         │
         ├─ Has tables (5%) ──→ Docling (0.3-3s, CPU, free)
         │                       └─ ML table detection + extraction (see [table extraction methods](/blog/extract-tables-from-pdf-python/))
         │
         └─ Scanned (5%) ──→ RapidOCR (0.5-2s, CPU, free)
                               └─ CPU-only OCR engine

No GPU anywhere in this pipeline. The ML components (Docling for tables, RapidOCR for scans) are designed to run on CPU. The classifier itself uses zero ML — it’s a few hundred lines of geometry on the PDF’s existing layout metadata.

The classification step

Before extracting anything, pdfmux classifies each PDF using lightweight heuristics (no ML):

Drawn lines: Counts horizontal and vertical rules (tables have grid lines)
Number density: Counts lines with high numeric content (financial tables)
Column alignment: Checks if text blocks align vertically (columnar data)
Whitespace patterns: Detects regular spacing patterns (table structure)
PyMuPDF find_tables(): Quick heuristic table detection

This takes <1 millisecond and correctly routes 95%+ of documents. For the remaining edge cases, pdfmux runs a Docling table overlay as a safety net — extracting only table blocks from Docling’s output and merging them into the PyMuPDF text. The same logic also drives the scanned-PDF detection path, so the OCR fallback only fires when there’s no extractable text layer to begin with.

The self-healing loop

After extraction, pdfmux doesn’t just return the text. It audits every page with 5 quality checks and re-extracts failures:

for each page:
    score = audit(page.text, page.image_count)
    if score == "bad":
        # Text present but low quality (encoding errors, truncation)
        page.text = region_ocr(page)  # OCR only the bad regions
    elif score == "empty":
        # No text extracted (likely scanned/image)
        page.text = full_ocr(page)    # OCR the entire page

This self-healing loop catches failures that other tools miss silently. The result: higher effective accuracy on degraded documents without needing expensive models. The lift is most visible on the hardest pages — scans with multi-generational copy degradation, faxed forms, columns with low DPI — where single-pass extractors return text that looks plausible but is silently incomplete.

Benchmark proof

Tested on opendataloader-bench (200 real-world PDFs from financial filings, academic papers, legal contracts, and government documents). Benchmark last re-run April 22, 2026 against current library versions:

Tool	Overall	GPU Required	Cost/Page	Install Size
opendataloader-hybrid (paid)	0.909	Cloud	~$0.01/page	N/A (API)
pdfmux	0.903	No	$0	~20MB
docling	0.877	Optional	$0	~500MB
marker	0.861	Recommended	$0	~2GB
mineru	0.831	Recommended	$0	~2GB

pdfmux achieves 99.3% of the paid top score (see the full benchmarked ranking of every Python library) while being:

Free (zero cost per page, no API keys)
Small (20MB core install, no model downloads for basic use)
Fast (0.05s/page average across the benchmark)
CPU-only (runs on any server, laptop, or CI runner)
Best-in-class on headings (MHS 0.847 — highest of any engine on the benchmark, paid or free)

When you DO need more

pdfmux has optional extras for harder cases:

# Core: digital PDFs, no dependencies
pip install pdfmux

# Add table extraction (downloads Docling models, ~500MB)
pip install pdfmux[tables]

# Add OCR for scanned documents (~200MB)
pip install pdfmux[ocr]

# Add LLM extraction for the hardest cases
pip install pdfmux[llm]  # Requires GEMINI_API_KEY

The [llm] extra uses Gemini Flash as a final fallback — only for pages that both PyMuPDF and OCR couldn’t handle. In practice, this is <1% of pages, and you can disable the path entirely with quality="cpu-only" if your compliance posture forbids any outbound API call.

Real-world cost comparison

Processing 100,000 PDF pages per month:

Tool	Monthly Cost	Infrastructure
GPT-4o Vision	$1,000-5,000	API only
Gemini Flash	$100-500	API only
marker (GPU)	~$350	T4 GPU instance
Docling (CPU)	~$50	4-core server
pdfmux (CPU)	~$20	2-core server
PyMuPDF only	~$5	Any server

pdfmux on a $20/month Hetzner VPS comfortably processes the kind of throughput a typical mid-size RAG ingestion workload requires (table extraction, OCR fallback, and quality auditing included). Per-page throughput depends on document mix — born-digital text is fast, table-heavy pages are slower, OCR-needing pages slower still. We measured this at smaller scale in our real-world benchmark across 1,422 pages of SEC filings and Supreme Court opinions. The same VPS handles the chunking pipeline that feeds the embedding step, so for most RAG workloads you don’t need to split extraction and chunking across separate boxes.

Serverless and cold-start economics

CPU-only doesn’t just save the GPU bill — it changes where you can run extraction.

AWS Lambda: The core pdfmux package (~20MB) fits comfortably under the 250MB unzipped Lambda layer cap. Cold starts are sub-second because there are no ML models to load. The [tables] extra blows past the Lambda layer cap (~500MB of Docling weights) — use a container image deployment, or split tables out to a separate ECS service.
Cloud Run / Fly.io: A container with pdfmux[tables,ocr] runs in ~1GB of memory at idle. Cold start to first response is 8-15s on first invocation (Docling model load), then sub-second on warm instances.
CI runners: pdfmux runs inside GitHub Actions, GitLab CI, and Jenkins without any GPU runner setup. Useful for golden-file regression testing extraction pipelines against your real PDF corpus on every PR.
Edge devices: The core package runs on Raspberry Pi 4 and Apple Silicon with no special build steps. RapidOCR has ARM wheels.

Compare this to marker or mineru, where the GPU dependency forces you into a narrow set of cloud regions and instance families. The accuracy delta (pdfmux 0.903 vs marker 0.861) goes the wrong way for them anyway — you’re paying for GPU infra and getting a lower score.

License, determinism, and other non-accuracy concerns

Three things that don’t show up in benchmark numbers but tend to surface during procurement:

License: pdfmux is MIT. PyMuPDF underneath it is AGPL-3.0, which pdfmux’s packaging routes around via the LGPL-friendly build path — relevant if your legal team treats AGPL as a non-starter. marker is GPL, Docling and Unstructured are permissive, LlamaParse is commercial-only.
Determinism: Rule-based extractors (the PyMuPDF path inside pdfmux, plus opendataloader) produce byte-identical output across runs. ML extractors can produce slightly different output between versions or across hardware due to floating-point non-determinism. For regulated workflows (financial filings, legal discovery, medical records), pdfmux has a deterministic=True mode that pins the rule-based path and refuses to fall through to ML — you get reproducibility in exchange for a small accuracy hit on edge cases.
GIL behavior: pdfmux releases the GIL during PyMuPDF calls, so a thread pool of 8-16 workers scales nearly linearly on a single machine. Docling and marker hold the GIL through their model invocations, so they need multiprocessing — and the model memory cost multiplies per worker. On the same 4-core VPS, pdfmux throughput is 3-4x higher than Docling for typical mixed corpora.

FAQ

Can pdfmux run in a Docker container without GPU? Yes. The entire pipeline runs on CPU. A basic Docker image with pip install pdfmux[tables,ocr] is all you need. For Alpine, use the slim-bookworm Debian base instead — RapidOCR’s wheels don’t ship for musl.

How does pdfmux handle scanned PDFs without GPU-based OCR? pdfmux uses RapidOCR, a CPU-optimized OCR engine based on ONNX runtime. It’s not as fast as GPU-based Tesseract or PaddleOCR, but it handles most scanned documents accurately. The classifier routes scanned pages there automatically using the scanned-PDF detection heuristics, so you don’t have to flag them yourself.

What’s the accuracy tradeoff vs GPU-based tools? pdfmux scores 0.903 overall vs marker’s 0.861 (which recommends GPU). pdfmux is actually more accurate than both marker and Docling (0.877) while using zero GPU. The only tool ahead is the paid opendataloader-hybrid engine (0.909).

Can I use pdfmux in a serverless function? The core package (without [tables]) works in Lambda/Cloud Functions with a ~20MB footprint. The [tables] extra requires a longer cold start (Docling model loading) — use a container or persistent server for that. Most teams split it: core in Lambda, tables on a small persistent worker that the Lambda calls only when classification flags table-heavy pages.

How does this compare to the rest of the field? We maintain a decision-flowchart comparison of 7 tools for picking by use case rather than raw score. The TL;DR: pdfmux wins on cost, license, and headings; Docling wins if tables are the only thing you care about; LlamaParse wins if you’re allowed to send your data to a hosted API.

Keep reading

Best PDF extraction library for Python in 2026 (benchmarked) — full ranked list with per-tool scores on the 200-PDF benchmark
What “self-healing” PDF extraction actually looks like — the full algorithm behind the extract-audit-repair pipeline
PDF chunking strategies for RAG — what to do with the extracted text once you have it
How to give your AI agent the ability to read any PDF — run this CPU-only pipeline as an MCP server for Claude, Cursor, or any agent

Last updated: May 18, 2026