Direct answer: pdfmux extracts text, tables, and headings from PDFs without a GPU or API keys by combining fast heuristic extraction (PyMuPDF) with targeted ML table extraction (Docling) only where needed. The result: 0.905 overall accuracy on a 200-PDF benchmark — 99.5% of the paid #1 score — at zero cost per page. Install: pip install pdfmux.


The cost problem in PDF extraction

The best PDF extraction tools in 2026 are expensive:

  • Gemini Flash / GPT-4o: $0.01-0.05 per page. Process 100K pages/month and you’re paying $1,000-5,000.
  • marker: Free but needs a GPU. A T4 GPU on AWS costs ~$0.50/hour. Processing 100K pages takes ~28 hours = $14.
  • Docling: Free, no GPU required, but loads 500MB of transformer models and processes every page through them — even the 90% that are simple digital text.

For a startup building a RAG pipeline, these costs add up fast. (We benchmarked every major PDF extraction tool to find where the cost-accuracy tradeoffs actually land.) And for self-hosted deployments (healthcare, legal, finance), GPU dependencies are often a non-starter.

How pdfmux avoids the GPU tax

pdfmux’s key insight: 90% of PDF pages don’t need ML. They’re digital text — clean, extractable, and perfectly handled by heuristic tools like PyMuPDF at 0.01 seconds per page.

The other 10% — scanned pages, complex tables, image-heavy layouts — do need specialized tools. But you only need to pay the ML cost on those pages.

The architecture

PDF ──→ Classify (heuristic, <1ms)
         ├─ Digital text (90%) ──→ PyMuPDF (0.01s, CPU, free)
         │                          │
         │                          ├─ Audit quality (5 checks)
         │                          ├─ Bad page? → Re-extract with OCR
         │                          └─ Inject headings (font-size analysis)
         ├─ Has tables (5%) ──→ Docling (0.3-3s, CPU, free)
         │                       └─ ML table detection + extraction (see [table extraction methods](/blog/extract-tables-from-pdf-python/))
         └─ Scanned (5%) ──→ RapidOCR (0.5-2s, CPU, free)
                               └─ CPU-only OCR engine

No GPU anywhere in this pipeline. The ML components (Docling for tables, RapidOCR for scans) are designed to run on CPU.

The classification step

Before extracting anything, pdfmux classifies each PDF using lightweight heuristics (no ML):

  1. Drawn lines: Counts horizontal and vertical rules (tables have grid lines)
  2. Number density: Counts lines with high numeric content (financial tables)
  3. Column alignment: Checks if text blocks align vertically (columnar data)
  4. Whitespace patterns: Detects regular spacing patterns (table structure)
  5. PyMuPDF find_tables(): Quick heuristic table detection

This takes <1 millisecond and correctly routes 95%+ of documents. For the remaining edge cases, pdfmux runs a Docling table overlay as a safety net — extracting only table blocks from Docling’s output and merging them into the PyMuPDF text.

The self-healing loop

After extraction, pdfmux doesn’t just return the text. It audits every page with 5 quality checks and re-extracts failures:

for each page:
    score = audit(page.text, page.image_count)
    if score == "bad":
        # Text present but low quality (encoding errors, truncation)
        page.text = region_ocr(page)  # OCR only the bad regions
    elif score == "empty":
        # No text extracted (likely scanned/image)
        page.text = full_ocr(page)    # OCR the entire page

This self-healing loop catches failures that other tools miss silently. The result: higher effective accuracy without needing expensive models.

Benchmark proof

Tested on opendataloader-bench (200 real-world PDFs):

ToolOverallGPU RequiredCost/PageInstall Size
hybrid (AI)0.909No~$0.01N/A (API)
pdfmux0.905No$0~20MB
docling0.877No$0~500MB
marker0.861Recommended$0~2GB
mineru0.831Recommended$0~2GB

pdfmux achieves 99% of the AI-powered top score (see the full 200-PDF benchmark results) while being:

  • Free (zero cost per page, no API keys)
  • Small (20MB core install, no model downloads for basic use)
  • Fast (0.05s/page average across the benchmark)
  • CPU-only (runs on any server, laptop, or CI runner)

When you DO need more

pdfmux has optional extras for harder cases:

# Core: digital PDFs, no dependencies
pip install pdfmux

# Add table extraction (downloads Docling models, ~500MB)
pip install pdfmux[tables]

# Add OCR for scanned documents (~200MB)
pip install pdfmux[ocr]

# Add LLM extraction for the hardest cases
pip install pdfmux[llm]  # Requires GEMINI_API_KEY

The [llm] extra uses Gemini Flash as a final fallback — only for pages that both PyMuPQ and OCR couldn’t handle. In practice, this is <1% of pages.

Real-world cost comparison

Processing 100,000 PDF pages per month:

ToolMonthly CostInfrastructure
GPT-4o Vision$1,000-5,000API only
Gemini Flash$100-500API only
marker (GPU)~$350T4 GPU instance
Docling (CPU)~$504-core server
pdfmux (CPU)~$202-core server
PyMuPDF only~$5Any server

pdfmux on a $20/month Hetzner VPS processes 100K pages with table extraction, OCR fallback, and quality auditing. We validated this throughput in our real-world benchmark across 1,422 pages of SEC filings and Supreme Court opinions.

FAQ

Can pdfmux run in a Docker container without GPU? Yes. The entire pipeline runs on CPU. A basic Docker image with pip install pdfmux[tables,ocr] is all you need.

How does pdfmux handle scanned PDFs without GPU-based OCR? pdfmux uses RapidOCR, a CPU-optimized OCR engine. It’s not as fast as GPU-based Tesseract or PaddleOCR, but it handles most scanned documents accurately.

What’s the accuracy tradeoff vs GPU-based tools? pdfmux scores 0.905 overall vs marker’s 0.861 (which recommends GPU). pdfmux is actually more accurate than both marker and Docling (0.877) while using zero GPU. The only tool ahead is hybrid AI (0.909, paid API calls).

Can I use pdfmux in a serverless function? The core package (without [tables]) works in Lambda/Cloud Functions with a ~20MB footprint. The [tables] extra requires a longer cold start (Docling model loading) — use a container or persistent server for that.

Keep reading

Last updated: March 2026