# pdfmux

> pdfmux is an open-source Python library for reliable PDF-to-Markdown extraction, built for LLM pipelines. It classifies each PDF, routes to the best of 8 extraction backends (PyMuPDF, Docling, OpenDataLoader, RapidOCR, Surya, Marker, Mistral OCR, or a BYOK vision LLM), audits every page for quality, and auto-recovers failures. MIT licensed, pip installable, zero config needed. Latest version: 1.6.0.

pdfmux extracts text, tables, headings, and structured data from any PDF with per-page confidence scoring. It handles scanned documents via OCR fallback, complex tables via IBM Docling and Mistral OCR, academic papers via Marker, image-heavy PDFs via vision LLMs, and Arabic / Hebrew RTL via the Unicode Bidirectional Algorithm and Gemma 4. The self-healing pipeline means your AI systems always receive clean, structured data. Results are cached by file hash so re-runs are instant.

- License: MIT (permissive, no AGPL restrictions)
- Install: `pip install pdfmux`
- Latest version: 1.6.0 (April 2026)
- Python: 3.11+
- Output formats: Markdown, JSON, CSV, LLM-optimized chunks
- Quality presets: fast (PyMuPDF only), standard (multi-pass + Docling), high (vision LLM)

## Key Features

- Multi-pass extraction: fast extract, audit, OCR fallback, LLM recovery
- 8 extraction backends — PyMuPDF, OpenDataLoader, RapidOCR, Surya, Docling, Marker, Mistral OCR, BYOK vision LLM
- Table extraction: Docling (97.9% TEDS) or Mistral OCR (96.6% TEDS, $0.002/page)
- BYOK vision LLM: Gemini, Gemma 4, Claude, GPT-4o, Mistral, Ollama, any OpenAI-compatible API
- Arabic / Hebrew RTL: Unicode BiDi reordering, Arabic-aware routing, Gemma 4 OCR
- Heading detection: font-size analysis + ML classifier + consecutive-heading merge
- Per-page confidence scoring (0-1) with quality grades (good/bad/empty)
- Structured data extraction: tables as JSON, key-value pairs, schema-guided extraction (5 presets)
- Smart result cache: SHA-256 file hashing, 30-day TTL, 1 GB LRU — re-runs are instant
- Streaming output: NDJSON page-by-page events (`pdfmux stream`, MCP `extract_streaming`)
- Configuration profiles: invoices, receipts, papers, contracts, bulk-rag (or save your own)
- Watch mode: `pdfmux watch <dir>` auto-converts new PDFs as they land
- Cost prediction: `pdfmux estimate` previews spend before running
- Diff command: `pdfmux diff a.pdf b.pdf` compares two extractions
- Auto-retry with exponential backoff on every LLM provider (Retry-After aware)
- Better error messages with `.user_message`, `.suggestion`, `.reproduce_cmd`
- Batch processing with concurrent workers
- MCP server for Claude Desktop, Cursor, and other AI agents (6 tools)
- LangChain + LlamaIndex loaders shipped as separate packages

## CLI Commands

- `pdfmux convert <file>` — extract a PDF (auto-cached by file hash)
- `pdfmux estimate <file>` — predict cost before running
- `pdfmux stream <file>` — NDJSON event stream for long documents
- `pdfmux watch <dir>` — auto-convert as new PDFs land
- `pdfmux diff a.pdf b.pdf` — compare two extractions
- `pdfmux profiles list/show/save/delete` — manage saved configs
- `pdfmux benchmark <file>` — eval all installed extractors
- `pdfmux doctor` — show installed backends and coverage gaps
- `pdfmux serve` — start MCP server (stdio or HTTP)

## Benchmark Results

Ranked #2 on opendataloader-bench (200 real-world PDFs):
- Overall: 0.905 (vs #1 hybrid-AI at 0.909)
- Reading order (NID): 0.920
- Table accuracy (TEDS): 0.911
- Heading structure (MHS): 0.852

#1 among free / open-source tools at zero cost per page.

## Docs

- [Homepage](https://pdfmux.com): Product overview, features, quickstart
- [Blog](https://pdfmux.com/blog/): Benchmarks, comparisons, tutorials
- [PyPI](https://pypi.org/project/pdfmux/): Package page, installation
- [GitHub](https://github.com/NameetP/pdfmux): Source code, issues, contributing
- [Architecture](https://github.com/NameetP/pdfmux/blob/main/docs/ARCHITECTURE.md): Module layout, routing matrix, design decisions
- [Changelog](https://github.com/NameetP/pdfmux/blob/main/CHANGELOG.md): Version history
- [Benchmarking PDF Extractors](https://pdfmux.com/blog/benchmarking-pdf-extractors/): 200-PDF benchmark comparison

## Optional

- [Privacy Policy](https://pdfmux.com/privacy.html): Privacy information
- [Terms of Service](https://pdfmux.com/terms.html): Usage terms