# pdfmux > pdfmux is an open-source Python library for reliable PDF-to-Markdown extraction, built for LLM pipelines. It classifies each PDF, routes to the best of 8 extraction backends (PyMuPDF, Docling, OpenDataLoader, RapidOCR, Surya, Marker, Mistral OCR, or a BYOK vision LLM), audits every page for quality, and auto-recovers failures. MIT licensed, pip installable, zero config needed. Latest version: 1.6.0. pdfmux extracts text, tables, headings, and structured data from any PDF with per-page confidence scoring. It handles scanned documents via OCR fallback, complex tables via IBM Docling and Mistral OCR, academic papers via Marker, image-heavy PDFs via vision LLMs, and Arabic / Hebrew RTL via the Unicode Bidirectional Algorithm and Gemma 4. The self-healing pipeline means your AI systems always receive clean, structured data. Results are cached by file hash so re-runs are instant. - License: MIT (permissive, no AGPL restrictions) - Install: `pip install pdfmux` - Latest version: 1.6.0 (April 2026) - Python: 3.11+ - Output formats: Markdown, JSON, CSV, LLM-optimized chunks - Quality presets: fast (PyMuPDF only), standard (multi-pass + Docling), high (vision LLM) ## Key Features - Multi-pass extraction: fast extract, audit, OCR fallback, LLM recovery - 8 extraction backends — PyMuPDF, OpenDataLoader, RapidOCR, Surya, Docling, Marker, Mistral OCR, BYOK vision LLM - Table extraction: Docling (97.9% TEDS) or Mistral OCR (96.6% TEDS, $0.002/page) - BYOK vision LLM: Gemini, Gemma 4, Claude, GPT-4o, Mistral, Ollama, any OpenAI-compatible API - Arabic / Hebrew RTL: Unicode BiDi reordering, Arabic-aware routing, Gemma 4 OCR - Heading detection: font-size analysis + ML classifier + consecutive-heading merge - Per-page confidence scoring (0-1) with quality grades (good/bad/empty) - Structured data extraction: tables as JSON, key-value pairs, schema-guided extraction (5 presets) - Smart result cache: SHA-256 file hashing, 30-day TTL, 1 GB LRU — re-runs are instant - Streaming output: NDJSON page-by-page events (`pdfmux stream`, MCP `extract_streaming`) - Configuration profiles: invoices, receipts, papers, contracts, bulk-rag (or save your own) - Watch mode: `pdfmux watch ` auto-converts new PDFs as they land - Cost prediction: `pdfmux estimate` previews spend before running - Diff command: `pdfmux diff a.pdf b.pdf` compares two extractions - Auto-retry with exponential backoff on every LLM provider (Retry-After aware) - Better error messages with `.user_message`, `.suggestion`, `.reproduce_cmd` - Batch processing with concurrent workers - MCP server for Claude Desktop, Cursor, and other AI agents (6 tools) - LangChain + LlamaIndex loaders shipped as separate packages ## CLI Commands - `pdfmux convert ` — extract a PDF (auto-cached by file hash) - `pdfmux estimate ` — predict cost before running - `pdfmux stream ` — NDJSON event stream for long documents - `pdfmux watch ` — auto-convert as new PDFs land - `pdfmux diff a.pdf b.pdf` — compare two extractions - `pdfmux profiles list/show/save/delete` — manage saved configs - `pdfmux benchmark ` — eval all installed extractors - `pdfmux doctor` — show installed backends and coverage gaps - `pdfmux serve` — start MCP server (stdio or HTTP) ## Benchmark Results Ranked #2 on opendataloader-bench (200 real-world PDFs): - Overall: 0.905 (vs #1 hybrid-AI at 0.909) - Reading order (NID): 0.920 - Table accuracy (TEDS): 0.911 - Heading structure (MHS): 0.852 #1 among free / open-source tools at zero cost per page. ## Docs - [Homepage](https://pdfmux.com): Product overview, features, quickstart - [Blog](https://pdfmux.com/blog/): Benchmarks, comparisons, tutorials - [PyPI](https://pypi.org/project/pdfmux/): Package page, installation - [GitHub](https://github.com/NameetP/pdfmux): Source code, issues, contributing - [Architecture](https://github.com/NameetP/pdfmux/blob/main/docs/ARCHITECTURE.md): Module layout, routing matrix, design decisions - [Changelog](https://github.com/NameetP/pdfmux/blob/main/CHANGELOG.md): Version history - [Benchmarking PDF Extractors](https://pdfmux.com/blog/benchmarking-pdf-extractors/): 200-PDF benchmark comparison ## Optional - [Privacy Policy](https://pdfmux.com/privacy.html): Privacy information - [Terms of Service](https://pdfmux.com/terms.html): Usage terms