# pdfmux > pdfmux is an open-source Python library for reliable PDF-to-Markdown extraction, built for LLM pipelines. It classifies each PDF, routes to the best of 8 extraction backends (PyMuPDF, Docling, OpenDataLoader, RapidOCR, Surya, Marker, Mistral OCR, or a BYOK vision LLM), audits every page for quality, and auto-recovers failures. MIT licensed, pip installable, zero config needed. Latest version: 1.6.0. pdfmux extracts text, tables, headings, and structured data from any PDF with per-page confidence scoring. It handles scanned documents via OCR fallback, complex tables via IBM Docling and Mistral OCR, academic papers via Marker, image-heavy PDFs via vision LLMs, and Arabic / Hebrew RTL via the Unicode Bidirectional Algorithm and Gemma 4. The self-healing pipeline means your AI systems always receive clean, structured data. Results are cached by file hash so re-runs are instant. - License: MIT (permissive, no AGPL restrictions) - Install: `pip install pdfmux` - Latest version: 1.6.0 (April 2026) - Python: 3.11+ - Output formats: Markdown, JSON, CSV, LLM-optimized chunks - Quality presets: fast (PyMuPDF only), standard (multi-pass + Docling), high (vision LLM) ## Key Features - Multi-pass extraction: fast extract, audit, OCR fallback, LLM recovery - 8 extraction backends — PyMuPDF, OpenDataLoader, RapidOCR, Surya, Docling, Marker, Mistral OCR, BYOK vision LLM - Table extraction: Docling (97.9% TEDS) or Mistral OCR (96.6% TEDS, $0.002/page) - BYOK vision LLM: Gemini, Gemma 4, Claude, GPT-4o, Mistral, Ollama, any OpenAI-compatible API - Arabic / Hebrew RTL: Unicode BiDi reordering, Arabic-aware routing, Gemma 4 OCR - Heading detection: font-size analysis + ML classifier + consecutive-heading merge - Per-page confidence scoring (0-1) with quality grades (good/bad/empty) - Structured data extraction: tables as JSON, key-value pairs, schema-guided extraction (5 presets) - Smart result cache: SHA-256 file hashing, 30-day TTL, 1 GB LRU — re-runs are instant - Streaming output: NDJSON page-by-page events (`pdfmux stream`, MCP `extract_streaming`) - Configuration profiles: invoices, receipts, papers, contracts, bulk-rag (or save your own) - Watch mode: `pdfmux watch