pdfmux 1.6 — three new backends, smart cache, streaming, watch mode

TL;DR: pdfmux 1.6.0 shipped today. Three new extraction backends (Mistral OCR, Marker, Gemma 4 27B), a hash-keyed result cache that turns 14-second re-runs into 0.05 seconds, NDJSON page-by-page streaming, configuration profiles with five built-ins, a watch mode that auto-converts PDFs as they land, estimate for predicting spend before running, diff for comparing two extractions, structured errors with .suggestion and .reproduce_cmd, and @with_retry on every LLM provider. Test count went from 481 to 659. Install with pip install -U pdfmux.

The shape of the release

1.6 is the largest pdfmux release since the v1.0 router rewrite. Three categories of work landed at once:

More backends. The router now picks from seven extractors (was four) plus a BYOK vision LLM. Two of the new ones (Mistral OCR, Marker) are best-in-class for table-heavy and academic documents; the third (Gemma 4 27B) is the first vision LLM provider with native Arabic OCR.
Performance you don’t have to think about. Smart result cache keyed by file hash. NDJSON streaming for long documents. Cost prediction before you spend anything.
Operational ergonomics. Configuration profiles, watch mode, diff, structured errors, retry with backoff. The boring infrastructure work that makes pdfmux feel like a tool, not a script.

The rest of this post walks through every feature with the actual commands.

Three new extraction backends

The router picks per page based on classification. Before 1.6 it had PyMuPDF, RapidOCR, Surya, and Docling on the rule-based side. Now it has those plus three more.

Mistral OCR — paid backend, 96.6 % TEDS on tables

Mistral’s OCR API is a cloud table extractor that scores 96.6 % on TEDS — a hair below Docling’s 97.9 % but with one significant advantage: it runs without local model weights. No 500 MB download. No GPU. Just an API call.

pip install "pdfmux[llm-mistral]"
export MISTRAL_API_KEY=...
pdfmux convert quarterly-report.pdf --llm-provider mistral

It’s $0.002 per page, which is in the same ballpark as running Gemini Flash and a fraction of LlamaParse. The router only routes to Mistral when a page is classified as table-heavy, so the bill stays predictable.

Marker — neural extractor for academic papers

Marker is the open-source neural extractor that’s been quietly winning on dense layouts since v1.10. It’s particularly strong on academic papers, multi-column scientific journals, and anything with footnotes that bleed into the next column.

pip install "pdfmux[marker]"
pdfmux convert attention-is-all-you-need.pdf --quality high

Marker pulls in roughly 2 GB of models on first run and is GPU-accelerated, but it works fine on CPU for low-volume use. The router classifies pages as academic based on layout signals (column count, footnote density, equation regions) and routes them to Marker before falling back to LLM extraction.

Gemma 4 27B — vision LLM with native Arabic OCR

Gemma 4 27B IT is Google’s open-weight model, and it has the best Arabic vision OCR of any open model we’ve benchmarked. Better still: it serves through the same OpenAI-compatible endpoint as Gemini, so it reuses your existing GEMINI_API_KEY.

pip install "pdfmux[llm-all]"  # includes Gemma
export GEMINI_API_KEY=...
pdfmux convert bill-of-lading.pdf --llm-provider gemma

The pricing works out to roughly $0.005 per page — half of Gemini Flash, less than a quarter of Claude Sonnet. For Arabic documents specifically, it’s the new default recommendation.

Arabic and RTL, end to end

This release closes the loop on RTL support that started in 1.4.

python-bidi is now a base dependency. Every PyMuPDF, RapidOCR, and Docling output passes through markdown-aware bidi reordering — meaning heading prefixes, list markers, code fences, and pipe-table cells are preserved while inner text is reordered to correct logical reading order.

A new pdfmux.arabic module exposes:

from pdfmux.arabic import (
    is_arabic_text,
    arabic_ratio,
    fix_bidi_order,
    normalize_arabic,
)

text = "مرحبا بالعالم"
assert is_arabic_text(text)

# Fix glyph order from PyMuPDF / OCR engines
visual = fix_bidi_order(text)

# Canonicalize for indexing — strip Tatweel, unify Alef variants, drop diacritics
indexable = normalize_arabic("أَحْمَدْ")  # → "احمد"

normalize_arabic is the right function to call before embedding Arabic chunks for retrieval. It removes Tatweel (the elongation character that breaks substring matching), unifies Alef and Yeh variants that look identical to readers but encode differently, and strips Tashkeel (vocalization marks) that almost never match between query and document.

The classifier samples the first 20 pages for Arabic content, sets DocumentResult.has_arabic, and routes pages with high Arabic ratios to Gemma 4 first. If you’ve been working around RTL in pdfmux up to now, the GCC logistics walkthrough is worth re-reading — most of the workarounds it documents are no longer necessary.

Smart result cache

This is the feature you’ll feel within five minutes of upgrading.

Every PDF that pdfmux extracts gets hashed (SHA-256) and the result is keyed on (file_hash, quality, format, schema). Subsequent runs hit the cache:

$ pdfmux convert big-report.pdf --quality high
# 14.2s

$ pdfmux convert big-report.pdf --quality high
# 0.05s

$ pdfmux convert big-report.pdf --quality high --format json
# 0.05s — separate cache entry, but the underlying extraction is reused

$ pdfmux convert big-report.pdf --no-cache
# 14.2s — bypass without disturbing the cache

$ pdfmux convert big-report.pdf --clear-cache
# purges this file's cache entries, then re-runs

Cache files live at ~/.cache/pdfmux/results/{hash}.json. TTL is 30 days. The cache is LRU-capped at 1 GB — when it fills up, the least-recently-touched results get evicted first. There’s no daemon, no background process, no setup. Just first-run latency, then instant.

The reason the cache key includes format and schema is so flipping --format markdown and --format json against the same document each get their own entry. Once you’ve extracted a document once, switching output shapes is free.

For batch processing in CI: this is roughly the difference between paying for compute every time someone touches the bucket and paying for compute exactly once per unique file. We’ve seen reruns that used to take twenty minutes on a fixture set finish in under a second after one warm pass.

Streaming output

For very long documents (think 600-page bond prospectuses) and live UIs, the new pdfmux stream command emits NDJSON events as pages complete:

$ pdfmux stream long-prospectus.pdf --quality high
{"event":"classified","page_count":624,"plan":"pymupdf+gemini-fallback"}
{"event":"page","page_num":0,"confidence":0.97,"chars":1842}
{"event":"page","page_num":1,"confidence":0.92,"chars":1611,"ocr":true}
{"event":"page","page_num":2,"confidence":0.99,"chars":2204}
...
{"event":"warning","page_num":47,"reason":"low_confidence","value":0.42}
{"event":"page","page_num":47,"confidence":0.84,"chars":1320,"recovered":true}
...
{"event":"complete","confidence":0.94,"cost_usd":0.0712}

The same machinery is exposed as a new MCP tool, extract_streaming, which means an MCP-aware agent (Claude Desktop, Cursor, or your own) can render pages progressively instead of waiting for the entire document. The MCP server now exposes six tools total: convert_pdf, analyze_pdf, extract_structured, extract_streaming, get_pdf_metadata, and batch_convert.

Configuration profiles

The single biggest piece of pdfmux feedback over the last six months: “I have five different shapes of documents and I keep typing the same flag combinations.” Profiles solve that.

$ pdfmux profiles list
invoices    quality=standard, schema=invoice, format=json
receipts    quality=fast,     schema=receipt, format=json
papers      quality=high,     chunk=true, max_tokens=500
contracts   quality=high,     schema=contract
bulk-rag    quality=standard, format=llm, chunk=true

$ pdfmux convert vendor-invoice.pdf --profile invoices
# uses quality=standard, schema=invoice, format=json — no other flags needed

$ pdfmux profiles save my-default --quality high --format llm --chunk
# now you can `--profile my-default` anywhere

Profiles live at ~/.config/pdfmux/profiles.yaml. Built-ins ship for the five common shapes; explicit flags always win over profile values. There’s also a pdfmux profiles show <name> for printing one and pdfmux profiles delete <name> for removing your own (the built-ins are protected).

Watch mode

If you’ve ever wired a CI job or cron to pdfmux convert over a directory, the new watch command replaces all of it:

$ pdfmux watch ./inbox/ -o ./output/ --profile bulk-rag
watching ./inbox/ for new PDFs (Ctrl-C to stop)

It uses watchdog under the hood, picks up file creates and modifications, and runs the configured profile against each new PDF as it lands. Combined with profiles, it gives you a one-line “auto-convert any PDF that shows up in this folder” pipeline:

$ pdfmux profiles save partner-pipeline \
    --quality high \
    --schema contract \
    --format json \
    --chunk

$ pdfmux watch ./partner-uploads/ \
    -o ./extracted/ \
    --profile partner-pipeline

That’s the whole thing. No supervisor. No daemon. Ctrl-C to stop.

Cost prediction

pdfmux estimate runs the classifier and the router cost model without doing any actual extraction. It tells you what would happen and what it would cost.

$ pdfmux estimate q3-earnings.pdf --quality high --llm-provider gemini
Pages       : 47
Plan        : pymupdf4llm + gemini-2.5-flash on 9 pages
Estimated   : $0.0234
Cache hit?  : no  (first run for this file)

This is most useful for two things: scripts that need to enforce a budget cap before kicking off a batch, and humans who are about to point pdfmux at a 1,200-page deposition and want to know how much that’s going to cost.

The estimator reads the same COST_PER_PAGE table the router uses for hard --budget enforcement, so the number is grounded in the same logic that controls actual spend.

Diff

When you’re tuning an extraction pipeline — comparing --quality standard to --quality high, or evaluating Mistral OCR against Docling on the same document — pdfmux diff gives you a Levenshtein-style comparison of two extractions:

$ pdfmux diff old.pdf new.pdf --quality standard
similarity      : 0.94
chars (a/b)     : 12,847 / 12,902
confidence (a/b): 0.91 / 0.93
cost (a/b)      : $0.0000 / $0.0000
diff regions    : 7
  page 3: +2 lines (+102 chars)
  page 7: -1 line  (-44 chars)
  ...

It’s the missing tool for “did my router config change actually improve anything, and where?”

Errors that tell you what to do

Every PdfmuxError now carries three additional attributes:

.user_message — a one-line explanation aimed at the human, not the developer
.suggestion — what to try next
.reproduce_cmd — a copy-pasteable shell command that reproduces the failure

try:
    pdfmux.extract_text("scanned.pdf", quality="high")
except pdfmux.ExtractorNotAvailable as e:
    print(e.user_message)
    # "OCR fallback needed but the rapidocr backend isn't installed."

    print(e.suggestion)
    # "Install with: pip install 'pdfmux[ocr]'"

    print(e.reproduce_cmd)
    # "pdfmux convert scanned.pdf --quality high --verbose"

The MCP server surfaces these directly so MCP-aware agents can recover (or ask the user to) without having to parse error strings.

Retry with backoff on every LLM provider

This is the smallest change and the one that will quietly save the most pain. Every LLM provider’s extract_page() method is now wrapped in a @with_retry(max_attempts=3, backoff_base=2.0) decorator:

Exponential backoff (1s, 2s, 4s) with jitter on 5xx, 429, and network errors
Honors Retry-After headers when servers send them
Skips retry on 401 / 403 — a bad API key fails immediately instead of burning three attempts

This applies to all six providers: Gemini, Gemma 4, Claude, GPT-4o, Ollama, and Mistral. If you’ve ever had a 1,000-page batch fail at page 847 because Anthropic threw one transient 503, you’ll know why this matters.

Tests, docs, and the install matrix

The test suite went from 481 passing to 659 passing (3 skipped). The bulk of the new tests are around the cache, streaming, profiles, and the Arabic pipeline — areas where regressions would be silent and expensive.

The install matrix grew to match the new backends:

pip install "pdfmux[marker]"          # Marker neural extractor
pip install "pdfmux[llm-mistral]"     # Mistral OCR
pip install "pdfmux[llm-all]"         # all LLM providers, including Gemma 4
pip install "pdfmux[watch]"           # `pdfmux watch <dir>`
pip install "pdfmux[all]"             # everything

Documentation is refreshed across the board: the README, the architecture doc, and the website. The self-healing extraction post still describes the core algorithm — none of that changed in 1.6.

Upgrade path

pip install -U pdfmux

That’s it. Every existing flag and import path still works. Existing extraction scripts pick up the new cache automatically (which means most of them will get faster on second run). To opt out, pass --no-cache.

If you’re using one of the LangChain or LlamaIndex loaders (langchain-pdfmux, llama-index-readers-pdfmux), they pin pdfmux>=1.2.0 and will pick up 1.6 features on their next install.

The full changelog is on GitHub. If you find anything that looks like a regression, open an issue — the test suite is at 659 but real-world PDFs find ways to surprise us regardless.

Update (2026-05-02): real-world PDFs did surprise us. We ran 1.6.0 on a 433-PDF customer batch and the first run silently lost 16 documents. The retro is live — and 1.6.1, 1.6.2, and 1.6.3 followed within 48 hours, closing the gap with --strict / --min-confidence, a manifest.json per batch, the public pdfmux.batch_extract() API, pdfmux doctor --check <dir>, and an audit-correctness fix surfaced by a new 50-fixture eval set.