What "self-healing" PDF extraction actually looks like

TL;DRMost PDF extractors run once and hope for the best. pdfmux extracts, audits every page with 5 quality checks, and re-extracts failures automatically — here is exactly how

TL;DR: “Self-healing extraction” means extract → audit → re-extract failures — automatically. pdfmux runs 5 quality checks on every page, scores each from 0.0 to 1.0, and re-extracts broken pages with OCR. Pages that score above threshold stay untouched. Pages that score below get surgical repair. 90% of PDFs need zero healing — the overhead is measured in milliseconds.

The extraction problem nobody measures

Here’s what happens when you run most PDF extractors on a real-world document:

import pymupdf4llm

text = pymupdf4llm.to_markdown("contract.pdf")
print(len(text))  # 12,847 characters

12,847 characters. Looks good. Ship it.

But how much of that text is actually correct? Which pages extracted cleanly? Which ones are garbled? Is page 7 empty because it’s a blank separator, or because it’s a scanned signature page that returned nothing?

You don’t know. The extractor doesn’t tell you. It ran once, returned text, and moved on. If page 7 was a scanned signature page, your RAG pipeline just indexed an empty page, your search will never surface that clause, and your agent will confidently tell a user the contract has no signature provisions.

This is the gap that confidence scoring fills. Not “did extraction run?” but “did extraction work?” (And it’s not just text — table extraction has the same blind-spot problem.) We learned this the hard way on a real customer batch — pdfmux silently lost 11 of 433 PDFs on first run because the audit signal was correct internally but the CLI exit code didn’t propagate it. The fix is what’s described below.

The architecture: extract-audit-repair-merge

pdfmux’s pipeline has four phases:

┌─────────────┐    ┌─────────────┐    ┌─────────────┐    ┌─────────────┐
│   EXTRACT   │───▶│    AUDIT    │───▶│   REPAIR    │───▶│    MERGE    │
│  (PyMuPDF)  │    │  (5 checks) │    │  (OCR/LLM)  │    │  (combine)  │
└─────────────┘    └─────────────┘    └─────────────┘    └─────────────┘
     10ms              ~1ms/page         only broken         instant
   per page            overhead            pages

Phase 1: Fast extract

Every page gets extracted with PyMuPDF via pymupdf4llm. This produces markdown output with headings, lists, and basic structure. (For why markdown is the ideal format for LLM pipelines, see our complete guide to PDF-to-Markdown for RAG.)

Speed: ~0.01 seconds per page. A 100-page document processes in about 1 second. (We verified this at scale in our real-world benchmark across 1,422 pages of SEC filings and legal documents.)

PyMuPDF has a fallback built in: if pymupdf4llm returns fewer than 50 characters for a page, pdfmux tries raw fitz.get_text("text") instead. This catches edge cases where the markdown conversion fails but the raw text is fine.

Phase 2: Audit

This is the differentiator. Every page gets scored from 0.0 to 1.0 using five independent quality checks.

Starting score: 1.0 (perfect). Each check can subtract from this score. Final score is clamped to [0.0, 1.0].

The 5 quality checks (in detail)

Check 1: Character density

The most basic signal: does this page have enough text?

Characters < 20  →  score = 0.0 (classified as "empty", skip remaining checks)
Characters < 50  →  penalty: -0.3
Characters < 200 AND has images  →  penalty: -0.2
Characters < 200 AND no images  →  penalty: -0.1

Why these thresholds? A typical PDF page has 2,000-4,000 characters. A page with fewer than 200 characters is suspiciously sparse — it might be a scanned page where only headers extracted, or a form where embedded text in images was missed.

The distinction between “has images” and “no images” matters: a page with 150 characters and no images is probably a short page (title page, divider). A page with 150 characters and 3 images is probably a scanned page where the images contain text that wasn’t extracted.

Threshold of 20 for “empty”: Catches fully scanned pages, blank pages, and pages where only page numbers or headers extracted. 20 characters is roughly “Page 7 of 50” — just metadata, no content.

Check 2: Alphabetic ratio

What fraction of non-whitespace characters are actual letters (a-z, A-Z)?

alpha_ratio = count(alpha chars) / count(non-space chars)

alpha_ratio < 0.3  →  penalty: -0.25 (mostly garbage/numbers/symbols)
alpha_ratio < 0.5  →  penalty: -0.10

What this catches: Garbled OCR output often produces high ratios of symbols and digits. A healthy English text page typically has an alpha ratio of 0.70-0.85. A ratio below 0.3 means more than 70% of the characters are non-alphabetic — that’s not text, that’s noise.

Exception: Financial documents and spreadsheet-style pages naturally have lower alpha ratios due to numbers, currency symbols, and formatting. A well-extracted financial table might score 0.4-0.5 on this check. That’s why the penalty for <0.5 is only -0.10, not a hard fail.

Check 3: Word structure

Are the “words” on this page actual words?

avg_word_length = total_alpha_chars / word_count

avg_word_length < 2 OR > 25  →  penalty: -0.15

What this catches: Two failure modes.

Too short (< 2): Broken word boundaries. When OCR or extraction splits every character with a space, “contract” becomes “c o n t r a c t” — 8 “words” averaging 1 character each. pdfmux’s post-processing step catches this specific pattern (lines where >50% of words are single characters) and collapses them back.

Too long (> 25): Missing spaces. When extraction concatenates words, “thetermsofthisagreement” is 23 characters with no spaces. This happens with some PDF generators that store text without explicit space characters.

Check 4: Whitespace sanity

How many runs of 5+ consecutive spaces exist on this page?

excessive_whitespace_runs > 10  →  penalty: -0.1

What this catches: Layout extraction artifacts. When a two-column PDF is extracted as a single column, the inter-column gap creates runs of spaces in the middle of lines. Ten or more such runs suggests the extraction is capturing layout whitespace rather than clean text.

Why only -0.1? This is a soft signal. Some documents legitimately have aligned columns with whitespace (like a table of contents with dot leaders). The penalty is light enough to flag the issue without condemning the page.

Check 5: Encoding quality (mojibake detection)

Regex search for common Unicode encoding failure patterns:

Pattern: â€|Ã©|Ã¨|â€™|ï¿½

matches > 5   →  penalty: -0.20
matches > 0   →  penalty: -0.05

What this catches: Mojibake — the garbled text that appears when a PDF’s text encoding is misinterpreted. “don’t” becomes “donâ€™t”. “résumé” becomes “rÃ©sumÃ©”. The replacement character ï¿½ (U+FFFD) appears when bytes can’t be decoded at all.

Why these specific patterns? These are the five most common mojibake patterns in English-language PDFs, covering UTF-8 misinterpreted as Windows-1252 (the most common encoding error) and the Unicode replacement character.

More than 5 matches means the page has systematic encoding problems — penalty of -0.20. A few matches (1-5) might be isolated special characters that got mangled — lighter penalty of -0.05.

Page classification

After scoring, each page is classified:

Score	Text Length	Images	Classification	Action
Any	< 20 chars	Any	empty	Full-page OCR
Any	< 200 chars	> 0	bad	Region OCR
≥ 0.5	≥ 200 chars	Any	good	Keep as-is
≥ 0.5	≥ 50 chars	0	good	Keep as-is

The key insight: “bad” and “empty” pages get different treatment.

Empty pages have essentially no text. The whole page needs OCR from scratch.
Bad pages have some text plus images. The text might be fine — it’s the image regions that need OCR. Doing full-page OCR would overwrite the good text with a (probably worse) OCR approximation.

Phase 3: Repair

Region OCR (for “bad” pages)

This is the surgical approach. Instead of OCR-ing the entire page, pdfmux identifies image regions that lack text overlays and OCRs only those regions.

Algorithm:

Get all image bounding boxes on the page
Get all text block bounding boxes on the page
For each image, calculate what percentage of its area overlaps with text blocks
If < 15% of an image is covered by text → it’s a “weak region” that needs OCR
Filter out images smaller than 50×50 points (icons, bullets, decorative elements)
Render each weak region at 200 DPI
Run RapidOCR on the cropped image
Append OCR text to existing page text, ordered top-to-bottom by region position

Why 15%? An image with a text overlay (like a chart with axis labels) will have significant text coverage. An image containing text that wasn’t extracted (like a scanned table embedded as an image) will have near-zero text coverage. 15% is the threshold that separates “text overlaid on image” from “text inside image.”

Why 50×50 minimum? Small images are almost always decorative — icons, bullets, logos. OCR-ing them produces noise. A 50-point square is about 0.7 inches — roughly the minimum size where text content would be expected.

Result: The existing good text stays untouched. Only the image-embedded text gets extracted and appended. This produces better results than full-page OCR because:

Digital text extracted by PyMuPDF is exact (copy-paste from the PDF engine)
OCR text is always approximate (limited by image quality and model accuracy)
Mixing them with surgical precision — PyMuPDF for digital regions, OCR for image regions — gives the best of both worlds

Full-page OCR (for “empty” pages)

Empty pages have nothing to preserve. pdfmux renders the full page at 200 DPI and runs RapidOCR.

Extractor priority chain:

RapidOCR (priority 20) — PaddleOCR v4 via ONNX, CPU-only, ~200MB. Default confidence: 0.85.
Surya OCR (priority 30) — PyTorch-based, ~5GB, GPU recommended. Renders at 300 DPI (higher than RapidOCR). Default confidence: 0.80.
Gemini 2.5 Flash (priority 50) — Vision model, cloud API, handles handwriting. Default confidence: 0.90.

All of these run on CPU — no GPU required. For a deeper look at how pdfmux achieves near-AI accuracy without GPU or API keys, see the architecture breakdown.

pdfmux tries extractors in priority order. If RapidOCR isn’t installed, it tries Surya. If neither OCR engine is available, it tries Gemini Flash (if API key is configured). If nothing is available, the page stays unrecovered and the confidence score reflects this.

Dynamic OCR budget

Not every empty page is worth OCR-ing. In a 200-page document with 60 empty pages (intentional blank pages, separator pages), you don’t want to OCR all 60.

pdfmux computes a dynamic budget:

default_budget = total_pages * 0.30  # 30% of document

if graphical_ratio > 0.50:
    budget = total_pages          # OCR everything
elif graphical_ratio > 0.25:
    budget = total_pages * (graphical_ratio + 0.10)
else:
    budget = default_budget

“Graphical” pages are those with multiple images and little text (≥2 images and <500 chars, or ≥1 image and <100 chars). When a document is mostly graphical, it’s probably a scanned document and every page needs OCR. When only a few pages are graphical, a 30% budget is more than enough.

Priority ordering: When the budget is constrained, “bad” pages are OCR’d before “empty” pages. Rationale: bad pages have some extracted content that provides context about what’s on the page. Empty pages might actually be intentionally blank.

Parallel processing

OCR is CPU-intensive. pdfmux dispatches OCR jobs to a ThreadPoolExecutor with 4 workers by default (good for 4-8 core machines). Workers are clamped to the number of pages needing OCR.

Why threads, not processes? ONNX runtime (used by RapidOCR) releases the GIL during inference. Thread-based parallelism avoids the overhead of serializing data between processes while still achieving real parallelism during the compute-heavy OCR step.

LLM fallback

Pages that are still bad or empty after OCR get one more chance: Gemini 2.5 Flash vision extraction. This is the nuclear option — highest quality, but costs money and sends data to Google.

pdfmux renders the page at 200 DPI, base64-encodes the PNG, and sends it to Gemini with a structured prompt requesting markdown output. The prompt handles headings, lists, tables (pipe delimiters), captions, and even handwriting ([unclear: best guess] notation).

This only fires if google-genai is installed AND a GEMINI_API_KEY or GOOGLE_API_KEY environment variable is set. Otherwise, the page is marked as unrecovered.

Phase 4: Merge

Replace fast-extracted pages with OCR/LLM results only when the new text is longer than the original. This prevents a failed OCR attempt from overwriting partial extraction with nothing.

The final document is assembled in page order: good pages from Phase 1, repaired pages from Phase 3, and unrecovered pages marked with their low confidence scores.

Document-level confidence

Individual page scores are combined into a single document confidence number:

document_confidence = Σ(page_confidence × max(1, page_char_count)) / total_chars

This is a content-weighted average — a 5,000-character page with 0.95 confidence contributes more than a 50-character page with 0.5 confidence. This prevents short problematic pages (like a title page) from dragging down the score of a well-extracted document.

Adjustments

Three adjustments on top of the weighted average:

OCR penalty: min(0.15, ocr_ratio × 0.2) where ocr_ratio = ocr_pages / total_pages. OCR text is inherently less reliable than digitally extracted text. A document where 50% of pages were OCR’d gets a -0.10 penalty even if individual OCR scores were good.

Unrecovered penalty: min(0.40, unrecovered_ratio × 0.5). Pages that couldn’t be fixed are a red flag. If 30% of pages are unrecovered, the document gets a -0.15 penalty. If 80% are unrecovered, the max penalty of -0.40 applies — the agent should not trust this extraction.

Structure bonus: +0.03 if any page contains markdown headings (^#+\s). This is a positive signal that the extractor captured document structure, not just raw text.

Warning generation

The audit generates specific warnings:

Empty pages (<20 chars): lists exact page numbers
Sparse pages (20-100 chars, if >25% of total pages): suggests possible extraction issues
Unrecovered pages: suggests installing missing extractors (pip install "pdfmux[ocr]")

What this looks like in practice

Example 1: Clean digital PDF (zero overhead)

A 30-page digital research paper:

Phase 1: Extract all 30 pages with PyMuPDF → 0.3s
Phase 2: Audit all 30 pages → 30/30 good (all scores > 0.90)
Phase 3: Skipped (nothing to repair)
Phase 4: Skipped (nothing to merge)

Document confidence: 0.96
Warnings: none
Total time: ~0.35s

The audit added ~50ms of overhead. For 90% of PDFs, this is the entire pipeline. The fast path is fast.

Example 2: Mixed document with scanned pages

A 47-page pitch deck with embedded screenshots and a few scanned pages:

Phase 1: Extract all 47 pages with PyMuPDF → 0.5s
Phase 2: Audit → 23 good, 18 bad, 6 empty
Phase 3:
  - 18 bad pages → region OCR (only image regions) → 14 recovered
  - 6 empty pages → full-page OCR → 5 recovered
  - 5 pages → LLM fallback → 3 recovered
  - 2 pages unrecovered (decorative full-bleed images, no text content)
Phase 4: Merge recovered pages into document

Document confidence: 0.87
Warnings: "2 pages unrecovered (pages 12, 29) — these appear to be decorative images"
Total time: ~18s

Standard mode: 0.87 confidence with 22 of 24 problem pages fixed automatically.

Example 3: Fully scanned document

A 15-page scanned contract:

Phase 1: Extract all 15 pages with PyMuPDF → 0.15s
Phase 2: Audit → 0 good, 0 bad, 15 empty
Phase 3:
  - OCR budget: 100% (>50% graphical)
  - 15 pages → full-page OCR (RapidOCR, 4 parallel workers) → 15 recovered
Phase 4: Replace all pages with OCR results

Document confidence: 0.82
OCR penalty applied: -0.03 (100% OCR'd)
Warnings: none
Total time: ~12s

All pages empty on fast extract, all recovered via OCR. The 0.82 confidence reflects that OCR output is inherently less exact than digital extraction — a realistic assessment, not inflated certainty.

Post-processing: cleaning up extraction artifacts

After extraction (whether from PyMuPDF or OCR), pdfmux runs a cleanup pipeline:

Remove control characters (except newlines and tabs)
Collapse excessive newlines (4+ consecutive → 3)
Fix broken hyphenation (word-\n → word)
Fix spaced-out text: If >50% of “words” in a line are single characters, collapse them (W i t h o v e r → With over)
Remove trailing whitespace

Step 4 is the interesting one. Some PDF generators and OCR engines insert spaces between every character. The heuristic is simple but effective: count single-character “words” in a line. If they’re the majority, the line’s characters were probably extracted with spurious spaces. Collapse and rejoin.

The “self-healing” framing

Why call this “self-healing”? Because the pipeline:

Detects its own failures (the audit catches what the extractor missed)
Diagnoses the type of failure (empty vs. bad, which specific checks failed)
Repairs using a different strategy (region OCR vs. full-page OCR vs. LLM)
Verifies the repair worked (the repaired text must be longer than the original)

This is fundamentally different from “try harder” approaches like running a better model on the whole document. The precision matters:

A good page stays untouched (don’t degrade working extraction)
A bad page gets surgical repair (OCR only the image regions)
An empty page gets full replacement (nothing to preserve)
An unrecovered page is flagged honestly (don’t pretend it worked)

The confidence score at the end isn’t a vanity metric. It’s a contract: “here is how much you should trust this extraction, and here’s exactly where the problems are.” This approach is why pdfmux scores 0.903 overall on the 200-PDF benchmark — the best among free tools.

Using confidence scores in your application

Gate on confidence

import pdfmux

result = pdfmux.extract_json("contract.pdf")

if result["metadata"]["confidence"] < 0.80:
    # Flag for human review
    send_to_review_queue(result)
else:
    # Process automatically
    index_in_vector_db(result)

Per-page confidence for selective RAG

This pattern is especially powerful in RAG pipelines where extraction quality directly determines answer quality.

chunks = pdfmux.load_llm_context("report.pdf")

for chunk in chunks:
    if chunk["confidence"] > 0.85:
        # Index high-confidence chunks for RAG
        vector_db.upsert(chunk["text"], metadata=chunk)
    else:
        # Store but mark as low-confidence
        vector_db.upsert(chunk["text"], metadata={**chunk, "needs_review": True})

Confidence-aware agent behavior

When pdfmux runs as an MCP server, the confidence metadata flows directly to the agent. A well-prompted agent can:

Use analyze_pdf first (quick triage) before committing to full extraction
Request quality: "high" when standard mode returns low confidence
Tell the user when a document can’t be reliably processed
Skip low-confidence sections when answering questions

This is the real value of confidence scoring: it makes the agent’s behavior proportional to the quality of its inputs.

Try it

pip install pdfmux

# Extract with confidence scoring
python -c "
import pdfmux
result = pdfmux.extract_json('your-file.pdf')
print(f'Confidence: {result[\"metadata\"][\"confidence\"]:.0%}')
for page in result['pages']:
    print(f'  Page {page[\"page\"]}: {page[\"confidence\"]:.2f} ({page[\"extractor\"]})')
"

# Or use the CLI
pdfmux analyze your-file.pdf  # quick quality triage
pdfmux your-file.pdf          # full extraction with confidence

GitHub — source code with the full audit implementation
PyPI — pip install pdfmux
pdfmux.com — documentation

MIT licensed. Runs locally. No API keys needed for the base install.

Keep reading

PDF to Markdown for RAG pipelines: the complete guide — how to turn self-healing extraction into production RAG
PDF extraction without GPU or API keys — the cost architecture behind pdfmux’s CPU-only pipeline
How to extract tables from PDF in Python — deep dive into the table extraction that runs inside the repair phase
Best PDF extraction library for Python in 2026 — where pdfmux ranks against every other tool on the opendataloader benchmark
Certify Anything: verify any PDF extractor for silently dropped pages — this same audit engine, pointed at any other extractor’s output

Built by Nameet Potnis. Contributions welcome.