What is silent failure in PDF extraction?

Silent failure is when a PDF extractor returns exit code 0 and empty (or near-empty) text on a document it couldn't actually parse, with no error log, no warning, and no per-document confidence signal. Downstream the empty text gets indexed into a vector store as if it were real content. The user only finds out when retrieval misses or hallucinates an answer.

Why did the v1 batch script lose 16 PDFs without an error?

v1 shelled out to pdfmux convert in a subprocess and used a length-of-stdout guard to decide whether extraction worked. On 11 scanned PDFs without OCR installed, pdfmux returned 0 characters; the subprocess exited 0; the guard rejected the empty output; the pypdf fallback also returned 0; the CSV shipped with 11 missing rows. The audit machinery had scored those pages at 0.0 confidence — but the CLI exit code didn't propagate that signal.

How do I detect silent PDF extraction failures in my own pipeline today?

Run pip install -U 'pdfmux[ocr]' then pdfmux convert --strict --min-confidence 0.20 -o ./out/ and read the generated out/manifest.json. Any document in the low_lt_0.50 bucket is one your previous extractor was almost certainly losing. Diff the file list against your current pipeline's output to find the missing rows.

What does pdfmux 1.6.1 and 1.6.2 fix?

1.6.1 added the --strict and --min-confidence flags, a manifest.json per batch run, the public pdfmux.batch_extract() Python API, pdfmux doctor --check for pre-flight extra detection, RapidOCR warning translation with file+page context, and deleted the broken sklearn heading model. 1.6.2 added 11 regression tests covering the five v1 failure modes. 670 tests passing total.

Is pypdf a safe fallback for pdfmux?

No. PyMuPDF (the engine pdfmux uses by default) tolerates a strict superset of what pypdf can read. On our eval set's truncated-PDF fixtures, pypdf returns 0 characters on 5 of 5 documents that PyMuPDF (and therefore pdfmux) recovers cleanly. Wrapping pdfmux with a pypdf fallback turns recoverable PDFs into failures.

We ran pdfmux on 433 of our own customer's PDFs. The first run failed silently.

By Nameet Potnis, founder of pdfmux. @NameetP on GitHub · pdfmux is MIT-licensed and runs locally — no API keys, no SaaS.

If you run a RAG pipeline today, somewhere between 1 and 5 percent of the documents you think you indexed contain zero extracted text. Your users are asking questions about contracts that, as far as your vector store is concerned, are blank. We know — we shipped this exact failure on our own customer’s batch last week. Here’s the audit.

Same 433 customer PDFs	Result
Subprocess + `pypdf` fallback (the script most engineers write)	16 silent failures
`pdfmux.batch_extract()` with `[ocr]` extra	0 failures

The product solved the problem. The default invocation didn’t. This post is about closing that gap — and how to verify your current pipeline isn’t open to the same failure.

TL;DR: We pointed pdfmux at a real B2B chemical distribution customer’s 433-PDF batch — a mix of digital, scanned, and partially-corrupted product data sheets. The first script we wrote shelled out to pdfmux convert in a subprocess loop and used pypdf as a fallback. It reported “412 processed, 16 failed.” Eleven of those sixteen failures had no log line at all — they were missing rows in the output CSV that nobody noticed until we audited the manifest. The fix was a one-line API change (pdfmux.extract_text(quality="standard") + pdfmux[ocr]) and zero output failures on rerun. But the lesson was bigger: the audit machinery was right internally; the CLI exit-code contract didn’t propagate the signal. We shipped 1.6.1 and 1.6.2 over the next twenty-four hours to close that gap.

Last updated: 2026-05-02. Includes pdfmux 1.6.3 audit-correctness fix.

What we ran

A real customer’s Google Drive folder. 433 unique PDFs after deduplicating by filename stem. Mixed origin: some scanned on phone cameras, some printed and re-scanned, some downloaded as email attachments, some clean digital exports. Languages: English plus a handful of mixed Arabic-English documents. Document classes:

Type	Count
TDS (technical data sheets)	253
PDS (product data sheets)	80
MSDS (material safety)	66
COA (certificates of analysis)	11
SDS (safety data sheets)	11
OTHER	7
PO (purchase orders)	5
Total	433

We had three script iterations across the day. v1 used pdfmux convert shelled out as a subprocess + pypdf as a fallback for files where the CLI didn’t return enough text. v3 was a partial pivot to pymupdf4llm direct that failed on a missing dependency and was abandoned. v4 used the pdfmux Python API.

Test this against your own pipeline in 10 minutes: pip install pdfmux[ocr] && pdfmux doctor --check <your-inbox-dir>. Doctor samples 10 PDFs and tells you what fraction of your batch is scanned, truncated, or non-Latin. If your current extractor doesn’t ship OCR by default and your batch is more than 10 percent scanned, you are losing those documents silently right now.

What v1 got wrong

We built the v1 script with the structure most people would write: subprocess the CLI, fall back to a different library if the CLI doesn’t produce enough output, log the failures.

# v1 — what we wrote first. Don't do this.
result = subprocess.run(
    ["pdfmux", "convert", str(pdf_path)],
    capture_output=True, text=True, timeout=60,
)
if result.returncode == 0 and len(result.stdout) > 100:
    text = result.stdout
else:
    # Fallback: pypdf
    reader = pypdf.PdfReader(str(pdf_path))
    text = "\n\n".join(p.extract_text() for p in reader.pages if p.extract_text())

When the script finished it printed:

Done: 412 processed, 5 skipped (cached), 16 failed

Sixteen failures, in four categories.

1. Eleven scanned PDFs that returned zero characters (no log line)

Most directories of real-world PDFs include at least one scan. In this batch they were field reports — phone-camera photos of forms, faxed safety data sheets, printed-then-re-scanned MSDS pages. We installed pdfmux without pdfmux[ocr]. PyMuPDF on a scanned page returns zero characters because there is no embedded text layer. The CLI exited 0 with nearly empty stdout. Our len(stdout) > 100 guard rejected it. The pypdf fallback also returned zero. The script logged a [SKIP] marker that didn’t make it into the production log, and moved on.

The script was technically correct: it tried, it failed, it logged. But the output CSV shipped with eleven products missing and the operator didn’t notice for days.

Eleven of 433 documents — 2.5 percent — left our pipeline with zero text and zero log line. In RAG terms: 2.5 percent of this customer’s product catalog became unanswerable, with no error to alert on. The internal audit data scored every one of those eleven pages at 0.0 confidence. The signal was there. The CLI surface buried it.

This is the failure mode pdfmux’s brand explicitly positions against — “you will never get silent garbage. pdfmux tells you what extracted, how confidently, and which pages need attention.”

2. Four “Stream has ended unexpectedly” failures from `pypdf`

[WARN] pypdf failed for INV-01_*.pdf: Stream has ended unexpectedly
[WARN] pypdf failed for PDS-14_*.pdf: Stream has ended unexpectedly
[WARN] pypdf failed for PO-01_*.pdf:  Stream has ended unexpectedly
[WARN] pypdf failed for SDS-03_*.pdf: Stream has ended unexpectedly

These were truncated PDFs — email attachments saved partially, or shared via a tool that cut off the trailing bytes. pypdf rejects them because the xref table is corrupt. The pdfmux subprocess had already failed silently before we got here (zero-text scans, see #1). The pypdf fallback then failed loudly. Both fallbacks returned nothing, and the script shipped a CSV without those four products.

We tested the same four PDFs in v4 with pdfmux.extract_text directly. PyMuPDF’s xref repair handled all four cleanly — the bytes that were there got extracted, the missing bytes got dropped.

If your pipeline already falls back to pypdf when the primary extractor returns short output, you have this exact bug. PyMuPDF (what pdfmux runs by default) tolerates a strict superset of what pypdf can read — so the fallback path you wrote to be safer is actually rejecting recoverable documents. We did it too. Delete the fallback.

3. One JSON parse error in the downstream extraction step

[ERROR] Structure extraction failed for *.pdf: Extra data: line 22 column 1 (char 464)

Our pipeline ran a downstream LLM step that took pdfmux’s markdown output and produced structured JSON. For one document, the LLM returned what looked like JSON but had trailing content after the closing brace. Not a pdfmux bug — but the v1 pipeline had no retry on JSON-parse failures, so a transient LLM artifact silently dropped the row. In v4 (cleaner markdown from pdfmux) the LLM’s output was consistently parseable.

4. RapidOCR warnings without context

In the v4 run, RapidOCR printed eight lines that looked like this:

[WARNING] 2026-04-30 17:23:16,370 [RapidOCR] main.py:132: The text detection result is empty

No file. No page number. No way to know whether it was “this is a genuinely blank page” (fine) or “OCR gave up on a page that needed re-extraction” (bad). The signal was there in the upstream library; the surface was wrong.

(Also worth noting: pdfmux 1.5 printed Failed to load ML heading model: No module named 'sklearn' 24+ times during the run. It was harmless noise on every batch — the heuristic fallback ran and produced correct headings. Removing it in 1.6.1 deleted 250 lines and 24 stderr prints per run.)

Where this fails on other extractors too

The retro is easy to read as “pdfmux had a bug, pdfmux fixed it.” That’s local. The category-wide picture is worse.

We re-ran the same five failure modes against pypdf 6.10.2 on the public eval set we built for 1.6.2’s regression tests (50 labeled fixtures spanning the same failure categories — 0-byte files, HTML masquerading as PDF, truncated PDFs at two severity levels, image-only PDFs, blank pages). Reproducible at eval/benchmark_pypdf.py.

Failure category	n	pypdf recovered	pdfmux (PyMuPDF) recovered
Truncated PDF, 70% bytes intact (xref damaged)	5	0 of 5	5 of 5 (218 chars, conf 1.00)
Truncated PDF, 15% bytes intact (header gone)	4	0 of 4	0 of 4 (errors)
Image-only PDF (no text layer, no OCR)	5	0 of 5	0 of 5 (conf 0.00)
Blank PDF (one valid empty page)	4	0 of 4	0 of 4 (conf 0.00)
HTML renamed to .pdf	5	0 of 5	0 of 5 (37 chars, conf 0.73 — flagged)
0-byte file	5	0 of 5 (raises)	0 of 5 (raises FileError)

The first row is the load-bearing one. PyMuPDF recovers 5 of 5 lightly-truncated PDFs; pypdf recovers 0 of 5. Same files, opposite outcomes. That’s the “strict superset” claim made concrete: any pipeline that uses pypdf as a fallback for “primary extractor returned short output” turns recoverable documents into failures.

The other rows are the silent-failure baseline. Both extractors return zero characters on image-only, blank, and severely-truncated PDFs at quality=fast (no OCR). The difference is that pdfmux returns a 0.0 confidence score the caller can gate on — pypdf returns text and exit code 0 with no signal. With pdfmux[ocr] and quality=standard, the image-only column flips to 5 of 5 recovered for pdfmux. The pypdf column stays 0 of 5 — pypdf doesn’t ship OCR.

We did not run cloud APIs (LlamaParse, Unstructured Cloud) on this eval set — that’s a follow-up benchmark — but those tools also do not emit per-document confidence in their public response, so the “silent failure” surface is the same architectural gap regardless of extraction quality.

What we changed

Everything below shipped in pdfmux 1.6.1, 1.6.2, and 1.6.3 over the next forty-eight hours. Every change is additive — no breaking defaults yet. The breaking-default change comes in 1.7 once we calibrate confidence thresholds against a labeled eval set.

`--strict` and `--min-confidence` flags

The most important change. The audit machinery already produced accurate per-document confidence; it just needed a CLI surface.

# Old behavior (still default in 1.6.x for backwards compat):
pdfmux convert ./inbox/ -o ./output/
# Done: 433 converted, 0 failed   ← exit code 0 even if 11 docs returned 0% confidence

# New behavior — fail loud in CI:
pdfmux convert ./inbox/ -o ./output/ --strict --min-confidence 0.20
# pdfmux WARNING: scan_001.pdf confidence 0.00 (below 0.50 threshold)
# pdfmux ERROR: --strict gate failed — 11 of 422 documents below --min-confidence 0.20
# exit 3

Any document with confidence under 0.50 now also emits a stderr WARNING regardless of --strict, so silent low-quality batches are visible in CI logs without changing the exit code.

Exit codes are now documented:

0 — success
1 — extraction or runtime error
2 — usage error (bad arguments, file not found)
3 — strict gate failed

`manifest.json` per batch

Every pdfmux convert <directory> run now writes a manifest.json to the output directory. Per-document confidence, extractor used, OCR pages, cost, warnings, plus a summary breakdown.

{
  "schema_version": "1.0",
  "summary": {
    "total": 433, "processed": 433, "failed": 0,
    "low_confidence": 0,
    "min_confidence_threshold": 0.20,
    "confidence_breakdown": {
      "high_ge_0.80": 421,
      "medium_0.50_to_0.80": 12,
      "low_lt_0.50": 0
    }
  },
  "documents": [
    {"path": "scan_001.pdf", "status": "ok",
     "confidence": 0.92, "extractor": "rapidocr", ...}
  ]
}

If v1 had had this, the eleven silent failures would have been visible in the very first run.

`pdfmux.batch_extract()` as a public API

The Python API is the right path for batch processing — it avoids three subprocess spawns per file, handles non-ASCII filenames without shell-quoting issues, and is significantly faster.

import pdfmux
from pathlib import Path

pdfs = list(Path("./inbox").glob("*.pdf"))

for path, result in pdfmux.batch_extract(pdfs, quality="standard", workers=4):
    if isinstance(result, Exception):
        print(f"FAILED  {path.name}: {result}")
        continue
    if result.confidence < 0.50:
        print(f"REVIEW  {path.name} ({result.confidence:.2f})")
    else:
        print(f"OK      {path.name} ({result.confidence:.2f})")

batch_extract was always there internally as pdfmux.pipeline.process_batch. We just hadn’t exposed it. Surfacing it is a 15-line patch and a documentation change. It’s the right thing to reach for in a Python script.

`pdfmux doctor --check <directory>`

The antidote to the “I forgot to install pdfmux[ocr]” failure mode. Sample the directory before you run the batch, classify the PDFs, and tell the user which extras are missing for this input:

$ pdfmux doctor --check ./inbox/
Batch check: 433 PDFs in ./inbox/ (sampling 10)

Page type   Sample count   Estimated batch share
─────────────────────────────────────────────────
digital              7     70% (~303 of 433)
scanned              3     30% (~130 of 433)

Recommendations for this batch
  ⚠  ~130 of 433 PDFs look scanned. Without OCR they will return empty text.
     Install: pip install 'pdfmux[ocr]'

doctor --check runs in seconds (it samples, doesn’t extract) and surfaces the install gap before you waste a 50-minute batch.

RapidOCR warnings translated

The bare upstream warnings are gone. They now route through pdfmux’s logger with file + page context attached:

pdfmux.extractors.rapid_ocr INFO: OCR found no text on page 4 of scan_017.pdf

You can now tell which document and page an empty OCR result came from.

Eleven regression tests for the failure modes

Shipped in 1.6.2. tests/test_real_world_failures.py covers all five v1 failure categories as behavioral contracts:

class TestZeroBytePdf:
    def test_zero_byte_via_batch_yields_exception(self, zero_byte_pdf):
        """A bad file must yield an exception in batch_extract,
        not silently appear successful."""
        results = list(pdfmux.batch_extract([zero_byte_pdf], quality="fast"))
        assert isinstance(results[0][1], Exception)

670 tests passing total, up from 659 in 1.6.0. None of these tests required behavior changes — pdfmux handled every fixture correctly already. We just hadn’t pinned the contracts.

The lesson, in three lines

Internal confidence scores are not a brand promise. The default exit code is the brand promise.
If your extractor does not write a per-document manifest by default, you cannot audit what it shipped.
--strict should not be a flag. In 1.7 it becomes the default for pdfmux convert <directory>. Until then: add it manually.

If you run a RAG-backed product, a silently-missing document is not a bug — it is a wrong answer to a customer in production, with no error to investigate and no log to grep. At a 2.5 percent silent-loss rate on 10,000 documents, that is 250 confidently-wrong answers waiting to surface as a support ticket, a churn event, or in regulated verticals, a deposition exhibit.

Install and verify in 60 seconds

Take 100 PDFs your current pipeline already processed. Run pdfmux against the same files. Diff the manifests.

pip install -U 'pdfmux[ocr]'
pdfmux doctor --check <your-pdf-directory>
pdfmux convert <your-pdf-directory> -o ./pdfmux-out/ --strict --min-confidence 0.20
cat ./pdfmux-out/manifest.json | jq '.summary'

If the manifest summary shows low_lt_0.50 > 0, your batch contained documents your previous tool was almost certainly losing — and pdfmux just told you which ones.

If pdfmux and your existing extractor agree on every document, your pipeline is fine and you don’t need us. If they disagree on more than 2 percent, those are documents your RAG is answering questions about with text that isn’t there. We will tell you which pages.

The full changelog and the v1.6.0 release notes are on GitHub. The 1.7 release with calibrated default-strict ships next week.