By Nameet Potnis, founder of pdfmux. @NameetP on GitHub · pdfmux is MIT-licensed and runs locally — no API keys, no SaaS.
If you run a RAG pipeline today, somewhere between 1 and 5 percent of the documents you think you indexed contain zero extracted text. Your users are asking questions about contracts that, as far as your vector store is concerned, are blank. We know — we shipped this exact failure on our own customer’s batch last week. Here’s the audit.
| Same 433 customer PDFs | Result |
|---|---|
Subprocess + pypdf fallback (the script most engineers write) | 16 silent failures |
pdfmux.batch_extract() with [ocr] extra | 0 failures |
The product solved the problem. The default invocation didn’t. This post is about closing that gap — and how to verify your current pipeline isn’t open to the same failure.
TL;DR: We pointed pdfmux at a real B2B chemical distribution customer’s 433-PDF batch — a mix of digital, scanned, and partially-corrupted product data sheets. The first script we wrote shelled out to pdfmux convert in a subprocess loop and used pypdf as a fallback. It reported “412 processed, 16 failed.” Eleven of those sixteen failures had no log line at all — they were missing rows in the output CSV that nobody noticed until we audited the manifest. The fix was a one-line API change (pdfmux.extract_text(quality="standard") + pdfmux[ocr]) and zero output failures on rerun. But the lesson was bigger: the audit machinery was right internally; the CLI exit-code contract didn’t propagate the signal. We shipped 1.6.1 and 1.6.2 over the next twenty-four hours to close that gap.
Last updated: 2026-05-02. Includes pdfmux 1.6.3 audit-correctness fix.
What we ran
A real customer’s Google Drive folder. 433 unique PDFs after deduplicating by filename stem. Mixed origin: some scanned on phone cameras, some printed and re-scanned, some downloaded as email attachments, some clean digital exports. Languages: English plus a handful of mixed Arabic-English documents. Document classes:
| Type | Count |
|---|---|
| TDS (technical data sheets) | 253 |
| PDS (product data sheets) | 80 |
| MSDS (material safety) | 66 |
| COA (certificates of analysis) | 11 |
| SDS (safety data sheets) | 11 |
| OTHER | 7 |
| PO (purchase orders) | 5 |
| Total | 433 |
We had three script iterations across the day. v1 used pdfmux convert shelled out as a subprocess + pypdf as a fallback for files where the CLI didn’t return enough text. v3 was a partial pivot to pymupdf4llm direct that failed on a missing dependency and was abandoned. v4 used the pdfmux Python API.
Test this against your own pipeline in 10 minutes:
pip install pdfmux[ocr] && pdfmux doctor --check <your-inbox-dir>. Doctor samples 10 PDFs and tells you what fraction of your batch is scanned, truncated, or non-Latin. If your current extractor doesn’t ship OCR by default and your batch is more than 10 percent scanned, you are losing those documents silently right now.
What v1 got wrong
We built the v1 script with the structure most people would write: subprocess the CLI, fall back to a different library if the CLI doesn’t produce enough output, log the failures.
# v1 — what we wrote first. Don't do this.
result = subprocess.run(
["pdfmux", "convert", str(pdf_path)],
capture_output=True, text=True, timeout=60,
)
if result.returncode == 0 and len(result.stdout) > 100:
text = result.stdout
else:
# Fallback: pypdf
reader = pypdf.PdfReader(str(pdf_path))
text = "\n\n".join(p.extract_text() for p in reader.pages if p.extract_text())
When the script finished it printed:
Done: 412 processed, 5 skipped (cached), 16 failed
Sixteen failures, in four categories.
1. Eleven scanned PDFs that returned zero characters (no log line)
Most directories of real-world PDFs include at least one scan. In this batch they were field reports — phone-camera photos of forms, faxed safety data sheets, printed-then-re-scanned MSDS pages. We installed pdfmux without pdfmux[ocr]. PyMuPDF on a scanned page returns zero characters because there is no embedded text layer. The CLI exited 0 with nearly empty stdout. Our len(stdout) > 100 guard rejected it. The pypdf fallback also returned zero. The script logged a [SKIP] marker that didn’t make it into the production log, and moved on.
The script was technically correct: it tried, it failed, it logged. But the output CSV shipped with eleven products missing and the operator didn’t notice for days.
Eleven of 433 documents — 2.5 percent — left our pipeline with zero text and zero log line. In RAG terms: 2.5 percent of this customer’s product catalog became unanswerable, with no error to alert on. The internal audit data scored every one of those eleven pages at 0.0 confidence. The signal was there. The CLI surface buried it.
This is the failure mode pdfmux’s brand explicitly positions against — “you will never get silent garbage. pdfmux tells you what extracted, how confidently, and which pages need attention.”
2. Four “Stream has ended unexpectedly” failures from pypdf
[WARN] pypdf failed for INV-01_*.pdf: Stream has ended unexpectedly
[WARN] pypdf failed for PDS-14_*.pdf: Stream has ended unexpectedly
[WARN] pypdf failed for PO-01_*.pdf: Stream has ended unexpectedly
[WARN] pypdf failed for SDS-03_*.pdf: Stream has ended unexpectedly
These were truncated PDFs — email attachments saved partially, or shared via a tool that cut off the trailing bytes. pypdf rejects them because the xref table is corrupt. The pdfmux subprocess had already failed silently before we got here (zero-text scans, see #1). The pypdf fallback then failed loudly. Both fallbacks returned nothing, and the script shipped a CSV without those four products.
We tested the same four PDFs in v4 with pdfmux.extract_text directly. PyMuPDF’s xref repair handled all four cleanly — the bytes that were there got extracted, the missing bytes got dropped.
If your pipeline already falls back to pypdf when the primary extractor returns short output, you have this exact bug. PyMuPDF (what pdfmux runs by default) tolerates a strict superset of what pypdf can read — so the fallback path you wrote to be safer is actually rejecting recoverable documents. We did it too. Delete the fallback.
3. One JSON parse error in the downstream extraction step
[ERROR] Structure extraction failed for *.pdf: Extra data: line 22 column 1 (char 464)
Our pipeline ran a downstream LLM step that took pdfmux’s markdown output and produced structured JSON. For one document, the LLM returned what looked like JSON but had trailing content after the closing brace. Not a pdfmux bug — but the v1 pipeline had no retry on JSON-parse failures, so a transient LLM artifact silently dropped the row. In v4 (cleaner markdown from pdfmux) the LLM’s output was consistently parseable.
4. RapidOCR warnings without context
In the v4 run, RapidOCR printed eight lines that looked like this:
[WARNING] 2026-04-30 17:23:16,370 [RapidOCR] main.py:132: The text detection result is empty
No file. No page number. No way to know whether it was “this is a genuinely blank page” (fine) or “OCR gave up on a page that needed re-extraction” (bad). The signal was there in the upstream library; the surface was wrong.
(Also worth noting: pdfmux 1.5 printed Failed to load ML heading model: No module named 'sklearn' 24+ times during the run. It was harmless noise on every batch — the heuristic fallback ran and produced correct headings. Removing it in 1.6.1 deleted 250 lines and 24 stderr prints per run.)
Where this fails on other extractors too
The retro is easy to read as “pdfmux had a bug, pdfmux fixed it.” That’s local. The category-wide picture is worse.
We re-ran the same five failure modes against pypdf 6.10.2 on the public eval set we built for 1.6.2’s regression tests (50 labeled fixtures spanning the same failure categories — 0-byte files, HTML masquerading as PDF, truncated PDFs at two severity levels, image-only PDFs, blank pages). Reproducible at eval/benchmark_pypdf.py.
| Failure category | n | pypdf recovered | pdfmux (PyMuPDF) recovered |
|---|---|---|---|
| Truncated PDF, 70% bytes intact (xref damaged) | 5 | 0 of 5 | 5 of 5 (218 chars, conf 1.00) |
| Truncated PDF, 15% bytes intact (header gone) | 4 | 0 of 4 | 0 of 4 (errors) |
| Image-only PDF (no text layer, no OCR) | 5 | 0 of 5 | 0 of 5 (conf 0.00) |
| Blank PDF (one valid empty page) | 4 | 0 of 4 | 0 of 4 (conf 0.00) |
| HTML renamed to .pdf | 5 | 0 of 5 | 0 of 5 (37 chars, conf 0.73 — flagged) |
| 0-byte file | 5 | 0 of 5 (raises) | 0 of 5 (raises FileError) |
The first row is the load-bearing one. PyMuPDF recovers 5 of 5 lightly-truncated PDFs; pypdf recovers 0 of 5. Same files, opposite outcomes. That’s the “strict superset” claim made concrete: any pipeline that uses pypdf as a fallback for “primary extractor returned short output” turns recoverable documents into failures.
The other rows are the silent-failure baseline. Both extractors return zero characters on image-only, blank, and severely-truncated PDFs at quality=fast (no OCR). The difference is that pdfmux returns a 0.0 confidence score the caller can gate on — pypdf returns text and exit code 0 with no signal. With pdfmux[ocr] and quality=standard, the image-only column flips to 5 of 5 recovered for pdfmux. The pypdf column stays 0 of 5 — pypdf doesn’t ship OCR.
We did not run cloud APIs (LlamaParse, Unstructured Cloud) on this eval set — that’s a follow-up benchmark — but those tools also do not emit per-document confidence in their public response, so the “silent failure” surface is the same architectural gap regardless of extraction quality.
What we changed
Everything below shipped in pdfmux 1.6.1, 1.6.2, and 1.6.3 over the next forty-eight hours. Every change is additive — no breaking defaults yet. The breaking-default change comes in 1.7 once we calibrate confidence thresholds against a labeled eval set.
--strict and --min-confidence flags
The most important change. The audit machinery already produced accurate per-document confidence; it just needed a CLI surface.
# Old behavior (still default in 1.6.x for backwards compat):
pdfmux convert ./inbox/ -o ./output/
# Done: 433 converted, 0 failed ← exit code 0 even if 11 docs returned 0% confidence
# New behavior — fail loud in CI:
pdfmux convert ./inbox/ -o ./output/ --strict --min-confidence 0.20
# pdfmux WARNING: scan_001.pdf confidence 0.00 (below 0.50 threshold)
# pdfmux ERROR: --strict gate failed — 11 of 422 documents below --min-confidence 0.20
# exit 3
Any document with confidence under 0.50 now also emits a stderr WARNING regardless of --strict, so silent low-quality batches are visible in CI logs without changing the exit code.
Exit codes are now documented:
0 — success
1 — extraction or runtime error
2 — usage error (bad arguments, file not found)
3 — strict gate failed
manifest.json per batch
Every pdfmux convert <directory> run now writes a manifest.json to the output directory. Per-document confidence, extractor used, OCR pages, cost, warnings, plus a summary breakdown.
{
"schema_version": "1.0",
"summary": {
"total": 433, "processed": 433, "failed": 0,
"low_confidence": 0,
"min_confidence_threshold": 0.20,
"confidence_breakdown": {
"high_ge_0.80": 421,
"medium_0.50_to_0.80": 12,
"low_lt_0.50": 0
}
},
"documents": [
{"path": "scan_001.pdf", "status": "ok",
"confidence": 0.92, "extractor": "rapidocr", ...}
]
}
If v1 had had this, the eleven silent failures would have been visible in the very first run.
pdfmux.batch_extract() as a public API
The Python API is the right path for batch processing — it avoids three subprocess spawns per file, handles non-ASCII filenames without shell-quoting issues, and is significantly faster.
import pdfmux
from pathlib import Path
pdfs = list(Path("./inbox").glob("*.pdf"))
for path, result in pdfmux.batch_extract(pdfs, quality="standard", workers=4):
if isinstance(result, Exception):
print(f"FAILED {path.name}: {result}")
continue
if result.confidence < 0.50:
print(f"REVIEW {path.name} ({result.confidence:.2f})")
else:
print(f"OK {path.name} ({result.confidence:.2f})")
batch_extract was always there internally as pdfmux.pipeline.process_batch. We just hadn’t exposed it. Surfacing it is a 15-line patch and a documentation change. It’s the right thing to reach for in a Python script.
pdfmux doctor --check <directory>
The antidote to the “I forgot to install pdfmux[ocr]” failure mode. Sample the directory before you run the batch, classify the PDFs, and tell the user which extras are missing for this input:
$ pdfmux doctor --check ./inbox/
Batch check: 433 PDFs in ./inbox/ (sampling 10)
Page type Sample count Estimated batch share
─────────────────────────────────────────────────
digital 7 70% (~303 of 433)
scanned 3 30% (~130 of 433)
Recommendations for this batch
⚠ ~130 of 433 PDFs look scanned. Without OCR they will return empty text.
Install: pip install 'pdfmux[ocr]'
doctor --check runs in seconds (it samples, doesn’t extract) and surfaces the install gap before you waste a 50-minute batch.
RapidOCR warnings translated
The bare upstream warnings are gone. They now route through pdfmux’s logger with file + page context attached:
pdfmux.extractors.rapid_ocr INFO: OCR found no text on page 4 of scan_017.pdf
You can now tell which document and page an empty OCR result came from.
Eleven regression tests for the failure modes
Shipped in 1.6.2. tests/test_real_world_failures.py covers all five v1 failure categories as behavioral contracts:
class TestZeroBytePdf:
def test_zero_byte_via_batch_yields_exception(self, zero_byte_pdf):
"""A bad file must yield an exception in batch_extract,
not silently appear successful."""
results = list(pdfmux.batch_extract([zero_byte_pdf], quality="fast"))
assert isinstance(results[0][1], Exception)
670 tests passing total, up from 659 in 1.6.0. None of these tests required behavior changes — pdfmux handled every fixture correctly already. We just hadn’t pinned the contracts.
The lesson, in three lines
- Internal confidence scores are not a brand promise. The default exit code is the brand promise.
- If your extractor does not write a per-document manifest by default, you cannot audit what it shipped.
--strictshould not be a flag. In 1.7 it becomes the default forpdfmux convert <directory>. Until then: add it manually.
If you run a RAG-backed product, a silently-missing document is not a bug — it is a wrong answer to a customer in production, with no error to investigate and no log to grep. At a 2.5 percent silent-loss rate on 10,000 documents, that is 250 confidently-wrong answers waiting to surface as a support ticket, a churn event, or in regulated verticals, a deposition exhibit.
Install and verify in 60 seconds
Take 100 PDFs your current pipeline already processed. Run pdfmux against the same files. Diff the manifests.
pip install -U 'pdfmux[ocr]'
pdfmux doctor --check <your-pdf-directory>
pdfmux convert <your-pdf-directory> -o ./pdfmux-out/ --strict --min-confidence 0.20
cat ./pdfmux-out/manifest.json | jq '.summary'
If the manifest summary shows low_lt_0.50 > 0, your batch contained documents your previous tool was almost certainly losing — and pdfmux just told you which ones.
If pdfmux and your existing extractor agree on every document, your pipeline is fine and you don’t need us. If they disagree on more than 2 percent, those are documents your RAG is answering questions about with text that isn’t there. We will tell you which pages.
The full changelog and the v1.6.0 release notes are on GitHub. The 1.7 release with calibrated default-strict ships next week.