Direct answer: A scanned PDF is one where pages are images of text rather than encoded text characters. Detect it in Python with three lightweight checks per page — extract text length, count image regions, and inspect text-block coverage — then route only the scanned pages through OCR. The full check costs under 5 milliseconds per page and avoids running OCR on 90% of documents that don’t need it.
import fitz # PyMuPDF
def is_page_scanned(page, min_text_chars=50, min_text_area_ratio=0.05):
text = page.get_text("text")
if len(text.strip()) >= min_text_chars:
return False
blocks = page.get_text("blocks")
text_area = sum((b[2]-b[0]) * (b[3]-b[1]) for b in blocks if b[6] == 0)
page_area = page.rect.width * page.rect.height
return (text_area / page_area) < min_text_area_ratio
Why this is a real problem
Every PDF extraction pipeline runs into the same failure mode eventually: a user uploads a 40-page document, the parser returns 12 characters of text, and nobody can tell whether the PDF is corrupt, password-protected, or just scanned. In our 200-PDF real-world benchmark, 23% of files contained at least one scanned page. Documents from law firms, government archives, and shipping logistics in particular skew heavily toward scans.
The naive fix — run OCR on every page — is expensive. RapidOCR on CPU takes 0.5 to 2 seconds per page. Tesseract is slower. Cloud OCR APIs charge $1.50 per 1,000 pages. On a 100-page document where only two pages are scanned, blanket OCR wastes 98 page-seconds and changes accuracy by zero.
The correct pattern is to detect first, route second. This post covers the three detection methods that actually work, the failure cases each one misses, and the routing pattern we use in pdfmux to keep OCR cost near zero without losing accuracy on the pages that need it.
What “scanned” actually means
A born-digital PDF stores characters as text objects with font, position, and encoding metadata. Selecting text in a PDF reader and copying it works because the characters are first-class data.
A scanned PDF stores pages as embedded raster images — usually JPEG or JPEG2000 at 200-300 DPI. There are no text objects. Selecting text returns nothing because there is no text to select; you are looking at pixels of text.
In practice you encounter four states per page:
| State | Text objects | Image objects | What works |
|---|---|---|---|
| Born-digital | Yes | No | Direct extraction |
| Born-digital + image | Yes | Yes | Direct extraction |
| Scanned + OCR layer | Yes (low-quality) | Yes (full page) | Either; OCR quality varies |
| Pure scan | No | Yes (full page) | OCR only |
The third state is the dangerous one. A pure scan is obvious — text extraction returns an empty string. A scan with an invisible OCR layer added at scan time looks like a digital PDF to most extractors, but the OCR layer was often generated by a 2010-era engine and contains character substitution errors (“rn” for “m”, “l” for “1”) that propagate downstream.
Method 1: Text length per page
The simplest check. Extract text with PyMuPDF and measure the result.
import fitz
def detect_by_text_length(pdf_path, threshold=50):
doc = fitz.open(pdf_path)
results = []
for i, page in enumerate(doc):
text = page.get_text("text").strip()
results.append({
"page": i,
"chars": len(text),
"likely_scanned": len(text) < threshold,
})
return results
This catches pure scans reliably. A page that contains 600 words of text but extracts as 4 characters is almost certainly a scan.
Where it fails: scans with an invisible OCR layer return reasonable text length even though the underlying page is a raster image. Slide decks return short text per page even when fully digital. Title pages, dividers, and image-heavy figures pages return short text on born-digital PDFs.
Use this method as the first filter, not the final answer.
Method 2: Image coverage
A scanned page is almost always backed by a single full-page image. Born-digital pages either contain no images or contain small embedded figures.
import fitz
def detect_by_image_coverage(pdf_path, coverage_threshold=0.5):
doc = fitz.open(pdf_path)
results = []
for i, page in enumerate(doc):
page_area = page.rect.width * page.rect.height
image_area = 0
for img in page.get_images(full=True):
xref = img[0]
for rect in page.get_image_rects(xref):
image_area += rect.width * rect.height
coverage = image_area / page_area if page_area else 0
results.append({
"page": i,
"image_coverage": round(coverage, 3),
"likely_scanned": coverage > coverage_threshold,
})
return results
A page where an image covers more than half the page area is suspicious. Combined with Method 1 (short extracted text), this catches the invisible-OCR-layer case that Method 1 alone misses: text length is reasonable, but the visible content is still a raster image.
Where it fails: infographic-heavy pages and full-page charts trigger false positives. A magazine layout with a full-bleed photograph and a text caption looks identical to a scanned page by this measure.
Method 3: Text-block area ratio
The most reliable single signal. Born-digital text is stored as positioned text blocks. Measure what fraction of the page area is covered by actual text blocks, not images.
import fitz
def detect_by_text_blocks(pdf_path, area_threshold=0.05):
doc = fitz.open(pdf_path)
results = []
for i, page in enumerate(doc):
page_area = page.rect.width * page.rect.height
text_area = 0
for block in page.get_text("blocks"):
x0, y0, x1, y1, _, _, block_type = block
if block_type == 0: # 0 = text block, 1 = image block
text_area += (x1 - x0) * (y1 - y0)
ratio = text_area / page_area if page_area else 0
results.append({
"page": i,
"text_area_ratio": round(ratio, 3),
"likely_scanned": ratio < area_threshold,
})
return results
A digital page typically has 20-60% of its area covered by text blocks. A scanned page has under 5% — usually zero, occasionally trace amounts from a scanner-added OCR layer in the page margins.
This method correctly classifies the cases that defeat Methods 1 and 2:
- Scanned PDF with hidden OCR layer: text-block area is still tiny because the OCR layer sits behind the image, not as proper text blocks
- Born-digital with full-page image background: text-block area is large because real text sits in front of the image
- Sparse digital page (a title slide): text-block area is small but non-zero, distinguishable from a true scan
Combining the three: a routing function
The robust answer is to combine all three signals and route on the combination, not any single check.
import fitz
from enum import Enum
class PageType(str, Enum):
DIGITAL = "digital"
SCANNED = "scanned"
HYBRID = "hybrid" # OCR layer over image; quality unknown
def classify_page(page):
text = page.get_text("text").strip()
char_count = len(text)
page_area = page.rect.width * page.rect.height
if page_area == 0:
return PageType.DIGITAL
text_area = sum(
(b[2] - b[0]) * (b[3] - b[1])
for b in page.get_text("blocks")
if b[6] == 0
)
image_area = 0
for img in page.get_images(full=True):
for rect in page.get_image_rects(img[0]):
image_area += rect.width * rect.height
text_ratio = text_area / page_area
image_ratio = image_area / page_area
# Pure scan: no real text blocks, page mostly covered by an image
if text_ratio < 0.02 and image_ratio > 0.5:
return PageType.SCANNED
# Scan with OCR layer: some text characters present but text blocks are tiny
# and an image covers most of the page
if char_count > 50 and text_ratio < 0.05 and image_ratio > 0.5:
return PageType.HYBRID
return PageType.DIGITAL
def route(pdf_path):
doc = fitz.open(pdf_path)
plan = []
for i, page in enumerate(doc):
kind = classify_page(page)
plan.append((i, kind))
return plan
The HYBRID case is the one most pipelines miss. The page has extractable text, so naive pipelines accept it. But the text came from a low-quality OCR layer baked in at scan time, so accuracy is poor and you cannot tell from the text alone. Re-running OCR on these pages with a modern engine often improves character accuracy by 10-15 percentage points.
Benchmark: detection accuracy on 200 real PDFs
We labelled every page in our 200-PDF benchmark by hand (3,847 pages total) and measured each detection method against the ground truth:
| Method | Precision | Recall | F1 | False positives |
|---|---|---|---|---|
| Text length only | 0.94 | 0.71 | 0.81 | 19 |
| Image coverage only | 0.62 | 0.96 | 0.75 | 187 |
| Text-block ratio only | 0.97 | 0.93 | 0.95 | 11 |
| Combined (all three) | 0.99 | 0.97 | 0.98 | 4 |
Image coverage alone produces too many false positives — infographic-heavy slide decks get flagged as scans. Text length alone misses the hybrid case. The combined check gets both precision and recall above 97%.
The four remaining false positives in the combined check were all the same pathological case: a one-page receipt scanned with a fixed-position OCR layer in the corner. These are rare enough that we route them through OCR anyway as a safety net — the extra cost is one OCR pass per ~1,000 pages.
What to do once a page is classified
The whole point of detection is routing. Once you know which pages are scanned, run OCR on those and leave the digital pages alone.
from rapidocr_onnxruntime import RapidOCR
ocr = RapidOCR()
def extract(pdf_path):
doc = fitz.open(pdf_path)
output = []
for i, page in enumerate(doc):
kind = classify_page(page)
if kind == PageType.DIGITAL:
text = page.get_text("text")
else:
# Render the page to an image, then OCR
pix = page.get_pixmap(dpi=300)
result, _ = ocr(pix.tobytes("png"))
text = "\n".join(r[1] for r in result) if result else ""
output.append({"page": i, "type": kind, "text": text})
return output
On the 200-PDF benchmark this approach runs in 27 seconds total versus 89 seconds for blanket OCR — a 3.3x speedup with identical extraction quality. The full breakdown is in our PDF extractor benchmark post.
Pitfalls to watch for
A few edge cases that cost us debugging time:
Cropped pages with
CropBox: PyMuPDF’spage.rectreturns the cropped area, not the media box. If a scanned page has a heavily cropped CropBox, your image-coverage ratio gets distorted. Usepage.mediaboxif you need the full physical page area.Rotated pages: A page with 90 or 270 degree rotation has swapped width and height. PyMuPDF handles this correctly in
page.rect, but if you read raw rectangle coordinates frompage.get_text("dict")you need to apply the rotation matrix.Form XObjects: Some PDFs store page content inside a referenced Form XObject rather than directly on the page. Text-block extraction follows the reference correctly; image-area calculation can miss it. If your image-coverage ratio comes out to zero on a page that visually looks scanned, this is usually why.
Very large image dimensions: A few scanners produce 8000x10000 pixel images downsampled at display time. Iterating images with
page.get_images(full=True)is fine, but if you ever load the pixmap to inspect it, do so at reduced DPI to avoid memory blow-up.Encrypted PDFs: A password-protected PDF returns empty text on every page. The text-length check flags every page as scanned, which is wrong. Check
doc.is_encryptedbefore classification.
Self-healing as a safety net
Detection is good but not perfect. A robust pipeline runs a self-healing quality audit after extraction: if a page classified as digital extracts to garbled or near-empty text, fall back to OCR on that page only. This catches the small percentage of misclassifications without paying the OCR cost on every page.
def extract_with_fallback(page):
if classify_page(page) == PageType.DIGITAL:
text = page.get_text("text")
if quality_score(text) >= 0.7:
return text
# Either classified as scanned, or digital extraction was low quality
return ocr_page(page)
The combination of upfront classification and post-extraction audit is what brings effective accuracy to within 0.5% of running OCR on everything, at less than a third of the runtime.
How pdfmux handles this
pdfmux implements exactly this pattern: classify each page in under 1 millisecond using the three-signal combined check, route digital pages through PyMuPDF and scanned pages through RapidOCR, then audit every extraction and re-run OCR on any page that fails the audit. The pipeline runs entirely on CPU with no GPU and no API keys — see our GPU-free architecture post for the full design.
Install with pip install pdfmux and call pdfmux.extract("file.pdf") — the classifier, router, and self-healing loop are all internal. If you want the lower-level building blocks the code above is a complete reference implementation; the only thing pdfmux adds is the audit loop and a handful of pathological-document edge cases that took us a year to find.
Summary
Detecting scanned PDFs is a routing problem, not a yes-or-no problem. Three checks — extracted text length, image coverage, and text-block area ratio — combined into a single classifier reach 98% F1 on real-world documents. Once you can classify pages reliably, OCR cost drops by 3-5x because you only run OCR on the pages that need it. The remaining edge cases (hybrid OCR layers, cropped pages, encrypted documents) are predictable enough to handle with a small set of guard clauses plus a post-extraction quality audit. That is the entire detection-and-routing pattern, and it is what separates a pipeline that handles 90% of PDFs from one that handles 99.5%.