How to detect if a PDF is scanned (and what to do about it)

TL;DRThree reliable ways to detect a scanned PDF in Python, plus a routing pattern that runs OCR only on the pages that need it. With code and benchmarks.

Direct answer: A scanned PDF is one where pages are images of text rather than encoded text characters. Detect it in Python with three lightweight checks per page — extract text length, count image regions, and inspect text-block coverage — then route only the scanned pages through OCR. The full check costs under 5 milliseconds per page and avoids running OCR on 90% of documents that don’t need it.

import fitz  # PyMuPDF

def is_page_scanned(page, min_text_chars=50, min_text_area_ratio=0.05):
    text = page.get_text("text")
    if len(text.strip()) >= min_text_chars:
        return False
    blocks = page.get_text("blocks")
    text_area = sum((b[2]-b[0]) * (b[3]-b[1]) for b in blocks if b[6] == 0)
    page_area = page.rect.width * page.rect.height
    return (text_area / page_area) < min_text_area_ratio

Why this is a real problem

Every PDF extraction pipeline runs into the same failure mode eventually: a user uploads a 40-page document, the parser returns 12 characters of text, and nobody can tell whether the PDF is corrupt, password-protected, or just scanned. In our 200-PDF real-world benchmark, 23% of files contained at least one scanned page. Documents from law firms, government archives, and shipping logistics in particular skew heavily toward scans.

The naive fix — run OCR on every page — is expensive. RapidOCR on CPU takes 0.5 to 2 seconds per page. Tesseract is slower. Cloud OCR APIs charge $1.50 per 1,000 pages. On a 100-page document where only two pages are scanned, blanket OCR wastes 98 page-seconds and changes accuracy by zero.

The correct pattern is to detect first, route second. This post covers the three detection methods that actually work, the failure cases each one misses, and the routing pattern we use in pdfmux to keep OCR cost near zero without losing accuracy on the pages that need it.

What “scanned” actually means

A born-digital PDF stores characters as text objects with font, position, and encoding metadata. Selecting text in a PDF reader and copying it works because the characters are first-class data.

A scanned PDF stores pages as embedded raster images — usually JPEG or JPEG2000 at 200-300 DPI. There are no text objects. Selecting text returns nothing because there is no text to select; you are looking at pixels of text.

In practice you encounter four states per page:

State	Text objects	Image objects	What works
Born-digital	Yes	No	Direct extraction
Born-digital + image	Yes	Yes	Direct extraction
Scanned + OCR layer	Yes (low-quality)	Yes (full page)	Either; OCR quality varies
Pure scan	No	Yes (full page)	OCR only

The third state is the dangerous one. A pure scan is obvious — text extraction returns an empty string. A scan with an invisible OCR layer added at scan time looks like a digital PDF to most extractors, but the OCR layer was often generated by a 2010-era engine and contains character substitution errors (“rn” for “m”, “l” for “1”) that propagate downstream.

Method 1: Text length per page

The simplest check. Extract text with PyMuPDF and measure the result.

import fitz

def detect_by_text_length(pdf_path, threshold=50):
    doc = fitz.open(pdf_path)
    results = []
    for i, page in enumerate(doc):
        text = page.get_text("text").strip()
        results.append({
            "page": i,
            "chars": len(text),
            "likely_scanned": len(text) < threshold,
        })
    return results

This catches pure scans reliably. A page that contains 600 words of text but extracts as 4 characters is almost certainly a scan.

Where it fails: scans with an invisible OCR layer return reasonable text length even though the underlying page is a raster image. Slide decks return short text per page even when fully digital. Title pages, dividers, and image-heavy figures pages return short text on born-digital PDFs.

Use this method as the first filter, not the final answer.

Method 2: Image coverage

A scanned page is almost always backed by a single full-page image. Born-digital pages either contain no images or contain small embedded figures.

import fitz

def detect_by_image_coverage(pdf_path, coverage_threshold=0.5):
    doc = fitz.open(pdf_path)
    results = []
    for i, page in enumerate(doc):
        page_area = page.rect.width * page.rect.height
        image_area = 0
        for img in page.get_images(full=True):
            xref = img[0]
            for rect in page.get_image_rects(xref):
                image_area += rect.width * rect.height
        coverage = image_area / page_area if page_area else 0
        results.append({
            "page": i,
            "image_coverage": round(coverage, 3),
            "likely_scanned": coverage > coverage_threshold,
        })
    return results

A page where an image covers more than half the page area is suspicious. Combined with Method 1 (short extracted text), this catches the invisible-OCR-layer case that Method 1 alone misses: text length is reasonable, but the visible content is still a raster image.

Where it fails: infographic-heavy pages and full-page charts trigger false positives. A magazine layout with a full-bleed photograph and a text caption looks identical to a scanned page by this measure.

Method 3: Text-block area ratio

The most reliable single signal. Born-digital text is stored as positioned text blocks. Measure what fraction of the page area is covered by actual text blocks, not images.

import fitz

def detect_by_text_blocks(pdf_path, area_threshold=0.05):
    doc = fitz.open(pdf_path)
    results = []
    for i, page in enumerate(doc):
        page_area = page.rect.width * page.rect.height
        text_area = 0
        for block in page.get_text("blocks"):
            x0, y0, x1, y1, _, _, block_type = block
            if block_type == 0:  # 0 = text block, 1 = image block
                text_area += (x1 - x0) * (y1 - y0)
        ratio = text_area / page_area if page_area else 0
        results.append({
            "page": i,
            "text_area_ratio": round(ratio, 3),
            "likely_scanned": ratio < area_threshold,
        })
    return results

A digital page typically has 20-60% of its area covered by text blocks. A scanned page has under 5% — usually zero, occasionally trace amounts from a scanner-added OCR layer in the page margins.

This method correctly classifies the cases that defeat Methods 1 and 2:

Scanned PDF with hidden OCR layer: text-block area is still tiny because the OCR layer sits behind the image, not as proper text blocks
Born-digital with full-page image background: text-block area is large because real text sits in front of the image
Sparse digital page (a title slide): text-block area is small but non-zero, distinguishable from a true scan

Combining the three: a routing function

The robust answer is to combine all three signals and route on the combination, not any single check.

import fitz
from enum import Enum

class PageType(str, Enum):
    DIGITAL = "digital"
    SCANNED = "scanned"
    HYBRID = "hybrid"  # OCR layer over image; quality unknown

def classify_page(page):
    text = page.get_text("text").strip()
    char_count = len(text)

    page_area = page.rect.width * page.rect.height
    if page_area == 0:
        return PageType.DIGITAL

    text_area = sum(
        (b[2] - b[0]) * (b[3] - b[1])
        for b in page.get_text("blocks")
        if b[6] == 0
    )
    image_area = 0
    for img in page.get_images(full=True):
        for rect in page.get_image_rects(img[0]):
            image_area += rect.width * rect.height

    text_ratio = text_area / page_area
    image_ratio = image_area / page_area

    # Pure scan: no real text blocks, page mostly covered by an image
    if text_ratio < 0.02 and image_ratio > 0.5:
        return PageType.SCANNED

    # Scan with OCR layer: some text characters present but text blocks are tiny
    # and an image covers most of the page
    if char_count > 50 and text_ratio < 0.05 and image_ratio > 0.5:
        return PageType.HYBRID

    return PageType.DIGITAL

def route(pdf_path):
    doc = fitz.open(pdf_path)
    plan = []
    for i, page in enumerate(doc):
        kind = classify_page(page)
        plan.append((i, kind))
    return plan

The HYBRID case is the one most pipelines miss. The page has extractable text, so naive pipelines accept it. But the text came from a low-quality OCR layer baked in at scan time, so accuracy is poor and you cannot tell from the text alone. Re-running OCR on these pages with a modern engine often improves character accuracy by 10-15 percentage points.

Benchmark: detection accuracy on 200 real PDFs

We labelled every page in our 200-PDF benchmark by hand (3,847 pages total) and measured each detection method against the ground truth:

Method	Precision	Recall	F1	False positives
Text length only	0.94	0.71	0.81	19
Image coverage only	0.62	0.96	0.75	187
Text-block ratio only	0.97	0.93	0.95	11
Combined (all three)	0.99	0.97	0.98	4

Image coverage alone produces too many false positives — infographic-heavy slide decks get flagged as scans. Text length alone misses the hybrid case. The combined check gets both precision and recall above 97%.

The four remaining false positives in the combined check were all the same pathological case: a one-page receipt scanned with a fixed-position OCR layer in the corner. These are rare enough that we route them through OCR anyway as a safety net — the extra cost is one OCR pass per ~1,000 pages.

What to do once a page is classified

The whole point of detection is routing. Once you know which pages are scanned, run OCR on those and leave the digital pages alone.

from rapidocr_onnxruntime import RapidOCR

ocr = RapidOCR()

def extract(pdf_path):
    doc = fitz.open(pdf_path)
    output = []
    for i, page in enumerate(doc):
        kind = classify_page(page)
        if kind == PageType.DIGITAL:
            text = page.get_text("text")
        else:
            # Render the page to an image, then OCR
            pix = page.get_pixmap(dpi=300)
            result, _ = ocr(pix.tobytes("png"))
            text = "\n".join(r[1] for r in result) if result else ""
        output.append({"page": i, "type": kind, "text": text})
    return output

On the 200-PDF benchmark this approach runs in 27 seconds total versus 89 seconds for blanket OCR — a 3.3x speedup with identical extraction quality. The full breakdown is in our PDF extractor benchmark post.

Pitfalls to watch for

A few edge cases that cost us debugging time:

Cropped pages with CropBox: PyMuPDF’s page.rect returns the cropped area, not the media box. If a scanned page has a heavily cropped CropBox, your image-coverage ratio gets distorted. Use page.mediabox if you need the full physical page area.
Rotated pages: A page with 90 or 270 degree rotation has swapped width and height. PyMuPDF handles this correctly in page.rect, but if you read raw rectangle coordinates from page.get_text("dict") you need to apply the rotation matrix.
Form XObjects: Some PDFs store page content inside a referenced Form XObject rather than directly on the page. Text-block extraction follows the reference correctly; image-area calculation can miss it. If your image-coverage ratio comes out to zero on a page that visually looks scanned, this is usually why.
Very large image dimensions: A few scanners produce 8000x10000 pixel images downsampled at display time. Iterating images with page.get_images(full=True) is fine, but if you ever load the pixmap to inspect it, do so at reduced DPI to avoid memory blow-up.
Encrypted PDFs: A password-protected PDF returns empty text on every page. The text-length check flags every page as scanned, which is wrong. Check doc.is_encrypted before classification.

Self-healing as a safety net

Detection is good but not perfect. A robust pipeline runs a self-healing quality audit after extraction: if a page classified as digital extracts to garbled or near-empty text, fall back to OCR on that page only. This catches the small percentage of misclassifications without paying the OCR cost on every page.

def extract_with_fallback(page):
    if classify_page(page) == PageType.DIGITAL:
        text = page.get_text("text")
        if quality_score(text) >= 0.7:
            return text
    # Either classified as scanned, or digital extraction was low quality
    return ocr_page(page)

The combination of upfront classification and post-extraction audit is what brings effective accuracy to within 0.5% of running OCR on everything, at less than a third of the runtime.

How pdfmux handles this

pdfmux implements exactly this pattern: classify each page in under 1 millisecond using the three-signal combined check, route digital pages through PyMuPDF and scanned pages through RapidOCR, then audit every extraction and re-run OCR on any page that fails the audit. The pipeline runs entirely on CPU with no GPU and no API keys — see our GPU-free architecture post for the full design.

Install with pip install pdfmux and call pdfmux.extract("file.pdf") — the classifier, router, and self-healing loop are all internal. If you want the lower-level building blocks the code above is a complete reference implementation; the only thing pdfmux adds is the audit loop and a handful of pathological-document edge cases that took us a year to find.

Summary

Detecting scanned PDFs is a routing problem, not a yes-or-no problem. Three checks — extracted text length, image coverage, and text-block area ratio — combined into a single classifier reach 98% F1 on real-world documents. Once you can classify pages reliably, OCR cost drops by 3-5x because you only run OCR on the pages that need it. The remaining edge cases (hybrid OCR layers, cropped pages, encrypted documents) are predictable enough to handle with a small set of guard clauses plus a post-extraction quality audit. That is the entire detection-and-routing pattern, and it is what separates a pipeline that handles 90% of PDFs from one that handles 99.5%.