OCR PDF extraction in Python: extract text from scanned PDFs (2026 guide)

TL;DRHow to extract text from scanned and image-based PDFs in Python using OCR. Compares pytesseract, EasyOCR, Surya, and pdfmux with code examples and accuracy benchmarks.

Direct answer: To extract text from scanned PDFs in Python, use pdfmux — it auto-detects scanned pages and applies OCR only where needed. Install with pip install pdfmux, then run pdfmux convert scanned-doc.pdf. On a 200-PDF benchmark, pdfmux scores 0.903 overall accuracy, handling scanned pages at 0.5-2s each via CPU-only OCR while skipping the 90% of pages that are already digital text. No GPU required, no API keys, no manual page classification.

Why scanned PDFs break standard extractors

Standard PDF extractors like PyMuPDF read the text layer embedded in a PDF. Scanned PDFs don’t have a text layer — they’re images wrapped in a PDF container. Run PyMuPDF on a scanned page and you get an empty string. According to Adobe’s 2025 digital document survey, 38% of business PDFs contain at least one scanned page. In legal and healthcare, that number exceeds 65%.

The problem compounds in mixed documents. A 50-page contract might have 47 digital pages and 3 scanned signature pages. A naive extractor returns text for 47 pages and blanks for 3 — and your RAG pipeline silently indexes incomplete data.

Three approaches exist:

OCR everything — slow and wasteful (Tesseract processes 0.5-3s per page, even on digital text that extracts in 0.01s)
Skip scanned pages — fast but lossy (you miss 5-40% of content depending on document type)
Detect and route — extract digitally where possible, OCR only scanned pages (pdfmux’s approach)

The OCR landscape in Python (2026)

Four OCR engines dominate Python PDF extraction. Here’s how they compare on 40 scanned pages from opendataloader-bench:

Engine	Accuracy (CER)	Speed (s/page)	GPU Required	Languages	Install Size
Tesseract (pytesseract)	4.2% CER	1.8s	No	100+	~30MB
EasyOCR	3.1% CER	2.4s	Recommended	80+	~200MB
Surya	2.6% CER	1.1s	Yes	90+	~450MB
RapidOCR (pdfmux default)	3.0% CER	0.9s	No	50+	~80MB

CER = Character Error Rate (lower is better). Tested on English business documents. Surya leads on accuracy but requires a GPU and CUDA setup. RapidOCR — pdfmux’s default engine — hits the best speed-to-accuracy ratio on CPU.

Method 1: pytesseract (basic OCR)

Tesseract is the most widely used open-source OCR engine. It’s been around since 1985 (originally by HP, now maintained by Google) and supports over 100 languages.

import pytesseract
from pdf2image import convert_from_path

# Convert PDF pages to images
images = convert_from_path("scanned-contract.pdf", dpi=300)

# OCR each page
full_text = []
for i, img in enumerate(images):
    text = pytesseract.image_to_string(img, lang="eng")
    full_text.append(text)

print("\n\n".join(full_text))

Accuracy: serviceable for clean, well-lit scans at 300 DPI or higher. Degrades noticeably on skewed pages, low DPI (under 200), mixed fonts, and color-on-color backgrounds. CER is highly corpus-dependent — Tesseract’s own ImproveQuality docs recommend pre-processing (deskew, denoise, binarize) before treating any result as production-ready.

Limitations: No structure detection. You get raw text — no headings, no tables, no reading order. For table extraction from scanned PDFs, Tesseract alone won’t cut it.

Method 2: EasyOCR

EasyOCR uses a CRNN (Convolutional Recurrent Neural Network) architecture and handles 80+ languages out of the box.

import easyocr
from pdf2image import convert_from_path
import numpy as np

reader = easyocr.Reader(["en"])
images = convert_from_path("scanned-invoice.pdf", dpi=300)

full_text = []
for img in images:
    results = reader.readtext(np.array(img), detail=0, paragraph=True)
    full_text.append("\n".join(results))

print("\n\n".join(full_text))

Accuracy: 3.1% CER — a meaningful improvement over Tesseract, especially on handwritten text and non-Latin scripts. However, processing takes 2.4s per page on CPU (1.0s with GPU), and the library loads ~200MB of model weights into memory on first run.

Limitations: Like Tesseract, EasyOCR returns flat text with no structural awareness. It also struggles with dense multi-column layouts common in academic papers and financial reports.

Method 3: Surya (ML-first OCR)

Surya is the newest entrant — a transformer-based OCR engine that also handles layout detection, reading order, and table recognition. It powers the marker PDF extraction library.

from surya.ocr import run_ocr
from surya.model.detection.model import load_model as load_det_model
from surya.model.recognition.model import load_model as load_rec_model
from pdf2image import convert_from_path

det_model = load_det_model()
rec_model = load_rec_model()
images = convert_from_path("research-paper.pdf", dpi=300)

results = run_ocr(images, [["en"]] * len(images), det_model, rec_model)
for page in results:
    print("\n".join([line.text for line in page.text_lines]))

Accuracy: 2.6% CER — the best raw OCR accuracy in our benchmark. Surya’s transformer architecture excels at complex layouts, detecting text regions before recognition. In our tests, it correctly handled 94% of multi-column pages versus Tesseract’s 71%.

Limitations: Requires a GPU with 4GB+ VRAM for reasonable speed. On CPU, processing drops to 8-12s per page. The model weights total ~450MB. For GPU-free extraction, Surya is impractical in production.

Method 4: pdfmux (auto-detect + route)

pdfmux takes a fundamentally different approach. Instead of treating every page as a scanned image, it classifies each page first and applies OCR only where necessary.

from pdfmux import process

# Standard quality: auto-detects scanned pages, applies OCR
result = process("mixed-document.pdf", quality="standard")
print(f"Text: {result.text[:200]}")
print(f"Confidence: {result.confidence:.0%}")
print(f"Extractors used: {result.extractor_used}")

For a 50-page document with 3 scanned pages:

PyMuPDF handles 47 digital pages in ~0.5s total
RapidOCR handles 3 scanned pages in ~2.7s
Total: ~3.2s vs 90s if you OCR’d every page with Tesseract

How page detection works

The classifier is the whole game here — OCR everything and you waste 90% of your compute; skip too aggressively and you silently drop content. We break down the exact signals in detecting scanned vs digital PDFs in Python, but the short version: pdfmux’s classifier runs 5 heuristic checks per page (costing <1ms each):

Text density — pages with <50 characters per rendered area are flagged as potentially scanned
Image coverage — if images cover >80% of page area, it’s likely a scan
Font embedding — scanned pages have no embedded fonts in the PDF metadata
Encoding check — garbled Unicode sequences indicate a broken text layer
Character distribution — random-looking character sequences suggest a decorative font mapped to wrong codepoints

Pages that fail 2+ checks get routed to OCR. This is the same self-healing architecture that gives pdfmux its overall accuracy advantage — extract fast, audit quality, repair failures.

OCR + structure preservation

Unlike standalone OCR engines, pdfmux preserves document structure through the OCR path:

result = process("scanned-report.pdf", quality="high")

# Even scanned pages get:
# - Heading detection (via font-size analysis on OCR bounding boxes)
# - Table extraction (via Docling overlay on OCR output)
# - Reading order correction
# - Confidence scores per page

This matters for RAG pipelines. Our benchmarks show that structured extraction (with headings and tables) improves downstream retrieval accuracy by 23% compared to flat OCR text, because chunking on heading boundaries produces semantically coherent chunks.

The hard part: multi-column and rotated scans

Single-column scans are the easy case. The two failure modes that quietly wreck OCR output in production are multi-column layouts and rotation.

Multi-column pages. OCR engines return text in the order their detector emits bounding boxes — which on a two-column page often means a line from the left column, then a line from the right, interleaved into nonsense. Tesseract and EasyOCR have no built-in reading-order model, so academic papers, financial reports, and newspapers come out scrambled. The fix is column detection before recognition: cluster text regions on their x-coordinates, then read each column top-to-bottom. pdfmux applies the same reading-order reconstruction to OCR output that it uses on digital text — the full mechanism is covered in multi-column PDF extraction in Python. Surya handles this natively (its transformer detects layout first), which is part of why it scores 2.6% CER on complex pages versus Tesseract’s 4.2%.

Rotated and skewed scans. Pages scanned at an angle drop OCR accuracy fast — Tesseract’s error rate climbs from ~4% to 12-18% below 200 DPI and on skewed input. pdfmux deskews and normalizes orientation before OCR; if you’re rolling your own pipeline, deskew with OpenCV (cv2.minAreaRect on the text mask) and upsample to 300 DPI before passing to the engine.

Choosing the right approach

Use case	Recommended	Why
Quick script, all-English, clean scans	pytesseract	Simple setup, good enough accuracy
Multi-language, handwritten text	EasyOCR	Best non-Latin support on CPU
Maximum OCR accuracy, GPU available	Surya	2.6% CER, layout-aware
Production pipeline, mixed PDFs	pdfmux	Auto-detection, no GPU, structured output
RAG ingestion	pdfmux	Confidence scoring + clean Markdown

For a broader comparison of all PDF extraction tools (not just OCR), see our 2026 PDF extractor comparison and ranked list of the best Python PDF libraries.

FAQ

Does pdfmux use Tesseract internally?

No. pdfmux uses RapidOCR as its default OCR engine, which is based on PaddleOCR models compiled for CPU via ONNX Runtime. It’s faster than Tesseract (0.9s vs 1.8s per page) and more accurate (3.0% vs 4.2% CER). No external Tesseract binary needed.

Can I OCR just specific pages of a PDF?

Yes. With pdfmux, you can target specific pages: process("doc.pdf", pages=[3, 7, 12]). But in practice, pdfmux’s auto-detection makes this unnecessary — it only OCRs the pages that need it. See our guide on which PDF extractor to use for more on page-level control.

How do I improve OCR accuracy on low-quality scans?

Three techniques: (1) increase DPI to 300+ when converting to images, (2) apply preprocessing — deskewing, contrast normalization, noise removal — before OCR, (3) use pdfmux’s quality="high" mode, which runs multiple extractors and picks the best result per page based on the per-page audit score. The lift from quality="high" is most visible on degraded scans; the right way to size it for your corpus is to diff the manifest.json output of a --quality fast run against a --quality high run on the same documents.

Is GPU-based OCR worth the cost?

For most workloads, no. RapidOCR on CPU (0.9s/page, 3.0% CER) gets within 0.4% CER of Surya on GPU (1.1s/page, 2.6% CER). The accuracy gap is measurable but rarely impactful for downstream tasks like RAG retrieval. For a full breakdown of CPU-only PDF extraction, see our architecture guide.

What about PDFs that mix scanned and digital pages?

This is pdfmux’s core strength. Its per-page classification routes each page to the optimal extractor. A 100-page document with 5 scanned pages processes in ~7s total (95 pages via PyMuPDF at 0.01s each + 5 pages via OCR at ~1s each), versus 180s if you OCR’d everything. The MCP server exposes this same pipeline to AI agents.

Which languages can pdfmux OCR?

The default RapidOCR engine covers 50+ languages including Latin scripts, Chinese, Japanese, and Korean out of the box. For broader coverage you can swap in Tesseract language packs at the CLI: pdfmux convert doc.pdf --ocr-lang ara for Arabic, --ocr-lang deu for German, and so on. Right-to-left scripts preserve visual reading order — see the Arabic PDF extraction guide for the GCC-document patterns.

How do I OCR a scanned PDF straight to Markdown for RAG?

Run pdfmux convert scanned.pdf (Markdown is the default output) or process("scanned.pdf").markdown from Python. Because pdfmux preserves headings and tables through the OCR path, the Markdown chunks cleanly on heading boundaries — which is exactly what you want feeding a RAG pipeline. For the ranked list of every Python option, not just OCR engines, see the best PDF extraction library for Python in 2026.