Direct answer: To extract text from scanned PDFs in Python, use pdfmux — it auto-detects scanned pages and applies OCR only where needed. Install with pip install pdfmux, then run pdfmux convert scanned-doc.pdf. On a 200-PDF benchmark, pdfmux scores 0.905 overall accuracy, handling scanned pages at 0.5-2s each via CPU-only OCR while skipping the 90% of pages that are already digital text. No GPU required, no API keys, no manual page classification.
Why scanned PDFs break standard extractors
Standard PDF extractors like PyMuPDF read the text layer embedded in a PDF. Scanned PDFs don’t have a text layer — they’re images wrapped in a PDF container. Run PyMuPDF on a scanned page and you get an empty string. According to Adobe’s 2025 digital document survey, 38% of business PDFs contain at least one scanned page. In legal and healthcare, that number exceeds 65%.
The problem compounds in mixed documents. A 50-page contract might have 47 digital pages and 3 scanned signature pages. A naive extractor returns text for 47 pages and blanks for 3 — and your RAG pipeline silently indexes incomplete data.
Three approaches exist:
- OCR everything — slow and wasteful (Tesseract processes 0.5-3s per page, even on digital text that extracts in 0.01s)
- Skip scanned pages — fast but lossy (you miss 5-40% of content depending on document type)
- Detect and route — extract digitally where possible, OCR only scanned pages (pdfmux’s approach)
The OCR landscape in Python (2026)
Four OCR engines dominate Python PDF extraction. Here’s how they compare on 40 scanned pages from opendataloader-bench:
| Engine | Accuracy (CER) | Speed (s/page) | GPU Required | Languages | Install Size |
|---|---|---|---|---|---|
| Tesseract (pytesseract) | 4.2% CER | 1.8s | No | 100+ | ~30MB |
| EasyOCR | 3.1% CER | 2.4s | Recommended | 80+ | ~200MB |
| Surya | 2.6% CER | 1.1s | Yes | 90+ | ~450MB |
| RapidOCR (pdfmux default) | 3.0% CER | 0.9s | No | 50+ | ~80MB |
CER = Character Error Rate (lower is better). Tested on English business documents. Surya leads on accuracy but requires a GPU and CUDA setup. RapidOCR — pdfmux’s default engine — hits the best speed-to-accuracy ratio on CPU.
Method 1: pytesseract (basic OCR)
Tesseract is the most widely used open-source OCR engine. It’s been around since 1985 (originally by HP, now maintained by Google) and supports over 100 languages.
import pytesseract
from pdf2image import convert_from_path
# Convert PDF pages to images
images = convert_from_path("scanned-contract.pdf", dpi=300)
# OCR each page
full_text = []
for i, img in enumerate(images):
text = pytesseract.image_to_string(img, lang="eng")
full_text.append(text)
print("\n\n".join(full_text))
Accuracy: 4.2% CER on our benchmark — serviceable for clean scans, but degrades sharply on skewed pages, low DPI, or mixed fonts. A 2024 study by the National Archives found Tesseract’s error rate jumps to 12-18% on documents scanned below 200 DPI.
Limitations: No structure detection. You get raw text — no headings, no tables, no reading order. For table extraction from scanned PDFs, Tesseract alone won’t cut it.
Method 2: EasyOCR
EasyOCR uses a CRNN (Convolutional Recurrent Neural Network) architecture and handles 80+ languages out of the box.
import easyocr
from pdf2image import convert_from_path
import numpy as np
reader = easyocr.Reader(["en"])
images = convert_from_path("scanned-invoice.pdf", dpi=300)
full_text = []
for img in images:
results = reader.readtext(np.array(img), detail=0, paragraph=True)
full_text.append("\n".join(results))
print("\n\n".join(full_text))
Accuracy: 3.1% CER — a meaningful improvement over Tesseract, especially on handwritten text and non-Latin scripts. However, processing takes 2.4s per page on CPU (1.0s with GPU), and the library loads ~200MB of model weights into memory on first run.
Limitations: Like Tesseract, EasyOCR returns flat text with no structural awareness. It also struggles with dense multi-column layouts common in academic papers and financial reports.
Method 3: Surya (ML-first OCR)
Surya is the newest entrant — a transformer-based OCR engine that also handles layout detection, reading order, and table recognition. It powers the marker PDF extraction library.
from surya.ocr import run_ocr
from surya.model.detection.model import load_model as load_det_model
from surya.model.recognition.model import load_model as load_rec_model
from pdf2image import convert_from_path
det_model = load_det_model()
rec_model = load_rec_model()
images = convert_from_path("research-paper.pdf", dpi=300)
results = run_ocr(images, [["en"]] * len(images), det_model, rec_model)
for page in results:
print("\n".join([line.text for line in page.text_lines]))
Accuracy: 2.6% CER — the best raw OCR accuracy in our benchmark. Surya’s transformer architecture excels at complex layouts, detecting text regions before recognition. In our tests, it correctly handled 94% of multi-column pages versus Tesseract’s 71%.
Limitations: Requires a GPU with 4GB+ VRAM for reasonable speed. On CPU, processing drops to 8-12s per page. The model weights total ~450MB. For GPU-free extraction, Surya is impractical in production.
Method 4: pdfmux (auto-detect + route)
pdfmux takes a fundamentally different approach. Instead of treating every page as a scanned image, it classifies each page first and applies OCR only where necessary.
from pdfmux import process
# Standard quality: auto-detects scanned pages, applies OCR
result = process("mixed-document.pdf", quality="standard")
print(f"Text: {result.text[:200]}")
print(f"Confidence: {result.confidence:.0%}")
print(f"Extractors used: {result.extractor_used}")
For a 50-page document with 3 scanned pages:
- PyMuPDF handles 47 digital pages in ~0.5s total
- RapidOCR handles 3 scanned pages in ~2.7s
- Total: ~3.2s vs 90s if you OCR’d every page with Tesseract
How page detection works
pdfmux’s classifier runs 5 heuristic checks per page (costing <1ms each):
- Text density — pages with <50 characters per rendered area are flagged as potentially scanned
- Image coverage — if images cover >80% of page area, it’s likely a scan
- Font embedding — scanned pages have no embedded fonts in the PDF metadata
- Encoding check — garbled Unicode sequences indicate a broken text layer
- Character distribution — random-looking character sequences suggest a decorative font mapped to wrong codepoints
Pages that fail 2+ checks get routed to OCR. This is the same self-healing architecture that gives pdfmux its overall accuracy advantage — extract fast, audit quality, repair failures.
OCR + structure preservation
Unlike standalone OCR engines, pdfmux preserves document structure through the OCR path:
result = process("scanned-report.pdf", quality="high")
# Even scanned pages get:
# - Heading detection (via font-size analysis on OCR bounding boxes)
# - Table extraction (via Docling overlay on OCR output)
# - Reading order correction
# - Confidence scores per page
This matters for RAG pipelines. Our benchmarks show that structured extraction (with headings and tables) improves downstream retrieval accuracy by 23% compared to flat OCR text, because chunking on heading boundaries produces semantically coherent chunks.
Choosing the right approach
| Use case | Recommended | Why |
|---|---|---|
| Quick script, all-English, clean scans | pytesseract | Simple setup, good enough accuracy |
| Multi-language, handwritten text | EasyOCR | Best non-Latin support on CPU |
| Maximum OCR accuracy, GPU available | Surya | 2.6% CER, layout-aware |
| Production pipeline, mixed PDFs | pdfmux | Auto-detection, no GPU, structured output |
| RAG ingestion | pdfmux | Confidence scoring + clean Markdown |
For a broader comparison of all PDF extraction tools (not just OCR), see our 2026 PDF extractor comparison and ranked list of the best Python PDF libraries.
FAQ
Does pdfmux use Tesseract internally?
No. pdfmux uses RapidOCR as its default OCR engine, which is based on PaddleOCR models compiled for CPU via ONNX Runtime. It’s faster than Tesseract (0.9s vs 1.8s per page) and more accurate (3.0% vs 4.2% CER). No external Tesseract binary needed.
Can I OCR just specific pages of a PDF?
Yes. With pdfmux, you can target specific pages: process("doc.pdf", pages=[3, 7, 12]). But in practice, pdfmux’s auto-detection makes this unnecessary — it only OCRs the pages that need it. See our guide on which PDF extractor to use for more on page-level control.
How do I improve OCR accuracy on low-quality scans?
Three techniques: (1) increase DPI to 300+ when converting to images, (2) apply preprocessing — deskewing, contrast normalization, noise removal — before OCR, (3) use pdfmux’s quality="high" mode, which runs multiple extractors and picks the best result per page. Our real-world benchmark shows high-quality mode recovers 8-12% more text from degraded scans.
Is GPU-based OCR worth the cost?
For most workloads, no. RapidOCR on CPU (0.9s/page, 3.0% CER) gets within 0.4% CER of Surya on GPU (1.1s/page, 2.6% CER). The accuracy gap is measurable but rarely impactful for downstream tasks like RAG retrieval. For a full breakdown of CPU-only PDF extraction, see our architecture guide.
What about PDFs that mix scanned and digital pages?
This is pdfmux’s core strength. Its per-page classification routes each page to the optimal extractor. A 100-page document with 5 scanned pages processes in ~7s total (95 pages via PyMuPDF at 0.01s each + 5 pages via OCR at ~1s each), versus 180s if you OCR’d everything. The MCP server exposes this same pipeline to AI agents.