Batch PDF processing in Python: handle 10,000 PDFs without a crash

Direct answer: Use ThreadPoolExecutor with 4–8 workers and process each PDF independently with per-file error handling. Do not load all files into memory before starting. Do not use ProcessPoolExecutor for IO-bound PDF work — threading wins on both throughput and stability. For directory-scale batching (thousands of files), use pdfmux’s built-in batch.convert_directory(). On a 4-core machine, 8 workers processes 100 mixed PDFs (averaging 15 pages each) in 31 seconds — versus 180 seconds single-threaded.

Why naive loops fail at scale

The obvious approach — loop over files, extract each one, collect results — works fine for 10 PDFs. It breaks at 1,000 PDFs in three ways.

Memory accumulation: Most extraction libraries return the full text of a document as a string. If you collect all results in a list before writing them out, a batch of 1,000 × 500KB PDFs can hold 50GB+ of strings in memory. The process is killed by the OS before it finishes.

All-or-nothing failures: A corrupt PDF, a locked file, or an encoding error raises an exception and stops the entire loop. You lose the results for all 997 PDFs that succeeded before it.

Sequential bottleneck: PDF extraction is IO-bound (reading from disk) and partially CPU-bound (parsing, OCR). A single thread sits idle waiting for IO while it could be extracting the next file. On a 4-core machine, sequential processing uses roughly 25% of available CPU.

The fix for all three: process files in parallel with per-file error handling and stream results to disk as they complete.

Concurrency benchmark

We processed 100 mixed PDFs (80 digital, 20 scanned) averaging 15 pages each on a MacBook Pro M2 (8-core CPU, 16GB RAM) using pdfmux as the extractor.

Workers	Total time	Throughput	CPU usage	Peak RAM
1 (sequential)	183s	0.55/s	12%	340MB
4 workers	52s	1.92/s	44%	410MB
8 workers	31s	3.23/s	81%	490MB
16 workers	28s	3.57/s	94%	620MB

Diminishing returns kick in around 8 workers on this hardware — 16 workers buys only 10% more throughput at a 26% RAM increase. The right number depends on your machine and how much of the batch is OCR (CPU-heavy) vs digital extraction (IO-heavy). Start at 4 workers and benchmark up.

For pure digital PDFs, 8 workers is usually optimal. For batches with significant OCR content, cap at cpu_count() // 2 to avoid thermal throttling.

ThreadPoolExecutor pattern with pdfmux

from concurrent.futures import ThreadPoolExecutor, as_completed
from pathlib import Path
from pdfmux import process
import json

def extract_one(pdf_path: Path) -> dict:
    try:
        result = process(str(pdf_path), quality="standard")
        return {
            "file": pdf_path.name,
            "text": result.text,
            "confidence": result.confidence,
            "pages": result.page_count,
            "extractor": result.extractor_used,
            "status": "ok",
        }
    except Exception as exc:
        return {
            "file": pdf_path.name,
            "status": "error",
            "error": str(exc),
        }


def batch_extract(pdf_dir: str, output_path: str, max_workers: int = 8):
    pdf_files = list(Path(pdf_dir).glob("**/*.pdf"))
    print(f"Found {len(pdf_files)} PDFs in {pdf_dir}")

    results_ok = 0
    results_err = 0

    # Stream results to JSONL as they complete — no memory accumulation
    with open(output_path, "w") as out_file:
        with ThreadPoolExecutor(max_workers=max_workers) as executor:
            futures = {executor.submit(extract_one, f): f for f in pdf_files}

            for future in as_completed(futures):
                result = future.result()
                out_file.write(json.dumps(result) + "\n")
                out_file.flush()  # write to disk immediately

                if result["status"] == "ok":
                    results_ok += 1
                    print(f"  ✓ {result['file']} ({result['confidence']:.0%}, {result['pages']}p)")
                else:
                    results_err += 1
                    print(f"  ✗ {result['file']}: {result['error']}")

    print(f"\nDone: {results_ok} succeeded, {results_err} failed")
    print(f"Results written to {output_path}")


batch_extract("/data/reports/", "/data/extracted.jsonl", max_workers=8)

Two design choices here matter more than the library you pick:

Stream to JSONL, not a list. Writing each result immediately as it completes means your memory footprint stays proportional to the number of in-flight workers, not the total batch size. A batch of 10,000 PDFs uses the same peak RAM as a batch of 100.
Return errors, don’t raise them. extract_one() catches all exceptions and returns a structured error dict. Every file gets a result. You can review failures without re-running the entire batch.

Memory management

pdfmux processes PDFs page-by-page with ~50–100MB of working memory per page. The extracted text is returned as a string and then garbage-collected when the function returns. In the pattern above, memory for each PDF is freed as soon as json.dumps(result) writes it to disk.

The main memory leak in batch processing comes from keeping references alive longer than needed. Two patterns to avoid:

# BAD: accumulates all results in memory before writing
results = []
for pdf in pdf_files:
    results.append(extract_one(pdf))  # 10,000 dicts in RAM

# BAD: futures dict keeps references to all completed results
futures = {executor.submit(extract_one, f): f for f in pdf_files}
all_results = [f.result() for f in futures]  # waits for all, keeps all in RAM

# GOOD: process results as they complete using as_completed()
for future in as_completed(futures):
    result = future.result()
    write_to_disk(result)
    # result goes out of scope, memory freed

For very large batches (50,000+ PDFs), chunk the file list:

def chunked(lst: list, size: int):
    for i in range(0, len(lst), size):
        yield lst[i:i + size]

for chunk in chunked(pdf_files, 500):
    batch_extract_chunk(chunk, output_file, max_workers=8)
    # GC has a chance to clean up between chunks

Error handling and retry logic

Not all failures are permanent. A PDF that fails because the file was being written when you tried to read it will succeed on a second attempt. A PDF that fails because it is actually corrupt will fail every time.

A retry wrapper with exponential backoff:

import time
from pdfmux import process

def extract_with_retry(pdf_path: Path, max_retries: int = 3) -> dict:
    last_error = None

    for attempt in range(max_retries):
        try:
            result = process(str(pdf_path), quality="standard")
            return {
                "file": pdf_path.name,
                "text": result.text,
                "confidence": result.confidence,
                "pages": result.page_count,
                "status": "ok",
                "attempts": attempt + 1,
            }
        except FileNotFoundError:
            # Permanent failure — no point retrying
            return {"file": pdf_path.name, "status": "error", "error": "file_not_found"}
        except PermissionError:
            # Might be temporary (file lock) — retry
            last_error = "permission_denied"
            time.sleep(2 ** attempt)  # 1s, 2s, 4s
        except Exception as exc:
            last_error = str(exc)
            if attempt < max_retries - 1:
                time.sleep(2 ** attempt)

    return {"file": pdf_path.name, "status": "error", "error": last_error, "attempts": max_retries}

In production, separate permanent failures from transient ones in your output:

# After batch completes, split JSONL by status
import json
from pathlib import Path

results = [json.loads(line) for line in Path("extracted.jsonl").read_text().splitlines()]

failed = [r for r in results if r["status"] == "error"]
low_confidence = [r for r in results if r.get("confidence", 1.0) < 0.75]
ok = [r for r in results if r["status"] == "ok" and r.get("confidence", 1.0) >= 0.75]

print(f"Clean: {len(ok)}")
print(f"Low confidence (review): {len(low_confidence)}")
print(f"Failed: {len(failed)}")

# Write failed files for manual review
Path("failed.txt").write_text("\n".join(r["file"] for r in failed))

Low-confidence extractions (confidence <0.75) are not failures — pdfmux extracted something, but with low certainty. These are candidates for human review rather than automated downstream processing. See how pdfmux’s self-healing pipeline reduces this category by re-extracting problematic pages automatically.

Progress tracking with tqdm

For long-running batches, you want to know how far along you are and the current success rate. tqdm integrates cleanly with as_completed():

from concurrent.futures import ThreadPoolExecutor, as_completed
from tqdm import tqdm

def batch_extract_with_progress(pdf_files: list[Path], output_path: str, max_workers: int = 8):
    ok, errors, low_conf = 0, 0, 0

    with open(output_path, "w") as out_file:
        with ThreadPoolExecutor(max_workers=max_workers) as executor:
            futures = {executor.submit(extract_with_retry, f): f for f in pdf_files}

            with tqdm(total=len(pdf_files), unit="pdf") as pbar:
                for future in as_completed(futures):
                    result = future.result()
                    out_file.write(json.dumps(result) + "\n")
                    out_file.flush()

                    if result["status"] == "error":
                        errors += 1
                    elif result.get("confidence", 1.0) < 0.75:
                        low_conf += 1
                    else:
                        ok += 1

                    pbar.set_postfix(ok=ok, err=errors, low=low_conf)
                    pbar.update(1)

Directory batch API

For processing an entire directory — including filtering by minimum confidence and writing structured output — pdfmux’s built-in batch module handles the boilerplate:

from pdfmux import batch

results = batch.convert_directory(
    "/data/invoices/",
    min_confidence=0.80,
    output_format="markdown",  # or "json" for structured extraction
    max_workers=8,
)

print(f"High confidence: {len(results.high_confidence)}")
print(f"Needs review: {len(results.needs_review)}")
print(f"Failed: {len(results.failed)}")

# Write high-confidence results
for r in results.high_confidence:
    output_path = Path("/data/extracted") / (r.stem + ".md")
    output_path.write_text(r.text)

# Queue needs-review for human handling
for r in results.needs_review:
    print(f"  Review needed: {r.file} (confidence: {r.confidence:.1%})")

The CLI equivalent for quick batch jobs:

# Extract all PDFs in a directory to /output/
pdfmux batch /data/invoices/ --output /data/extracted/ --workers 8

# Filter to high-confidence only
pdfmux batch /data/invoices/ --output /data/extracted/ --min-confidence 0.80

When to use ProcessPoolExecutor

ThreadPoolExecutor works for PDF extraction because the Python GIL doesn’t block IO-bound work, and most PDF parsing involves disk reads followed by library calls that release the GIL.

ProcessPoolExecutor is better when the work is pure Python CPU computation with no IO — training a model, encoding embeddings. For PDF extraction, ProcessPoolExecutor adds process spawn overhead (~200ms per worker) without meaningful throughput gain, and complicates pickling (some extraction library objects are not pickleable).

One exception: if your pipeline includes heavy post-processing after extraction (large LLM calls, embedding generation, vector indexing), ProcessPoolExecutor for that stage can be worth it. Separate the extraction stage (ThreadPoolExecutor) from the processing stage (ProcessPoolExecutor) with a queue between them.

Production checklist

Before running a large batch in production:

Test on 10–20 files first — verify output format, check confidence distribution, confirm error handling works
Set max_workers conservatively — 4 workers is safe on any machine; tune up from there
Use JSONL output — append-friendly, resumable if the process is killed mid-batch
Log every file result — you want a complete audit trail of what succeeded, what failed, and at what confidence
Gate on confidence — decide upfront what confidence threshold triggers human review vs automated processing
Run in a screen or tmux session — batch jobs that take hours should not depend on your SSH connection staying open
Check disk space — output files can be large; a batch of 10,000 PDFs might produce 5–50GB of extracted text

For the confidence benchmarks on different document types and a comparison of extraction accuracy across tools, see the 200-PDF benchmark. If you are choosing an extractor for a batch pipeline and evaluating cost vs accuracy tradeoffs, the extractor comparison has the full breakdown including cost-per-page figures for API-based tools.