Extract invoice data from PDFs in Python: a complete guide for AP automation

TL;DRHow to extract invoice data from PDFs in Python. Benchmarks on 1,200 real invoices, production code for field extraction, and a working AP automation pipeline.

Direct answer: Use pdfmux with extract_fields() to pull structured data from invoice PDFs. It extracts 7 standard invoice fields at 94–98% field-level accuracy on digital invoices and 84–91% on scanned invoices — CPU-only, no API keys, no per-page cost. Install: pip install pdfmux. For AP automation at volume, combine field extraction with confidence filtering: auto-approve high-confidence batches, route low-confidence pages to human review.

Why invoice PDF extraction is harder than it looks

Invoices seem like a solved problem. They’re structured documents with consistent fields: vendor, date, amount, line items. In practice, they’re among the most structurally inconsistent documents in business workflows.

A typical accounts payable team at a mid-size company receives invoices from dozens or hundreds of vendors. Each vendor has its own template. Some are digital PDFs generated by accounting software. Some are scanned images — faxed or photographed on a phone. Some are bilingual. Some have tables with 40+ line items. Some have no formal table structure at all, just line items formatted as free text.

The specific failure modes by document type:

Document type	Failure mode	Frequency
Digital, standard template	Minimal — usually clean	~55% of invoices
Digital, non-standard layout	Reading order, merged cells	~20% of invoices
Scanned, clean scan	OCR errors on amounts	~15% of invoices
Scanned, low quality	Significant field errors	~7% of invoices
Bilingual (e.g. Arabic/English)	RTL text order, field mapping	~3% of invoices

At a company processing 2,000 invoices per month, even a 5% error rate means 100 invoices requiring manual correction. The goal is a pipeline that accurately extracts high-confidence invoices automatically and routes uncertain cases — not one that claims 95% accuracy but fails unpredictably.

Benchmark: 1,200 real invoices

We ran pdfmux against a set of 1,200 real-world invoices across three categories — US domestic, EU multi-currency, and UAE/GCC bilingual — to measure per-field accuracy. All invoices were drawn from production AP workflows with vendor names, amounts, and dates anonymized.

Field	Digital accuracy	Scanned accuracy
Vendor name	98.1%	91.4%
Invoice number	97.3%	89.7%
Invoice date	96.8%	87.2%
Due date	93.5%	84.1%
Currency	99.2%	96.8%
Subtotal	97.9%	87.6%
Tax / VAT amount	94.3%	85.9%
Total amount	98.4%	90.1%
Line item count (correct)	91.2%	78.3%
Line item details (full match)	88.6%	72.4%

The confidence score from pdfmux correlates strongly with actual accuracy: pages that score above 0.85 confidence have 97%+ field accuracy; pages below 0.65 drop to 82%. This correlation is what makes confidence-based routing reliable — you can use the score to decide whether to auto-approve or queue for review, rather than inspecting every output manually.

Tested on pdfmux v0.9.4, Python 3.11, Intel Core i7 (no GPU, no API keys).

Basic invoice extraction

from pdfmux import process

result = process("invoice-2026-04-23.pdf", quality="standard")

# Plain Markdown text — vendor name, dates, amounts, table
print(result.text)

# Per-document quality signal
print(result.confidence)  # e.g. 0.93

# Per-page warnings
print(result.warnings)    # ["Page 2: scanned image, applied RapidOCR"]

This gives you the extracted Markdown. For AP automation you want structured JSON, not Markdown. Use extract_fields():

from pdfmux import extract_fields

INVOICE_SCHEMA = {
    "vendor_name": str,
    "invoice_number": str,
    "invoice_date": str,
    "due_date": str,
    "currency": str,
    "subtotal": float,
    "tax_amount": float,
    "total_amount": float,
}

result = extract_fields("invoice-2026-04-23.pdf", schema=INVOICE_SCHEMA)

print(result.fields)
# {
#   "vendor_name": "Acme Supplies Ltd",
#   "invoice_number": "INV-2026-00847",
#   "invoice_date": "2026-04-15",
#   "due_date": "2026-05-15",
#   "currency": "USD",
#   "subtotal": 4850.00,
#   "tax_amount": 388.00,
#   "total_amount": 5238.00
# }

print(result.confidence)   # 0.96

The schema drives extraction: pdfmux locates the relevant fields using layout analysis plus a lightweight structured extraction pass. For numeric fields (subtotal, tax, total) it normalizes currency symbols and locale-specific formatting (e.g. €4.850,00 → 4850.00) before parsing.

Extracting line items

Line items are the hardest part. Vendor templates vary from clean bordered tables to free-text lists. The approach that works across formats:

from pdfmux import process, extract_fields

# First, get the Markdown — line items often render better from full extraction
result = process("invoice.pdf", quality="high")

# Then extract the structured header fields separately
header = extract_fields("invoice.pdf", schema={
    "vendor_name": str,
    "invoice_number": str,
    "total_amount": float,
})

# Parse line items from the Markdown table pdfmux generates
import re

def parse_line_items(markdown_text: str) -> list[dict]:
    lines = markdown_text.splitlines()
    in_table = False
    headers = []
    items = []

    for line in lines:
        if re.match(r"^\|.*\|$", line.strip()):
            cells = [c.strip() for c in line.strip().strip("|").split("|")]
            if not in_table:
                in_table = True
                headers = [h.lower().replace(" ", "_") for h in cells]
            elif all(c.replace("-", "") == "" for c in cells):
                continue  # separator row
            else:
                items.append(dict(zip(headers, cells)))
        else:
            if in_table:
                break  # end of table

    return items

line_items = parse_line_items(result.text)

For invoices with complex table structures (merged cells, sub-totals, multi-line descriptions), pdfmux routes table pages to Docling, which uses a trained transformer for cell detection. The resulting Markdown table is structurally accurate in 91% of cases on the 1,200-invoice benchmark — the main failure case being sub-total rows that Docling mis-classifies as data rows.

Batch processing

Processing a folder of invoices in parallel:

from pdfmux import extract_fields
from concurrent.futures import ThreadPoolExecutor, as_completed
from pathlib import Path
import json

INVOICE_SCHEMA = {
    "vendor_name": str,
    "invoice_number": str,
    "invoice_date": str,
    "total_amount": float,
    "currency": str,
}

def process_invoice(path: Path) -> dict:
    try:
        result = extract_fields(str(path), schema=INVOICE_SCHEMA)
        return {
            "file": path.name,
            "fields": result.fields,
            "confidence": result.confidence,
            "status": "ok" if result.confidence >= 0.80 else "review",
        }
    except Exception as e:
        return {"file": path.name, "error": str(e), "status": "error"}

invoice_dir = Path("invoices/")
invoice_files = list(invoice_dir.glob("*.pdf"))

results = []
with ThreadPoolExecutor(max_workers=4) as executor:
    futures = {executor.submit(process_invoice, f): f for f in invoice_files}
    for future in as_completed(futures):
        results.append(future.result())

# Separate auto-approve from review queue
auto_approved = [r for r in results if r.get("status") == "ok"]
needs_review = [r for r in results if r.get("status") in ("review", "error")]

print(f"Auto-approved: {len(auto_approved)}")
print(f"Needs review: {len(needs_review)}")

# Write structured output
with open("extracted_invoices.json", "w") as f:
    json.dump(auto_approved, f, indent=2)

On a 4-core machine, this processes approximately 200–400 digital invoices per minute depending on invoice complexity. Scanned invoices run 3–5x slower due to OCR. For overnight batch runs, parallelism of 4–8 workers is typically optimal without exceeding memory limits.

Handling scanned invoices

Scanned invoices require explicit OCR mode. pdfmux auto-detects scanned pages — you don’t need to tell it — but you can force high-quality extraction for invoice batches where you know scans are common:

from pdfmux import extract_fields

# force high quality mode — slower but better OCR on scanned pages
result = extract_fields(
    "scanned-invoice.pdf",
    schema=INVOICE_SCHEMA,
    quality="high",
)

print(result.confidence)  # e.g. 0.78 for a clean scan, 0.61 for a phone photo
print(result.warnings)    # ["Page 1: scanned image, RapidOCR confidence 0.78"]

On the 1,200-invoice benchmark, the confidence score threshold that best separates accurate from inaccurate scanned extractions is 0.72: above that, field accuracy is 92%+; below that, field accuracy drops to 71%. Setting your review threshold at 0.72 for scanned invoices routes approximately 22% of scanned invoices to human review — which typically covers 95%+ of the actual errors.

For invoices with very low scan quality (dim photos, extreme angles, water damage), consider the quality="high" mode which uses a full Docling pass before OCR. Processing time increases from ~1.5s to ~4.5s per page, but accuracy on degraded scans improves by 8–14 percentage points.

A minimal AP automation pipeline

This is the production pattern that handles the full workflow: ingest → extract → validate → route.

from pdfmux import extract_fields
from pathlib import Path
from datetime import datetime
import json
import shutil

INVOICE_SCHEMA = {
    "vendor_name": str,
    "invoice_number": str,
    "invoice_date": str,
    "due_date": str,
    "currency": str,
    "total_amount": float,
    "tax_amount": float,
}

CONFIDENCE_AUTO = 0.82    # above this: auto-process
CONFIDENCE_REVIEW = 0.60  # below this: flag for manual review, above: low-confidence auto

def validate_fields(fields: dict) -> list[str]:
    issues = []
    if not fields.get("invoice_number"):
        issues.append("missing invoice_number")
    if not fields.get("total_amount") or fields["total_amount"] <= 0:
        issues.append("invalid total_amount")
    try:
        datetime.strptime(fields.get("invoice_date", ""), "%Y-%m-%d")
    except ValueError:
        issues.append(f"unparseable invoice_date: {fields.get('invoice_date')}")
    return issues

def process_invoice(path: Path, output_dir: Path, review_dir: Path):
    result = extract_fields(str(path), schema=INVOICE_SCHEMA, quality="standard")
    issues = validate_fields(result.fields)

    record = {
        "file": path.name,
        "fields": result.fields,
        "confidence": result.confidence,
        "validation_issues": issues,
        "processed_at": datetime.utcnow().isoformat(),
    }

    if result.confidence >= CONFIDENCE_AUTO and not issues:
        # High-confidence, clean validation — auto-approve
        out_path = output_dir / f"{path.stem}.json"
        out_path.write_text(json.dumps(record, indent=2))
        return "approved", record
    else:
        # Route to review queue with context for the reviewer
        dest = review_dir / path.name
        shutil.copy2(path, dest)
        review_path = review_dir / f"{path.stem}.json"
        review_path.write_text(json.dumps(record, indent=2))
        return "review", record

This pattern handles the 3-tier split seen in most production AP workflows: clean digital invoices auto-approve without human touch, scanned or low-confidence invoices get routed with extracted data pre-filled for the reviewer to verify, and validation failures (missing fields, unparseable dates) get flagged explicitly.

Integration with accounting systems

Once you have structured JSON, posting to most AP systems is straightforward. The extracted fields map directly to common invoice APIs:

QuickBooks Online: VendorCredit or Bill objects via the QBO API — VendorRef.name ← vendor_name, TotalAmt ← total_amount, TxnDate ← invoice_date
Xero: Invoices API — Contact.Name ← vendor_name, AmountDue ← total_amount
SAP Business One: OPCH (AP Invoice) table via the Service Layer — standard field mapping
NetSuite: VendorBill record type via REST API

The common pattern is to match vendor_name against your vendor master list first (fuzzy string match, threshold ~0.85), then create or update the bill record. The invoice_number serves as the idempotency key — check for duplicates before posting.

For a complete walkthrough of connecting pdfmux to LangChain and downstream data stores, see PDF extraction for RAG pipelines. For handling scanned documents at scale with mixed digital and image-based pages, see OCR PDF extraction in Python.

Performance characteristics

On a standard Hetzner CPX21 server (3 vCPU, 4GB RAM, ~$15/month):

Invoice type	Extraction time	Throughput (4 workers)
Digital, standard	0.08–0.15s	~800 invoices/min
Digital, complex tables	0.8–1.5s	~80 invoices/min
Scanned, clean	1.2–2.5s	~50 invoices/min
Scanned, degraded	3.5–6.0s	~20 invoices/min

At 2,000 invoices per month (typical mid-size AP team), even the worst case (all scanned, degraded) completes in under 2 hours. For nightly batch runs this is comfortably within window. At 20,000+ invoices per month, horizontal scaling with additional workers is more efficient than single-machine vertical scaling.

Install pdfmux: pip install pdfmux. For structured extraction from invoices and other business documents, no additional dependencies are required beyond the base package.