Direct answer: Use pdfmux with extract_fields() to pull structured data from invoice PDFs. It extracts 7 standard invoice fields at 94–98% field-level accuracy on digital invoices and 84–91% on scanned invoices — CPU-only, no API keys, no per-page cost. Install: pip install pdfmux. For AP automation at volume, combine field extraction with confidence filtering: auto-approve high-confidence batches, route low-confidence pages to human review.
Why invoice PDF extraction is harder than it looks
Invoices seem like a solved problem. They’re structured documents with consistent fields: vendor, date, amount, line items. In practice, they’re among the most structurally inconsistent documents in business workflows.
A typical accounts payable team at a mid-size company receives invoices from dozens or hundreds of vendors. Each vendor has its own template. Some are digital PDFs generated by accounting software. Some are scanned images — faxed or photographed on a phone. Some are bilingual. Some have tables with 40+ line items. Some have no formal table structure at all, just line items formatted as free text.
The specific failure modes by document type:
| Document type | Failure mode | Frequency |
|---|---|---|
| Digital, standard template | Minimal — usually clean | ~55% of invoices |
| Digital, non-standard layout | Reading order, merged cells | ~20% of invoices |
| Scanned, clean scan | OCR errors on amounts | ~15% of invoices |
| Scanned, low quality | Significant field errors | ~7% of invoices |
| Bilingual (e.g. Arabic/English) | RTL text order, field mapping | ~3% of invoices |
At a company processing 2,000 invoices per month, even a 5% error rate means 100 invoices requiring manual correction. The goal is a pipeline that accurately extracts high-confidence invoices automatically and routes uncertain cases — not one that claims 95% accuracy but fails unpredictably.
Benchmark: 1,200 real invoices
We ran pdfmux against a set of 1,200 real-world invoices across three categories — US domestic, EU multi-currency, and UAE/GCC bilingual — to measure per-field accuracy. All invoices were drawn from production AP workflows with vendor names, amounts, and dates anonymized.
| Field | Digital accuracy | Scanned accuracy |
|---|---|---|
| Vendor name | 98.1% | 91.4% |
| Invoice number | 97.3% | 89.7% |
| Invoice date | 96.8% | 87.2% |
| Due date | 93.5% | 84.1% |
| Currency | 99.2% | 96.8% |
| Subtotal | 97.9% | 87.6% |
| Tax / VAT amount | 94.3% | 85.9% |
| Total amount | 98.4% | 90.1% |
| Line item count (correct) | 91.2% | 78.3% |
| Line item details (full match) | 88.6% | 72.4% |
The confidence score from pdfmux correlates strongly with actual accuracy: pages that score above 0.85 confidence have 97%+ field accuracy; pages below 0.65 drop to 82%. This correlation is what makes confidence-based routing reliable — you can use the score to decide whether to auto-approve or queue for review, rather than inspecting every output manually.
Tested on pdfmux v0.9.4, Python 3.11, Intel Core i7 (no GPU, no API keys).
Basic invoice extraction
from pdfmux import process
result = process("invoice-2026-04-23.pdf", quality="standard")
# Plain Markdown text — vendor name, dates, amounts, table
print(result.text)
# Per-document quality signal
print(result.confidence) # e.g. 0.93
# Per-page warnings
print(result.warnings) # ["Page 2: scanned image, applied RapidOCR"]
This gives you the extracted Markdown. For AP automation you want structured JSON, not Markdown. Use extract_fields():
from pdfmux import extract_fields
INVOICE_SCHEMA = {
"vendor_name": str,
"invoice_number": str,
"invoice_date": str,
"due_date": str,
"currency": str,
"subtotal": float,
"tax_amount": float,
"total_amount": float,
}
result = extract_fields("invoice-2026-04-23.pdf", schema=INVOICE_SCHEMA)
print(result.fields)
# {
# "vendor_name": "Acme Supplies Ltd",
# "invoice_number": "INV-2026-00847",
# "invoice_date": "2026-04-15",
# "due_date": "2026-05-15",
# "currency": "USD",
# "subtotal": 4850.00,
# "tax_amount": 388.00,
# "total_amount": 5238.00
# }
print(result.confidence) # 0.96
The schema drives extraction: pdfmux locates the relevant fields using layout analysis plus a lightweight structured extraction pass. For numeric fields (subtotal, tax, total) it normalizes currency symbols and locale-specific formatting (e.g. €4.850,00 → 4850.00) before parsing.
Extracting line items
Line items are the hardest part. Vendor templates vary from clean bordered tables to free-text lists. The approach that works across formats:
from pdfmux import process, extract_fields
# First, get the Markdown — line items often render better from full extraction
result = process("invoice.pdf", quality="high")
# Then extract the structured header fields separately
header = extract_fields("invoice.pdf", schema={
"vendor_name": str,
"invoice_number": str,
"total_amount": float,
})
# Parse line items from the Markdown table pdfmux generates
import re
def parse_line_items(markdown_text: str) -> list[dict]:
lines = markdown_text.splitlines()
in_table = False
headers = []
items = []
for line in lines:
if re.match(r"^\|.*\|$", line.strip()):
cells = [c.strip() for c in line.strip().strip("|").split("|")]
if not in_table:
in_table = True
headers = [h.lower().replace(" ", "_") for h in cells]
elif all(c.replace("-", "") == "" for c in cells):
continue # separator row
else:
items.append(dict(zip(headers, cells)))
else:
if in_table:
break # end of table
return items
line_items = parse_line_items(result.text)
For invoices with complex table structures (merged cells, sub-totals, multi-line descriptions), pdfmux routes table pages to Docling, which uses a trained transformer for cell detection. The resulting Markdown table is structurally accurate in 91% of cases on the 1,200-invoice benchmark — the main failure case being sub-total rows that Docling mis-classifies as data rows.
Batch processing
Processing a folder of invoices in parallel:
from pdfmux import extract_fields
from concurrent.futures import ThreadPoolExecutor, as_completed
from pathlib import Path
import json
INVOICE_SCHEMA = {
"vendor_name": str,
"invoice_number": str,
"invoice_date": str,
"total_amount": float,
"currency": str,
}
def process_invoice(path: Path) -> dict:
try:
result = extract_fields(str(path), schema=INVOICE_SCHEMA)
return {
"file": path.name,
"fields": result.fields,
"confidence": result.confidence,
"status": "ok" if result.confidence >= 0.80 else "review",
}
except Exception as e:
return {"file": path.name, "error": str(e), "status": "error"}
invoice_dir = Path("invoices/")
invoice_files = list(invoice_dir.glob("*.pdf"))
results = []
with ThreadPoolExecutor(max_workers=4) as executor:
futures = {executor.submit(process_invoice, f): f for f in invoice_files}
for future in as_completed(futures):
results.append(future.result())
# Separate auto-approve from review queue
auto_approved = [r for r in results if r.get("status") == "ok"]
needs_review = [r for r in results if r.get("status") in ("review", "error")]
print(f"Auto-approved: {len(auto_approved)}")
print(f"Needs review: {len(needs_review)}")
# Write structured output
with open("extracted_invoices.json", "w") as f:
json.dump(auto_approved, f, indent=2)
On a 4-core machine, this processes approximately 200–400 digital invoices per minute depending on invoice complexity. Scanned invoices run 3–5x slower due to OCR. For overnight batch runs, parallelism of 4–8 workers is typically optimal without exceeding memory limits.
Handling scanned invoices
Scanned invoices require explicit OCR mode. pdfmux auto-detects scanned pages — you don’t need to tell it — but you can force high-quality extraction for invoice batches where you know scans are common:
from pdfmux import extract_fields
# force high quality mode — slower but better OCR on scanned pages
result = extract_fields(
"scanned-invoice.pdf",
schema=INVOICE_SCHEMA,
quality="high",
)
print(result.confidence) # e.g. 0.78 for a clean scan, 0.61 for a phone photo
print(result.warnings) # ["Page 1: scanned image, RapidOCR confidence 0.78"]
On the 1,200-invoice benchmark, the confidence score threshold that best separates accurate from inaccurate scanned extractions is 0.72: above that, field accuracy is 92%+; below that, field accuracy drops to 71%. Setting your review threshold at 0.72 for scanned invoices routes approximately 22% of scanned invoices to human review — which typically covers 95%+ of the actual errors.
For invoices with very low scan quality (dim photos, extreme angles, water damage), consider the quality="high" mode which uses a full Docling pass before OCR. Processing time increases from ~1.5s to ~4.5s per page, but accuracy on degraded scans improves by 8–14 percentage points.
A minimal AP automation pipeline
This is the production pattern that handles the full workflow: ingest → extract → validate → route.
from pdfmux import extract_fields
from pathlib import Path
from datetime import datetime
import json
import shutil
INVOICE_SCHEMA = {
"vendor_name": str,
"invoice_number": str,
"invoice_date": str,
"due_date": str,
"currency": str,
"total_amount": float,
"tax_amount": float,
}
CONFIDENCE_AUTO = 0.82 # above this: auto-process
CONFIDENCE_REVIEW = 0.60 # below this: flag for manual review, above: low-confidence auto
def validate_fields(fields: dict) -> list[str]:
issues = []
if not fields.get("invoice_number"):
issues.append("missing invoice_number")
if not fields.get("total_amount") or fields["total_amount"] <= 0:
issues.append("invalid total_amount")
try:
datetime.strptime(fields.get("invoice_date", ""), "%Y-%m-%d")
except ValueError:
issues.append(f"unparseable invoice_date: {fields.get('invoice_date')}")
return issues
def process_invoice(path: Path, output_dir: Path, review_dir: Path):
result = extract_fields(str(path), schema=INVOICE_SCHEMA, quality="standard")
issues = validate_fields(result.fields)
record = {
"file": path.name,
"fields": result.fields,
"confidence": result.confidence,
"validation_issues": issues,
"processed_at": datetime.utcnow().isoformat(),
}
if result.confidence >= CONFIDENCE_AUTO and not issues:
# High-confidence, clean validation — auto-approve
out_path = output_dir / f"{path.stem}.json"
out_path.write_text(json.dumps(record, indent=2))
return "approved", record
else:
# Route to review queue with context for the reviewer
dest = review_dir / path.name
shutil.copy2(path, dest)
review_path = review_dir / f"{path.stem}.json"
review_path.write_text(json.dumps(record, indent=2))
return "review", record
This pattern handles the 3-tier split seen in most production AP workflows: clean digital invoices auto-approve without human touch, scanned or low-confidence invoices get routed with extracted data pre-filled for the reviewer to verify, and validation failures (missing fields, unparseable dates) get flagged explicitly.
Integration with accounting systems
Once you have structured JSON, posting to most AP systems is straightforward. The extracted fields map directly to common invoice APIs:
- QuickBooks Online:
VendorCreditorBillobjects via the QBO API —VendorRef.name←vendor_name,TotalAmt←total_amount,TxnDate←invoice_date - Xero:
InvoicesAPI —Contact.Name←vendor_name,AmountDue←total_amount - SAP Business One:
OPCH(AP Invoice) table via the Service Layer — standard field mapping - NetSuite:
VendorBillrecord type via REST API
The common pattern is to match vendor_name against your vendor master list first (fuzzy string match, threshold ~0.85), then create or update the bill record. The invoice_number serves as the idempotency key — check for duplicates before posting.
For a complete walkthrough of connecting pdfmux to LangChain and downstream data stores, see PDF extraction for RAG pipelines. For handling scanned documents at scale with mixed digital and image-based pages, see OCR PDF extraction in Python.
Performance characteristics
On a standard Hetzner CPX21 server (3 vCPU, 4GB RAM, ~$15/month):
| Invoice type | Extraction time | Throughput (4 workers) |
|---|---|---|
| Digital, standard | 0.08–0.15s | ~800 invoices/min |
| Digital, complex tables | 0.8–1.5s | ~80 invoices/min |
| Scanned, clean | 1.2–2.5s | ~50 invoices/min |
| Scanned, degraded | 3.5–6.0s | ~20 invoices/min |
At 2,000 invoices per month (typical mid-size AP team), even the worst case (all scanned, degraded) completes in under 2 hours. For nightly batch runs this is comfortably within window. At 20,000+ invoices per month, horizontal scaling with additional workers is more efficient than single-machine vertical scaling.
Install pdfmux: pip install pdfmux. For structured extraction from invoices and other business documents, no additional dependencies are required beyond the base package.