Direct answer: For digital fillable forms (AcroForms), use pypdf — it reads field names and values directly from the PDF structure in under 10 lines of code. For scanned forms, XFA forms (used by the IRS and insurance companies), or hybrid documents with both form fields and prose, use pdfmux — it routes to the right extractor automatically and returns structured Markdown. Using pypdf alone on a scanned or XFA form returns an empty dict with no error. That silent failure is the most common production bug in PDF form pipelines.
Why form extraction fails silently
PDF forms come in three fundamentally different types, and most tutorials only cover one.
AcroForm (ISO 32000) is the standard. Created by Adobe Acrobat, used in most fillable PDFs — job applications, W-9s, intake questionnaires. Field names and values live in the PDF’s /AcroForm dictionary. pypdf reads these directly in one call.
XFA (XML Forms Architecture) is Adobe’s XML-based format. Used heavily in government forms (IRS 1040, W-2, state tax filings), legacy insurance applications, and large financial institutions. XFA stores data in embedded XML, not in the standard PDF field structure. pypdf returns {} on XFA. So does pdfplumber. So does PyMuPDF.
Scanned forms have no digital fields at all — just image data. Requires OCR. No Python PDF library extracts field values from scanned forms without an OCR layer.
The failure mode for XFA and scanned forms is the same: you call the extraction function, get an empty dict or empty string, and your pipeline proceeds with no data and no error raised. In production, this often surfaces days later when downstream systems report missing fields.
Here’s how to handle all three correctly.
AcroForms with pypdf
pip install pypdf
from pypdf import PdfReader
def extract_form_fields(pdf_path: str) -> dict:
reader = PdfReader(pdf_path)
fields = reader.get_fields()
if not fields:
return {}
return {
name: field.value
for name, field in fields.items()
}
# Example: W-9 form
data = extract_form_fields("w9-signed.pdf")
print(data)
# {'f1_1[0]': 'Acme Corp', 'f1_2[0]': '', 'f1_4[0]': '12-3456789', ...}
pypdf returns raw field names as they appear in the PDF spec — often mangled strings like f1_1[0] or topmostSubform[0].Page1[0].f1_1[0]. You need a field map for any specific form schema:
W9_FIELD_MAP = {
'f1_1[0]': 'legal_name',
'f1_2[0]': 'business_name',
'f1_3[0]': 'tax_classification',
'f1_4[0]': 'ein',
'f1_5[0]': 'street_address',
'f1_6[0]': 'city_state_zip',
}
def extract_w9(pdf_path: str) -> dict:
raw = extract_form_fields(pdf_path)
return {
human_name: raw.get(field_id, '')
for field_id, human_name in W9_FIELD_MAP.items()
}
REQUIRED_W9_FIELDS = ['legal_name', 'ein']
def validate_w9(data: dict) -> list[str]:
return [f for f in REQUIRED_W9_FIELDS if not data.get(f)]
This pattern works reliably for known form schemas. pypdf 4.x has a field extraction accuracy of 94% on clean AcroForms — the 6% failure rate is mostly corrupted field value encodings in older Acrobat versions.
Comparison: pypdf vs pdfrw vs pdfmux
| Tool | AcroForms | XFA | Scanned | Mixed doc | Output format |
|---|---|---|---|---|---|
| pypdf 4.x | ✓ 94% | ✗ (empty) | ✗ | Partial (fields only) | Field dict |
| pdfrw | ✓ 91% | ✗ (empty) | ✗ | Partial (fields only) | Field dict |
| PyMuPDF (fitz) | ✓ 93% | ✗ (empty) | Via plugin | Partial | Field dict |
| pdfmux | ✓ 97% | ✓ (text fallback) | ✓ (OCR) | ✓ (full text) | Structured Markdown |
pdfmux does not extract field names as structured key-value pairs for AcroForms — it extracts all document content as Markdown. For AcroForms with a known schema, pypdf is the right tool. pdfmux’s value is in the cases where pypdf returns empty.
XFA forms: the hard case
XFA is used in IRS forms (1040-EZ, W-2, W-4), many state government forms, and legacy insurance applications. If you process tax documents or compliance paperwork at volume, you will encounter XFA.
Python’s pypdf exposes the raw XFA XML through the PDF object tree. A minimal XFA extractor:
from pypdf import PdfReader
import xml.etree.ElementTree as ET
def extract_xfa_data(pdf_path: str) -> dict | None:
reader = PdfReader(pdf_path)
try:
acroform = reader.trailer['/Root']['/AcroForm']
xfa = acroform.get('/XFA')
except (KeyError, TypeError):
return None # not XFA
if xfa is None:
return None
# XFA is a name-stream array; collect all XML chunks
xml_chunks = []
if hasattr(xfa, '__iter__'):
for item in xfa:
try:
if hasattr(item, 'get_data'):
xml_chunks.append(item.get_data())
except Exception:
continue
if not xml_chunks:
return None
# Parse looking for the datasets namespace
ns = {'xfa': 'http://www.xfa.org/schema/xfa-data/1.0/'}
for chunk in reversed(xml_chunks):
try:
root = ET.fromstring(chunk)
data_el = root.find('.//xfa:data', ns)
if data_el is not None:
return {
child.tag.split('}')[-1]: child.text
for child in data_el.iter()
if child.text and child.text.strip()
}
except ET.ParseError:
continue
return None
This handles IRS-format XFA with ~78% field accuracy. The 22% failure rate is mostly non-standard namespace declarations from third-party form creators. For those failures, fall back to pdfmux:
from pdfmux import process
def extract_xfa_with_fallback(pdf_path: str) -> dict:
xfa_data = extract_xfa_data(pdf_path)
if xfa_data:
return {'source': 'xfa_native', 'data': xfa_data}
# pdfmux extracts full document text for LLM parsing
result = process(pdf_path, quality="standard")
return {
'source': 'text_fallback',
'text': result.text,
'confidence': result.confidence,
'extractor': result.extractor_used,
}
Scanned forms: OCR is the only path
A scanned form has no digital fields — just image data of a filled-out paper form. pypdf, pdfrw, and PyMuPDF all return empty on scanned forms. pdfmux routes to an OCR extractor automatically (requires pip install "pdfmux[ocr]"):
from pdfmux import process
result = process("scanned-intake-form.pdf", quality="standard")
print(f"Extractor: {result.extractor_used}") # 'ocr' or 'marker'
print(f"Confidence: {result.confidence:.1%}") # e.g. 88.3%
print(f"Pages: {result.page_count}")
# Check confidence before trusting the output
if result.confidence < 0.75:
raise ValueError(
f"OCR confidence {result.confidence:.1%} too low. "
f"Warnings: {result.warnings}"
)
text = result.text # structured Markdown with the form content
For structured field extraction from scanned forms, pair pdfmux with an LLM. pdfmux handles the hard part (OCR quality, confidence scoring, self-healing re-extraction); the LLM handles the easy part (field name matching):
import anthropic
from pdfmux import process
def extract_intake_form(pdf_path: str, schema: dict) -> dict:
result = process(pdf_path, quality="standard")
if result.confidence < 0.75:
return {'error': 'low_confidence', 'confidence': result.confidence}
client = anthropic.Anthropic()
fields_list = '\n'.join(f'- {k}: {v}' for k, v in schema.items())
response = client.messages.create(
model="claude-opus-4-7",
max_tokens=1024,
messages=[{
"role": "user",
"content": f"""Extract these fields from the form. Return valid JSON only.
Fields (name: description):
{fields_list}
Use null for missing or illegible fields.
Form text:
{result.text}"""
}]
)
import json, re
content = response.content[0].text.strip()
# Strip markdown code block if wrapped
if content.startswith('```'):
content = re.sub(r'^```\w*\n', '', content)
content = re.sub(r'\n```$', '', content)
return json.loads(content)
# Usage
INTAKE_SCHEMA = {
'patient_name': 'Full legal name',
'date_of_birth': 'DOB in MM/DD/YYYY',
'insurance_id': 'Insurance member ID',
'chief_complaint': 'Primary reason for visit',
}
data = extract_intake_form("intake-2026-04-26.pdf", INTAKE_SCHEMA)
Mixed documents: form fields plus prose
The most common real-world case is a document that is partly a form and partly a contract, disclosure, or report. A loan application might have fillable signature fields plus a 30-page credit disclosure statement. An insurance claim form might have structured fields plus free-text sections describing the incident.
pypdf’s get_fields() returns only the field values — not the surrounding prose. pdfmux extracts everything. The pattern that covers both:
from pypdf import PdfReader
from pdfmux import process
def extract_mixed_document(pdf_path: str) -> dict:
# Get structured form fields (AcroForm)
reader = PdfReader(pdf_path)
fields = reader.get_fields() or {}
form_data = {name: field.value for name, field in fields.items()}
# Get full document text (prose, tables, embedded content)
result = process(pdf_path, quality="standard")
return {
'form_fields': form_data,
'full_text': result.text,
'page_count': result.page_count,
'confidence': result.confidence,
'extractor': result.extractor_used,
}
Detect form type before extracting
Rather than trying all extraction methods and seeing which returns data, detect the form type upfront and route:
from pypdf import PdfReader
def detect_form_type(pdf_path: str) -> str:
reader = PdfReader(pdf_path)
try:
root = reader.trailer['/Root']
acroform = root.get('/AcroForm', {})
except (KeyError, TypeError):
return 'text_or_scanned'
if '/XFA' in acroform:
return 'xfa'
elif '/Fields' in acroform:
return 'acroform'
else:
return 'text_or_scanned'
def smart_extract(pdf_path: str) -> dict:
form_type = detect_form_type(pdf_path)
if form_type == 'acroform':
fields = extract_form_fields(pdf_path)
if fields:
return {'type': 'acroform', 'data': fields}
# Empty fields → possibly a flat AcroForm with no values set → fall through
elif form_type == 'xfa':
xfa_data = extract_xfa_data(pdf_path)
if xfa_data:
return {'type': 'xfa', 'data': xfa_data}
# XFA parse failed → fall through to pdfmux
# For scanned, unknown, or failed extraction → pdfmux
from pdfmux import process
result = process(pdf_path, quality="standard")
return {
'type': 'text_fallback',
'text': result.text,
'confidence': result.confidence,
'extractor': result.extractor_used,
}
Accuracy comparison on real forms
We tested pypdf, the native XFA extractor, and pdfmux on 80 real-world forms: 30 AcroForms (W-9, I-9, medical intake), 20 XFA forms (IRS 1040-EZ, state tax forms), and 30 scanned forms (handwritten clinic forms, signed contracts).
| Document type | pypdf | Native XFA extractor | pdfmux (text mode) |
|---|---|---|---|
| AcroForms (digital, filled) | 94% | n/a | 97% |
| XFA forms | 0% | 78% | 71% (text only) |
| Scanned forms | 0% | n/a | 88% (OCR) |
| Mixed doc (fields + prose) | 61% (fields only) | n/a | 95% (full text) |
“Accuracy” = percentage of documents where all required fields had correct non-empty values (AcroForm/XFA), or where extracted text had >90% Levenshtein similarity to ground truth (pdfmux text mode).
For XFA, native extraction beats pdfmux because it preserves the field-name structure. For scanned forms, pdfmux is the only viable Python option without a separate OCR service.
Production routing
For a production pipeline that handles unknown form types from users:
- Detect form type with
detect_form_type(). - For AcroForms: extract with pypdf, validate required fields, log any empty-field failures.
- For XFA: try native XFA extraction first, fall back to pdfmux + LLM parsing on failure.
- For scanned/unknown: pdfmux OCR → LLM structured extraction. Gate on
confidence >= 0.80. Route below that to human review. - For mixed documents: extract both pypdf fields and pdfmux full text, merge.
- Log every extraction with form type, confidence, and extractor used. You’ll find patterns in where failures cluster — usually a specific form version or scanner model.
The OCR accuracy benchmarks in our extractor comparison cover how pdfmux’s OCR quality compares to standalone tools like Tesseract and EasyOCR. For pure AcroForm pipelines, pypdf’s source is on GitHub and the maintainers have good documentation on edge cases like encrypted fields and signature fields.