AWS Textract is Amazon’s managed document extraction service. pdfmux is an open-source Python library that runs locally. They take opposite approaches to the same problem: getting structured text out of PDFs.
This comparison covers accuracy, cost, latency, operational complexity, and the tradeoffs that matter in production — not the marketing surface.
Feature Comparison
| Feature | pdfmux | AWS Textract |
|---|---|---|
| Installation | pip install pdfmux | AWS account + IAM + boto3 setup |
| Deployment | Local, self-hosted | AWS cloud only |
| Pricing | Free (MIT) | $0.0015–$0.015 per page |
| Extraction approach | Multi-engine router | ML-based OCR + layout analysis |
| Output formats | Markdown, JSON | JSON (Block-based structure) |
| Table extraction | Built-in, high accuracy | ML-powered, enterprise-grade |
| Form extraction | Limited | Strong (key-value pairs) |
| OCR capability | Basic (text-based PDFs) | Advanced ML-based OCR |
| Data privacy | Documents stay local | Documents sent to AWS |
| Offline support | Yes | No |
| AWS integration | None (framework-agnostic) | Native (S3, Lambda, SNS) |
| MCP server | Built-in | No |
| Vendor lock-in | None | AWS ecosystem |
Benchmark Results
We tested both on our 200-document benchmark suite. Full methodology lives in our PDF extractor benchmark post.
| Metric | pdfmux | AWS Textract |
|---|---|---|
| Text accuracy (text-based PDFs) | 94.2% | 92.5% |
| Text accuracy (scanned PDFs) | 88.1% | 95.2% |
| Table extraction F1 | 89.1% | 91.2% |
| Median latency per page | ~1.2s local | ~2.9s (includes network) |
| Failed documents | 2 / 200 | 3 / 200 |
| Cost per 10,000 pages | $0 | $15–$150 |
Textract’s ML-based OCR outperforms pdfmux on heavily scanned documents — that is the one place where the cloud round-trip is worth paying for. On text-based PDFs (the majority of modern documents), pdfmux is faster and matches accuracy at a fraction of the cost. Textract has a slight edge on structured form-style tables; pdfmux leads on general text accuracy.
Pricing at Scale
pdfmux is free and open-source under MIT. Run it on a $5/month VPS or your laptop. The only cost is compute.
AWS Textract (US East):
- DetectDocumentText: $1.50 per 1,000 pages
- AnalyzeDocument (Tables): $15.00 per 1,000 pages
- AnalyzeDocument (Forms): $50.00 per 1,000 pages
At any reasonable volume, the difference compounds:
| Volume (pages/month) | pdfmux | Textract (Tables) | Textract (Forms) |
|---|---|---|---|
| 1,000 | $0 | $15 | $50 |
| 10,000 | $0 | $150 | $500 |
| 100,000 | $0 | $1,500 | $5,000 |
| 1,000,000 | $0 | $15,000 | $50,000 |
For an LLM/RAG pipeline ingesting hundreds of thousands of pages, Textract becomes a meaningful line item. For pdfmux, it is rounding error.
Operational Complexity
Textract requires an AWS account, IAM roles with the right permissions, region selection, and boto3 configuration. Documents over 10 MB or 1 page require asynchronous processing through S3 with SNS notifications. Error handling means managing throttling, service quotas, and region-specific availability.
pdfmux requires pip install pdfmux. That is the entire setup.
For teams already deep in AWS, adding Textract is incremental. For everyone else, the operational overhead is significant for what is fundamentally a PDF parsing task.
Code Comparison
pdfmux:
from pdfmux import convert
result = convert("invoice.pdf")
print(result.markdown)
AWS Textract:
import boto3
client = boto3.client("textract", region_name="us-east-1")
with open("invoice.pdf", "rb") as f:
response = client.analyze_document(
Document={"Bytes": f.read()},
FeatureTypes=["TABLES", "FORMS"]
)
for block in response["Blocks"]:
if block["BlockType"] == "LINE":
print(block["Text"])
pdfmux returns a clean Markdown document. Textract returns a nested JSON tree of Blocks that you must parse and reassemble — non-trivial work for any multi-page document, and the reason most teams end up writing a Textract-to-text adapter on top.
When to Use AWS Textract
Textract is the right call when you need:
- Scanned document OCR — physical documents, faxes, low-quality scans where ML-based OCR materially outperforms heuristics
- Form extraction — key-value pairs from standardized forms (W-2s, insurance claims, invoices with consistent layout)
- Handwriting recognition — Textract reads handwritten text local tools generally cannot
- AWS-native pipelines — you already have S3 triggers, Lambda processing, and SNS notifications wired up
- Enterprise compliance — Textract is HIPAA-eligible (with BAA), SOC, and FedRAMP certified
When to Use pdfmux
pdfmux is the better fit when you need:
- Cost control — zero per-page cost, free at any scale
- Low latency — ~1s local vs 1–5s cloud round trip; matters for interactive UIs and real-time pipelines
- Data privacy — documents never leave your infrastructure; simpler compliance story for HIPAA/GDPR
- Text-based PDF extraction — comparable or better accuracy without cloud overhead
- Cloud-agnostic deployment — runs on any infra, on a laptop, in a Docker container, on the edge
- RAG pipelines — clean Markdown output drops directly into your chunker
- Batch processing — process millions of PDFs without API rate limits or scaling fees
Verdict
For general-purpose PDF extraction, pdfmux delivers better text accuracy at zero cost with dramatically simpler setup. Textract earns its keep on form extraction, scanned-document OCR, and tight AWS integration.
The honest split: if your documents are mostly text-based PDFs and your goal is clean output for an LLM, pdfmux is the practical choice. If you’re processing structured forms at scale inside an AWS environment, Textract’s purpose-built features pay for themselves.
For a broader survey, see our roundup of the best PDF extraction libraries for Python or the full extractor benchmark.
FAQ
How much does AWS Textract cost at scale?
DetectDocumentText is $1.50 per 1,000 pages; AnalyzeDocument (Tables) is $15 per 1,000; AnalyzeDocument (Forms) is $50 per 1,000. Processing 100,000 pages costs $150–$5,000 depending on which features you enable. pdfmux processes the same volume for the cost of compute.
Can pdfmux handle scanned PDFs as well as Textract?
For heavily scanned or degraded documents, Textract’s ML-based OCR is superior (95.2% vs 88.1% in our benchmark). pdfmux handles text-based PDFs and lightly scanned documents well. A common production pattern: route text-based PDFs to pdfmux and scanned ones to Textract.
Is AWS Textract HIPAA compliant?
Textract is HIPAA-eligible when used inside a properly configured AWS environment with a BAA in place. pdfmux processes documents locally, so HIPAA compliance follows your own infrastructure controls — keeping data local is often the simpler compliance path.
Is there a free tier for Textract?
AWS offers a free tier of 1,000 pages/month for the first 3 months. After that, standard pricing applies. pdfmux is free in perpetuity.
Does pdfmux support form extraction?
pdfmux focuses on text, tables, and document structure rendered as Markdown. For dedicated key-value form extraction (W-2s, structured invoices), Textract has purpose-built features that pdfmux does not match. The pragmatic answer is to use both: Textract for forms, pdfmux for everything else.
What about document size limits?
Textract limits synchronous processing to 10 MB and 1 page. Larger documents require async processing through S3. pdfmux has no such limits — it processes documents of any size locally. See our self-healing extraction post for handling very large or fragile documents.