Direct answer: Use AWS Textract when your stack is already deeply on AWS, your documents are forms and tables that match the Textract feature set (AnalyzeDocument with FORMS and TABLES), and you have an enterprise budget with IAM-level governance requirements that justify the per-page cost. Use pdfmux when any of these are true: your documents leave AWS for a reason (regulatory, contractual, or because you are not on AWS), your monthly volume is over 5,000 pages, you process documents that contain personal data you do not want to ship to a third party, or you need confidence scores per page rather than confidence scores per detected field. The cost crossover is around 6,000 pages per month at standard Textract pricing. Below that, Textract is cheaper than the engineer time to operate pdfmux. Above it, the math flips fast.
What each tool actually is
AWS Textract is a managed document analysis API published by Amazon Web Services. You upload a PDF or image to S3 (or POST it directly), call one of three endpoints — DetectDocumentText for raw OCR, AnalyzeDocument for forms and tables, or AnalyzeExpense and AnalyzeID for specialized document types — and Textract returns a JSON response with extracted text, key-value pairs, table cells, and bounding boxes. Pricing is per-page: $1.50 per 1,000 pages for OCR, $50 per 1,000 pages for AnalyzeDocument with FORMS, $65 per 1,000 pages with FORMS and TABLES. There is no self-hosted option. The model is closed.
pdfmux is an open-source Python library that runs entirely on your machine. It routes each PDF page to the optimal extractor: PyMuPDF for digital text, Docling for tables, RapidOCR for scanned pages. It scores quality on every page and re-extracts failures automatically. No API keys. No per-page cost. No documents leave your environment. Install: pip install pdfmux. License is MIT.
Both tools target the same set of use cases — invoice and form extraction, document indexing for RAG, KYC pipelines, contract analysis, claims processing. The architectural choice is fundamentally about where the data lives at the moment of extraction: in your VPC versus in Amazon’s.
Accuracy
This is where honest comparison gets uncomfortable. Textract is a closed API, and the AWS team does not publish results on third-party benchmarks like opendataloader-bench. pdfmux is benchmarked publicly. Comparing them requires running Textract against the same test set ourselves, which is what we did.
Test set: 200 PDFs from financial filings (10-Ks, S-1s), academic papers (arXiv preprints), legal contracts (EDGAR exhibits), and government documents (US Federal Register, EU Official Journal). Mix of digital and scanned. Mix of single-column, multi-column, and heavy tables. Total pages: 4,180.
Both tools were evaluated on the same metrics: text accuracy (character error rate against ground truth), table cell accuracy (matched cells over total cells), and reading order accuracy (Kendall tau against human-annotated reading order).
| Metric | pdfmux 1.6 | AWS Textract (May 2026) |
|---|---|---|
| Text accuracy on digital pages | 99.4% | 99.6% |
| Text accuracy on scanned pages | 96.1% | 97.8% |
| Table cell accuracy | 91.2% | 93.5% |
| Reading order (Kendall tau) | 0.94 | 0.92 |
| Forms key-value extraction | not native (custom rules) | 89.4% |
| Confidence score granularity | per page | per detected field |
| Multi-column reading order | layout-aware via Docling | column-aware |
Textract wins on raw text accuracy, particularly on degraded scans, by 1.7 percentage points. It wins on table cell accuracy by 2.3 points. pdfmux wins narrowly on reading order on multi-column academic layouts, mostly because the orchestrator routes those pages to Docling, which handles columns more reliably than Textract on three-column layouts.
The forms extraction line is the one that matters for many readers. Textract has a FORMS feature that returns labeled key-value pairs (“Invoice Number” → “INV-2024-0847”) with bounding boxes. pdfmux does not do this natively — you write extraction rules on top of the Markdown output, or you pair pdfmux with a downstream LLM call. If your use case is “give me the line items from any invoice,” Textract’s AnalyzeExpense is faster to integrate than building it yourself.
The reverse is also true. If your use case is “give me clean text and tables I can feed to a custom LLM agent,” pdfmux’s confidence-scored Markdown is more useful than Textract’s typed JSON, because you can pipe it directly into any LLM context window without writing a translation layer.
Cost
This is where the analysis breaks Textract’s way until volume changes.
Textract pricing as of May 2026:
| Operation | Price per 1,000 pages |
|---|---|
DetectDocumentText (OCR only) | $1.50 |
AnalyzeDocument with FORMS | $50 |
AnalyzeDocument with FORMS and TABLES | $65 |
AnalyzeDocument with QUERIES | $15 per 1,000 queries |
AnalyzeExpense | $10 |
For a typical RAG pipeline that needs text plus tables, you are looking at $65 per 1,000 pages on Textract. On pdfmux, the cost is the compute you already pay for, plus engineer time to operate it.
Worked example: 50,000 pages per month.
- Textract
AnalyzeDocumentwith FORMS and TABLES: 50,000 × $0.065 = $3,250 per month, or $39,000 per year. - pdfmux on a single c7g.xlarge EC2 instance (4 vCPU, 8 GB RAM, ARM Graviton, $0.145 per hour on-demand or about $50 per month reserved): $50 to $105 per month, or $600 to $1,260 per year.
The AWS bill drops by approximately $38,000 per year for the same throughput. That assumes you do not need the FORMS feature; if you do, you replace the cost with either custom extraction rules on top of pdfmux output (engineer cost: one-time) or a downstream LLM call per page ($0.005 to $0.02 per page on a small model, which can still beat Textract’s $50 per 1,000 pages, depending on the model).
Below 6,000 pages per month, Textract is cheaper in total cost (including operations time). Above 6,000, pdfmux wins by a margin that grows linearly with volume.
Privacy
This is where the analysis breaks pdfmux’s way regardless of volume.
Textract processes documents on AWS infrastructure. AWS publishes a SOC 2 report, a HIPAA BAA is available on enrolled accounts, and Textract has a commitment that customer content is not used to train AWS models. The data still leaves your VPC during processing, even if it returns immediately, and even if the processing happens in your region. For many regulated industries, this is fine; for some, it is not.
Specifically, the privacy advantage is real for:
- Healthcare records under HIPAA where the BAA is not signed (smaller hospital systems, international providers).
- Legal contracts under attorney-client privilege where a third-party processor would arguably break the privilege.
- Personally identifiable information (PII) under EU GDPR Article 28 where data residency requirements forbid cross-border transfer.
- Government documents under FedRAMP High where Textract’s authorization may not cover the use case.
- Any internal data where the security review of a new SaaS vendor would take six months and a self-hosted alternative ships next week.
pdfmux runs in your VPC, on your machines, with no network calls. You can run it air-gapped. The privacy story is binary: documents do not leave the system unless you put them somewhere else.
This is not a theoretical advantage. We have customers running pdfmux specifically because their security team would not approve Textract for documents covered by their data processing agreements with their own customers. For some teams, “we use AWS for compute” and “we send customer documents through a managed AWS service” are different security postures, even though both are technically AWS.
Integration
Textract integrates trivially into AWS-native pipelines.
import boto3
textract = boto3.client("textract")
with open("invoice.pdf", "rb") as f:
response = textract.analyze_document(
Document={"Bytes": f.read()},
FeatureTypes=["FORMS", "TABLES"],
)
for block in response["Blocks"]:
if block["BlockType"] == "LINE":
print(block["Text"])
There is no model to download, no Python dependency to manage beyond boto3, no GPU to provision, and the IAM model is the same one your application already uses. For a team already on AWS, Textract is the path of least resistance.
pdfmux is also straightforward, but the operational model is different.
from pdfmux import extract
result = extract("invoice.pdf", format="markdown")
for page in result.pages:
print(f"Page {page.number} (confidence {page.confidence:.2f}):")
print(page.text)
You install the package, pin a model file or two for OCR, and run it inside whatever compute environment you already have. There is no API key, no rate limit, no per-call latency from a network round trip, and no monthly bill from a third party. There is also no AWS support contract to call when something breaks; you read the source, file an issue on GitHub, or fix it yourself.
For an AWS-native team building a quick MVP, Textract ships faster. For a team that is going to operate this pipeline for years and process millions of pages, the operational simplicity of pdfmux compounds.
Form and table extraction: the feature gap
Textract’s strongest feature is AnalyzeDocument with FORMS, which returns key-value pairs detected on the document, plus TABLES, which returns cell-level structured tables with row and column indices.
pdfmux does not return key-value pairs natively. It returns clean Markdown with tables in standard pipe syntax. For RAG, this is usually what you want — the LLM you pair with the index can extract the key-value pairs from the Markdown when the user asks. For automation pipelines that need structured fields without an LLM in the loop (invoice line items going straight into an ERP), Textract’s AnalyzeExpense is the more direct path.
The hybrid pattern that works well: pdfmux for ingestion (cheap, private, high recall), Textract for the small set of documents that require typed-field extraction without an LLM (rare, high-precision). Most teams find they only need 5 to 10% of their volume on Textract once pdfmux handles the rest.
When Textract wins
You should pick AWS Textract over pdfmux when:
- Your stack is fully AWS-native and the IAM model is already in place. Adding pdfmux means adding a new compute environment to manage; adding Textract is a
boto3client. - You need typed key-value extraction without an LLM in the loop. Invoice line items going directly into NetSuite. Driver license fields going into a KYC database. Textract’s
AnalyzeExpenseandAnalyzeIDare purpose-built and faster to integrate than rolling your own. - Your monthly volume is below 6,000 pages and your engineer time is expensive. The cost crossover has not flipped yet.
- You need AWS support contracts and SLAs. pdfmux is open source. There is no SLA. If your procurement requires a vendor contract, Textract has one.
- Your documents are already in S3 and you need async processing of 1,000-page jobs. Textract’s async API and S3 integration are very good for batch workloads with documents that already live in AWS.
When pdfmux wins
You should pick pdfmux over AWS Textract when:
- Privacy or data residency rules forbid third-party processing. This is non-negotiable for some regulated workloads.
- Your monthly volume is over 6,000 pages and growing. The cost gap widens linearly.
- You are not on AWS, or you are multi-cloud. Running Textract from GCP or Azure adds egress cost and architectural awkwardness; pdfmux runs anywhere.
- You need confidence scores per page, not per detected field. This is what lets you filter bad extractions out of your RAG index before they corrupt retrieval. We covered the pattern in PDF extraction for RAG pipelines.
- You want to avoid vendor lock-in on the extraction layer. Textract’s JSON schema is proprietary; pdfmux returns Markdown that any tool can consume.
- You need to extract documents in languages Textract supports poorly. Textract’s Arabic and Hindi support has improved but still trails the open-source layout models pdfmux routes to. We covered Arabic specifically in How to extract data from Arabic PDFs.
The hybrid approach most teams settle on
The teams that have been running both for a year tend to converge on a split: pdfmux for the bulk of the corpus (cheap, private, high recall), and Textract called selectively on the documents where typed-field extraction is required and the per-document cost is justified.
from pdfmux import extract
import boto3
textract = boto3.client("textract")
def hybrid_extract(pdf_path, needs_typed_fields=False):
if needs_typed_fields:
with open(pdf_path, "rb") as f:
return textract.analyze_document(
Document={"Bytes": f.read()},
FeatureTypes=["FORMS", "TABLES"],
)
return extract(pdf_path, format="markdown")
The router decision — which 5 to 10% of documents go to Textract — is usually based on document type (invoices and government forms go to Textract; everything else goes to pdfmux). The cost reduction from this single change tends to be 80 to 90% of the original Textract bill.
Summary
AWS Textract is the right answer for AWS-native teams with low to medium volume that need typed-field extraction without an LLM downstream. pdfmux is the right answer for everything else: cost-sensitive teams, privacy-bound workloads, multi-cloud or non-AWS stacks, and any pipeline where confidence scoring per page is the difference between a working RAG system and a hallucinating one. The cost crossover at standard Textract pricing is around 6,000 pages per month for AnalyzeDocument with FORMS and TABLES; below that, Textract is cheaper than the engineer time to run pdfmux, and above it, the math runs the other direction at roughly $1 saved per 15 pages processed.
For a deeper comparison against open-source alternatives, see pdfmux vs LlamaParse, pdfmux vs Kreuzberg, and the four-way benchmark in pdfmux vs LlamaParse vs Docling vs Unstructured.