pdfmux vs AWS Textract: when self-hosted PDF extraction beats the cloud

TL;DRHonest 2026 comparison of pdfmux and AWS Textract on accuracy, cost, privacy, and integration. Where Textract still wins, and where the cost math has flipped against it.

Direct answer: Use AWS Textract when your stack is already deeply on AWS, your documents are forms and tables that match the Textract feature set (AnalyzeDocument with FORMS and TABLES), and you have an enterprise budget with IAM-level governance requirements that justify the per-page cost. Use pdfmux when any of these are true: your documents leave AWS for a reason (regulatory, contractual, or because you are not on AWS), your monthly volume is over 5,000 pages, you process documents that contain personal data you do not want to ship to a third party, or you need confidence scores per page rather than confidence scores per detected field. The cost crossover is around 6,000 pages per month at standard Textract pricing. Below that, Textract is cheaper than the engineer time to operate pdfmux. Above it, the math flips fast.

What each tool actually is

AWS Textract is a managed document analysis API published by Amazon Web Services. You upload a PDF or image to S3 (or POST it directly), call one of three endpoints — DetectDocumentText for raw OCR, AnalyzeDocument for forms and tables, or AnalyzeExpense and AnalyzeID for specialized document types — and Textract returns a JSON response with extracted text, key-value pairs, table cells, and bounding boxes. Pricing is per-page: $1.50 per 1,000 pages for OCR, $50 per 1,000 pages for AnalyzeDocument with FORMS, $65 per 1,000 pages with FORMS and TABLES. There is no self-hosted option. The model is closed.

pdfmux is an open-source Python library that runs entirely on your machine. It routes each PDF page to the optimal extractor: PyMuPDF for digital text, Docling for tables, RapidOCR for scanned pages. It scores quality on every page and re-extracts failures automatically. No API keys. No per-page cost. No documents leave your environment. Install: pip install pdfmux. License is MIT.

Both tools target the same set of use cases — invoice and form extraction, document indexing for RAG, KYC pipelines, contract analysis, claims processing. The architectural choice is fundamentally about where the data lives at the moment of extraction: in your VPC versus in Amazon’s.

Accuracy

This is where honest comparison gets uncomfortable. Textract is a closed API, and the AWS team does not publish results on third-party benchmarks like opendataloader-bench. pdfmux is benchmarked publicly — its current Overall score on opendataloader-bench is 0.905 (200 PDFs), with Docling at 0.877 and Marker at 0.861 as nearest open-weight comparisons. There is no equivalent third-party number for Textract that we can responsibly cite.

We have not run a controlled head-to-head benchmark of pdfmux against Textract on a shared corpus, and we are not going to fabricate one for the purposes of this comparison. The few public head-to-head writeups we’ve seen (mostly community blog posts on small corpora) report Textract winning narrowly on character error rate for degraded scans, with pdfmux competitive on born-digital pages — but small corpora are not a basis for confident claims.

The honest framing: if accuracy on degraded scans is the binding constraint and you can tolerate the AWS lock-in, run a 50-document pilot of both against your own corpus and pick on the result. The two architectures differ enough that “which is more accurate” is genuinely document-dependent, not a single number.

What we will say with confidence:

Forms extraction: Textract has a FORMS feature that returns labeled key-value pairs (“Invoice Number” → “INV-2024-0847”) with bounding boxes. pdfmux does not do this natively — you write extraction rules on top of the Markdown output, or you pair pdfmux with a downstream LLM call. If your use case is “give me the line items from any invoice,” Textract’s AnalyzeExpense is faster to integrate than building it yourself.
Confidence semantics: pdfmux returns a per-page confidence score derived from text density, layout coherence, and OCR confidence — useful for routing low-confidence pages to a review queue. Textract returns per-field confidence on forms output but does not surface a single per-page audit signal of the same shape.
Reading order on multi-column academic layouts: anecdotally pdfmux is competitive here because the orchestrator routes those pages to Docling, which handles columns reliably. Whether that holds on your specific layouts is, again, a question to settle with a pilot, not a marketing claim.

The reverse is also true. If your use case is “give me clean text and tables I can feed to a custom LLM agent,” pdfmux’s confidence-scored Markdown is more useful than Textract’s typed JSON, because you can pipe it directly into any LLM context window without writing a translation layer.

Cost

This is where the analysis breaks Textract’s way until volume changes.

Textract pricing as of May 2026:

Operation	Price per 1,000 pages
`DetectDocumentText` (OCR only)	$1.50
`AnalyzeDocument` with FORMS	$50
`AnalyzeDocument` with FORMS and TABLES	$65
`AnalyzeDocument` with QUERIES	$15 per 1,000 queries
`AnalyzeExpense`	$10

For a typical RAG pipeline that needs text plus tables, you are looking at $65 per 1,000 pages on Textract. On pdfmux, the cost is the compute you already pay for, plus engineer time to operate it.

Worked example: 50,000 pages per month.

Textract AnalyzeDocument with FORMS and TABLES: 50,000 × $0.065 = $3,250 per month, or $39,000 per year.
pdfmux on a single c7g.xlarge EC2 instance (4 vCPU, 8 GB RAM, ARM Graviton, $0.145 per hour on-demand or about $50 per month reserved): $50 to $105 per month, or $600 to $1,260 per year.

The AWS bill drops by approximately $38,000 per year for the same throughput. That assumes you do not need the FORMS feature; if you do, you replace the cost with either custom extraction rules on top of pdfmux output (engineer cost: one-time) or a downstream LLM call per page ($0.005 to $0.02 per page on a small model, which can still beat Textract’s $50 per 1,000 pages, depending on the model).

Below 6,000 pages per month, Textract is cheaper in total cost (including operations time). Above 6,000, pdfmux wins by a margin that grows linearly with volume.

Privacy

This is where the analysis breaks pdfmux’s way regardless of volume.

Textract processes documents on AWS infrastructure. AWS publishes a SOC 2 report, a HIPAA BAA is available on enrolled accounts, and Textract has a commitment that customer content is not used to train AWS models. The data still leaves your VPC during processing, even if it returns immediately, and even if the processing happens in your region. For many regulated industries, this is fine; for some, it is not.

Specifically, the privacy advantage is real for:

Healthcare records under HIPAA where the BAA is not signed (smaller hospital systems, international providers).
Legal contracts under attorney-client privilege where a third-party processor would arguably break the privilege.
Personally identifiable information (PII) under EU GDPR Article 28 where data residency requirements forbid cross-border transfer.
Government documents under FedRAMP High where Textract’s authorization may not cover the use case.
Any internal data where the security review of a new SaaS vendor would take six months and a self-hosted alternative ships next week.

pdfmux runs in your VPC, on your machines, with no network calls. You can run it air-gapped. The privacy story is binary: documents do not leave the system unless you put them somewhere else.

This is not a theoretical advantage. We have customers running pdfmux specifically because their security team would not approve Textract for documents covered by their data processing agreements with their own customers. For some teams, “we use AWS for compute” and “we send customer documents through a managed AWS service” are different security postures, even though both are technically AWS.

Integration

Textract integrates trivially into AWS-native pipelines.

import boto3

textract = boto3.client("textract")

with open("invoice.pdf", "rb") as f:
    response = textract.analyze_document(
        Document={"Bytes": f.read()},
        FeatureTypes=["FORMS", "TABLES"],
    )

for block in response["Blocks"]:
    if block["BlockType"] == "LINE":
        print(block["Text"])

There is no model to download, no Python dependency to manage beyond boto3, no GPU to provision, and the IAM model is the same one your application already uses. For a team already on AWS, Textract is the path of least resistance.

pdfmux is also straightforward, but the operational model is different.

from pdfmux import extract

result = extract("invoice.pdf", format="markdown")

for page in result.pages:
    print(f"Page {page.number} (confidence {page.confidence:.2f}):")
    print(page.text)

You install the package, pin a model file or two for OCR, and run it inside whatever compute environment you already have. There is no API key, no rate limit, no per-call latency from a network round trip, and no monthly bill from a third party. There is also no AWS support contract to call when something breaks; you read the source, file an issue on GitHub, or fix it yourself.

For an AWS-native team building a quick MVP, Textract ships faster. For a team that is going to operate this pipeline for years and process millions of pages, the operational simplicity of pdfmux compounds.

Form and table extraction: the feature gap

Textract’s strongest feature is AnalyzeDocument with FORMS, which returns key-value pairs detected on the document, plus TABLES, which returns cell-level structured tables with row and column indices.

pdfmux does not return key-value pairs natively. It returns clean Markdown with tables in standard pipe syntax. For RAG, this is usually what you want — the LLM you pair with the index can extract the key-value pairs from the Markdown when the user asks. For automation pipelines that need structured fields without an LLM in the loop (invoice line items going straight into an ERP), Textract’s AnalyzeExpense is the more direct path.

The hybrid pattern that works well: pdfmux for ingestion (cheap, private, high recall), Textract for the small set of documents that require typed-field extraction without an LLM (rare, high-precision). Most teams find they only need 5 to 10% of their volume on Textract once pdfmux handles the rest.

When Textract wins

You should pick AWS Textract over pdfmux when:

Your stack is fully AWS-native and the IAM model is already in place. Adding pdfmux means adding a new compute environment to manage; adding Textract is a boto3 client.
You need typed key-value extraction without an LLM in the loop. Invoice line items going directly into NetSuite. Driver license fields going into a KYC database. Textract’s AnalyzeExpense and AnalyzeID are purpose-built and faster to integrate than rolling your own.
Your monthly volume is below 6,000 pages and your engineer time is expensive. The cost crossover has not flipped yet.
You need AWS support contracts and SLAs. pdfmux is open source. There is no SLA. If your procurement requires a vendor contract, Textract has one.
Your documents are already in S3 and you need async processing of 1,000-page jobs. Textract’s async API and S3 integration are very good for batch workloads with documents that already live in AWS.

When pdfmux wins

You should pick pdfmux over AWS Textract when:

Privacy or data residency rules forbid third-party processing. This is non-negotiable for some regulated workloads.
Your monthly volume is over 6,000 pages and growing. The cost gap widens linearly.
You are not on AWS, or you are multi-cloud. Running Textract from GCP or Azure adds egress cost and architectural awkwardness; pdfmux runs anywhere.
You need confidence scores per page, not per detected field. This is what lets you filter bad extractions out of your RAG index before they corrupt retrieval. We covered the pattern in PDF extraction for RAG pipelines.
You want to avoid vendor lock-in on the extraction layer. Textract’s JSON schema is proprietary; pdfmux returns Markdown that any tool can consume.
You need to extract documents in languages Textract supports poorly. Textract’s Arabic and Hindi support has improved but still trails the open-source layout models pdfmux routes to. We covered Arabic specifically in How to extract data from Arabic PDFs.

The hybrid approach most teams settle on

The teams that have been running both for a year tend to converge on a split: pdfmux for the bulk of the corpus (cheap, private, high recall), and Textract called selectively on the documents where typed-field extraction is required and the per-document cost is justified.

from pdfmux import extract
import boto3

textract = boto3.client("textract")

def hybrid_extract(pdf_path, needs_typed_fields=False):
    if needs_typed_fields:
        with open(pdf_path, "rb") as f:
            return textract.analyze_document(
                Document={"Bytes": f.read()},
                FeatureTypes=["FORMS", "TABLES"],
            )
    return extract(pdf_path, format="markdown")

The router decision — which 5 to 10% of documents go to Textract — is usually based on document type (invoices and government forms go to Textract; everything else goes to pdfmux). The cost reduction from this single change tends to be 80 to 90% of the original Textract bill.

Summary

AWS Textract is the right answer for AWS-native teams with low to medium volume that need typed-field extraction without an LLM downstream. pdfmux is the right answer for everything else: cost-sensitive teams, privacy-bound workloads, multi-cloud or non-AWS stacks, and any pipeline where confidence scoring per page is the difference between a working RAG system and a hallucinating one. The cost crossover at standard Textract pricing is around 6,000 pages per month for AnalyzeDocument with FORMS and TABLES; below that, Textract is cheaper than the engineer time to run pdfmux, and above it, the math runs the other direction at roughly $1 saved per 15 pages processed.

For a deeper comparison against open-source alternatives, see pdfmux vs LlamaParse, pdfmux vs Kreuzberg, and the four-way benchmark in pdfmux vs LlamaParse vs Docling vs Unstructured.