Direct answer: Use LlamaParse if you process under 1,000 pages per day (their free tier), need maximum accuracy on complex multi-column layouts, and send no privacy-sensitive documents. Use pdfmux in every other case: it’s free, runs locally, scores 0.905 on the opendataloader-bench (99.5% of the paid #1 tool), and processes private documents without sending them to any API. The cost crossover is around 15,000–20,000 pages per month on LlamaParse’s standard tier.
What each tool actually is
LlamaParse is a cloud API published by LlamaIndex. You upload PDFs to their servers, they parse them using a proprietary pipeline — multimodal LLM inference plus layout analysis — and return structured Markdown or JSON. Pricing is consumption-based: free for 1,000 pages per day, $0.003 per page on the standard tier, and $0.01 per page on the premium tier with enhanced table parsing. There is no self-hosted option. Source code is closed.
pdfmux is an open-source Python library that runs entirely on your machine. It routes each PDF page to the optimal extractor: PyMuPDF for digital text, Docling for tables, RapidOCR for scanned pages. It scores quality on every page and re-extracts failures automatically. No API keys. No per-page cost. No documents leave your environment. Install: pip install pdfmux.
Both tools target the same use case: reliable PDF extraction for RAG pipelines, AI agents, and document automation. The tradeoffs are fundamentally about infrastructure philosophy — cloud versus local — not about extraction quality at the top end.
Accuracy
This is where honest comparison gets difficult. pdfmux is benchmarked on opendataloader-bench, a public dataset of 200 real-world PDFs from financial filings, academic papers, legal contracts, and government documents. LlamaParse is a cloud API — it cannot be submitted to the same benchmark without paying per-page costs at test scale, and LlamaIndex has not published third-party opendataloader-bench scores for it.
What we can compare directly:
| Metric | pdfmux | LlamaParse |
|---|---|---|
| opendataloader-bench overall | 0.905 | not published |
| Reading Order (NID) | 0.920 | not published |
| Table Accuracy (TEDS) | 0.911 | not published |
| Heading Structure (MHS) | 0.852 | not published |
| LlamaIndex internal eval | not published | ~92% (claimed) |
| Complex multi-column layouts | good | better |
| Scanned documents (OCR) | 0.87 overall | comparable |
| GPU required | No | No (cloud-side) |
| Max pages per call | no hard limit | 10,000 pages |
LlamaIndex claims approximately 92% accuracy on their internal evaluation mix of financial, academic, and legal documents. Without a shared benchmark, direct verification is not possible. What drives LlamaParse’s accuracy edge on complex layouts: it runs GPT-4V or equivalent multimodal inference on every page in premium mode, which understands reading order from visual layout rather than coordinate positions alone.
pdfmux uses ML selectively — Docling for table-heavy pages, RapidOCR for scanned pages — and skips LLM inference on the 85–90% of pages that are clean digital text. This is more efficient and cheaper, but means complex multi-column layouts (academic preprints, financial prospectuses with flowing multi-column text) may have worse reading-order recovery than a model that sees the page visually.
In practice, the accuracy gap is small on most business documents. Financial reports, legal contracts, government documents, and standard invoices all extract at comparable quality between the two tools.
Cost analysis
The cost difference compounds quickly at scale.
| Volume (pages/month) | LlamaParse standard ($0.003/page) | LlamaParse premium ($0.01/page) | pdfmux |
|---|---|---|---|
| 30,000 (free tier equivalent) | $0 (free tier) | — | $0 |
| 50,000 | $150 | $500 | $0 |
| 100,000 | $300 | $1,000 | $0 |
| 250,000 | $750 | $2,500 | $0 |
| 500,000 | $1,500 | $5,000 | $0 |
| 1,000,000 | $3,000 | $10,000 | $0 |
The infrastructure cost for pdfmux at these volumes: a Hetzner CPX21 server (3 vCPU, 4GB RAM, $15/month) processes approximately 3,000–6,000 pages per day. At 100,000 pages per month, that’s $15 in compute versus $300 in LlamaParse API fees — a 20x difference. At 500,000 pages per month you’d need 3–4 servers ($60/month total) versus $1,500 in API fees, still a 25x gap.
The free tier math is different. If you stay under 1,000 pages per day, LlamaParse genuinely costs nothing and requires zero infrastructure. That is the specific scenario where it wins on economics.
Privacy and data residency
This is the decisive factor for a large fraction of pdfmux deployments.
LlamaParse sends your documents to LlamaIndex’s servers for processing. For most developer applications — public-domain research corpora, internal product documentation, open datasets — this is acceptable. For regulated environments, it is not:
- Legal documents: Client contracts, case files, and privileged memos sent to a third-party API create attorney-client privilege risk in most common law jurisdictions.
- Healthcare: PHI under HIPAA requires a Business Associate Agreement. LlamaIndex offers BAAs at enterprise tier, but the data still travels to their infrastructure.
- Financial services: Data residency rules in UAE (PDPL), Saudi Arabia (PDPL), Switzerland (FADP), and the EU (GDPR Article 46) restrict cross-border document processing without specific contractual safeguards that most teams prefer to avoid entirely.
- Government and defense: No cloud API applies.
pdfmux has no network calls during extraction. Documents are processed on your hardware, stay in your environment, and never touch an external server. This is the primary reason regulated industries default to pdfmux — not extraction accuracy, but risk elimination.
Feature comparison
| Feature | pdfmux | LlamaParse |
|---|---|---|
| Open source | Yes (MIT) | No |
| Self-hosted | Yes | No |
| Per-page cost | $0 | $0.003–$0.01 |
| Local processing | Yes | No |
| Per-page confidence scoring | Yes (0.0–1.0) | No |
| Multi-pass extraction | Yes (extract → audit → repair) | Single pass |
| Self-healing re-extraction | Yes | No |
| Table extraction method | ML-based (Docling) | Proprietary LLM |
| OCR for scanned pages | RapidOCR, auto-detect | Cloud OCR |
| MCP server for AI agents | Yes | No |
| Complex multi-column layouts | Good | Better |
| Output formats | Markdown, JSON, text | Markdown, JSON, text |
| Async API | No (sync) | Yes (REST) |
| Free tier | Unlimited (self-hosted) | 1,000 pages/day |
| Python version | 3.9+ | 3.8+ |
Code comparison
pdfmux
from pdfmux import process, extract_fields
# Standard extraction with quality audit
result = process("financial-report.pdf", quality="standard")
print(result.text) # Clean Markdown
print(result.confidence) # 0.94 — per-document average
print(result.warnings) # ["Page 7: low text density, re-extracted with OCR"]
# Structured field extraction
fields = extract_fields("invoice.pdf", schema={
"vendor": str,
"total": float,
"date": str,
"invoice_number": str,
})
LlamaParse
from llama_parse import LlamaParse
# Standard extraction
parser = LlamaParse(api_key="llx-...", result_type="markdown")
documents = parser.load_data("financial-report.pdf")
text = documents[0].text
# Premium mode for complex layouts
parser_premium = LlamaParse(
api_key="llx-...",
result_type="markdown",
premium_mode=True,
)
docs = parser_premium.load_data("multi-column-annual-report.pdf")
LlamaParse integrates natively with LlamaIndex’s RAG framework, which makes it convenient if you’re already using that stack. pdfmux integrates with any framework — LangChain, LlamaIndex, raw ChromaDB, custom pipelines — since it returns plain text and Markdown. See PDF extraction for RAG pipelines for integration patterns.
The confidence scoring gap
One practical difference that matters for production systems: pdfmux returns a confidence score (0.0–1.0) for every page. LlamaParse does not.
Why this matters: in a production RAG pipeline, you need to know which pages to trust. A financial report with 3 scanned signature pages and 47 digital pages should trigger review on the scanned pages, not silently index garbled OCR. pdfmux gives you the signal to make that decision automatically:
result = process("annual-report.pdf", quality="standard")
low_confidence_pages = [p for p in result.pages if p.confidence < 0.7]
if low_confidence_pages:
# Flag for human review or escalate to higher quality mode
result_high = process("annual-report.pdf", quality="high")
LlamaParse returns text. Whether that text is from a clean digital page or a low-quality scanned image is opaque. You have to inspect the output manually to detect extraction failures.
When to use each
Use LlamaParse when:
- Your volume stays under 1,000 pages per day (free tier, zero cost, zero infrastructure)
- You’re processing complex multi-column academic papers or dense financial prospectuses where LLM-based reading order recovery improves output quality
- You’re already building on LlamaIndex’s ecosystem and want native integration
- No privacy, confidentiality, or data residency constraints apply
- You want zero infrastructure management — no servers, no deployments, no maintenance
Use pdfmux when:
- Monthly volume exceeds 20,000–50,000 pages (cost math shifts decisively at scale)
- Any documents are privileged, confidential, regulated, or subject to data residency rules
- You need per-page quality signals to build conditional downstream logic (flag, escalate, retry)
- Your pipeline must run offline, on private hardware, or in air-gapped environments
- You want self-healing extraction — automatic recovery when a page fails
- You need an MCP server that gives AI agents direct local PDF access
- You’re processing high-volume batch jobs where API latency and rate limits add friction
The practical recommendation: benchmark pdfmux on a representative sample of your actual documents first. For most business PDFs — invoices, contracts, reports, forms — pdfmux’s 0.905 overall score is indistinguishable from LlamaParse in downstream RAG quality. The 2–4% accuracy gap on complex layouts rarely moves retrieval metrics. If it does on your specific document type, LlamaParse premium is a targeted upgrade, not a full migration.
Summary
Both tools are serious. LlamaParse wins on complex layout accuracy and simplicity at low volumes. pdfmux wins on cost (free at any scale), privacy (fully local), observability (per-page confidence scores), and reliability (self-healing pipeline).
For the full context of where pdfmux sits across 7 tools including OpenDataLoader, Docling, marker, MinerU, and MarkItDown, see the 2026 PDF extractor comparison.