These are head-to-head comparisons between pdfmux and every other PDF extraction tool we benchmark against. Each post is written from one perspective: which tool produces the cleanest output for an LLM pipeline, measured on the same set of real PDFs.
Why not LiteParse, OpenDataLoader-PDF, or Chandra?
These three come up most often in the question “why not just use X?” The honest answers, with links to the full posts:
- LiteParse is a library — a Rust core with clean Python, Node, and WASM bindings, and an 8.4K-star community shipped by LlamaIndex. If you’re a single engineer dropping a parser into your own RAG pipeline and you don’t need a per-batch audit manifest, install LiteParse and stop reading. Full post: pdfmux vs LiteParse.
- OpenDataLoader-PDF is the #1 score on its own hybrid benchmark (0.909, with pdfmux at 0.905 free). If your workload is dominated by complex multi-column reading order and you’re willing to pay their API for the 0.4-point edge, that’s the honest buy. Full post: pdfmux vs OpenDataLoader-PDF.
- Chandra is Datalab’s newest model — a single VLM with state-of-the-art accuracy on complex tables, forms, and handwriting, available via Datalab’s hosted service with SOC 2 Type 2 and custom BAAs out of the box. If you’re a healthcare or legal team that needs ONE specialized model with vendor compliance attached — not an orchestrator across many — Chandra is the right buy. Full post: pdfmux vs Chandra.
pdfmux is the right call when your document mix is heterogeneous, when a silent failure costs real money downstream, when you need an MIT license with no field-of-use restrictions, or when you want one of those three to sit underneath pdfmux as a backend with an audit manifest on top. The composition is real.
What we measure
Every comparison post in this section uses the same evaluation harness:
- Reading order accuracy — does the extracted text appear in the order a human would read it? PDFs with multi-column layouts, sidebars, and footnotes break most tools here. We measure with the opendataloader benchmark, a 200-PDF corpus of real-world documents.
- Table structure (TEDS) — does a table come out with the right rows and columns? Borderless tables, merged cells, and nested headers separate the good tools from the bad. TEDS (Tree-Edit-Distance-based Similarity) is the industry-standard metric.
- OCR fallback — does the tool correctly detect scanned pages and route them to OCR? Pure text-extraction tools silently produce empty output on image-PDFs, which is the worst failure mode for downstream pipelines.
- Speed per page — wall-clock seconds on a 4-core x86 machine, no GPU. We don’t compare GPU times because most production pipelines can’t justify GPUs for ingest.
- Cost per 1,000 pages — for hosted tools, this includes API charges. For self-hosted tools, it includes the typical infra cost at the throughput we measure.
Why we publish these
Every PDF extraction tool’s docs claims to “handle complex documents.” Most of them don’t. The only way to know which one to use is to run them against your actual document mix.
We run them against ours. The posts below are the result.
Each post in this section ends with a recommendation: which tool fits which workload best. Sometimes that recommendation is “use pdfmux.” Sometimes it’s “use the other one — here’s why.” We publish both kinds.
Reading order
If you’re choosing a PDF extractor for the first time, start with:
- pdfmux vs LlamaParse — the most common “should I use the hosted commercial option” question
- pdfmux vs Docling — IBM’s open-source extractor, the strongest competitor on table accuracy
- pdfmux vs PyMuPDF — PyMuPDF is the de-facto default in Python; this post is the “do I need anything more than PyMuPDF?” question
- pdfmux vs AWS Textract — the most common AWS-shop question
If you already know which tool you’re comparing against, the list below has every comparison we’ve published.