These are head-to-head comparisons between pdfmux and every other PDF extraction tool we benchmark against. Each post is written from one perspective: which tool produces the cleanest output for an LLM pipeline, measured on the same set of real PDFs.

Why not LiteParse, OpenDataLoader-PDF, or Chandra?

These three come up most often in the question “why not just use X?” The honest answers, with links to the full posts:

pdfmux is the right call when your document mix is heterogeneous, when a silent failure costs real money downstream, when you need an MIT license with no field-of-use restrictions, or when you want one of those three to sit underneath pdfmux as a backend with an audit manifest on top. The composition is real.

What we measure

Every comparison post in this section uses the same evaluation harness:

  1. Reading order accuracy — does the extracted text appear in the order a human would read it? PDFs with multi-column layouts, sidebars, and footnotes break most tools here. We measure with the opendataloader benchmark, a 200-PDF corpus of real-world documents.
  2. Table structure (TEDS) — does a table come out with the right rows and columns? Borderless tables, merged cells, and nested headers separate the good tools from the bad. TEDS (Tree-Edit-Distance-based Similarity) is the industry-standard metric.
  3. OCR fallback — does the tool correctly detect scanned pages and route them to OCR? Pure text-extraction tools silently produce empty output on image-PDFs, which is the worst failure mode for downstream pipelines.
  4. Speed per page — wall-clock seconds on a 4-core x86 machine, no GPU. We don’t compare GPU times because most production pipelines can’t justify GPUs for ingest.
  5. Cost per 1,000 pages — for hosted tools, this includes API charges. For self-hosted tools, it includes the typical infra cost at the throughput we measure.

Why we publish these

Every PDF extraction tool’s docs claims to “handle complex documents.” Most of them don’t. The only way to know which one to use is to run them against your actual document mix.

We run them against ours. The posts below are the result.

Each post in this section ends with a recommendation: which tool fits which workload best. Sometimes that recommendation is “use pdfmux.” Sometimes it’s “use the other one — here’s why.” We publish both kinds.

Reading order

If you’re choosing a PDF extractor for the first time, start with:

If you already know which tool you’re comparing against, the list below has every comparison we’ve published.