pdfmux vs LiteParse: library or product?

LiteParse is a library. pdfmux is a product. That is the honest tradeoff, and the rest of this page makes that distinction concrete instead of pretending one is “better” than the other.

LiteParse is the open-source PDF parser shipped by LlamaIndex — Apache-2.0 licensed, Rust-core with Python/Node/WASM bindings, around 8.4K GitHub stars and adding roughly 3,000 stars in the last week as it trends. It is small, fast, and well-engineered. If you need a parser to drop into a personal RAG project, LiteParse is a sensible default.

pdfmux is what happens when you take an OSS PDF extractor and wrap it in the operational scaffolding a team actually runs in production — a CLI with strict-mode batch jobs, an MCP server for agentic pipelines, a LangChain adapter, an audit-correctness eval harness that scores per-page confidence, and a hosted Cloud tier at app.pdfmux.com with BYOK and per-key quotas. pdfmux 1.7.0 shipped to PyPI on 2026-05-22 with 670 tests passing.

You can run pdfmux purely as a Python library and never touch the Cloud tier. You can also run LiteParse as a backend under pdfmux’s MCP server and keep the audit manifest on top. Both are real.

This page is a side-by-side honest tradeoff. No “X is faster” claims without measurement methodology. No accuracy numbers that pretend to settle which one is “best” on PDFs in general.

Feature Comparison

FeaturepdfmuxLiteParse
TypeProduct (OSS library + hosted Cloud)Library (OSS only)
LicenseMITApache-2.0
GitHub stars~600~8,400 (+~3,000 this week)
Shipped byNameet Potnis (Drumworks)LlamaIndex (run-llama)
Latest release1.7.0 (2026-05-22)Active, ~50 releases on the repo
Test count670 passingPer language binding
ImplementationPythonRust core + Python/Node/WASM
CLIpdfmux convert with strict mode, manifestLibrary calls only
MCP serverYes (composable backends)No
LangChain integrationNativeCommunity wrapper
Audit-correctness harnessYes (per-page confidence + manifest.json)No
Per-document confidence signalYesNo
Hosted endpointapp.pdfmux.com (BYOK + quotas)None
Cloud pricing$49/mo Pro, $199/mo EnterpriseFree (run it yourself)
Best fitTeams running batch jobs where silent failures cost moneySingle engineer dropping a parser into a project

The shape of the table itself is the point: LiteParse is a parser; pdfmux is everything around a parser plus a parser.

When LiteParse is the right call

LiteParse wins clearly when:

  • You’re a single engineer building a personal RAG project or a side project. The operational scaffolding pdfmux adds is overhead you don’t need.
  • You already operate the rest of the pipeline yourself. You have your own queue, your own retry logic, your own observability, your own ledger of which document was indexed when. A parser is the missing piece, not the missing system.
  • You want Rust performance without paying for a hosted runtime. LiteParse’s core is Rust, which gives it a real speed advantage for high-throughput single-machine workloads where Python overhead matters.
  • You want to fork it. Apache-2.0 means you can vendor it, modify the core, and ship a derivative. That’s a different relationship than depending on a vendor’s product roadmap.
  • You’re inside the LlamaIndex ecosystem already. LiteParse is shipped by LlamaIndex; the conventions match the rest of your stack.
  • The 8.4K-star community is a feature. More eyes on the codebase, more bindings (Node, WASM), more community fixes per week.

This is not a backhanded list. These are real reasons. If any three of them describe your situation, install LiteParse and stop reading.

When pdfmux is the right call

pdfmux is the right call when:

  • A silent failure costs real money. A RAG system that indexes 11 near-empty strings as if they were real content will hallucinate answers grounded in nothing. We know this because it happened to us on a 433-PDF customer batch — the CLI returned exit code 0; the manifest later showed 16 silent failures. That retro drove the audit-correctness harness, which drove the 1.6.3 audit fix.
  • You run batches, not one-offs. The manifest.json that pdfmux emits per batch is the artifact that lets you diff today’s run against yesterday’s, find the documents that regressed, and rerun only those. A library that returns markdown gives you no such artifact.
  • You need per-key quotas because you BYOK. When your team is calling an LLM-backed parser with OpenAI or Anthropic keys, a runaway script can burn $400 of provider budget in 20 minutes. pdfmux Cloud enforces a quota per BYOK key. LiteParse, being a library, has no concept of a key.
  • You want an MCP server, not a function call. Agentic pipelines that call extractors over MCP need a server. pdfmux ships one. LiteParse is invoked from inside your own runtime.
  • You want a $49/month price tag instead of an SRE. Cloud tier at $49/mo (Pro) or $199/mo (Enterprise) is cheaper than the on-call rotation that a self-operated extraction service eventually requires.
  • You want LangChain to be the integration, not a project. pdfmux is a native LangChain document loader.

If two of these match your situation, pdfmux Cloud is probably the cheaper option even though LiteParse is free.

What’s actually shared (composition pattern)

Both pdfmux and LiteParse use LLMs under the hood for the hard parts — layout detection, table reconstruction, OCR on degraded scans. The choice of model differs and is configurable in both. The fundamental approach (rules-first, LLM-assist on the residue) is similar.

The interesting consequence: pdfmux’s MCP server can wrap LiteParse as a backend. pdfmux’s extraction pipeline is backend-agnostic by design. You can configure it to call LiteParse for the parse step and keep pdfmux’s audit manifest, retry logic, and BYOK quota wrapping on top.

This is not a marketing line. It is a real composition pattern that we use ourselves when we want to compare backends on a customer’s specific document set. The eval harness scores LiteParse’s output the same way it scores pdfmux’s, so we can answer “which backend is better on YOUR PDFs” with numbers instead of vibes.

The honest framing is: LiteParse is a parser. pdfmux is a parser AND the audit manifold around a parser. You can use them together. The relationship is closer to “FastAPI uses uvicorn” than to “Postgres vs MySQL.”

How we measure quality

This is the section every extractor comparison page should have and almost none do. Here is how pdfmux measures quality, and the same methodology applies if you point it at LiteParse:

  1. Per-page confidence score. Each extracted page gets a 0.0–1.0 score derived from text density, layout coherence, and OCR confidence (when OCR runs). Pages below 0.50 land in the low_lt_0.50 bucket in the manifest. Pages below 0.20 land in critical_lt_0.20 and trigger a strict-mode failure.
  2. Manifest diff. pdfmux convert --strict --min-confidence 0.20 -o ./out/ produces a manifest.json with one row per document, one block per page. You diff today’s manifest against the previous run to find regressions.
  3. Regression tests. 670 tests passing as of 1.7.0, including 11 added in 1.6.2 covering the five specific v1 failure modes from the 433-PDF retro. Each test is a real PDF, not a synthetic fixture.
  4. The doctor preflight. pdfmux doctor --check <dir> runs before extraction and tells you which documents will need OCR, which are encrypted, and which are truncated. The point is to surface failures BEFORE the batch starts.

If you point pdfmux at a LiteParse-extracted directory, the eval harness will score those outputs too. That is the honest way to settle “which extractor wins on my PDFs” — measure on your PDFs, not someone else’s.

Quick code comparison

LiteParse (Python binding):

from liteparse import LiteParse

parser = LiteParse()
result = parser.parse("report.pdf")
print(result.markdown)

pdfmux (library mode):

import pdfmux

result = pdfmux.convert("report.pdf")
print(result.markdown)
print(result.confidence)  # per-page confidence scores

pdfmux (batch mode with audit manifest):

pip install -U 'pdfmux[ocr]'
pdfmux convert ./customer-pdfs/ -o ./out/ --strict --min-confidence 0.20
cat ./out/manifest.json | jq '.summary'

pdfmux Cloud (BYOK, hosted):

from pdfmux import Client

client = Client(api_key="pmx_...", byok={"openai": "sk-..."})
result = client.convert("report.pdf")  # quota-enforced, audit-logged

The library calls are within a few characters of each other. The difference shows up the moment you go from one PDF to a thousand — that’s where the manifest, the doctor, the strict mode, and the quotas earn their keep.

FAQ

Is LiteParse a product or a library?

LiteParse is a library — an open-source Rust core with Python, Node.js and WASM bindings, Apache-2.0 licensed, shipped by LlamaIndex. There is no hosted control plane, no quota system, no per-batch audit manifest, no billing. You install it in your own runtime and own the pipeline around it. pdfmux is a product — the OSS package ships with a CLI, MCP server, LangChain adapter, audit-correctness harness, and a hosted Cloud tier at app.pdfmux.com with BYOK and per-key quotas.

Is LiteParse more accurate than pdfmux?

On clean text-based PDFs, both produce comparable markdown. The harder question — which most extractor comparisons avoid — is what happens on the documents that don’t parse cleanly. pdfmux ships an audit-correctness eval harness that scores per-page confidence and flags low-confidence outputs in a manifest.json. LiteParse returns markdown without a per-document confidence signal, which is fine until you find out, three weeks later, that 11 documents in your RAG index were near-empty strings.

Can I use LiteParse with pdfmux?

Yes. pdfmux is composable — its MCP server and convert pipeline are backend-agnostic by design, so LiteParse can sit underneath pdfmux as one of the extraction backends while pdfmux supplies the audit manifest, retry logic, and BYOK quota wrapping on top. The composition is real, not a marketing slogan.

What does pdfmux Cloud give me that LiteParse doesn’t?

Hosted endpoint at app.pdfmux.com, BYOK so your OpenAI/Anthropic/Voyage keys stay on your account, per-key quotas (so a runaway script doesn’t burn your provider budget), a hosted audit manifest UI, and tier pricing — Pro at $49/month and Enterprise at $199/month. LiteParse is a library; there is no hosted endpoint, no quota system, and no billing surface to begin with.

Which one should I pick if I’m starting from scratch?

If you’re a single engineer building a personal RAG project and you already manage your own infra, LiteParse is the lighter call — install, parse, ship, done. If you’re a team running batch jobs against customer PDFs where a silent failure costs real money downstream, pdfmux’s audit harness pays for itself the first time it flags a low-confidence batch before it hits production.


Related reading: pdfmux vs LlamaParse (the hosted sibling of LiteParse, also from LlamaIndex), the 433-PDF silent-failure retro that drove pdfmux’s audit-correctness harness, and the broader PDF extractor comparison for 2026. For LLM/agent consumption, see llms.txt.