pdfmux vs Chandra: model or orchestrator?
Chandra is a model. pdfmux is an orchestrator. That is the honest tradeoff, and the rest of this page makes that distinction concrete instead of pretending one is “better” than the other.
Chandra is Datalab’s newest OCR model — an end-to-end vision-language model that handles complex tables, forms, handwriting, and full-layout extraction in a single pass. The latest is Chandra 2. It supersedes Marker (Datalab’s previous flagship, which used a pipeline of specialized models). The code is Apache-2.0; the model weights ship under a Modified OpenRAIL-M license — free for research, personal use, and startups under $2M in funding or revenue, with a non-compete clause against Datalab’s hosted API. Datalab’s hosted service offers zero data retention by default, SOC 2 Type 2, and custom BAAs.
pdfmux is what happens when you take seven extraction backends and put a routing layer on top — PyMuPDF for clean digital text, RapidOCR for scanned pages, Docling for tables, Marker for academic papers, OpenDataLoader for complex reading order, Mistral OCR for cloud table OCR, and any vision LLM (Gemini, Claude, GPT-4o, Ollama, Mistral) for the hardest pages. Each page is classified, routed to the right backend, audited for confidence, and re-extracted with a different backend if it fails. pdfmux 1.7.0 shipped to PyPI on 2026-05-22 with 670+ tests passing. The library and CLI are MIT licensed with no field-of-use restrictions; the hosted Cloud tier at app.pdfmux.com adds BYOK and per-key quotas.
You can run Chandra as an open-weight model and never pay Datalab anything (within the OpenRAIL-M limits). You can also configure pdfmux to call Chandra as one of its backends and keep the audit manifest on top. Both are real.
This page is a side-by-side honest tradeoff. No “X is more accurate” claims without measurement methodology. No license-fine-print hidden three scrolls down.
Feature Comparison
| Feature | pdfmux | Chandra |
|---|---|---|
| Type | Orchestrator (OSS library + hosted Cloud) | Model (open weights + hosted Datalab service) |
| Code license | MIT | Apache-2.0 |
| Model/weights license | N/A — routes to backends you choose | Modified OpenRAIL-M (no commercial competition with Datalab) |
| Field-of-use restrictions | None | Cannot compete with Datalab’s hosted API |
| Shipped by | Nameet Potnis (Drumworks) | Datalab (same shop as Marker and Surya) |
| Latest release | 1.7.0 (2026-05-22) | Chandra 2 |
| Test count | 670+ passing | Per Datalab’s internal eval suite |
| Implementation | Python orchestrator + 7 backends + BYOK LLMs | Single end-to-end vision-language model |
| CLI | pdfmux convert with strict mode, manifest, watch, diff | Library calls / Datalab API |
| MCP server | Yes (composable backends) | No |
| LangChain integration | Native | Community wrapper |
| Audit-correctness harness | Yes (per-page confidence + manifest.json) | No |
| Per-document confidence signal | Yes | No |
| Compliance posture | None bundled (BYOK keeps your stack) | SOC 2 Type 2, BAAs, zero data retention on Datalab Cloud |
| GPU requirement | No (CPU-first; LLM backends optional) | Recommended for local inference |
| Self-hosted price | Free (pip install) | Free under OpenRAIL-M revenue thresholds |
| Hosted price | $49/mo Pro, $199/mo Enterprise (BYOK) | Per Datalab pricing (per-page) |
| Best fit | Teams running batch jobs across mixed document types where silent failures cost money | Healthcare/legal teams needing ONE compliant VLM, or solo engineers under the OpenRAIL-M threshold |
The shape of the table itself is the point: Chandra is a model; pdfmux is the layer that picks which model to run per page plus the audit manifest around it.
When Chandra is the right call
Chandra wins clearly when:
- You need one specialized VLM with compliance attached. Healthcare and legal teams who need SOC 2 Type 2, custom BAAs, and zero data retention out of the box get all of that from Datalab’s hosted Chandra service in one purchase. Bundling those certifications yourself across a multi-backend orchestrator is more work than buying Chandra is.
- Your document mix is uniform. If 95% of your batch is the same kind of document — invoices, lab reports, claims forms — and Chandra handles it well, the orchestration overhead pdfmux adds is overhead you don’t need.
- You want SOTA single-model accuracy on hard pages. Chandra is designed for complex tables, forms, and handwriting. On those specific failure modes, a purpose-built VLM beats a general-purpose orchestrator using non-specialized backends.
- You’re under the OpenRAIL-M threshold. Research projects, personal projects, and startups under $2M in funding or revenue can run Chandra’s open weights for free. That’s a real commercial advantage at the seed stage.
- You’re inside Datalab’s ecosystem already. If you’re shipping Marker or Surya in production, Chandra is the in-family upgrade.
This is not a backhanded list. These are real reasons. If any three of them describe your situation, install Chandra and stop reading.
When pdfmux is the right call
pdfmux is the right call when:
- A silent failure costs real money. A RAG system that indexes 11 near-empty strings as if they were real content will hallucinate answers grounded in nothing. We know this because it happened to us on a 433-PDF customer batch — the CLI returned exit code 0; the manifest later showed 16 silent failures. That retro drove the audit-correctness harness. A single model returning markdown — Chandra, Mistral OCR, anything — gives you no per-document signal that something went wrong.
- Your document mix is heterogeneous. Mixed batches of digital PDFs, scanned PDFs, academic papers, tables, and forms benefit from per-page routing. PyMuPDF in 10ms for a digital page is faster and cheaper than a VLM call on the same page. pdfmux makes that routing decision automatically.
- You need an MIT license with no field-of-use restrictions. If you’re building a product that competes — or might compete — with Datalab’s hosted extraction service, the Chandra OpenRAIL-M non-compete clause is a forecloser. pdfmux is MIT.
- You need per-key quotas because you BYOK. When your team is calling LLM-backed extractors with OpenAI, Anthropic, or Gemini keys, a runaway script can burn $400 of provider budget in 20 minutes. pdfmux Cloud enforces a quota per BYOK key.
- You want an MCP server, not a function call. Agentic pipelines that call extractors over MCP need a server. pdfmux ships one. Chandra is invoked from inside your own runtime.
- You want to compose Chandra and Marker and Mistral OCR and a custom LLM behind one interface. That’s what pdfmux is for. Picking one model and committing to it is fine; picking the right model per page is better.
If two of these match your situation, pdfmux is probably the cheaper option even when Chandra is free under OpenRAIL-M.
What’s actually shared (composition pattern)
Both Chandra and pdfmux use vision-language models for the hard parts. Chandra IS a vision-language model; pdfmux ROUTES to vision-language models (Gemini, Claude, GPT-4o, Mistral OCR, Ollama, BYOK custom) when the rule-based and OCR backends can’t handle a page. The underlying technique is similar; the architectural choice is different.
The interesting consequence: pdfmux can wrap Chandra as a backend. pdfmux’s extraction pipeline is backend-agnostic by design. You can configure it to call Chandra for the parse step on pages that need a VLM and keep pdfmux’s audit manifest, retry logic, and per-key quota wrapping on top. The OpenRAIL-M license permits this for non-competing uses.
This is not a marketing line. It is a real composition pattern we use ourselves when we want to compare backends on a customer’s specific document set. The eval harness scores Chandra’s output the same way it scores PyMuPDF’s, so we can answer “which backend is better on YOUR PDFs” with numbers instead of vibes.
The honest framing is: Chandra is a single specialized model. pdfmux is the audit manifold around a fleet of models. You can use them together.
How we measure quality
This is the section every extractor comparison page should have and almost none do. Here is how pdfmux measures quality, and the same methodology applies if you point it at Chandra:
- Per-page confidence score. Each extracted page gets a 0.0–1.0 score derived from text density, layout coherence, and OCR confidence (when OCR runs). Pages below 0.50 land in the
low_lt_0.50bucket in the manifest. Pages below 0.20 land incritical_lt_0.20and trigger a strict-mode failure. - Manifest diff.
pdfmux convert --strict --min-confidence 0.20 -o ./out/produces amanifest.jsonwith one row per document, one block per page. You diff today’s manifest against the previous run to find regressions. - Regression tests. 670+ tests passing as of 1.7.0, including 11 added in 1.6.2 covering the five specific v1 failure modes from the 433-PDF retro. Each test is a real PDF, not a synthetic fixture.
- The doctor preflight.
pdfmux doctor --check <dir>runs before extraction and tells you which documents will need OCR, which are encrypted, and which are truncated. The point is to surface failures BEFORE the batch starts.
If you point pdfmux at a Chandra-extracted directory, the eval harness will score those outputs too. That is the honest way to settle “which extractor wins on my PDFs” — measure on your PDFs, not someone else’s.
Quick code comparison
Chandra (open weights, local inference):
from transformers import AutoModelForVision2Seq, AutoProcessor
from PIL import Image
processor = AutoProcessor.from_pretrained("datalab-to/chandra-ocr-2")
model = AutoModelForVision2Seq.from_pretrained("datalab-to/chandra-ocr-2")
image = Image.open("page.png")
inputs = processor(images=image, return_tensors="pt")
output = model.generate(**inputs)
markdown = processor.decode(output[0], skip_special_tokens=True)
pdfmux (library mode):
import pdfmux
result = pdfmux.convert("report.pdf")
print(result.markdown)
print(result.confidence) # per-page confidence scores
pdfmux (batch mode with audit manifest):
pip install -U 'pdfmux[ocr]'
pdfmux convert ./customer-pdfs/ -o ./out/ --strict --min-confidence 0.20
cat ./out/manifest.json | jq '.summary'
pdfmux Cloud (BYOK, hosted):
from pdfmux import Client
client = Client(api_key="pmx_...", byok={"openai": "sk-..."})
result = client.convert("report.pdf") # quota-enforced, audit-logged
The library calls aren’t directly comparable — Chandra is a model loaded into memory, pdfmux is a pipeline you point at a file. The difference shows up the moment you go from one PDF to a thousand — that’s where the manifest, the doctor, the strict mode, and the multi-backend routing earn their keep.
FAQ
Is Chandra a product or a model?
Chandra is a model. The Apache-2.0 code in datalab-to/chandra wraps a vision-language model whose weights ship under a Modified OpenRAIL-M license — free for research, personal use, and startups under $2M in funding or revenue, but you cannot use it to compete with Datalab’s hosted API. The hosted product is Datalab’s document conversion service, which offers zero data retention by default, SOC 2 Type 2, and custom BAAs. pdfmux is an orchestrator — the OSS package (MIT, no field-of-use restrictions) ships with a CLI, MCP server, LangChain adapter, audit-correctness harness, and a hosted Cloud tier at app.pdfmux.com with BYOK and per-key quotas.
Is Chandra more accurate than pdfmux?
On the documents Chandra was designed for — complex tables, forms, handwriting, full-layout VLM tasks — Chandra is excellent. Datalab claims higher accuracy than Marker. pdfmux scored 0.905 on opendataloader-bench, #1 among free/open-source tools — but pdfmux doesn’t compete with Chandra on raw single-model accuracy. The honest question isn’t “which is more accurate”; it’s “what happens when accuracy drops to 60% on a document neither tool was trained for, and do you find out before or after that document lands in your RAG index.” Chandra returns markdown. pdfmux ships an audit-correctness eval harness that scores per-page confidence and flags low-confidence outputs in manifest.json.
Can I use Chandra with pdfmux?
Yes. pdfmux’s extraction pipeline is backend-agnostic — its MCP server and convert pipeline are designed so any extractor can sit underneath as a backend, with pdfmux supplying the audit manifest, retry logic, and BYOK quota wrapping on top. The OpenRAIL-M license on Chandra’s weights permits this for non-competing uses. The composition is real.
What does pdfmux give me that Chandra doesn’t?
Three things. First, per-page confidence scoring and a per-batch manifest.json — Chandra returns markdown; pdfmux returns markdown plus a structured record of which pages were near-empty, below the strict-mode threshold, or need re-extraction. Second, an MIT license with zero field-of-use restrictions — pdfmux can be used in any product, including a competing extraction product. Chandra’s model weights cannot. Third, multi-backend routing — pdfmux picks PyMuPDF for clean text pages, Docling for tables, RapidOCR for scanned pages, and any vision LLM for the hardest pages, on a per-page basis. Chandra is one model that runs on every page regardless.
Which one should I pick if I’m starting from scratch?
If you’re a healthcare or legal team that needs ONE specialized VLM with vendor compliance (SOC 2 Type 2, BAAs, zero data retention) attached, Chandra via Datalab’s hosted service is the cleanest call. If you’re a single engineer under the OpenRAIL-M revenue threshold, run Chandra’s open weights locally. If you’re a team running batches across mixed document types where a silent failure costs real money downstream, pdfmux’s audit harness pays for itself the first time it flags a low-confidence batch before it hits production. If you’re building a product that competes with Datalab’s extraction service, the OpenRAIL-M non-compete forecloses Chandra and pdfmux’s MIT license becomes the only viable choice.
Related reading: pdfmux vs LiteParse (a library from LlamaIndex), pdfmux vs Marker (Chandra’s predecessor at Datalab), pdfmux vs LlamaParse, the 433-PDF silent-failure retro that drove pdfmux’s audit-correctness harness, and the broader comparison hub. For LLM/agent consumption, see llms.txt.