pdfmux vs MarkItDown — specialist or universal?

pdfmux vs MarkItDown: specialist or universal?

MarkItDown is a universal-to-markdown converter. pdfmux is a PDF specialist. That is the honest tradeoff, and the rest of this page makes that distinction concrete instead of pretending one is “better” than the other.

MarkItDown is Microsoft’s open-source library — MIT-licensed, Python, ~153K GitHub stars and adding roughly 6,200 stars per week as it trends. It converts PDFs, Word docs, PowerPoint decks, Excel sheets, HTML pages, images (with OCR via Tesseract), and audio (with transcription via Whisper) into markdown for LLM ingest. It is broad, easy to install, and backed by Microsoft. If you have a folder of mixed file types and you want one library to turn all of them into markdown, MarkItDown is the right default.

pdfmux is what happens when you focus that breadth down to one file type and go deep instead — a CLI with strict-mode batch jobs, an MCP server for agentic pipelines, a LangChain adapter, an audit-correctness eval harness that scores per-page confidence, and a hosted Cloud tier at app.pdfmux.com with BYOK and per-key quotas. pdfmux 1.7.0 shipped to PyPI on 2026-05-22 with 670 tests passing.

You can run pdfmux purely as a Python library and never touch the Cloud tier. You can also route PDFs to pdfmux from a MarkItDown-fronted pipeline and keep the audit manifest on top. Both are real compositions.

This page is a side-by-side honest tradeoff. No “X is faster” claims without measurement methodology. No accuracy numbers that pretend to settle which one is “best” on documents in general.

Feature Comparison

Feature	pdfmux	MarkItDown
Scope	PDF specialist	Universal converter (PDF, DOCX, PPTX, XLSX, HTML, images, audio)
License	MIT	MIT
GitHub stars	~600	~153,000 (+~6,200 this week)
Shipped by	Nameet Potnis (Drumworks)	Microsoft
Latest release	1.7.0 (2026-05-22)	Active, frequent releases
Test count	670 passing	Per-converter
Implementation	Python	Python
CLI	`pdfmux convert` with strict mode, manifest	`markitdown` CLI for single-file conversion
MCP server	Yes (composable backends)	MCP server available
LangChain integration	Native	Community wrappers
Audit-correctness harness	Yes (per-page confidence + manifest.json)	No
Per-page confidence signal	Yes	No
Hosted endpoint	`app.pdfmux.com` (BYOK + quotas)	None (run it yourself)
Cloud pricing	$49/mo Pro, $199/mo Enterprise	Free (run it yourself)
Best fit	Teams running PDF batches where silent failures cost money	Engineers converting mixed file types to markdown

The shape of the table itself is the point: MarkItDown is a universal converter that happens to do PDFs; pdfmux is a PDF tool that happens to do everything around a parser plus a parser.

When MarkItDown is the right call

MarkItDown wins clearly when:

Your input set is mixed file types. You have a folder of PDFs AND DOCX AND PPTX AND XLSX. MarkItDown handles all of them in one library call. pdfmux only handles PDFs; you’d need another tool for the rest, and a routing layer to decide which goes where.
You want zero-config simplicity. pip install markitdown && markitdown file.pdf > file.md. That’s it. No strict mode, no confidence thresholds, no manifest to read. For a single document or a small one-shot batch, that’s exactly the right shape.
You’re building a “drop folder into RAG” feature. End-user-facing apps that let people upload arbitrary files and ingest them benefit from a single converter that handles the long tail of file types. MarkItDown is built for that job.
The Microsoft brand matters. Enterprise procurement, internal Microsoft shops, and Azure-native pipelines benefit from a Microsoft-published, MIT-licensed library with a 153K-star community behind it.
You don’t need a per-page confidence signal. If your downstream consumer is a chat session where a human reads the output, an occasional bad extraction is recoverable in the loop. The audit manifest pdfmux ships is overhead you don’t need.
You want audio + image transcription in the same library. MarkItDown ships Whisper-backed audio transcription and Tesseract-backed image OCR as first-class converters. pdfmux is PDF-only.

This is not a backhanded list. These are real reasons. If any three of them describe your situation, install MarkItDown and stop reading.

When pdfmux is the right call

pdfmux is the right call when:

Your input is PDFs at scale and silent failures cost real money. A RAG system that indexes 11 near-empty strings as if they were real content will hallucinate answers grounded in nothing. We know this because it happened to us on a 433-PDF customer batch — the CLI returned exit code 0; the manifest later showed 16 silent failures. That retro drove the audit-correctness harness, which drove the 1.6.3 audit fix.
You run batches, not one-offs. The manifest.json that pdfmux emits per batch is the artifact that lets you diff today’s run against yesterday’s, find the documents that regressed, and rerun only those. A universal converter that returns markdown gives you no such artifact.
You need per-key quotas because you BYOK. When your team is calling an LLM-backed parser with OpenAI or Anthropic keys, a runaway script can burn $400 of provider budget in 20 minutes. pdfmux Cloud enforces a quota per BYOK key. MarkItDown, being a library, has no concept of a key.
You want an MCP server tuned to PDF orchestration. Agentic pipelines that call extractors over MCP need a server with PDF-aware affordances — backend selection, per-page confidence, strict-mode failures. pdfmux ships one.
You want a $49/month price tag instead of an SRE. Cloud tier at $49/mo (Pro) or $199/mo (Enterprise) is cheaper than the on-call rotation that a self-operated extraction service eventually requires.
You want LangChain to be the integration, not a project. pdfmux is a native LangChain document loader.

If two of these match your situation, pdfmux is probably the right choice for the PDF leg of your pipeline even if MarkItDown handles the rest.

What’s actually shared (composition pattern)

Both pdfmux and MarkItDown lean on PyMuPDF and Tesseract under the hood for the easier parts of PDF extraction — text-based PDFs and OCR fallback on scanned pages. The choice of router and the layout-detection logic differs. MarkItDown also calls LLMs (optionally) for image-to-text descriptions; pdfmux uses LLMs for layout-routing decisions and re-extraction of low-confidence pages.

The interesting consequence: you can route file types between them. A pipeline that uses MarkItDown for DOCX/PPTX/XLSX/audio and pdfmux for PDFs is a sensible split — each tool stays in its strongest lane, and the downstream consumer sees uniform markdown either way. The MCP servers compose: a router can call either one based on the file extension.

The honest framing is: MarkItDown is a universal converter. pdfmux is a PDF specialist plus the audit manifold around a PDF specialist. You can use them together. The relationship is closer to “use the right tool for each file type” than to “pdfmux vs MarkItDown.”

How we measure quality

This is the section every extractor comparison page should have and almost none do. Here is how pdfmux measures quality on PDFs, and the same methodology applies if you point it at MarkItDown’s PDF output:

Per-page confidence score. Each extracted page gets a 0.0–1.0 score derived from text density, layout coherence, and OCR confidence (when OCR runs). Pages below 0.50 land in the low_lt_0.50 bucket in the manifest. Pages below 0.20 land in critical_lt_0.20 and trigger a strict-mode failure.
Manifest diff. pdfmux convert --strict --min-confidence 0.20 -o ./out/ produces a manifest.json with one row per document, one block per page. You diff today’s manifest against the previous run to find regressions.
Regression tests. 670 tests passing as of 1.7.0, including 11 added in 1.6.2 covering the five specific v1 failure modes from the 433-PDF retro. Each test is a real PDF, not a synthetic fixture.
The doctor preflight. pdfmux doctor --check <dir> runs before extraction and tells you which documents will need OCR, which are encrypted, and which are truncated. The point is to surface failures BEFORE the batch starts.

If you point pdfmux at a MarkItDown-extracted directory of PDF outputs, the eval harness will score those outputs too. That is the honest way to settle “which extractor wins on my PDFs” — measure on your PDFs, not someone else’s.

Quick code comparison

MarkItDown (Python library):

from markitdown import MarkItDown

md = MarkItDown()
result = md.convert("report.pdf")
print(result.text_content)

MarkItDown (CLI, single file):

pip install markitdown
markitdown report.pdf > report.md

pdfmux (library mode):

import pdfmux

result = pdfmux.convert("report.pdf")
print(result.markdown)
print(result.confidence)  # per-page confidence scores

pdfmux (batch mode with audit manifest):

pip install -U 'pdfmux[ocr]'
pdfmux convert ./customer-pdfs/ -o ./out/ --strict --min-confidence 0.20
cat ./out/manifest.json | jq '.summary'

pdfmux Cloud (BYOK, hosted):

from pdfmux import Client

client = Client(api_key="pmx_...", byok={"openai": "sk-..."})
result = client.convert("report.pdf")  # quota-enforced, audit-logged

The single-file calls are nearly identical. The difference shows up the moment you go from one PDF to a thousand — that’s where the manifest, the doctor, the strict mode, and the quotas earn their keep. And the difference shows up the other direction the moment you have a DOCX or an MP3 to handle — that’s where MarkItDown’s breadth earns its keep.

FAQ

Is MarkItDown a PDF tool or a general-purpose converter?

MarkItDown is a general-purpose converter. Microsoft ships it as a single Python library that turns PDFs, Word docs, PowerPoint decks, Excel sheets, HTML pages, images (with OCR) and audio (with transcription) into markdown for LLM ingest. The PDF path is one converter inside a much larger surface. pdfmux is the opposite: a PDF specialist that goes deep on per-page confidence, batch audit manifests, OCR routing, and table extraction quality.

Is MarkItDown more accurate than pdfmux on PDFs?

On clean text-based PDFs, MarkItDown’s output is solid markdown — it leans on PyMuPDF under the hood for text extraction and gets the easy cases right. The harder question is what happens on the documents that don’t parse cleanly: scanned-pixel PDFs, multi-column legal filings, financial statements with merged-cell tables. MarkItDown returns markdown without a per-page confidence signal, so a low-quality extraction looks identical to a high-quality one. pdfmux scores each page 0.0–1.0 and writes a manifest.json that flags the failures BEFORE they hit your RAG index.

Can I use MarkItDown and pdfmux together?

Yes, and it’s a sensible composition for mixed-input pipelines. Use MarkItDown for the non-PDF file types where it shines — DOCX, PPTX, XLSX, HTML, audio — and route PDFs to pdfmux for the audit manifest, confidence scoring, and strict-mode batch jobs.

What does pdfmux give me that MarkItDown doesn’t?

Per-page confidence scoring, a batch audit manifest you can diff between runs, strict mode that fails the job when low-confidence pages exceed a threshold, an MCP server for agentic pipelines, a LangChain document loader, BYOK Cloud quotas at app.pdfmux.com, and a regression test suite of 670 tests grounded in real customer-PDF failures.

Which one should I pick if I’m starting from scratch?

If your input set is mixed file types and you want one library to handle all of them with minimum configuration, MarkItDown is the right call. If your input set is PDFs at scale and you’re running batch jobs where a silently-bad extraction means a wrong RAG answer hitting a paying customer, pdfmux’s audit harness pays for itself the first time it flags a low-confidence batch before it lands in production.

Related reading: pdfmux vs PyMuPDF (the parser MarkItDown wraps for PDFs), the 433-PDF silent-failure retro that drove pdfmux’s audit-correctness harness, and the broader PDF extractor comparison for 2026. For LLM/agent consumption, see llms.txt.