pdfmux vs pymupdf4llm: PDF Extraction Compared

TL;DRCompare pdfmux and pymupdf4llm for PDF text extraction. Features, benchmarks, pricing, and when to use each.

pdfmux vs pymupdf4llm: Which LLM-optimized PDF extractor should you use?

pdfmux wins on output quality, license flexibility, and table handling. pymupdf4llm (~2k GitHub stars) is a wrapper around PyMuPDF specifically optimized for producing LLM-ready markdown output. It’s a direct response to the same problem pdfmux solves — but it inherits PyMuPDF’s AGPL license and produces less refined output on complex layouts.

Both tools target the same use case: turning PDFs into clean markdown for RAG pipelines. pdfmux does it better, with an MIT license.

Feature Comparison

Feature	pdfmux	pymupdf4llm
Primary output	Markdown, JSON	Markdown
Table extraction	Built-in, structured	Basic markdown tables
Multi-column handling	Automatic reading order	Improved over raw PyMuPDF
Image extraction	With descriptions	References only
Chunking support	Native, configurable	Basic page-level
LlamaIndex integration	Standard loader	Official adapter
License	MIT	AGPL-3.0
Standalone install	Yes	Requires PyMuPDF

Benchmark Comparison

Metric	pdfmux	pymupdf4llm
Text accuracy (mixed layouts)	94.2%	90.1%
Table extraction F1	91.8%	78.3%
Multi-column accuracy	92.5%	84.7%
Processing speed (pages/sec)	45	55
Install size	15 MB	30 MB (PyMuPDF dep)
Memory usage (100-page PDF)	85 MB	110 MB

pymupdf4llm is slightly faster (leveraging PyMuPDF’s C core), but pdfmux produces significantly better structured output — especially for tables and multi-column layouts.

When to Use pymupdf4llm

pymupdf4llm is the right choice when you need:

PyMuPDF ecosystem — you’re already using PyMuPDF and want LLM-ready output as an add-on
LlamaIndex native — pymupdf4llm has an official LlamaIndex document loader
Raw speed — slightly faster processing for simple single-column documents
Minimal migration — you can add pymupdf4llm to an existing PyMuPDF project without changing your stack
AGPL-compatible projects — your licensing allows AGPL-3.0

When to Use pdfmux

pdfmux is the better choice when you need:

Higher accuracy output — especially on tables, multi-column layouts, and complex documents
Commercial licensing — MIT license with no copyleft restrictions
Structured JSON output — tables as structured data, not just markdown approximations
Better chunking — native semantic chunking for embedding pipelines
Framework-agnostic — works equally well with LangChain, LlamaIndex, Haystack, or custom pipelines
Standalone tool — no PyMuPDF dependency required

Quick Code Comparison

pdfmux:

import pdfmux
result = pdfmux.convert("report.pdf")
print(result.markdown)

pymupdf4llm:

import pymupdf4llm
md_text = pymupdf4llm.to_markdown("report.pdf")
print(md_text)

FAQ

What’s the difference between PyMuPDF and pymupdf4llm?

PyMuPDF is a general-purpose PDF library for extraction, manipulation, and rendering. pymupdf4llm is a thin wrapper that converts PyMuPDF’s output into LLM-friendly markdown. It does not add new extraction capabilities — it reformats PyMuPDF’s existing output.

Does pymupdf4llm improve PyMuPDF’s table extraction?

Minimally. pymupdf4llm reformats detected tables as markdown tables, but it relies on PyMuPDF’s underlying table detection, which struggles with complex layouts. pdfmux uses a dedicated table extraction pipeline that produces significantly better results.

Can I switch from pymupdf4llm to pdfmux easily?

Yes. Both produce markdown output from PDFs. In most cases, replacing the import and function call is all that’s needed. pdfmux’s output is typically cleaner, so any downstream parsing you’ve built may work even better.

Looking for detailed benchmarks? Read our comprehensive PDF extraction benchmark. For a broader comparison of all Python PDF libraries, see Best PDF Extraction Library for Python.