pdfmux vs pymupdf4llm: Which LLM-optimized PDF extractor should you use?

pdfmux wins on output quality, license flexibility, and table handling. pymupdf4llm (~2k GitHub stars) is a wrapper around PyMuPDF specifically optimized for producing LLM-ready markdown output. It’s a direct response to the same problem pdfmux solves — but it inherits PyMuPDF’s AGPL license and produces less refined output on complex layouts.

Both tools target the same use case: turning PDFs into clean markdown for RAG pipelines. pdfmux does it better, with an MIT license.

Feature Comparison

Featurepdfmuxpymupdf4llm
Primary outputMarkdown, JSONMarkdown
Table extractionBuilt-in, structuredBasic markdown tables
Multi-column handlingAutomatic reading orderImproved over raw PyMuPDF
Image extractionWith descriptionsReferences only
Chunking supportNative, configurableBasic page-level
LlamaIndex integrationStandard loaderOfficial adapter
LicenseMITAGPL-3.0
Standalone installYesRequires PyMuPDF

Benchmark Comparison

Metricpdfmuxpymupdf4llm
Text accuracy (mixed layouts)94.2%90.1%
Table extraction F191.8%78.3%
Multi-column accuracy92.5%84.7%
Processing speed (pages/sec)4555
Install size15 MB30 MB (PyMuPDF dep)
Memory usage (100-page PDF)85 MB110 MB

pymupdf4llm is slightly faster (leveraging PyMuPDF’s C core), but pdfmux produces significantly better structured output — especially for tables and multi-column layouts.

When to Use pymupdf4llm

pymupdf4llm is the right choice when you need:

  • PyMuPDF ecosystem — you’re already using PyMuPDF and want LLM-ready output as an add-on
  • LlamaIndex native — pymupdf4llm has an official LlamaIndex document loader
  • Raw speed — slightly faster processing for simple single-column documents
  • Minimal migration — you can add pymupdf4llm to an existing PyMuPDF project without changing your stack
  • AGPL-compatible projects — your licensing allows AGPL-3.0

When to Use pdfmux

pdfmux is the better choice when you need:

  • Higher accuracy output — especially on tables, multi-column layouts, and complex documents
  • Commercial licensing — MIT license with no copyleft restrictions
  • Structured JSON output — tables as structured data, not just markdown approximations
  • Better chunking — native semantic chunking for embedding pipelines
  • Framework-agnostic — works equally well with LangChain, LlamaIndex, Haystack, or custom pipelines
  • Standalone tool — no PyMuPDF dependency required

Quick Code Comparison

pdfmux:

import pdfmux
result = pdfmux.convert("report.pdf")
print(result.markdown)

pymupdf4llm:

import pymupdf4llm
md_text = pymupdf4llm.to_markdown("report.pdf")
print(md_text)

FAQ

What’s the difference between PyMuPDF and pymupdf4llm?

PyMuPDF is a general-purpose PDF library for extraction, manipulation, and rendering. pymupdf4llm is a thin wrapper that converts PyMuPDF’s output into LLM-friendly markdown. It does not add new extraction capabilities — it reformats PyMuPDF’s existing output.

Does pymupdf4llm improve PyMuPDF’s table extraction?

Minimally. pymupdf4llm reformats detected tables as markdown tables, but it relies on PyMuPDF’s underlying table detection, which struggles with complex layouts. pdfmux uses a dedicated table extraction pipeline that produces significantly better results.

Can I switch from pymupdf4llm to pdfmux easily?

Yes. Both produce markdown output from PDFs. In most cases, replacing the import and function call is all that’s needed. pdfmux’s output is typically cleaner, so any downstream parsing you’ve built may work even better.


Looking for detailed benchmarks? Read our comprehensive PDF extraction benchmark. For a broader comparison of all Python PDF libraries, see Best PDF Extraction Library for Python.