pdfmux vs PyMuPDF: Which PDF extraction library should you use?

pdfmux wins on simplicity, output quality, and license flexibility. PyMuPDF (also known as fitz) is one of the most established PDF libraries in the Python ecosystem with ~8.7k GitHub stars and deep low-level PDF access. However, pdfmux delivers cleaner structured output optimized for modern AI/LLM workflows — without the AGPL licensing headache.

If you’re building a RAG pipeline, document ingestion system, or any application where you need reliable markdown or structured JSON from PDFs, pdfmux gets you there with fewer lines of code and better results on complex layouts.

Feature Comparison

FeaturepdfmuxPyMuPDF
Text extractionStructured markdown/JSONRaw text with coordinates
Table extractionBuilt-in, high accuracyBasic, requires post-processing
Layout preservationAutomatic reading orderManual layout analysis needed
Output formatsMarkdown, JSON, structuredText, HTML, XML, dict
LLM/RAG optimizationNative chunking supportRequires pymupdf4llm wrapper
Installation sizeLightweight (~15 MB)Heavy (~30 MB, C bindings)
LicenseMITAGPL-3.0
API complexity3-line extractionVerbose, low-level API

Benchmark Comparison

MetricpdfmuxPyMuPDF
Text accuracy (mixed layouts)94.2%87.6%
Table extraction F191.8%72.4%
Processing speed (pages/sec)4562
Install size15 MB30 MB
Memory usage (100-page PDF)85 MB120 MB

PyMuPDF edges out pdfmux on raw speed thanks to its C-based MuPDF core, but pdfmux achieves significantly higher accuracy on structured content — especially tables and multi-column layouts.

When to Use PyMuPDF

PyMuPDF is the right choice when you need:

  • Low-level PDF manipulation — merging, splitting, annotating, redacting, or watermarking PDFs
  • Raw speed over accuracy — processing millions of simple, single-column documents
  • Pixel-perfect rendering — converting PDF pages to images at high fidelity
  • Complete PDF access — inspecting fonts, metadata, embedded files, and PDF internals
  • AGPL-compatible projects — your project already uses AGPL or you can comply with its terms

When to Use pdfmux

pdfmux is the better choice when you need:

  • AI/LLM-ready output — clean markdown or structured JSON for RAG pipelines, embeddings, or document Q&A
  • Accurate table extraction — financial reports, invoices, or any document with complex tables
  • Multi-column layout handling — research papers, newspapers, or documents with complex reading order
  • Commercial-friendly licensing — MIT license with no copyleft restrictions
  • Minimal integration effort — extract structured data in 3 lines of code, no post-processing needed

Quick Code Comparison

pdfmux:

import pdfmux
result = pdfmux.convert("report.pdf")
print(result.markdown)

PyMuPDF:

import fitz
doc = fitz.open("report.pdf")
for page in doc:
    text = page.get_text("text")
    print(text)

FAQ

Can I use PyMuPDF in a commercial product?

PyMuPDF uses the AGPL-3.0 license, which requires you to open-source your entire application if you distribute it. Artifex (the company behind PyMuPDF) offers commercial licenses for proprietary use. pdfmux uses the MIT license, which allows unrestricted commercial use.

Is pdfmux faster than PyMuPDF?

PyMuPDF is faster for raw text extraction due to its C-based core. However, pdfmux is faster for end-to-end workflows because it produces structured output directly — eliminating the post-processing steps that PyMuPDF requires for usable results.

Can pdfmux replace PyMuPDF for PDF manipulation?

No. pdfmux is focused on extraction and conversion. If you need to merge, split, annotate, or modify PDFs, PyMuPDF or pikepdf are better choices. Many developers use pdfmux for extraction alongside PyMuPDF for manipulation.


Looking for detailed benchmarks? Read our comprehensive PDF extraction benchmark. For a broader comparison of all Python PDF libraries, see Best PDF Extraction Library for Python.