pdfmux vs PyMuPDF: Which PDF extraction library should you use?
pdfmux wins on simplicity, output quality, and license flexibility. PyMuPDF (also known as fitz) is one of the most established PDF libraries in the Python ecosystem with ~8.7k GitHub stars and deep low-level PDF access. However, pdfmux delivers cleaner structured output optimized for modern AI/LLM workflows — without the AGPL licensing headache.
If you’re building a RAG pipeline, document ingestion system, or any application where you need reliable markdown or structured JSON from PDFs, pdfmux gets you there with fewer lines of code and better results on complex layouts.
Feature Comparison
| Feature | pdfmux | PyMuPDF |
|---|---|---|
| Text extraction | Structured markdown/JSON | Raw text with coordinates |
| Table extraction | Built-in, high accuracy | Basic, requires post-processing |
| Layout preservation | Automatic reading order | Manual layout analysis needed |
| Output formats | Markdown, JSON, structured | Text, HTML, XML, dict |
| LLM/RAG optimization | Native chunking support | Requires pymupdf4llm wrapper |
| Installation size | Lightweight (~15 MB) | Heavy (~30 MB, C bindings) |
| License | MIT | AGPL-3.0 |
| API complexity | 3-line extraction | Verbose, low-level API |
Benchmark Comparison
| Metric | pdfmux | PyMuPDF |
|---|---|---|
| Text accuracy (mixed layouts) | 94.2% | 87.6% |
| Table extraction F1 | 91.8% | 72.4% |
| Processing speed (pages/sec) | 45 | 62 |
| Install size | 15 MB | 30 MB |
| Memory usage (100-page PDF) | 85 MB | 120 MB |
PyMuPDF edges out pdfmux on raw speed thanks to its C-based MuPDF core, but pdfmux achieves significantly higher accuracy on structured content — especially tables and multi-column layouts.
When to Use PyMuPDF
PyMuPDF is the right choice when you need:
- Low-level PDF manipulation — merging, splitting, annotating, redacting, or watermarking PDFs
- Raw speed over accuracy — processing millions of simple, single-column documents
- Pixel-perfect rendering — converting PDF pages to images at high fidelity
- Complete PDF access — inspecting fonts, metadata, embedded files, and PDF internals
- AGPL-compatible projects — your project already uses AGPL or you can comply with its terms
When to Use pdfmux
pdfmux is the better choice when you need:
- AI/LLM-ready output — clean markdown or structured JSON for RAG pipelines, embeddings, or document Q&A
- Accurate table extraction — financial reports, invoices, or any document with complex tables
- Multi-column layout handling — research papers, newspapers, or documents with complex reading order
- Commercial-friendly licensing — MIT license with no copyleft restrictions
- Minimal integration effort — extract structured data in 3 lines of code, no post-processing needed
Quick Code Comparison
pdfmux:
import pdfmux
result = pdfmux.convert("report.pdf")
print(result.markdown)
PyMuPDF:
import fitz
doc = fitz.open("report.pdf")
for page in doc:
text = page.get_text("text")
print(text)
FAQ
Can I use PyMuPDF in a commercial product?
PyMuPDF uses the AGPL-3.0 license, which requires you to open-source your entire application if you distribute it. Artifex (the company behind PyMuPDF) offers commercial licenses for proprietary use. pdfmux uses the MIT license, which allows unrestricted commercial use.
Is pdfmux faster than PyMuPDF?
PyMuPDF is faster for raw text extraction due to its C-based core. However, pdfmux is faster for end-to-end workflows because it produces structured output directly — eliminating the post-processing steps that PyMuPDF requires for usable results.
Can pdfmux replace PyMuPDF for PDF manipulation?
No. pdfmux is focused on extraction and conversion. If you need to merge, split, annotate, or modify PDFs, PyMuPDF or pikepdf are better choices. Many developers use pdfmux for extraction alongside PyMuPDF for manipulation.
Looking for detailed benchmarks? Read our comprehensive PDF extraction benchmark. For a broader comparison of all Python PDF libraries, see Best PDF Extraction Library for Python.