Direct answer: PyMuPDF is 8-12x faster than pdfplumber on plain text extraction (180 pages/sec vs 18 pages/sec on our benchmark) but ships under AGPL-3.0, which forces commercial users into a paid license or open-source obligation. pdfplumber is MIT-licensed and produces visibly better tables on financial documents but is far slower and has no built-in OCR. Pick PyMuPDF for speed-critical pipelines if license is fine. Pick pdfplumber for table-heavy work on small batches. If you need both — fast text and good tables and a permissive license — neither one is the answer.
What each library is, in one paragraph
PyMuPDF (also imported as fitz) is the Python binding for MuPDF, a C-based PDF rendering and parsing library from Artifex. It is the fastest pure-Python PDF library available, handles every PDF spec edge case, and exposes low-level access to the PDF object model — pages, text blocks, images, annotations, form fields, and bookmarks.
pdfplumber is a pure-Python library built on top of pdfminer.six. It focuses on visual layout: it represents each character with its position on the page and provides high-level abstractions for finding tables based on visual lines, whitespace, and text alignment. It is slower because it does more layout analysis per page.
The difference in design intent is the source of every other difference between them.
Speed: PyMuPDF wins by a wide margin
We ran both libraries on a corpus of 1,422 pages: SEC 10-K filings, board meeting minutes, technical specifications, and a sample of scanned legal documents. Hardware: M1 MacBook Pro, 4 cores allocated, no GPU. Methodology: cold start, single-process, average of 3 runs.
| Operation | PyMuPDF | pdfplumber | Ratio |
|---|---|---|---|
| Plain text extraction | 180 pages/sec | 18 pages/sec | 10x |
| Text + bounding boxes | 95 pages/sec | 22 pages/sec | 4.3x |
| Table extraction | 45 pages/sec | 8 pages/sec | 5.6x |
| Cold-start import time | 0.08s | 0.31s | 3.9x |
| Memory per 100 pages | 45 MB | 180 MB | 4x less |
For a 1,000-page batch, PyMuPDF finishes plain text extraction in ~5.5 seconds. pdfplumber takes ~55 seconds. At 100,000 pages a day this is 9 minutes vs 1.5 hours.
Full benchmark methodology and per-document numbers: real-world PDF benchmark.
The speed gap closes somewhat when you ask both libraries for layout-aware output. PyMuPDF can give bounding boxes for every text block, but its table detection is rudimentary — it gives you the cells but doesn’t always group them correctly into rows and columns. pdfplumber inverts the trade: more time per page, better table fidelity.
Tables: pdfplumber wins on accuracy, but the gap is smaller than you’d think
This is where intent shows up. pdfplumber was designed for tables. PyMuPDF added table support later as a secondary feature.
We measured table accuracy using the TEDS metric (Tree-Edit-Distance Similarity, the standard table extraction benchmark) on a 200-table subset:
| Library | TEDS Score | Notes |
|---|---|---|
| pdfplumber | 0.847 | Best on bordered tables. Struggles with merged cells. |
| PyMuPDF | 0.692 | Fast but loses ~30% of cell relationships on complex tables. |
| Docling | 0.911 | Slower than pdfplumber but the open-source state of the art. |
| pdfmux (standard) | 0.911 | Routes table pages through Docling overlay. |
pdfplumber’s table extraction works by finding visual lines (or aligned text in tables without lines) and using them as cell boundaries. This is the right approach for documents like financial filings where tables are bordered. It struggles when tables have:
- Merged cells across rows or columns
- Headers that span multiple rows
- Cells with line breaks inside them
- Implicit table structure (no visible lines, just whitespace alignment)
PyMuPDF gives you raw text positions and lets you build the table yourself. This is more work but gives you more control on edge cases. For most users, neither is the right answer for production table extraction — both score well below dedicated tools.
Full methodology for the TEDS comparison: benchmarking PDF extractors.
License: this is the deciding factor for many teams
PyMuPDF is AGPL-3.0. pdfplumber is MIT.
Most engineers underestimate this difference until legal flags it during a launch review.
AGPL-3.0 requires that if you distribute the software, or run it as a network service that users interact with, you must release the complete source code of your application under AGPL terms. For SaaS companies this is the trigger that hurts: any user who hits your API has triggered the source-disclosure obligation. Artifex offers a commercial license for a fee (typically $10,000-$50,000/year depending on use case) that exempts you from AGPL. This is fine for funded startups; it is a problem for indie developers and small teams.
MIT has no such obligation. Use, modify, distribute, sell — only requirement is keeping the copyright notice.
For a deeper dive on the AGPL trap and how teams have handled it: pdf extraction without GPU covers some of the same alternative landscape.
The practical impact:
- If you’re building a closed-source SaaS product on top of PyMuPDF, you need a commercial license. Budget for it.
- If you’re building open-source tooling, AGPL is fine but limits who can adopt your work.
- If you’re a Fortune 500 corporate IT team, AGPL software requires legal review before deployment. Many shops blanket-ban AGPL.
- pdfplumber is fine in all these contexts.
This single factor pushes a lot of teams to pdfplumber even when speed matters. It also drives demand for MIT/Apache-2.0 PDF stacks, which is part of why pdfmux (MIT) and Docling (MIT) exist.
API ergonomics: PyMuPDF is more powerful, pdfplumber is more pleasant
PyMuPDF exposes the full PDF object model. You can read annotations, fill forms, redact text, manipulate bookmarks, render pages to images, modify pages, and write the result back. This is overkill for “extract text” but invaluable when your task is more than extraction.
import fitz # PyMuPDF
doc = fitz.open("contract.pdf")
for page in doc:
text = page.get_text()
tables = page.find_tables()
images = page.get_images()
annotations = page.annots()
pdfplumber’s API is simpler and more consistent for layout-aware tasks:
import pdfplumber
with pdfplumber.open("contract.pdf") as pdf:
for page in pdf.pages:
text = page.extract_text()
tables = page.extract_tables()
words = page.extract_words()
chars = page.chars # every char with x, y, font, size
pdfplumber’s chars property — a list of every character with its position, font, and size — is genuinely useful when you’re writing custom layout logic. PyMuPDF gives you similar information but it requires more boilerplate to extract.
For tasks like form filling, annotation editing, or PDF generation, PyMuPDF wins by default — pdfplumber doesn’t do those.
OCR: neither one does it well
Both libraries punt on scanned PDFs.
PyMuPDF can integrate with Tesseract (page.get_textpage_ocr()) but the API is clunky and it does not auto-detect when a page needs OCR. You have to detect text density yourself and call OCR explicitly.
pdfplumber has no OCR support at all. You read the text layer or you get nothing.
In a typical document collection, 5-15% of pages are scanned or image-only. Without auto-OCR, both libraries silently return empty strings for those pages. Your downstream pipeline has no way to tell whether a blank page is actually blank or just unscanned. This is the most common silent failure mode in production document pipelines.
The fix is to add a separate OCR step (Tesseract via pytesseract, or a dedicated tool) and route image-heavy pages through it. We covered the patterns for OCR PDF extraction in Python and extracting tables from PDF in Python including the auto-routing approach.
Multi-column layouts: PyMuPDF wins on speed, both can fail
Academic papers, magazines, and old technical manuals often use 2- or 3-column layouts. The challenge is reading order: a naive extractor reads left-to-right across the whole page width, mixing text from different columns.
PyMuPDF’s page.get_text("blocks") mode returns text blocks with bounding boxes. You can sort blocks by column (sort by x-coordinate of the left edge, then by y-coordinate) to get correct reading order. It works well for clean 2-column layouts and breaks on more complex grids.
pdfplumber’s page.extract_text(layout=True) makes a similar attempt. It is a bit more accurate on irregular layouts because the underlying pdfminer.six does more sophisticated layout analysis, but it is also 5-10x slower in this mode.
Neither library “just works” on multi-column. Both require manual reading-order logic if accuracy matters. For RAG pipelines on academic papers this is a blocker — incorrectly ordered text destroys retrieval quality.
When PyMuPDF is the right answer
- You’re processing high volumes of digital-text PDFs (tax forms, regulatory filings without tables, generated reports)
- You need form filling, annotation editing, redaction, or page manipulation
- You can pay for a commercial license, or you’re working in a context where AGPL is fine
- You’re rendering PDFs to images (PyMuPDF’s rendering is excellent)
- Your pipeline needs to be fast and you can live with imperfect tables
When pdfplumber is the right answer
- You’re extracting tables from financial filings, invoices, or other bordered tables
- You need MIT-licensed code with no commercial gotchas
- You’re prototyping or doing one-off analysis where 18 pages/sec is fine
- You want pleasant, well-documented APIs and don’t need PDF manipulation
- Your batch sizes are small (under a few thousand pages)
When the answer is neither
If your PDF corpus is mixed — some digital, some scanned, some with complex tables, some multi-column — neither library handles all of it well. You end up writing routing logic: detect which engine fits this page, run it, fall back to OCR if empty, score the result, retry on failure.
Building this routing is most of what tools like pdfmux and Docling exist to do. The pdfmux pipeline runs PyMuPDF first for speed, audits each page’s quality, falls back to OCR when text is sparse, overlays Docling for tables, and returns a confidence score per page. It is MIT licensed and CPU-only. The full architecture: self-healing PDF extraction.
Common pitfalls when migrating between them
1. Different page indexing.
PyMuPDF uses 0-indexed pages (doc[0]). pdfplumber uses 1-indexed in some methods, 0-indexed in others. Always check.
2. Whitespace handling. PyMuPDF preserves more whitespace from the source PDF. pdfplumber normalizes more aggressively. Diffs in extracted text are often just whitespace, not content — adjust your equality checks accordingly.
3. Encoding.
Both libraries return Unicode strings, but PyMuPDF is more aggressive about decoding embedded fonts to readable Unicode. pdfplumber sometimes returns CID-encoded text (literal (cid:123) strings) for documents with custom font encodings. If you see CIDs, switch libraries or pre-process the PDF.
4. Memory leaks on large batches.
PyMuPDF requires doc.close() to release C-level memory. In long-running processes, forgetting this leaks ~5-10 MB per document. pdfplumber uses Python context managers (with pdfplumber.open(...)) which auto-release.
5. License accidents.
Adding PyMuPDF to a project that was previously MIT/Apache changes its effective license to AGPL. Many teams have shipped without realizing this. Check pip-licenses or pip show pymupdf before each release.
A note on PyMuPDF4LLM
Artifex shipped pymupdf4llm in 2024, a higher-level wrapper that returns Markdown-formatted output and has built-in chunking helpers for LLM use cases. It is the same AGPL license as PyMuPDF (and depends on it). Speed is comparable to PyMuPDF base. Table accuracy is slightly better (around 0.78 TEDS in our test) but still well below dedicated table tools.
If you’ve decided AGPL is fine and you want Markdown output, pymupdf4llm is more convenient than rolling your own conversion on top of PyMuPDF. If you’ve decided AGPL is not fine, neither library helps you.
FAQ
Is PyMuPDF really 10x faster than pdfplumber? On plain text, yes. The gap narrows on table extraction (5-6x) and on OCR-required pages (where neither is fast). For most workloads, PyMuPDF is meaningfully faster.
Can I use both in the same project? Yes — and many teams do, routing simple pages to PyMuPDF for speed and table-heavy pages to pdfplumber. You inherit AGPL the moment you add PyMuPDF, regardless of how little code calls it.
What about pypdf or PyPDF2? Both are slower than PyMuPDF and have weaker table support than pdfplumber. They are simpler dependencies (pure Python, no compiled C). Fine for trivial extraction; not competitive for production.
Is there a commercial alternative to PyMuPDF that’s faster? For pure speed, the C-level libraries (Poppler, MuPDF directly) are similar. For accuracy, paid services (LlamaParse, AWS Textract, Google Document AI) score higher on tables and scanned pages but cost $10-30 per 1,000 pages and add network latency.
Does PyMuPDF support PDF/A or PDF/UA? Yes. Both library can read these conforming PDFs without issue. Writing standards-conforming output is a different matter — neither library guarantees PDF/A output.
Keep reading
- Best PDF extraction library for Python in 2026 — ranked comparison covering pdfplumber, PyMuPDF, Marker, Docling, and pdfmux
- PDF extractor comparison 2026 — head-to-head benchmark across 7 tools
- Extract tables from PDF in Python — three table extraction methods with TEDS scores
- Real-world PDF benchmark — full methodology for the speed numbers in this post
Last updated: April 2026