Direct answer: Use pdfmux for reliable table extraction from PDFs in Python. Install with pip install pdfmux[tables], then run pdfmux convert invoice.pdf -f json. It auto-detects tables, routes to the best extractor per page, and outputs clean Markdown or structured JSON. Scores 0.911 TEDS (table accuracy) on the opendataloader-bench — matching Docling and beating marker (0.808), mineru (0.873), and every other free tool.
Why PDF table extraction is hard
PDF files don’t store tables as structured data. A PDF is a canvas of positioned text and lines — there’s no <table> tag. What looks like a table to a human is just text placed at specific coordinates. Extracting that into rows and columns requires heuristics, ML, or both.
The three main challenges:
- Borderless tables — no drawn lines, just aligned text. Most heuristic tools miss these entirely.
- Merged cells — spanning headers and multi-row cells break simple grid detection.
- Mixed content — a page with both paragraphs and tables. You need to identify where the table starts and the text resumes.
Here’s how the leading Python libraries handle it, with code and benchmark numbers. (For the full story of how we benchmarked every PDF extractor and why routing matters, start there.)
Method 1: pdfmux (recommended)
pdfmux is a self-healing extraction pipeline that classifies each PDF page and routes it to the best extractor. For tables, it uses IBM Docling (97.9% accuracy) when table signals are detected, and PyMuPDF for everything else.
pip install pdfmux[tables]
from pdfmux import process
# Extract with table detection (standard quality)
result = process("financial-report.pdf", quality="standard")
print(result.text) # Markdown with pipe tables
For structured JSON output with headers and rows:
result = process("financial-report.pdf", output_format="json")
# Returns: tables as [{headers: [...], rows: [[...]]}]
How it works under the hood:
- Classifies the PDF (drawn lines, number density, column alignment, whitespace patterns)
- Routes table pages to Docling, text pages to PyMuPDF
- For pages the classifier misses, runs a Docling table overlay — extracting only table blocks and merging them into the PyMuPDF text
- Audits every page for quality and re-extracts failures
Benchmark score: 0.911 TEDS on opendataloader-bench (200 real-world PDFs).
Pros: Best accuracy among free tools. Handles borderless tables. No GPU needed. Self-healing pipeline catches failures.
Cons: Slower than raw PyMuPDF on simple digital PDFs. Docling adds ~0.3-3s per table page.
Method 2: PyMuPDF find_tables()
PyMuPDF (via pymupdf4llm) provides built-in table detection using heuristic line analysis.
pip install pymupdf4llm
import pymupdf4llm
# Extract with table conversion to Markdown
text = pymupdf4llm.to_markdown("report.pdf")
print(text) # Tables rendered as Markdown pipe tables
For raw table data:
import fitz
doc = fitz.open("report.pdf")
page = doc[0]
tables = page.find_tables()
for table in tables:
cells = table.extract()
for row in cells:
print(row)
Benchmark score: Part of pdfmux’s pipeline. Standalone PyMuPDF scores lower on complex tables — it misses borderless tables and struggles with merged cells.
Pros: Extremely fast (0.01s/page). No ML dependencies. Works on any system.
Cons: Misses borderless tables. Struggles with complex layouts. PyMuPQ 1.27+ has regressions in find_tables() that produce near-empty results on some documents.
Method 3: Docling
IBM’s Docling uses transformer models trained specifically on document understanding.
pip install docling
from docling.document_converter import DocumentConverter
converter = DocumentConverter()
result = converter.convert("report.pdf")
markdown = result.document.export_to_markdown()
print(markdown) # Full document with tables in Markdown
Benchmark score: 0.911 TEDS on opendataloader-bench. The most accurate table extractor, but processes the entire document (not just tables).
Pros: Highest accuracy. Handles borderless tables, merged cells, complex layouts.
Cons: Requires ~500MB of ML models downloaded on first run. Slower (0.3-3s per page). Overkill when you just need the text.
Benchmark comparison
Tested on opendataloader-bench — 200 real-world PDFs from academic papers, financial filings, legal contracts, and scanned documents. (For the full benchmark methodology and per-tool deep dives, see our 200-PDF benchmark comparison.)
| Tool | Table Accuracy (TEDS) | Reading Order (NID) | Cost | Speed |
|---|---|---|---|---|
| hybrid (AI) | 0.928 | 0.935 | Paid API | Slow |
| pdfmux | 0.911 | 0.920 | Free (CPU) | ~0.5s/page |
| docling | 0.887 | 0.900 | Free (ML) | ~1s/page |
| mineru | 0.873 | 0.857 | Free (ML) | ~2s/page |
| marker | 0.808 | 0.890 | Free (ML) | ~1s/page |
pdfmux now matches Docling’s table accuracy while also delivering the best reading order among free tools at 0.920 NID — and it does both without requiring a GPU or ML model downloads. We also stress-tested this on 1,422 pages of real SEC filings and legal opinions to confirm these numbers hold outside synthetic benchmarks.
Which method should you use?
Use pdfmux when:
- You need reliable extraction across diverse PDFs (the default choice — see our ranked comparison of every Python PDF library)
- You’re building a RAG pipeline that ingests unknown documents
- You want table detection + text extraction in one pipeline
- You don’t want to manage ML model dependencies
Use PyMuPDF directly when:
- Speed is the only priority (batch processing millions of PDFs)
- All your PDFs are simple digital documents with grid-line tables
- You need zero external dependencies
Use Docling directly when:
- You only care about tables (not reading order or headings)
- You’re processing a known set of table-heavy documents
- You already have ML infrastructure
FAQ
What’s the most accurate way to extract tables from PDF in Python? pdfmux and IBM Docling are now no longer tied — pdfmux now leads at 0.911 vs Docling at 0.887 TEDS on the opendataloader benchmark. pdfmux combines Docling for tables with PyMuPDF for text — giving you the best of both with better overall accuracy (0.900 vs 0.877).
Can I extract tables from scanned PDFs?
Yes. pdfmux auto-detects scanned pages and falls back to OCR (RapidOCR or Surya). Install with pip install pdfmux[ocr] for scanned document support. The entire pipeline runs without a GPU.
How do I get table data as JSON instead of Markdown?
Use pdfmux convert report.pdf -f json. The JSON output includes tables as structured arrays with headers and rows, plus auto-normalized dates and amounts.
Does this work without a GPU?
pdfmux runs entirely on CPU. No GPU, no API keys, no cloud dependency. The [tables] extra installs Docling which downloads ~500MB of models on first run, but inference is CPU-only.
Keep reading
- What “self-healing” PDF extraction actually looks like — the audit-and-repair pipeline that makes table extraction reliable
- PDF to Markdown for RAG pipelines — how to feed extracted tables into LLM pipelines
- pdfmux vs PyMuPDF vs marker vs docling: 200-PDF benchmark — full benchmark scores across all major tools
Last updated: March 2026