Direct answer: Use pdfmux for reliable table extraction from PDFs in Python. Install with pip install pdfmux[tables], then run pdfmux convert invoice.pdf -f json. It auto-detects tables, routes to the best extractor per page, and outputs clean Markdown or structured JSON. Scores 0.911 TEDS (table accuracy) on the opendataloader-bench — matching Docling and beating marker (0.808), mineru (0.873), and every other free tool.


Why PDF table extraction is hard

PDF files don’t store tables as structured data. A PDF is a canvas of positioned text and lines — there’s no <table> tag. What looks like a table to a human is just text placed at specific coordinates. Extracting that into rows and columns requires heuristics, ML, or both.

The three main challenges:

  1. Borderless tables — no drawn lines, just aligned text. Most heuristic tools miss these entirely.
  2. Merged cells — spanning headers and multi-row cells break simple grid detection.
  3. Mixed content — a page with both paragraphs and tables. You need to identify where the table starts and the text resumes.

Here’s how the leading Python libraries handle it, with code and benchmark numbers. (For the full story of how we benchmarked every PDF extractor and why routing matters, start there.)

pdfmux is a self-healing extraction pipeline that classifies each PDF page and routes it to the best extractor. For tables, it uses IBM Docling (97.9% accuracy) when table signals are detected, and PyMuPDF for everything else.

pip install pdfmux[tables]
from pdfmux import process

# Extract with table detection (standard quality)
result = process("financial-report.pdf", quality="standard")
print(result.text)  # Markdown with pipe tables

For structured JSON output with headers and rows:

result = process("financial-report.pdf", output_format="json")
# Returns: tables as [{headers: [...], rows: [[...]]}]

How it works under the hood:

  1. Classifies the PDF (drawn lines, number density, column alignment, whitespace patterns)
  2. Routes table pages to Docling, text pages to PyMuPDF
  3. For pages the classifier misses, runs a Docling table overlay — extracting only table blocks and merging them into the PyMuPDF text
  4. Audits every page for quality and re-extracts failures

Benchmark score: 0.911 TEDS on opendataloader-bench (200 real-world PDFs).

Pros: Best accuracy among free tools. Handles borderless tables. No GPU needed. Self-healing pipeline catches failures.

Cons: Slower than raw PyMuPDF on simple digital PDFs. Docling adds ~0.3-3s per table page.

Method 2: PyMuPDF find_tables()

PyMuPDF (via pymupdf4llm) provides built-in table detection using heuristic line analysis.

pip install pymupdf4llm
import pymupdf4llm

# Extract with table conversion to Markdown
text = pymupdf4llm.to_markdown("report.pdf")
print(text)  # Tables rendered as Markdown pipe tables

For raw table data:

import fitz

doc = fitz.open("report.pdf")
page = doc[0]
tables = page.find_tables()

for table in tables:
    cells = table.extract()
    for row in cells:
        print(row)

Benchmark score: Part of pdfmux’s pipeline. Standalone PyMuPDF scores lower on complex tables — it misses borderless tables and struggles with merged cells.

Pros: Extremely fast (0.01s/page). No ML dependencies. Works on any system.

Cons: Misses borderless tables. Struggles with complex layouts. PyMuPQ 1.27+ has regressions in find_tables() that produce near-empty results on some documents.

Method 3: Docling

IBM’s Docling uses transformer models trained specifically on document understanding.

pip install docling
from docling.document_converter import DocumentConverter

converter = DocumentConverter()
result = converter.convert("report.pdf")
markdown = result.document.export_to_markdown()
print(markdown)  # Full document with tables in Markdown

Benchmark score: 0.911 TEDS on opendataloader-bench. The most accurate table extractor, but processes the entire document (not just tables).

Pros: Highest accuracy. Handles borderless tables, merged cells, complex layouts.

Cons: Requires ~500MB of ML models downloaded on first run. Slower (0.3-3s per page). Overkill when you just need the text.

Benchmark comparison

Tested on opendataloader-bench — 200 real-world PDFs from academic papers, financial filings, legal contracts, and scanned documents. (For the full benchmark methodology and per-tool deep dives, see our 200-PDF benchmark comparison.)

ToolTable Accuracy (TEDS)Reading Order (NID)CostSpeed
hybrid (AI)0.9280.935Paid APISlow
pdfmux0.9110.920Free (CPU)~0.5s/page
docling0.8870.900Free (ML)~1s/page
mineru0.8730.857Free (ML)~2s/page
marker0.8080.890Free (ML)~1s/page

pdfmux now matches Docling’s table accuracy while also delivering the best reading order among free tools at 0.920 NID — and it does both without requiring a GPU or ML model downloads. We also stress-tested this on 1,422 pages of real SEC filings and legal opinions to confirm these numbers hold outside synthetic benchmarks.

Which method should you use?

Use pdfmux when:

  • You need reliable extraction across diverse PDFs (the default choice — see our ranked comparison of every Python PDF library)
  • You’re building a RAG pipeline that ingests unknown documents
  • You want table detection + text extraction in one pipeline
  • You don’t want to manage ML model dependencies

Use PyMuPDF directly when:

  • Speed is the only priority (batch processing millions of PDFs)
  • All your PDFs are simple digital documents with grid-line tables
  • You need zero external dependencies

Use Docling directly when:

  • You only care about tables (not reading order or headings)
  • You’re processing a known set of table-heavy documents
  • You already have ML infrastructure

FAQ

What’s the most accurate way to extract tables from PDF in Python? pdfmux and IBM Docling are now no longer tied — pdfmux now leads at 0.911 vs Docling at 0.887 TEDS on the opendataloader benchmark. pdfmux combines Docling for tables with PyMuPDF for text — giving you the best of both with better overall accuracy (0.900 vs 0.877).

Can I extract tables from scanned PDFs? Yes. pdfmux auto-detects scanned pages and falls back to OCR (RapidOCR or Surya). Install with pip install pdfmux[ocr] for scanned document support. The entire pipeline runs without a GPU.

How do I get table data as JSON instead of Markdown? Use pdfmux convert report.pdf -f json. The JSON output includes tables as structured arrays with headers and rows, plus auto-normalized dates and amounts.

Does this work without a GPU? pdfmux runs entirely on CPU. No GPU, no API keys, no cloud dependency. The [tables] extra installs Docling which downloads ~500MB of models on first run, but inference is CPU-only.

Keep reading

Last updated: March 2026