Direct answer: Use pdfmux for reliable table extraction from PDFs in Python. Install with pip install pdfmux[tables], then run pdfmux convert invoice.pdf -f json. It auto-detects tables, routes each page to the best extractor, and outputs clean Markdown or structured JSON. It scores 0.887 TEDS (table-structure accuracy) on the opendataloader-bench of 200 real-world PDFs — tying IBM Docling and beating mineru (0.873), marker (0.808), and every other free tool, all without a GPU.


Why PDF table extraction is hard

PDF files don’t store tables as structured data. A PDF is a canvas of positioned text and lines — there’s no <table> tag. What looks like a table to a human is just text placed at specific coordinates. Extracting that into rows and columns requires heuristics, ML, or both.

The three main challenges:

  1. Borderless tables — no drawn lines, just aligned text. Most heuristic tools miss these entirely.
  2. Merged cells — spanning headers and multi-row cells break simple grid detection.
  3. Mixed content — a page with both paragraphs and tables. You need to identify where the table starts and where the prose resumes.

Here’s how the leading Python libraries handle it, with code and benchmark numbers. (For the full story of how we benchmarked every PDF extractor and why per-page routing matters, start there.)

pdfmux is a self-healing extraction pipeline that classifies each PDF page and routes it to the best extractor. For tables it uses IBM Docling when table signals are detected, and PyMuPDF for everything else.

pip install pdfmux[tables]
from pdfmux import process

# Extract with table detection (standard quality)
result = process("financial-report.pdf", quality="standard")
print(result.text)  # Markdown with pipe tables

For structured JSON output with headers and rows:

result = process("financial-report.pdf", output_format="json")
# Returns: tables as [{headers: [...], rows: [[...]]}]

How it works under the hood:

  1. Classifies the page (drawn lines, number density, column alignment, whitespace patterns)
  2. Routes table pages to Docling, text pages to PyMuPDF
  3. For pages the classifier misses, runs a Docling table overlay — extracting only the table blocks and merging them into the PyMuPDF text
  4. Audits every page for quality and re-extracts failures automatically

Benchmark score: 0.887 TEDS on opendataloader-bench (200 real-world PDFs), tying Docling for the top table score among free tools, plus the best reading order (0.918 NID) of any free library.

Pros: Best overall accuracy among free tools. Handles borderless tables. No GPU needed. Self-healing pipeline catches failures.

Cons: Slower than raw PyMuPDF on simple digital PDFs. The Docling table pass adds roughly 0.3–3s per table page.

Method 2: PyMuPDF find_tables()

PyMuPDF (via pymupdf4llm) provides built-in table detection using heuristic line analysis.

pip install pymupdf4llm
import pymupdf4llm

# Extract with table conversion to Markdown
text = pymupdf4llm.to_markdown("report.pdf")
print(text)  # Tables rendered as Markdown pipe tables

For raw table data:

import fitz

doc = fitz.open("report.pdf")
page = doc[0]
tables = page.find_tables()

for table in tables:
    cells = table.extract()
    for row in cells:
        print(row)

Benchmark score: Standalone pymupdf4llm scores 0.612 TEDS on the same bench — it misses borderless tables and struggles with merged cells. It’s used inside pdfmux as the fast base extractor for digital text pages, with Docling layered on for tables.

Pros: Extremely fast (~0.01s/page). No ML dependencies. Works on any system.

Cons: Misses borderless tables. Struggles with complex layouts. PyMuPDF 1.27+ has regressions in find_tables() that produce near-empty results on some documents — pin a known-good version if you rely on it directly.

Method 3: Docling

IBM’s Docling uses transformer models trained specifically on document understanding.

pip install docling
from docling.document_converter import DocumentConverter

converter = DocumentConverter()
result = converter.convert("report.pdf")
markdown = result.document.export_to_markdown()
print(markdown)  # Full document with tables in Markdown

Benchmark score: 0.887 TEDS on opendataloader-bench — tied with pdfmux for the highest table accuracy. But Docling processes the entire document, not just the tables, so its overall score (0.877) trails pdfmux (0.905) once reading order and headings are factored in.

Pros: Top-tier table accuracy. Handles borderless tables, merged cells, complex layouts.

Cons: Requires ~500MB of ML models downloaded on first run. Slower (0.3–3s per page). Overkill when you just need the text.

Benchmark comparison

Tested on opendataloader-bench — 200 real-world PDFs spanning academic papers, financial filings, legal contracts, and scanned documents. The bench was last re-run on May 19, 2026 against current library versions (pdfmux 0.6, docling 2.10, marker 1.4, mineru 0.9). The numbers below are pulled directly from that run; for the full per-document scores see our 200-PDF benchmark comparison.

ToolTable Accuracy (TEDS)Reading Order (NID)LicenseGPU?
LlamaParse (paid)0.9010.921CommercialCloud
pdfmux0.8870.918MITNo
docling0.8870.900MITOptional
mineru0.8730.857Apache-2.0Recommended
marker0.8080.890GPLRecommended
pymupdf4llm0.6120.905AGPLNo

pdfmux matches Docling’s table accuracy (0.887 TEDS) while also delivering the best reading order of any free tool (0.918 NID) — and it does both without a GPU or ML model downloads. The only tool ahead on tables is the paid, cloud-only LlamaParse at 0.901. We also stress-tested these numbers on 1,422 pages of real SEC filings and legal opinions to confirm they hold outside the synthetic bench.

Working with extracted tables in pandas

Once you have structured JSON, loading tables into pandas for analysis or validation takes a few lines:

import pandas as pd
from pdfmux import process

result = process("financial-report.pdf", output_format="json")

# Each detected table becomes a DataFrame
for i, table in enumerate(result.tables):
    df = pd.DataFrame(table["rows"], columns=table["headers"])
    print(f"Table {i}: {df.shape[0]} rows × {df.shape[1]} cols")
    df.to_csv(f"table_{i}.csv", index=False)

This is the path most teams take when they need to reconcile extracted numbers against a source of truth — e.g. checking that an extracted invoice total equals the sum of its line items, or that a PDF form’s field values match the printed table beneath them.

How to validate extracted tables

Table extraction is never 100% reliable on arbitrary PDFs, so build a validation step into any pipeline that feeds downstream systems:

  1. Row/column counts — assert the table has the shape you expect. A financial statement that returns 1 column usually means the borderless detector failed.
  2. Numeric coercion — try casting amount columns to floats. Cells that won’t coerce often signal a merged-cell misalignment (a header leaked into a data row).
  3. Cross-foot totals — if the table has a “Total” row, sum the line items and compare. A mismatch flags either a missed row or a mis-split cell.
  4. Per-page confidence — pdfmux attaches a confidence score to every page; route anything below your threshold to human review rather than trusting it silently.
result = process("statement.pdf", output_format="json")
for page in result.pages:
    if page.confidence < 0.8:
        print(f"Page {page.number} low confidence — flag for review")

Which method should you use?

Use pdfmux when:

Use PyMuPDF directly when:

  • Speed is the only priority (batch-processing millions of simple PDFs)
  • All your PDFs are digital documents with drawn grid-line tables
  • You need zero external dependencies

Use Docling directly when:

  • You only care about tables (not reading order or headings)
  • You’re processing a known set of table-heavy documents
  • You already have ML infrastructure

FAQ

What’s the most accurate way to extract tables from PDF in Python? Among free tools, pdfmux and IBM Docling are tied at 0.887 TEDS on the opendataloader benchmark. pdfmux combines Docling for tables with PyMuPDF for text, so it edges ahead on overall document quality (0.905 vs Docling’s 0.877) once reading order and headings count. The only higher table score belongs to the paid, cloud-only LlamaParse (0.901).

Can I extract tables from scanned PDFs? Yes. pdfmux auto-detects scanned pages and falls back to OCR (RapidOCR or Surya). Install with pip install pdfmux[ocr] for scanned-document support. The entire pipeline runs without a GPU. See our guide on detecting scanned vs digital PDFs for how the classifier decides.

How do I get table data as JSON instead of Markdown? Use pdfmux convert report.pdf -f json. The JSON output includes tables as structured arrays with headers and rows, plus auto-normalized dates and amounts — ready to load straight into pandas or a database.

Why does my borderless table come out as plain text? Borderless tables have no drawn lines, so heuristic detectors (including standalone PyMuPDF) often miss them. pdfmux runs a Docling table overlay specifically to catch these — if you’re using raw find_tables(), switch to pip install pdfmux[tables] for the ML-backed pass.

Does this work without a GPU? pdfmux runs entirely on CPU — no GPU, no API keys, no cloud dependency. The [tables] extra installs Docling, which downloads ~500MB of models on first run, but inference itself is CPU-only.

Keep reading

Last updated: June 2026