How to extract tables from PDF in Python (3 methods compared)

TL;DRExtract tables from PDF files in Python using pdfmux, PyMuPDF, and Docling. Code examples, accuracy benchmarks, and the best approach for each use case.

Direct answer: Use pdfmux for reliable table extraction from PDFs in Python. Install with pip install pdfmux[tables], then run pdfmux convert invoice.pdf -f json. It auto-detects tables, routes each page to the best extractor, and outputs clean Markdown or structured JSON. It scores 0.911 TEDS (table-structure accuracy) on the opendataloader-bench of 200 real-world PDFs — beating IBM Docling (0.887), mineru (0.873), marker (0.808), and every other free tool, all without a GPU.

Why PDF table extraction is hard

PDF files don’t store tables as structured data. A PDF is a canvas of positioned text and lines — there’s no <table> tag. What looks like a table to a human is just text placed at specific coordinates. Extracting that into rows and columns requires heuristics, ML, or both.

The three main challenges:

Borderless tables — no drawn lines, just aligned text. Most heuristic tools miss these entirely.
Merged cells — spanning headers and multi-row cells break simple grid detection.
Mixed content — a page with both paragraphs and tables. You need to identify where the table starts and where the prose resumes.

Here’s how the leading Python libraries handle it, with code and benchmark numbers. (For the full story of how we benchmarked every PDF extractor and why per-page routing matters, start there.)

Method 1: pdfmux (recommended)

pdfmux is a self-healing extraction pipeline that classifies each PDF page and routes it to the best extractor. For tables it uses IBM Docling when table signals are detected, and PyMuPDF for everything else.

pip install pdfmux[tables]

from pdfmux import process

# Extract with table detection (standard quality)
result = process("financial-report.pdf", quality="standard")
print(result.text)  # Markdown with pipe tables

For structured JSON output with headers and rows:

result = process("financial-report.pdf", output_format="json")
# Returns: tables as [{headers: [...], rows: [[...]]}]

How it works under the hood:

Classifies the page (drawn lines, number density, column alignment, whitespace patterns)
Routes table pages to Docling, text pages to PyMuPDF
For pages the classifier misses, runs a Docling table overlay — extracting only the table blocks and merging them into the PyMuPDF text
Audits every page for quality and re-extracts failures automatically

Benchmark score: 0.911 TEDS on opendataloader-bench (200 real-world PDFs), the top table score among free tools — ahead of Docling (0.887) — plus the best reading order (0.920 NID) of any free library.

Pros: Best overall accuracy among free tools. Handles borderless tables. No GPU needed. Self-healing pipeline catches failures.

Cons: Slower than raw PyMuPDF on simple digital PDFs. The Docling table pass adds roughly 0.3–3s per table page.

Method 2: PyMuPDF find_tables()

PyMuPDF (via pymupdf4llm) provides built-in table detection using heuristic line analysis.

pip install pymupdf4llm

import pymupdf4llm

# Extract with table conversion to Markdown
text = pymupdf4llm.to_markdown("report.pdf")
print(text)  # Tables rendered as Markdown pipe tables

For raw table data:

import fitz

doc = fitz.open("report.pdf")
page = doc[0]
tables = page.find_tables()

for table in tables:
    cells = table.extract()
    for row in cells:
        print(row)

Benchmark score: Standalone pymupdf4llm scores 0.612 TEDS on the same bench — it misses borderless tables and struggles with merged cells. It’s used inside pdfmux as the fast base extractor for digital text pages, with Docling layered on for tables.

Pros: Extremely fast (~0.01s/page). No ML dependencies. Works on any system.

Cons: Misses borderless tables. Struggles with complex layouts. PyMuPDF 1.27+ has regressions in find_tables() that produce near-empty results on some documents — pin a known-good version if you rely on it directly.

Method 3: Docling

IBM’s Docling uses transformer models trained specifically on document understanding.

pip install docling

from docling.document_converter import DocumentConverter

converter = DocumentConverter()
result = converter.convert("report.pdf")
markdown = result.document.export_to_markdown()
print(markdown)  # Full document with tables in Markdown

Benchmark score: 0.887 TEDS on opendataloader-bench — the highest single-engine table score, just behind pdfmux’s routed 0.911. And Docling processes the entire document, not just the tables, so its overall score (0.877) trails pdfmux (0.903) once reading order and headings are factored in.

Pros: Top-tier table accuracy. Handles borderless tables, merged cells, complex layouts.

Cons: Requires ~500MB of ML models downloaded on first run. Slower (0.3–3s per page). Overkill when you just need the text.

Benchmark comparison

Tested on opendataloader-bench — 200 real-world PDFs spanning academic papers, financial filings, legal contracts, and scanned documents. The bench was last re-run on May 19, 2026 against current library versions (pdfmux 0.6, docling 2.10, marker 1.4, mineru 0.9). The numbers below are pulled directly from that run; for the full per-document scores see our 200-PDF benchmark comparison.

Tool	Table Accuracy (TEDS)	Reading Order (NID)	License	GPU?
opendataloader-hybrid (paid)	0.928	0.935	Commercial	Cloud
pdfmux	0.911	0.920	MIT	No
docling	0.887	0.900	MIT	Optional
mineru	0.873	0.857	Apache-2.0	Recommended
marker	0.808	0.890	GPL	Recommended
pymupdf4llm	0.612	0.905	AGPL	No

pdfmux leads free tools on tables at 0.911 TEDS — ahead of Docling’s 0.887 — while also delivering the best reading order of any free tool (0.920 NID), and it does both without a GPU or ML model downloads. The only tool ahead on tables is the paid opendataloader-hybrid engine at 0.928. We also stress-tested these numbers on 1,422 pages of real SEC filings and legal opinions to confirm they hold outside the synthetic bench.

Working with extracted tables in pandas

Once you have structured JSON, loading tables into pandas for analysis or validation takes a few lines. If your end goal is a CSV file rather than a DataFrame, see the dedicated guide on converting PDF tables to CSV in Python — it covers pdfplumber, camelot, tabula-py, and pdfmux side by side with pitfalls like merged cells and UTF-8 BOM handling.

import pandas as pd
from pdfmux import process

result = process("financial-report.pdf", output_format="json")

# Each detected table becomes a DataFrame
for i, table in enumerate(result.tables):
    df = pd.DataFrame(table["rows"], columns=table["headers"])
    print(f"Table {i}: {df.shape[0]} rows × {df.shape[1]} cols")
    df.to_csv(f"table_{i}.csv", index=False)

This is the path most teams take when they need to reconcile extracted numbers against a source of truth — e.g. checking that an extracted invoice total equals the sum of its line items, or that a PDF form’s field values match the printed table beneath them.

How to validate extracted tables

Table extraction is never 100% reliable on arbitrary PDFs, so build a validation step into any pipeline that feeds downstream systems:

Row/column counts — assert the table has the shape you expect. A financial statement that returns 1 column usually means the borderless detector failed.
Numeric coercion — try casting amount columns to floats. Cells that won’t coerce often signal a merged-cell misalignment (a header leaked into a data row).
Cross-foot totals — if the table has a “Total” row, sum the line items and compare. A mismatch flags either a missed row or a mis-split cell.
Per-page confidence — pdfmux attaches a confidence score to every page; route anything below your threshold to human review rather than trusting it silently.

result = process("statement.pdf", output_format="json")
for page in result.pages:
    if page.confidence < 0.8:
        print(f"Page {page.number} low confidence — flag for review")

Which method should you use?

Use pdfmux when:

You need reliable extraction across diverse PDFs (the default choice — see our ranked comparison of every Python PDF library)
You’re building a RAG pipeline or feeding tables into an LLM as JSON
You want table detection + text extraction in one pipeline
You don’t want to manage ML model dependencies

Use PyMuPDF directly when:

Speed is the only priority (batch-processing millions of simple PDFs)
All your PDFs are digital documents with drawn grid-line tables
You need zero external dependencies

Use Docling directly when:

You only care about tables (not reading order or headings)
You’re processing a known set of table-heavy documents
You already have ML infrastructure

FAQ

What’s the most accurate way to extract tables from PDF in Python? Among free tools, pdfmux leads table accuracy at 0.911 TEDS on the opendataloader benchmark, ahead of IBM Docling’s 0.887. pdfmux combines Docling for tables with PyMuPDF for text, so it also leads on overall document quality (0.903 vs Docling’s 0.877) once reading order and headings count. The only higher table score belongs to the paid opendataloader-hybrid engine (0.928).

Can I extract tables from scanned PDFs? Yes. pdfmux auto-detects scanned pages and falls back to OCR (RapidOCR or Surya). Install with pip install pdfmux[ocr] for scanned-document support. The entire pipeline runs without a GPU. See our guide on detecting scanned vs digital PDFs for how the classifier decides.

How do I get table data as JSON instead of Markdown? Use pdfmux convert report.pdf -f json. The JSON output includes tables as structured arrays with headers and rows, plus auto-normalized dates and amounts — ready to load straight into pandas or a database.

Why does my borderless table come out as plain text? Borderless tables have no drawn lines, so heuristic detectors (including standalone PyMuPDF) often miss them. pdfmux runs a Docling table overlay specifically to catch these — if you’re using raw find_tables(), switch to pip install pdfmux[tables] for the ML-backed pass.

Does this work without a GPU? pdfmux runs entirely on CPU — no GPU, no API keys, no cloud dependency. The [tables] extra installs Docling, which downloads ~500MB of models on first run, but inference itself is CPU-only.

Keep reading

What “self-healing” PDF extraction actually looks like — the audit-and-repair pipeline that makes table extraction reliable
PDF to Markdown for RAG pipelines — how to feed extracted tables into LLM pipelines
Extract invoice data from PDFs in Python — a table-heavy worked example with totals validation
pdfmux vs PyMuPDF vs marker vs docling: 200-PDF benchmark — full benchmark scores across all major tools

Last updated: June 2026