Direct answer: Convert PDFs to Markdown for RAG using pdfmux: pip install pdfmux && pdfmux convert document.pdf. It outputs clean Markdown with tables, headings, and per-page confidence scores. The confidence scoring tells your pipeline which pages to trust and which to flag for review — critical for production RAG systems where hallucination from bad ingestion is the #1 failure mode.
Why Markdown for RAG?
LLMs consume text. RAG pipelines retrieve text chunks and feed them to models as context. The quality of that context determines whether the model gives a good answer or hallucinates.
Markdown is the ideal intermediate format because:
- Structure is preserved — headings, tables, lists, bold text carry semantic meaning
- Chunking is natural — split on
## Headingboundaries for semantically coherent chunks - LLMs understand it — every modern LLM is trained on Markdown. It’s their native structured format.
- Tables are readable — pipe tables (
| A | B |) are parseable by both humans and models - No HTML overhead — clean, lightweight, no rendering dependencies
The challenge: converting PDFs to good Markdown. Most tools produce text with wrong reading order, missed tables, lost headings, or garbled scanned pages. We benchmarked every major PDF-to-Markdown tool to find which ones actually produce usable output.
The ingestion pipeline
A production RAG pipeline needs more than just PDF-to-text conversion. Here’s the full flow:
PDF → Extract → Audit → Recover → Chunk → Embed → Index → Retrieve → Generate
^^^^^^^^^^^^^^^^^^^^^^^^
pdfmux handles this part
Step 1: Extract with quality scoring
from pdfmux import process
result = process("annual-report.pdf", quality="standard")
# Per-page confidence tells you which pages to trust
print(f"Confidence: {result.confidence:.0%}")
print(f"Extractor: {result.extractor_used}")
print(f"Warnings: {result.warnings}")
pdfmux’s standard quality mode runs a self-healing extraction pipeline that:
- Fast-extracts every page with PyMuPDF (0.01s/page)
- Audits each page for quality (text density, encoding errors, image ratio)
- Re-extracts bad pages with OCR
- Detects and extracts tables via Docling overlay
- Injects headings via font-size analysis
- Returns confidence score per page
Step 2: Handle tables properly
Tables are the #1 source of RAG failures. A financial report table extracted as garbled text will produce wrong answers every time. (We cover three extraction methods in detail in how to extract tables from PDF in Python.)
# JSON output gives you structured table data
result = process("financial.pdf", output_format="json")
# Tables come as: [{headers: [...], rows: [[...]], page: 1}]
For Markdown output, tables are rendered as pipe tables:
| Revenue | Q1 2025 | Q2 2025 | Q3 2025 |
|---------|---------|---------|---------|
| Product A | $12.3M | $14.1M | $15.8M |
| Product B | $8.7M | $9.2M | $10.1M |
pdfmux scores 0.911 TEDS (table accuracy) on the opendataloader benchmark — matching Docling and the highest among free tools.
Step 3: Chunk by heading structure
The best chunking strategy for RAG is heading-based splitting. Each ## Section becomes a chunk with its content:
import re
def chunk_by_headings(markdown: str, max_chunk_size: int = 2000) -> list[dict]:
"""Split Markdown into chunks at heading boundaries."""
sections = re.split(r'\n(?=#{1,3} )', markdown)
chunks = []
for section in sections:
if len(section.strip()) < 10:
continue
# Extract heading as metadata
lines = section.strip().split('\n')
heading = lines[0].lstrip('#').strip() if lines[0].startswith('#') else None
chunks.append({
'text': section.strip(),
'heading': heading,
'char_count': len(section),
})
return chunks
result = process("report.pdf", quality="standard")
chunks = chunk_by_headings(result.text)
pdfmux’s heading detection (font-size analysis + bold promotion) ensures headings are correctly identified, giving you reliable chunk boundaries. It scores 0.852 MHS on the benchmark — the best heading detection of any engine, paid or free.
Step 4: Filter low-confidence pages
This is where pdfmux’s quality scoring becomes critical for production:
from pdfmux import process
result = process("document.pdf", quality="standard", output_format="json")
# The JSON output includes per-page quality
# Use confidence to filter or flag pages
if result.confidence < 0.7:
print(f"Warning: low confidence extraction ({result.confidence:.0%})")
print(f"Consider manual review or using quality='high' (LLM extraction)")
Step 5: Batch processing
For ingesting large document collections:
from pdfmux import process_batch
from pathlib import Path
pdfs = list(Path("documents/").glob("*.pdf"))
for path, result in process_batch(pdfs, quality="standard", workers=4):
if isinstance(result, Exception):
print(f"Failed: {path} — {result}")
continue
# Write to your vector store
chunks = chunk_by_headings(result.text)
for chunk in chunks:
embed_and_index(chunk, source=str(path))
Common pitfalls
1. Ignoring extraction quality Most pipelines just dump text into the vector store without checking quality. pdfmux’s confidence scoring catches bad pages before they corrupt your index.
2. Flat text chunking Splitting on character count (every 500 chars) breaks mid-sentence and mid-table. Always chunk on semantic boundaries (headings, paragraphs).
3. Losing table structure Tables converted to plain text (“Revenue Q1 2025 $12.3M Q2 2025 $14.1M”) are unusable. Keep them as Markdown tables — LLMs can read pipe tables accurately.
4. No OCR fallback
10% of pages in a typical document collection are scanned or image-heavy. Without OCR, those pages produce empty chunks. pdfmux handles this automatically with pip install pdfmux[ocr] — and it does it entirely on CPU, with no GPU or API keys required.
5. Single extractor for everything PyMuPDF is great for digital text but terrible at tables. Docling is great at tables but slower on plain text. pdfmux routes each page to the best tool — see our head-to-head benchmark of 7 PDF extractors for why this routing approach wins.
Production checklist
- Install pdfmux with table and OCR support:
pip install pdfmux[tables,ocr] - Use
quality="standard"for document ingestion (not “fast”) - Check
result.confidence— flag pages below 0.7 for review - Chunk by headings, not character count
- Keep table structure in Markdown format
- Use
process_batchwith workers for large collections - Monitor
result.warningsfor extraction issues - Test with your actual documents — benchmarks are averages
FAQ
What’s the best output format for RAG? Markdown. It preserves structure (headings, tables, lists) that LLMs understand natively. Use JSON output when you need structured table data for database insertion.
How many PDFs can pdfmux process per hour? In fast mode: ~10,000+ pages/hour (PyMuPDF speed). In standard mode with tables: ~1,000-3,000 pages/hour depending on table density. See our real-world benchmark across 1,422 pages of SEC filings and legal documents for actual throughput numbers.
Does pdfmux work with LangChain?
Yes. The langchain-pdfmux package provides a PdfmuxLoader that integrates directly with LangChain’s document loading pipeline.
What about PDF images and charts?
pdfmux extracts text and tables but doesn’t interpret images or charts. For image understanding, use quality="high" which routes to Gemini Flash for visual content extraction.
Keep reading
- What “self-healing” PDF extraction actually looks like — the full architecture behind the extract-audit-repair pipeline that powers this guide
- How to extract tables from PDF in Python — three methods compared, with benchmark scores
- Best PDF extraction library for Python in 2026 — ranked comparison of every major tool for RAG pipelines
- How to give your AI agent the ability to read any PDF — use pdfmux as an MCP server so your agent can process PDFs directly
Last updated: March 2026