Direct answer: Convert PDFs to Markdown for RAG using pdfmux: pip install pdfmux && pdfmux convert document.pdf. It outputs clean Markdown with tables, headings, and per-page confidence scores. The confidence scoring tells your pipeline which pages to trust and which to flag for review — critical for production RAG systems where hallucination from bad ingestion is the #1 failure mode.


Why Markdown for RAG?

LLMs consume text. RAG pipelines retrieve text chunks and feed them to models as context. The quality of that context determines whether the model gives a good answer or hallucinates.

Markdown is the ideal intermediate format because:

  1. Structure is preserved — headings, tables, lists, bold text carry semantic meaning
  2. Chunking is natural — split on ## Heading boundaries for semantically coherent chunks
  3. LLMs understand it — every modern LLM is trained on Markdown. It’s their native structured format.
  4. Tables are readable — pipe tables (| A | B |) are parseable by both humans and models
  5. No HTML overhead — clean, lightweight, no rendering dependencies

The challenge: converting PDFs to good Markdown. Most tools produce text with wrong reading order, missed tables, lost headings, or garbled scanned pages. We benchmarked every major PDF-to-Markdown tool to find which ones actually produce usable output.

The ingestion pipeline

A production RAG pipeline needs more than just PDF-to-text conversion. Here’s the full flow:

PDF → Extract → Audit → Recover → Chunk → Embed → Index → Retrieve → Generate
      ^^^^^^^^^^^^^^^^^^^^^^^^
      pdfmux handles this part

Step 1: Extract with quality scoring

from pdfmux import process

result = process("annual-report.pdf", quality="standard")

# Per-page confidence tells you which pages to trust
print(f"Confidence: {result.confidence:.0%}")
print(f"Extractor: {result.extractor_used}")
print(f"Warnings: {result.warnings}")

pdfmux’s standard quality mode runs a self-healing extraction pipeline that:

  1. Fast-extracts every page with PyMuPDF (0.01s/page)
  2. Audits each page for quality (text density, encoding errors, image ratio)
  3. Re-extracts bad pages with OCR
  4. Detects and extracts tables via Docling overlay
  5. Injects headings via font-size analysis
  6. Returns confidence score per page

Step 2: Handle tables properly

Tables are the #1 source of RAG failures. A financial report table extracted as garbled text will produce wrong answers every time. (We cover three extraction methods in detail in how to extract tables from PDF in Python.)

# JSON output gives you structured table data
result = process("financial.pdf", output_format="json")
# Tables come as: [{headers: [...], rows: [[...]], page: 1}]

For Markdown output, tables are rendered as pipe tables:

| Revenue | Q1 2025 | Q2 2025 | Q3 2025 |
|---------|---------|---------|---------|
| Product A | $12.3M | $14.1M | $15.8M |
| Product B | $8.7M | $9.2M | $10.1M |

pdfmux scores 0.911 TEDS (table accuracy) on the opendataloader benchmark — matching Docling and the highest among free tools.

Step 3: Chunk by heading structure

The best chunking strategy for RAG is heading-based splitting. Each ## Section becomes a chunk with its content:

import re

def chunk_by_headings(markdown: str, max_chunk_size: int = 2000) -> list[dict]:
    """Split Markdown into chunks at heading boundaries."""
    sections = re.split(r'\n(?=#{1,3} )', markdown)
    chunks = []

    for section in sections:
        if len(section.strip()) < 10:
            continue

        # Extract heading as metadata
        lines = section.strip().split('\n')
        heading = lines[0].lstrip('#').strip() if lines[0].startswith('#') else None

        chunks.append({
            'text': section.strip(),
            'heading': heading,
            'char_count': len(section),
        })

    return chunks

result = process("report.pdf", quality="standard")
chunks = chunk_by_headings(result.text)

pdfmux’s heading detection (font-size analysis + bold promotion) ensures headings are correctly identified, giving you reliable chunk boundaries. It scores 0.852 MHS on the benchmark — the best heading detection of any engine, paid or free.

Step 4: Filter low-confidence pages

This is where pdfmux’s quality scoring becomes critical for production:

from pdfmux import process

result = process("document.pdf", quality="standard", output_format="json")

# The JSON output includes per-page quality
# Use confidence to filter or flag pages
if result.confidence < 0.7:
    print(f"Warning: low confidence extraction ({result.confidence:.0%})")
    print(f"Consider manual review or using quality='high' (LLM extraction)")

Step 5: Batch processing

For ingesting large document collections:

from pdfmux import process_batch
from pathlib import Path

pdfs = list(Path("documents/").glob("*.pdf"))

for path, result in process_batch(pdfs, quality="standard", workers=4):
    if isinstance(result, Exception):
        print(f"Failed: {path}{result}")
        continue

    # Write to your vector store
    chunks = chunk_by_headings(result.text)
    for chunk in chunks:
        embed_and_index(chunk, source=str(path))

Common pitfalls

1. Ignoring extraction quality Most pipelines just dump text into the vector store without checking quality. pdfmux’s confidence scoring catches bad pages before they corrupt your index.

2. Flat text chunking Splitting on character count (every 500 chars) breaks mid-sentence and mid-table. Always chunk on semantic boundaries (headings, paragraphs).

3. Losing table structure Tables converted to plain text (“Revenue Q1 2025 $12.3M Q2 2025 $14.1M”) are unusable. Keep them as Markdown tables — LLMs can read pipe tables accurately.

4. No OCR fallback 10% of pages in a typical document collection are scanned or image-heavy. Without OCR, those pages produce empty chunks. pdfmux handles this automatically with pip install pdfmux[ocr] — and it does it entirely on CPU, with no GPU or API keys required.

5. Single extractor for everything PyMuPDF is great for digital text but terrible at tables. Docling is great at tables but slower on plain text. pdfmux routes each page to the best tool — see our head-to-head benchmark of 7 PDF extractors for why this routing approach wins.

Production checklist

  • Install pdfmux with table and OCR support: pip install pdfmux[tables,ocr]
  • Use quality="standard" for document ingestion (not “fast”)
  • Check result.confidence — flag pages below 0.7 for review
  • Chunk by headings, not character count
  • Keep table structure in Markdown format
  • Use process_batch with workers for large collections
  • Monitor result.warnings for extraction issues
  • Test with your actual documents — benchmarks are averages

FAQ

What’s the best output format for RAG? Markdown. It preserves structure (headings, tables, lists) that LLMs understand natively. Use JSON output when you need structured table data for database insertion.

How many PDFs can pdfmux process per hour? In fast mode: ~10,000+ pages/hour (PyMuPDF speed). In standard mode with tables: ~1,000-3,000 pages/hour depending on table density. See our real-world benchmark across 1,422 pages of SEC filings and legal documents for actual throughput numbers.

Does pdfmux work with LangChain? Yes. The langchain-pdfmux package provides a PdfmuxLoader that integrates directly with LangChain’s document loading pipeline.

What about PDF images and charts? pdfmux extracts text and tables but doesn’t interpret images or charts. For image understanding, use quality="high" which routes to Gemini Flash for visual content extraction.

Keep reading

Last updated: March 2026