PDF to Markdown for RAG pipelines: the complete guide

TL;DRConvert PDFs to clean Markdown for RAG and LLM pipelines: chunking, tables, OCR, confidence scoring, and production patterns. Updated June 2026.

Direct answer (updated June 2026): Convert PDFs to Markdown for RAG using pdfmux: pip install pdfmux && pdfmux convert document.pdf. It outputs clean Markdown with tables, headings, and per-page confidence scores. On the opendataloader-bench of 200 real-world PDFs, pdfmux scores 0.903 overall — the best free tool and within 0.6% of the paid #1 (the opendataloader-hybrid engine, 0.909). The per-page confidence score is the part that matters for RAG: it tells your pipeline which pages to trust and which to flag for review — critical in production, where hallucination from bad ingestion is the #1 failure mode.

Why Markdown for RAG?

LLMs consume text. RAG pipelines retrieve text chunks and feed them to models as context. The quality of that context determines whether the model gives a good answer or hallucinates.

Markdown is the ideal intermediate format because:

Structure is preserved — headings, tables, lists, bold text carry semantic meaning
Chunking is natural — split on ## Heading boundaries for semantically coherent chunks
LLMs understand it — every modern LLM is trained on Markdown. It’s their native structured format.
Tables are readable — pipe tables (| A | B |) are parseable by both humans and models
No HTML overhead — clean, lightweight, no rendering dependencies

The challenge: converting PDFs to good Markdown. Most tools produce text with wrong reading order, missed tables, lost headings, or garbled scanned pages. We benchmarked every major PDF-to-Markdown tool on the opendataloader-bench — 200 real-world PDFs from financial filings, academic papers, legal contracts, and government documents — to find which ones actually produce usable output:

Library	Overall	Tables (TEDS)	Reading order	Headings (MHS)	License	GPU?
opendataloader-hybrid (paid)	0.909	0.928	0.935	0.828	Commercial	Cloud
pdfmux	0.903	0.911	0.920	0.847	MIT	No
Docling	0.877	0.887	0.900	0.844	MIT	Optional
marker	0.861	0.808	0.890	—	GPL	Recommended
mineru	0.831	0.873	0.857	—	Apache-2.0	Recommended
pymupdf4llm	0.802	0.612	0.905	—	AGPL	No

The three numbers that decide RAG quality are reading order (does the text come out in the right sequence?), table accuracy (TEDS), and heading structure (MHS — because headings are your chunk boundaries). pdfmux is the only free tool that’s competitive on all three, and it runs on CPU with no GPU or API keys. Full methodology and per-engine notes are in the ranked 2026 comparison.

The ingestion pipeline

A production RAG pipeline needs more than just PDF-to-text conversion. Here’s the full flow:

PDF → Extract → Audit → Recover → Chunk → Embed → Index → Retrieve → Generate
      ^^^^^^^^^^^^^^^^^^^^^^^^
      pdfmux handles this part

Step 1: Extract with quality scoring

from pdfmux import process

result = process("annual-report.pdf", quality="standard")

# Per-page confidence tells you which pages to trust
print(f"Confidence: {result.confidence:.0%}")
print(f"Extractor: {result.extractor_used}")
print(f"Warnings: {result.warnings}")

pdfmux’s standard quality mode runs a self-healing extraction pipeline that:

Fast-extracts every page with PyMuPDF (0.01s/page)
Audits each page for quality (text density, encoding errors, image ratio)
Re-extracts bad pages with OCR
Detects and extracts tables via Docling overlay
Injects headings via font-size analysis
Returns confidence score per page

Step 2: Handle tables properly

Tables are the #1 source of RAG failures. A financial report table extracted as garbled text will produce wrong answers every time. (We cover three extraction methods in detail in how to extract tables from PDF in Python.)

# JSON output gives you structured table data
result = process("financial.pdf", output_format="json")
# Tables come as: [{headers: [...], rows: [[...]], page: 1}]

For Markdown output, tables are rendered as pipe tables:

| Revenue | Q1 2025 | Q2 2025 | Q3 2025 |
|---------|---------|---------|---------|
| Product A | $12.3M | $14.1M | $15.8M |
| Product B | $8.7M | $9.2M | $10.1M |

pdfmux scores 0.911 TEDS (table accuracy) on the opendataloader benchmark — the highest among free tools, ahead of Docling’s 0.887. (See the three table-extraction methods compared for when to reach for which.)

Step 3: Chunk by heading structure

The best chunking strategy for RAG is heading-based splitting. Each ## Section becomes a chunk with its content:

import re

def chunk_by_headings(markdown: str, max_chunk_size: int = 2000) -> list[dict]:
    """Split Markdown into chunks at heading boundaries."""
    sections = re.split(r'\n(?=#{1,3} )', markdown)
    chunks = []

    for section in sections:
        if len(section.strip()) < 10:
            continue

        # Extract heading as metadata
        lines = section.strip().split('\n')
        heading = lines[0].lstrip('#').strip() if lines[0].startswith('#') else None

        chunks.append({
            'text': section.strip(),
            'heading': heading,
            'char_count': len(section),
        })

    return chunks

result = process("report.pdf", quality="standard")
chunks = chunk_by_headings(result.text)

pdfmux’s heading detection (font-size analysis + bold promotion) ensures headings are correctly identified, giving you reliable chunk boundaries. It scores 0.847 MHS on the opendataloader benchmark — the best heading detection of any engine, paid or free. That matters more than it looks: if headings are wrong, every chunk boundary is wrong, and bad boundaries are a silent, compounding source of retrieval misses. For a deeper treatment of boundary choices, see PDF chunking strategies for RAG.

Step 4: Filter low-confidence pages

This is where pdfmux’s quality scoring becomes critical for production:

from pdfmux import process

result = process("document.pdf", quality="standard", output_format="json")

# The JSON output includes per-page quality
# Use confidence to filter or flag pages
if result.confidence < 0.7:
    print(f"Warning: low confidence extraction ({result.confidence:.0%})")
    print(f"Consider manual review or using quality='high' (LLM extraction)")

Step 5: Batch processing

For ingesting large document collections:

from pdfmux import process_batch
from pathlib import Path

pdfs = list(Path("documents/").glob("*.pdf"))

for path, result in process_batch(pdfs, quality="standard", workers=4):
    if isinstance(result, Exception):
        print(f"Failed: {path} — {result}")
        continue

    # Write to your vector store
    chunks = chunk_by_headings(result.text)
    for chunk in chunks:
        embed_and_index(chunk, source=str(path))

Step 6: Attach metadata to every chunk

A Markdown chunk on its own loses its provenance the moment it lands in your vector store. Two months later you can’t answer “where did this claim come from?” Attach metadata at chunk time, while you still have it:

result = process("annual-report.pdf", quality="standard", output_format="json")
chunks = chunk_by_headings(result.text)

for i, chunk in enumerate(chunks):
    chunk["metadata"] = {
        "source": "annual-report.pdf",
        "page": chunk.get("page"),        # page-level provenance for citations
        "heading": chunk["heading"],       # section title doubles as a retrieval hint
        "confidence": result.confidence,   # let retrieval down-rank low-trust pages
        "chunk_index": i,
    }
    embed_and_index(chunk["text"], metadata=chunk["metadata"])

Two of these earn their keep immediately. page lets you render real citations (“source: annual-report.pdf, p.14”) instead of asking the user to trust an unsourced answer. confidence lets you do confidence-weighted retrieval — multiply the similarity score by the page confidence so a 0.55-confidence OCR page never outranks a clean 0.99 page on a borderline match. That single trick removes a whole class of “the model cited a garbled table” failures.

Connecting to your framework

You rarely run extraction in isolation — it feeds a framework. pdfmux plugs into the two most common ones directly, so you don’t hand-roll a loader:

LangChain — the langchain-pdfmux package ships a PdfmuxLoader that returns Document objects with the metadata above already attached. Full walkthrough: PDF extraction with LangChain.
LlamaIndex — load pdfmux output straight into a VectorStoreIndex with confidence carried through as node metadata. See loading PDFs into LlamaIndex.

Both paths preserve the heading-based chunk boundaries and per-page confidence — the two things generic loaders throw away.

Common pitfalls

1. Ignoring extraction quality Most pipelines just dump text into the vector store without checking quality. pdfmux’s confidence scoring catches bad pages before they corrupt your index.

2. Flat text chunking Splitting on character count (every 500 chars) breaks mid-sentence and mid-table. Always chunk on semantic boundaries (headings, paragraphs).

3. Losing table structure Tables converted to plain text (“Revenue Q1 2025 $12.3M Q2 2025 $14.1M”) are unusable. Keep them as Markdown tables — LLMs can read pipe tables accurately.

4. No OCR fallback 10% of pages in a typical document collection are scanned or image-heavy. Without OCR, those pages produce empty chunks. pdfmux handles this automatically with pip install pdfmux[ocr] — and it does it entirely on CPU, with no GPU or API keys required.

5. Single extractor for everything PyMuPDF is great for digital text but terrible at tables. Docling is great at tables but slower on plain text. pdfmux routes each page to the best tool — see our head-to-head benchmark of 7 PDF extractors for why this routing approach wins.

Production checklist

Install pdfmux with table and OCR support: pip install pdfmux[tables,ocr]
Use quality="standard" for document ingestion (not “fast”)
Check result.confidence — flag pages below 0.7 for review
Chunk by headings, not character count
Keep table structure in Markdown format
Use process_batch with workers for large collections
Monitor result.warnings for extraction issues
Test with your actual documents — benchmarks are averages

FAQ

What’s the best output format for RAG? Markdown. It preserves structure (headings, tables, lists) that LLMs understand natively. Use JSON output when you need structured table data for database insertion.

How many PDFs can pdfmux process per hour? In fast mode (PyMuPDF only): on the order of thousands of pages per hour on typical CPU hardware for born-digital text. In standard mode with tables and OCR fallback: substantially less, depending on table density and how many pages escalate to OCR. Measured at smaller scale in our real-world benchmark across 1,422 pages of SEC filings and legal documents. Throughput is corpus-dependent — the most reliable way to size capacity is to run a representative sample of your documents and read the manifest.json timings.

Does pdfmux work with LangChain? Yes. The langchain-pdfmux package provides a PdfmuxLoader that integrates directly with LangChain’s document loading pipeline.

What about PDF images and charts? pdfmux extracts text and tables but doesn’t interpret images or charts. For image understanding, use quality="high" which routes to Gemini Flash for visual content extraction.

How do I know my ingestion is actually good? Don’t eyeball it — measure it. Sample 20-30 pages across your corpus, run extraction, and check three things: (1) the per-page confidence distribution — anything below 0.7 needs review; (2) whether every table came through as a pipe table rather than collapsed text; and (3) whether heading boundaries match the source document’s sections. A 10-minute spot check on real documents catches the failures that benchmark averages hide, because your PDFs are never the average. Re-run it whenever you change extractor settings or add a new document type to the pipeline.

Should I store the raw Markdown or just the chunks? Store both. Keep the full Markdown per document as the source of truth, and the chunks as your retrieval units. When a chunk surfaces a bad answer, you want the original page to debug against — and when you change your chunking strategy later, you re-chunk from the stored Markdown instead of re-extracting every PDF from scratch.

Keep reading

What “self-healing” PDF extraction actually looks like — the full architecture behind the extract-audit-repair pipeline that powers this guide
How to extract tables from PDF in Python — three methods compared, with benchmark scores
Best PDF extraction library for Python in 2026 — ranked comparison of every major tool for RAG pipelines
How to give your AI agent the ability to read any PDF — use pdfmux as an MCP server so your agent can process PDFs directly

Last updated: June 2026. Benchmark figures are from the opendataloader-bench (200 real-world PDFs); see the full ranked comparison for methodology.