How to give your AI agent the ability to read any PDF

TL;DRThere are 18,000+ MCP servers but most PDF ones just wrap basic text extraction. Here is how to add production-grade PDF processing to Claude, Cursor, or any MCP-compatible

TL;DR: AI agents can’t read PDFs natively. MCP (Model Context Protocol) fixes this by letting agents call tools through a standard interface. pdfmux ships with a built-in MCP server that gives any agent self-healing PDF extraction — confidence scoring, automatic OCR, table detection — in one command. pip install "pdfmux[serve]" and you’re done.

The problem nobody talks about

Your AI agent is smart. It can write code, search the web, query databases. But hand it a PDF and it’s blind.

This isn’t a minor limitation. PDFs are everywhere in business — contracts, invoices, research papers, financial reports, compliance documents. If your agent can’t read them, it can’t do half the work you need it to.

The typical workaround? Copy-paste text into the prompt. Upload the file to a web UI. Run a separate script and feed the output back. All manual. All fragile. All defeating the purpose of having an agent.

There are now 33 PDF-related MCP servers listed on mcp.so alone. Most of them wrap pdf-parse (Node.js) or basic PyMuPDF text extraction. They’ll work fine on a clean, digital PDF. But hand them a scanned contract, a table-heavy financial report, or a mixed document with digital and scanned pages? They return garbage — and your agent doesn’t even know it.

The missing piece isn’t extraction. It’s knowing whether the extraction actually worked. (For context on why most extractors fail silently, see our benchmark of every PDF-to-Markdown tool.)

What is MCP and why should you care?

Model Context Protocol (MCP) is an open standard created by Anthropic in November 2024 for connecting AI models to external tools. Think of it as USB-C for AI — one standard interface that works everywhere.

Before MCP, every integration was custom. OpenAI has function calling, Anthropic has tool use, LangChain has its own tool abstraction. Same concept, different schemas, different APIs. Switching providers meant rewriting every integration.

MCP changes this. You build a tool server once, and it works with any MCP-compatible client:

Client	Status
Claude Desktop	Native support
Claude Code (CLI)	Native support
Cursor	Native support
Windsurf	Native support
Cline (VS Code)	Native support
Continue (VS Code)	Native support
Zed Editor	Native support
OpenAI Agents SDK	Via MCP adapter

The ecosystem is growing fast. mcp.so tracks over 18,000 MCP servers — databases, APIs, file systems, browsers, and yes, PDF processors.

How MCP works (30-second version)

AI Agent (Client)
    ↓ JSON-RPC over stdio
MCP Server
    ↓ calls local libraries
Your tools, files, APIs

The agent discovers available tools at startup, sees their names and parameter schemas, and calls them during conversation. No API keys passed through the model. No data leaving your machine (unless the tool explicitly does that). The server runs locally as a subprocess.

Three primitives:

Tools: Functions the agent can call (like convert_pdf)
Resources: Data the agent can read (like file contents)
Prompts: Reusable prompt templates

For PDF processing, tools are what matter.

The PDF MCP landscape: what’s out there

I reviewed the 33 PDF MCP servers on mcp.so. Here’s the breakdown:

Category	Count	What they do
Basic text extraction	~15	Wrap pdf-parse or PyMuPDF, return raw text
PDF manipulation	~8	Merge, split, rotate, encrypt PDFs
PDF generation	~5	Convert markdown/HTML to PDF
Academic/RAG	~3	Paper search, semantic indexing
OCR-capable	~2	Actual OCR for scanned documents

Most PDF MCP servers fall into the “basic text extraction” bucket. They solve the easy case — clean, digital PDFs where PyMuPDF extracts text perfectly in 10 milliseconds.

The problem is the other 10%. Scanned documents. Table-heavy reports. Mixed PDFs with digital and scanned pages. Forms with embedded images. These are the PDFs that actually matter in business workflows, and basic extraction fails silently on them.

Fails silently is the key phrase. The server returns text (or empty text), the agent uses it, and nobody knows the extraction was garbage until the agent produces wrong answers.

What pdfmux’s MCP server does differently

pdfmux doesn’t just extract text. It audits every page using a self-healing extraction pipeline and tells the agent whether to trust the result.

The 4 tools

When an MCP client connects to pdfmux, it discovers four tools:

1. convert_pdf — Full extraction with quality verification

{
  "file_path": "/path/to/document.pdf",
  "format": "markdown",
  "quality": "standard"
}

Returns extracted text plus a metadata header when confidence is below 80% or warnings exist. The agent sees exactly which pages had issues and can act on it — ask the user for clarification, flag the document for human review, or retry with higher quality.

2. analyze_pdf — Quick triage without full extraction

{
  "file_path": "/path/to/document.pdf"
}

Returns page count, document type (digital/scanned/mixed), per-page quality breakdown, and estimated extraction difficulty. Takes milliseconds. Use this when the agent needs to decide whether to process a document before committing to full extraction.

3. batch_convert — Process entire directories

{
  "directory": "/path/to/documents/",
  "quality": "standard"
}

Processes all PDFs in a directory with 4 concurrent workers. Returns per-file results with confidence scores. Useful for knowledge base ingestion or bulk document processing.

4. extract_structured — Tables and key-value pairs

{
  "file_path": "/path/to/invoice.pdf",
  "schema": "invoice",
  "quality": "standard"
}

Returns structured data: tables as JSON (headers + rows) using the same engine described in our PDF table extraction guide, key-value pairs with automatic normalization (dates to ISO 8601, amounts with currency and direction, rates with period detection), and optional schema-guided extraction using fuzzy matching (0.6 threshold).

The confidence scoring difference

Here’s what a typical PDF MCP server returns:

Page 1: [extracted text]
Page 2: [extracted text]
Page 3: [empty]
Page 4: Amoun  Dscriptin  $450  Consltng

The agent sees text. It uses it. Pages 3 and 4 are broken but there’s no signal.

Here’s what pdfmux returns:

Document confidence: 0.87 (87%)
Warnings: Pages 3, 4 had low quality — re-extracted with OCR

Page 1: good  0.98
Page 2: good  0.96
Page 3: bad → OCR'd  0.91
Page 4: bad → OCR'd  0.87

The agent knows the overall confidence. It knows which pages were problematic. It knows they were re-extracted. It can make informed decisions.

What happens under the hood

When the agent calls convert_pdf with quality: "standard":

Fast extract — PyMuPDF extracts every page in ~10ms total (digital PDFs)
Audit — 5 quality checks per page: character density, alphabetic ratio, word structure, whitespace sanity, encoding quality (mojibake detection)
Classify — Each page marked as good, bad, or empty based on scores
Region OCR — For “bad” pages (some text + images), surgical OCR on image regions only — preserving existing good text
Full OCR — For “empty” pages, full-page OCR via RapidOCR (PaddleOCR v4, CPU-only, ~200MB)
LLM fallback — Pages still broken after OCR? Gemini 2.5 Flash vision extraction (if API key configured)
Merge — Combine good pages + fixed pages in document order

90% of PDFs are fully digital. For those, step 1 is all that runs — zero overhead, 10ms per page. You only pay for OCR on pages that actually need it. The entire pipeline runs on CPU without a GPU or API keys.

Setup: 5 minutes to production

Step 1: Install pdfmux with MCP support

pip install "pdfmux[serve]"

This installs pdfmux + the MCP protocol library. Base install handles 90% of PDFs. Want OCR for scanned documents?

pip install "pdfmux[serve,ocr]"     # adds RapidOCR (~200MB, CPU-only)
pip install "pdfmux[serve,tables]"  # adds Docling for 97.9% table accuracy
pip install "pdfmux[serve,all]"     # everything including Gemini Flash

No errors if you skip the optional extras. pdfmux falls back gracefully — if Docling isn’t installed and you hit a table-heavy PDF, PyMuPDF does its best and the confidence score reflects the quality honestly.

Step 2: Configure your AI client

Claude Desktop — edit ~/Library/Application Support/Claude/claude_desktop_config.json:

{
  "mcpServers": {
    "pdfmux": {
      "command": "pdfmux",
      "args": ["serve"],
      "env": {
        "PDFMUX_ALLOWED_DIRS": "/Users/you/Documents"
      }
    }
  }
}

Cursor — add to MCP settings:

{
  "mcpServers": {
    "pdfmux": {
      "command": "pdfmux",
      "args": ["serve"],
      "env": {
        "PDFMUX_ALLOWED_DIRS": "/Users/you/projects"
      }
    }
  }
}

Claude Code (CLI) — create .mcp.json in your project root:

{
  "mcpServers": {
    "pdfmux": {
      "command": "/path/to/venv/bin/pdfmux",
      "args": ["serve"],
      "env": {
        "PDFMUX_ALLOWED_DIRS": "/Users/you"
      }
    }
  }
}

Step 3: Restart your client

MCP configs are loaded at startup. Restart Claude/Cursor/your editor and you’ll see the pdfmux tools available.

Step 4: Verify

Ask your agent: “Analyze this PDF: /path/to/any/file.pdf”

The agent will call analyze_pdf and return the document classification, page count, and quality assessment. If that works, you’re set.

Security: what can the server access?

The PDFMUX_ALLOWED_DIRS environment variable controls which directories pdfmux can read from. This is critical — without it, your AI agent could read any file on your system.

# Single directory
PDFMUX_ALLOWED_DIRS="/Users/you/Documents"

# Multiple directories (colon-separated)
PDFMUX_ALLOWED_DIRS="/Users/you/Documents:/Users/you/Downloads"

Every tool call checks _is_path_allowed() — file paths are resolved to absolute paths and verified against the allowed directories. Symlink tricks don’t work; the resolved path must be inside an allowed directory.

Additional limits (configurable via environment variables):

PDFMUX_MAX_FILE_SIZE_MB: default 500 MB
PDFMUX_MAX_PAGES: default 10,000 pages
PDFMUX_TIMEOUT: default 300 seconds

All processing happens locally. No data leaves your machine unless you’ve configured the Gemini Flash LLM extractor (optional, requires explicit API key setup).

Real-world workflow examples

1. Research paper analysis

You: "Read this paper and summarize the methodology"
Agent: [calls convert_pdf → gets markdown with 0.96 confidence]
Agent: "The paper presents a three-phase approach..."

2. Invoice data extraction

You: "Extract the line items from this invoice"
Agent: [calls extract_structured → gets tables + key-values]
Agent: "Found 12 line items totaling $4,250. Invoice date: 2026-02-15..."

3. Bulk document ingestion

You: "Process all PDFs in /docs/ and flag any with quality issues"
Agent: [calls batch_convert → processes 47 files, 4 concurrent]
Agent: "47 PDFs processed. 43 at >90% confidence. 4 flagged:
  - scan_2019.pdf (67% — scanned, no OCR installed)
  - form_arabic.pdf (72% — mixed language OCR artifacts)
  ..."

4. Contract review with quality gate

You: "Review this contract for termination clauses"
Agent: [calls analyze_pdf first → sees 0.71 confidence, scanned doc]
Agent: "This document is scanned and my extraction confidence is 71%.
        I can re-process with high quality for better results.
        Should I proceed?"
You: "Yes"
Agent: [calls convert_pdf with quality: "high" → 0.93 confidence]
Agent: "Found 3 termination clauses: Section 8.1 states..."

The agent uses analyze_pdf as a cheap preflight check (milliseconds, no full extraction) and only commits to expensive OCR when needed. This is the kind of intelligent behavior you get when the agent has quality metadata to work with.

Comparison: pdfmux MCP vs. alternatives

Feature	Basic PDF MCP servers	LlamaParse (cloud)	pdfmux MCP
Digital PDF extraction	Yes	Yes	Yes
Scanned PDF / OCR	No	Yes	Yes (RapidOCR / Surya / Gemini)
Table extraction	No	Yes	Yes (Docling, 97.9% accuracy)
Confidence scoring	No	No	Yes (per-page, 5 quality checks)
Self-healing re-extraction	No	No	Yes (auto-OCR on bad pages)
Structured data output	No	Yes	Yes (tables + KV + schema mapping)
Quick triage tool	No	No	Yes (`analyze_pdf`)
Runs locally	Yes	No (cloud API)	Yes
Cost	Free	$0.003/page	Free (base)
Setup time	~5 min	~5 min	~5 min

Troubleshooting

Tools not appearing after config? MCP configs load at client startup. Restart Claude/Cursor. If using a virtualenv, use the full path to the pdfmux binary (e.g., /path/to/venv/bin/pdfmux).

“Path not allowed” errors? Check PDFMUX_ALLOWED_DIRS. The path must be an ancestor of the file you’re trying to read.

Empty pages on scanned PDFs? Install OCR support: pip install "pdfmux[ocr]". The confidence score will reflect unrecovered pages with a warning.

Slow on large documents? For documents >50 pages, pdfmux automatically targets table extraction to only table-candidate pages instead of processing the whole document through Docling.

Want to verify the server works standalone?

echo '{"jsonrpc":"2.0","method":"initialize","params":{"protocolVersion":"2024-11-05","capabilities":{},"clientInfo":{"name":"test","version":"1.0"}},"id":1}' | pdfmux serve

If you get a JSON response with server capabilities, the server is working.

Try it

pip install "pdfmux[serve]"
pdfmux serve  # starts MCP server on stdio

Add it to your Claude/Cursor/editor config, restart, and your agent can read any PDF — with confidence scores that tell it when to trust the result and when to flag it for human review.

GitHub — source, docs, examples
PyPI — pip install pdfmux
pdfmux.com — documentation

MIT licensed. Runs locally. No API keys needed for the base install.

Built by Nameet Potnis. Contributions welcome.