TL;DR: AI agents can’t read PDFs natively. MCP (Model Context Protocol) fixes this by letting agents call tools through a standard interface. pdfmux ships with a built-in MCP server that gives any agent self-healing PDF extraction — confidence scoring, automatic OCR, table detection — in one command. pip install "pdfmux[serve]" and you’re done.
The problem nobody talks about
Your AI agent is smart. It can write code, search the web, query databases. But hand it a PDF and it’s blind.
This isn’t a minor limitation. PDFs are everywhere in business — contracts, invoices, research papers, financial reports, compliance documents. If your agent can’t read them, it can’t do half the work you need it to.
The typical workaround? Copy-paste text into the prompt. Upload the file to a web UI. Run a separate script and feed the output back. All manual. All fragile. All defeating the purpose of having an agent.
There are now 33 PDF-related MCP servers listed on mcp.so alone. Most of them wrap pdf-parse (Node.js) or basic PyMuPDF text extraction. They’ll work fine on a clean, digital PDF. But hand them a scanned contract, a table-heavy financial report, or a mixed document with digital and scanned pages? They return garbage — and your agent doesn’t even know it.
The missing piece isn’t extraction. It’s knowing whether the extraction actually worked. (For context on why most extractors fail silently, see our benchmark of every PDF-to-Markdown tool.)
What is MCP and why should you care?
Model Context Protocol (MCP) is an open standard created by Anthropic in November 2024 for connecting AI models to external tools. Think of it as USB-C for AI — one standard interface that works everywhere.
Before MCP, every integration was custom. OpenAI has function calling, Anthropic has tool use, LangChain has its own tool abstraction. Same concept, different schemas, different APIs. Switching providers meant rewriting every integration.
MCP changes this. You build a tool server once, and it works with any MCP-compatible client:
| Client | Status |
|---|---|
| Claude Desktop | Native support |
| Claude Code (CLI) | Native support |
| Cursor | Native support |
| Windsurf | Native support |
| Cline (VS Code) | Native support |
| Continue (VS Code) | Native support |
| Zed Editor | Native support |
| OpenAI Agents SDK | Via MCP adapter |
The ecosystem is growing fast. mcp.so tracks over 18,000 MCP servers — databases, APIs, file systems, browsers, and yes, PDF processors.
How MCP works (30-second version)
AI Agent (Client)
↓ JSON-RPC over stdio
MCP Server
↓ calls local libraries
Your tools, files, APIs
The agent discovers available tools at startup, sees their names and parameter schemas, and calls them during conversation. No API keys passed through the model. No data leaving your machine (unless the tool explicitly does that). The server runs locally as a subprocess.
Three primitives:
- Tools: Functions the agent can call (like
convert_pdf) - Resources: Data the agent can read (like file contents)
- Prompts: Reusable prompt templates
For PDF processing, tools are what matter.
The PDF MCP landscape: what’s out there
I reviewed the 33 PDF MCP servers on mcp.so. Here’s the breakdown:
| Category | Count | What they do |
|---|---|---|
| Basic text extraction | ~15 | Wrap pdf-parse or PyMuPDF, return raw text |
| PDF manipulation | ~8 | Merge, split, rotate, encrypt PDFs |
| PDF generation | ~5 | Convert markdown/HTML to PDF |
| Academic/RAG | ~3 | Paper search, semantic indexing |
| OCR-capable | ~2 | Actual OCR for scanned documents |
Most PDF MCP servers fall into the “basic text extraction” bucket. They solve the easy case — clean, digital PDFs where PyMuPDF extracts text perfectly in 10 milliseconds.
The problem is the other 10%. Scanned documents. Table-heavy reports. Mixed PDFs with digital and scanned pages. Forms with embedded images. These are the PDFs that actually matter in business workflows, and basic extraction fails silently on them.
Fails silently is the key phrase. The server returns text (or empty text), the agent uses it, and nobody knows the extraction was garbage until the agent produces wrong answers.
What pdfmux’s MCP server does differently
pdfmux doesn’t just extract text. It audits every page using a self-healing extraction pipeline and tells the agent whether to trust the result.
The 4 tools
When an MCP client connects to pdfmux, it discovers four tools:
1. convert_pdf — Full extraction with quality verification
{
"file_path": "/path/to/document.pdf",
"format": "markdown",
"quality": "standard"
}
Returns extracted text plus a metadata header when confidence is below 80% or warnings exist. The agent sees exactly which pages had issues and can act on it — ask the user for clarification, flag the document for human review, or retry with higher quality.
2. analyze_pdf — Quick triage without full extraction
{
"file_path": "/path/to/document.pdf"
}
Returns page count, document type (digital/scanned/mixed), per-page quality breakdown, and estimated extraction difficulty. Takes milliseconds. Use this when the agent needs to decide whether to process a document before committing to full extraction.
3. batch_convert — Process entire directories
{
"directory": "/path/to/documents/",
"quality": "standard"
}
Processes all PDFs in a directory with 4 concurrent workers. Returns per-file results with confidence scores. Useful for knowledge base ingestion or bulk document processing.
4. extract_structured — Tables and key-value pairs
{
"file_path": "/path/to/invoice.pdf",
"schema": "invoice",
"quality": "standard"
}
Returns structured data: tables as JSON (headers + rows) using the same engine described in our PDF table extraction guide, key-value pairs with automatic normalization (dates to ISO 8601, amounts with currency and direction, rates with period detection), and optional schema-guided extraction using fuzzy matching (0.6 threshold).
The confidence scoring difference
Here’s what a typical PDF MCP server returns:
Page 1: [extracted text]
Page 2: [extracted text]
Page 3: [empty]
Page 4: Amoun Dscriptin $450 Consltng
The agent sees text. It uses it. Pages 3 and 4 are broken but there’s no signal.
Here’s what pdfmux returns:
Document confidence: 0.87 (87%)
Warnings: Pages 3, 4 had low quality — re-extracted with OCR
Page 1: good 0.98
Page 2: good 0.96
Page 3: bad → OCR'd 0.91
Page 4: bad → OCR'd 0.87
The agent knows the overall confidence. It knows which pages were problematic. It knows they were re-extracted. It can make informed decisions.
What happens under the hood
When the agent calls convert_pdf with quality: "standard":
- Fast extract — PyMuPDF extracts every page in ~10ms total (digital PDFs)
- Audit — 5 quality checks per page: character density, alphabetic ratio, word structure, whitespace sanity, encoding quality (mojibake detection)
- Classify — Each page marked as good, bad, or empty based on scores
- Region OCR — For “bad” pages (some text + images), surgical OCR on image regions only — preserving existing good text
- Full OCR — For “empty” pages, full-page OCR via RapidOCR (PaddleOCR v4, CPU-only, ~200MB)
- LLM fallback — Pages still broken after OCR? Gemini 2.5 Flash vision extraction (if API key configured)
- Merge — Combine good pages + fixed pages in document order
90% of PDFs are fully digital. For those, step 1 is all that runs — zero overhead, 10ms per page. You only pay for OCR on pages that actually need it. The entire pipeline runs on CPU without a GPU or API keys.
Setup: 5 minutes to production
Step 1: Install pdfmux with MCP support
pip install "pdfmux[serve]"
This installs pdfmux + the MCP protocol library. Base install handles 90% of PDFs. Want OCR for scanned documents?
pip install "pdfmux[serve,ocr]" # adds RapidOCR (~200MB, CPU-only)
pip install "pdfmux[serve,tables]" # adds Docling for 97.9% table accuracy
pip install "pdfmux[serve,all]" # everything including Gemini Flash
No errors if you skip the optional extras. pdfmux falls back gracefully — if Docling isn’t installed and you hit a table-heavy PDF, PyMuPDF does its best and the confidence score reflects the quality honestly.
Step 2: Configure your AI client
Claude Desktop — edit ~/Library/Application Support/Claude/claude_desktop_config.json:
{
"mcpServers": {
"pdfmux": {
"command": "pdfmux",
"args": ["serve"],
"env": {
"PDFMUX_ALLOWED_DIRS": "/Users/you/Documents"
}
}
}
}
Cursor — add to MCP settings:
{
"mcpServers": {
"pdfmux": {
"command": "pdfmux",
"args": ["serve"],
"env": {
"PDFMUX_ALLOWED_DIRS": "/Users/you/projects"
}
}
}
}
Claude Code (CLI) — create .mcp.json in your project root:
{
"mcpServers": {
"pdfmux": {
"command": "/path/to/venv/bin/pdfmux",
"args": ["serve"],
"env": {
"PDFMUX_ALLOWED_DIRS": "/Users/you"
}
}
}
}
Step 3: Restart your client
MCP configs are loaded at startup. Restart Claude/Cursor/your editor and you’ll see the pdfmux tools available.
Step 4: Verify
Ask your agent: “Analyze this PDF: /path/to/any/file.pdf”
The agent will call analyze_pdf and return the document classification, page count, and quality assessment. If that works, you’re set.
Security: what can the server access?
The PDFMUX_ALLOWED_DIRS environment variable controls which directories pdfmux can read from. This is critical — without it, your AI agent could read any file on your system.
# Single directory
PDFMUX_ALLOWED_DIRS="/Users/you/Documents"
# Multiple directories (colon-separated)
PDFMUX_ALLOWED_DIRS="/Users/you/Documents:/Users/you/Downloads"
Every tool call checks _is_path_allowed() — file paths are resolved to absolute paths and verified against the allowed directories. Symlink tricks don’t work; the resolved path must be inside an allowed directory.
Additional limits (configurable via environment variables):
PDFMUX_MAX_FILE_SIZE_MB: default 500 MBPDFMUX_MAX_PAGES: default 10,000 pagesPDFMUX_TIMEOUT: default 300 seconds
All processing happens locally. No data leaves your machine unless you’ve configured the Gemini Flash LLM extractor (optional, requires explicit API key setup).
Real-world workflow examples
1. Research paper analysis
You: "Read this paper and summarize the methodology"
Agent: [calls convert_pdf → gets markdown with 0.96 confidence]
Agent: "The paper presents a three-phase approach..."
2. Invoice data extraction
You: "Extract the line items from this invoice"
Agent: [calls extract_structured → gets tables + key-values]
Agent: "Found 12 line items totaling $4,250. Invoice date: 2026-02-15..."
3. Bulk document ingestion
You: "Process all PDFs in /docs/ and flag any with quality issues"
Agent: [calls batch_convert → processes 47 files, 4 concurrent]
Agent: "47 PDFs processed. 43 at >90% confidence. 4 flagged:
- scan_2019.pdf (67% — scanned, no OCR installed)
- form_arabic.pdf (72% — mixed language OCR artifacts)
..."
4. Contract review with quality gate
You: "Review this contract for termination clauses"
Agent: [calls analyze_pdf first → sees 0.71 confidence, scanned doc]
Agent: "This document is scanned and my extraction confidence is 71%.
I can re-process with high quality for better results.
Should I proceed?"
You: "Yes"
Agent: [calls convert_pdf with quality: "high" → 0.93 confidence]
Agent: "Found 3 termination clauses: Section 8.1 states..."
The agent uses analyze_pdf as a cheap preflight check (milliseconds, no full extraction) and only commits to expensive OCR when needed. This is the kind of intelligent behavior you get when the agent has quality metadata to work with.
Comparison: pdfmux MCP vs. alternatives
| Feature | Basic PDF MCP servers | LlamaParse (cloud) | pdfmux MCP |
|---|---|---|---|
| Digital PDF extraction | Yes | Yes | Yes |
| Scanned PDF / OCR | No | Yes | Yes (RapidOCR / Surya / Gemini) |
| Table extraction | No | Yes | Yes (Docling, 97.9% accuracy) |
| Confidence scoring | No | No | Yes (per-page, 5 quality checks) |
| Self-healing re-extraction | No | No | Yes (auto-OCR on bad pages) |
| Structured data output | No | Yes | Yes (tables + KV + schema mapping) |
| Quick triage tool | No | No | Yes (analyze_pdf) |
| Runs locally | Yes | No (cloud API) | Yes |
| Cost | Free | $0.003/page | Free (base) |
| Setup time | ~5 min | ~5 min | ~5 min |
Troubleshooting
Tools not appearing after config?
MCP configs load at client startup. Restart Claude/Cursor. If using a virtualenv, use the full path to the pdfmux binary (e.g., /path/to/venv/bin/pdfmux).
“Path not allowed” errors?
Check PDFMUX_ALLOWED_DIRS. The path must be an ancestor of the file you’re trying to read.
Empty pages on scanned PDFs?
Install OCR support: pip install "pdfmux[ocr]". The confidence score will reflect unrecovered pages with a warning.
Slow on large documents? For documents >50 pages, pdfmux automatically targets table extraction to only table-candidate pages instead of processing the whole document through Docling.
Want to verify the server works standalone?
echo '{"jsonrpc":"2.0","method":"initialize","params":{"protocolVersion":"2024-11-05","capabilities":{},"clientInfo":{"name":"test","version":"1.0"}},"id":1}' | pdfmux serve
If you get a JSON response with server capabilities, the server is working.
Try it
pip install "pdfmux[serve]"
pdfmux serve # starts MCP server on stdio
Add it to your Claude/Cursor/editor config, restart, and your agent can read any PDF — with confidence scores that tell it when to trust the result and when to flag it for human review.
- GitHub — source, docs, examples
- PyPI —
pip install pdfmux - pdfmux.com — documentation
MIT licensed. Runs locally. No API keys needed for the base install.
Built by Nameet Potnis. Contributions welcome.