PDF to JSON for LLM pipelines: structured extraction patterns in 2026

TL;DRPDF to JSON for LLM tool use, function calling, and structured outputs. 3-layer schema, bounding boxes, confidence scoring, and the pdfmux pattern.

Direct answer: Modern LLM pipelines consume JSON, not Markdown — every tool-use call, every function-calling response, every structured-output schema is JSON (see the Anthropic tool-use spec and the MCP specification). The right way to convert a PDF to JSON in 2026 is not to dump the whole document as one blob, but to produce a 3-layer object: { document: {...}, pages: [...], blocks: [...] }, where each block carries text, type (heading/table/paragraph/figure), bounding box, and a confidence score. That is exactly what pdfmux returns from pdfmux convert --format json, and the rest of this post walks through how to use it.

Why JSON, not Markdown, for LLM pipelines

A year ago the prevailing pattern was PDF → Markdown → chunk → embed. Markdown is human-readable, easy to chunk on headings, and every LLM parses it natively. We covered that pipeline in detail in PDF to Markdown for RAG pipelines, and it is still the right answer for naive RAG.

But the production AI ecosystem in 2026 has shifted. Most serious LLM applications now use one of three patterns that need JSON:

Tool use / function calling — Claude, GPT-4o, and Gemini all consume tool definitions in JSON Schema and return tool calls as JSON objects. If your extraction layer outputs Markdown, you write a glue layer to parse it back into JSON. Skip the round trip.
Structured outputs / response_format — The OpenAI response_format parameter and Anthropic’s tool-use loop both expect a JSON schema. The model is conditioned to emit valid JSON matching that schema. If your retrieval layer already returns JSON, the model has less work to do and hallucinates less.
Citations — When the user asks “where in the document does it say X?”, you need to point back to a specific page, paragraph, and ideally bounding box. Markdown destroys that information. JSON preserves it natively.

There is also a fourth, quieter reason: agents that read PDFs benefit from typed blocks. An agent looking for the total on an invoice should be able to filter for block.type === 'table' before parsing, instead of regex-grepping the entire document. Type information is structural; Markdown loses it.

The 3-layer JSON schema

Every PDF-to-JSON output we have seen converge on roughly the same shape. Here is the canonical version, as emitted by pdfmux convert --format json:

{
  "document": {
    "source": "annual-report.pdf",
    "page_count": 84,
    "languages": ["en"],
    "overall_confidence": 0.92,
    "extractor_versions": {
      "pdfmux": "1.6.2",
      "pymupdf": "1.24.5",
      "docling": "2.4.0"
    },
    "warnings": [
      "Page 73 used OCR fallback; verify table on row 4."
    ]
  },
  "pages": [
    {
      "page": 1,
      "width": 612,
      "height": 792,
      "confidence": 0.98,
      "method": "pymupdf",
      "language": "en"
    }
  ],
  "blocks": [
    {
      "id": "p1-b0",
      "page": 1,
      "type": "heading",
      "level": 1,
      "text": "FY2025 Annual Report",
      "bbox": [72, 72, 540, 110],
      "confidence": 0.99
    },
    {
      "id": "p1-b1",
      "page": 1,
      "type": "paragraph",
      "text": "Net revenue grew 14% year over year to $4.2 billion...",
      "bbox": [72, 130, 540, 280],
      "confidence": 0.97
    },
    {
      "id": "p3-b4",
      "page": 3,
      "type": "table",
      "rows": [
        ["Region", "Revenue", "YoY"],
        ["North America", "$2.1B", "+12%"],
        ["EMEA", "$1.3B", "+18%"],
        ["APAC", "$0.8B", "+15%"]
      ],
      "bbox": [72, 200, 540, 380],
      "confidence": 0.94
    }
  ]
}

Three layers, each doing a distinct job:

document — Whole-document metadata. Use this for cache keys, audit logs, and confidence-based routing decisions.
pages — Per-page metadata. Used for citation rendering (rebuild the page rectangle) and for filtering low-confidence pages out of the retrieval index.
blocks — The actual content. Each block has a stable id, a type, the text or table data, a bounding box, and its own confidence score.

The block-level confidence score is the lever you operate on. Index high-confidence blocks aggressively, drop low-confidence blocks below a threshold, and either re-extract or human-review the borderline ones. See self-healing PDF extraction for the per-page version of this argument; the block-level version is just a finer-grained instance of the same idea.

Generating this JSON from a PDF

The shortest path is the pdfmux CLI:

pip install pdfmux
pdfmux convert annual-report.pdf --format json > extracted.json

That writes the full 3-layer structure above. From Python:

from pdfmux import process

result = process("annual-report.pdf", output="json")

# Filter to high-confidence text blocks for the retrieval index
indexable = [
    b for b in result["blocks"]
    if b["type"] in {"heading", "paragraph"} and b["confidence"] >= 0.85
]

# Pull every table as a Pandas-ready record list
import pandas as pd
tables = [
    pd.DataFrame(b["rows"][1:], columns=b["rows"][0])
    for b in result["blocks"]
    if b["type"] == "table"
]

From Node, call the same binary via subprocess — the full pattern is in PDF extraction with Node.js.

For why pdfmux specifically produces this output well (vs PyMuPDF directly, marker, or docling), see the 200-PDF benchmark — short version: 0.903 overall, #2 globally, #1 among free tools, and the only one with built-in confidence scoring per page.

Pattern 1: feeding the JSON to a tool-use loop

This is where the JSON pays off most directly. You define a tool that takes a block_id and returns the block content. The model decides which blocks to read based on the document metadata and the user’s question. No retrieval-step embedding required for small documents — let the model navigate the JSON directly.

import anthropic

doc = process("contract.pdf", output="json")

# Compact view: just block ids, types, and first 80 chars
toc = [
    {"id": b["id"], "type": b["type"], "preview": b["text"][:80] if "text" in b else "[table]"}
    for b in doc["blocks"]
]

tools = [{
    "name": "read_block",
    "description": "Read the full content of a block by id.",
    "input_schema": {
        "type": "object",
        "properties": {"block_id": {"type": "string"}},
        "required": ["block_id"],
    },
}]

client = anthropic.Anthropic()
response = client.messages.create(
    model="claude-opus-4-7",
    max_tokens=2048,
    tools=tools,
    messages=[{
        "role": "user",
        "content": f"Document table of contents (JSON): {toc}\n\nFind the indemnification clause and quote it verbatim."
    }],
)

The model gets a tiny table of contents, picks the blocks it wants to read, and the tool fetches each block’s full text on demand. For a 100-page contract, the model sees maybe 4 KB of structure up-front and reads 10 KB of selected blocks — vs. dumping the whole 400 KB document into context.

This pattern is in production at several agent companies and is the basis of the pdfmux MCP server’s design — see pdfmux MCP server for Claude, Cursor, Windsurf for the user-facing version.

Pattern 2: structured outputs from a PDF

The other common pattern is “read this PDF and emit JSON matching this schema.” Invoices, lab reports, financial statements, contracts with predictable clauses. The model’s job is to populate a schema, not to write prose.

schema = {
    "type": "object",
    "properties": {
        "invoice_number": {"type": "string"},
        "issue_date": {"type": "string", "format": "date"},
        "vendor_name": {"type": "string"},
        "line_items": {
            "type": "array",
            "items": {
                "type": "object",
                "properties": {
                    "description": {"type": "string"},
                    "quantity": {"type": "number"},
                    "unit_price": {"type": "number"},
                    "total": {"type": "number"},
                },
                "required": ["description", "total"],
            },
        },
        "subtotal": {"type": "number"},
        "tax": {"type": "number"},
        "total": {"type": "number"},
    },
    "required": ["invoice_number", "total"],
}

doc = process("invoice-q1-2026.pdf", output="json")
# Pull just the blocks the model needs to see
relevant = [b for b in doc["blocks"] if b["type"] in {"table", "paragraph", "heading"}]

Feed relevant and schema to the model with a tool-use call constrained to that schema. The model returns a valid JSON object you can validate, store, and audit.

The reason the pdfmux JSON layer matters here: if you feed the model raw Markdown, it has to parse the document, identify the table, read columns left-to-right, and emit JSON. Three steps where it can fail. If you feed it the JSON with the table already structured as rows-and-columns, it has one job — match rows to schema fields. Failure rate drops measurably. We have not benchmarked this rigorously, but the pattern is consistent across customer deployments.

For the deep dive on invoice schemas specifically, see PDF invoice extraction in Python which uses an identical structured-output approach.

Pattern 3: citation-aware retrieval

The third pattern is the one that justifies the bounding box data. When the user asks “where does the contract say I can terminate for convenience?”, you need to:

Retrieve the answer text.
Tell the UI which page and rectangle to highlight.
Give the user a clickable citation back to the source PDF.

You cannot do step 2 from Markdown. You need the bbox field from the block JSON.

# Pseudocode for a citation-aware retrieval
def answer_with_citation(question, doc_json):
    # 1. Embed each block, find best match
    matches = retrieve(question, doc_json["blocks"], top_k=3)

    # 2. Generate answer with citations as block ids
    answer = generate_answer(question, matches)

    # 3. Map block ids to page + bbox for the UI to render
    citations = [
        {
            "block_id": m["id"],
            "page": m["page"],
            "bbox": m["bbox"],
            "page_width": doc_json["pages"][m["page"]-1]["width"],
            "page_height": doc_json["pages"][m["page"]-1]["height"],
        }
        for m in matches
    ]
    return {"answer": answer, "citations": citations}

The frontend renders the source PDF page, overlays a yellow rectangle at the bbox coordinates, and the user sees exactly where the answer came from. This is the difference between “trust me, the AI said so” and an auditable system. For regulated industries, it is non-negotiable.

When NOT to use this 3-layer JSON

A few cases where Markdown is still the right intermediate:

Tiny documents, single-shot summarization. A 2-page memo going into a single LLM call — dump it as Markdown, done. The JSON overhead is wasted.
Naive RAG with semantic chunking on heading boundaries. If your pipeline is chunk on H2 → embed → top-k retrieve → stuff into context, Markdown is more compact and the JSON metadata is unused.
Pure summarization with no citations needed. If the deliverable is “write a one-page summary of this report” and the user does not need to verify, Markdown is faster end-to-end.

The flip happens the moment you add tools, structured outputs, or citations. Once any of those three are in the system, the JSON layer pays for itself.

A note on token cost

A common worry: “won’t JSON be 2-3x more tokens than Markdown?” Yes — the verbose version is. But you almost never feed the entire JSON to the model. You feed:

The document block (one paragraph, maybe 200 tokens)
A compact block table-of-contents (id + type + 80-char preview = ~30 tokens per block)
Only the blocks the model selects via tools, on demand

For a 100-page document with ~500 blocks, that is 200 + (30 × 500) = ~15,000 tokens of TOC, and the model typically reads 5-15 blocks of full content = ~3,000 additional tokens. Total ~18,000 tokens vs. ~70,000 tokens if you dump the whole Markdown document.

The JSON pattern is cheaper, not more expensive, for any document above 20 pages. This is why we recommend it for production. For more on cost-aware extraction pipeline design, see PDF data extraction for AI agents.

The shortest path

If you are designing a new pipeline today:

# 1. Extract once, write JSON
pdfmux convert document.pdf --format json > document.json

# 2. From your app — Python
python -c "
import json
doc = json.load(open('document.json'))
# Index blocks where confidence >= 0.85
# Feed the document.warnings + pages summary to the model
# Use tool-use to let the model navigate blocks on demand
"

Three commands. The first replaces a custom Python script. The second replaces ad-hoc Markdown chunking. The result is a pipeline that supports tool use, structured outputs, and citations without writing any of those layers from scratch.

For the wider design context — when to extract synchronously vs. queue, how to cache extractions, how to handle reprocessing when the extractor improves — see PDF data extraction for AI agents and PDF extraction for RAG pipelines.