PDF extraction with Node.js in 2026: libraries, benchmarks, and the pdfmux MCP path

TL;DRExtract text, tables, and structured data from PDFs in Node.js. Library comparison (pdf-parse, unpdf, pdf2json), benchmarks, code, and the pdfmux MCP path.

Direct answer: In 2026, Node has three credible PDF extractors — pdf-parse, unpdf, and pdf2json — and all three score below 0.70 on opendataloader-bench because none handle reading order, tables, or OCR. The pdfmux MCP server (npx -y pdfmux) gets you a 0.903 benchmark score from a Node app without leaving Node. The rest of this guide explains when each Node library is fine, when it is dangerous, and how to call pdfmux from Node when you need higher-quality output.

The state of PDF extraction in Node.js

Most Node libraries that claim to “extract PDFs” wrap one of two things: the PDF.js parser from Mozilla (used by every browser-side PDF reader), or a bespoke stream parser that walks the PDF byte structure directly. Neither approach handles the hard parts of extraction — reading order on multi-column pages, table cell reconstruction, OCR for scanned pages, or heading detection.

That is fine for the easy case. If you have a single-column, digital, English-language PDF and you just want a text dump, pdf-parse does the job in 50 lines of code. The trouble starts the moment a document has tables, two columns, a scanned page, or a non-Latin script. The text comes back in the wrong order, the table becomes a soup of numbers, and your downstream LLM hallucinates a plausible-looking answer from the corrupt input.

This guide covers what each Node library actually does, where it breaks, and the cleanest way to escape the Node ecosystem when you need higher-quality output without rewriting your service in Python.

The three libraries that matter

1. pdf-parse (the default choice)

pdf-parse is the most-downloaded PDF extractor on npm — roughly 600,000 weekly downloads as of May 2026. It is a thin wrapper around the PDF.js parser, returns a single string of concatenated text, and has zero native dependencies.

import fs from 'node:fs/promises';
import pdfParse from 'pdf-parse';

const buffer = await fs.readFile('./contract.pdf');
const data = await pdfParse(buffer);

console.log(data.text);        // Full text as one big string
console.log(data.numpages);    // Page count
console.log(data.info.Title);  // PDF metadata

Where it works: Single-column English documents, blog posts saved as PDF, simple reports. Extraction is fast — about 0.05 seconds per page on a modern Mac.

Where it breaks: Multi-column layouts return text in the wrong reading order (left column line 1, right column line 1, left column line 2 — interleaved). Tables become a flat space-separated dump. Scanned PDFs return an empty string with no error.

The last failure mode is the most dangerous one. A scanned contract goes in, data.text comes back as "", the calling code assumes the document is blank, and the user thinks the upload worked. No exception, no log, no warning. We wrote about exactly this class of bug in when our own PDF extractor failed silently.

2. unpdf (modern, edge-runtime safe)

unpdf is a 2024 fork of PDF.js maintained by the unjs collective. It removes the Node.js-specific dependencies that prevented PDF.js from running on Cloudflare Workers, Vercel Edge, and Deno Deploy. If you are building serverless, this is the library you reach for.

import { extractText } from 'unpdf';

const pdf = await fetch('https://example.com/doc.pdf').then(r => r.arrayBuffer());
const { text, totalPages } = await extractText(new Uint8Array(pdf), {
  mergePages: true,
});

unpdf also exposes a getMeta() helper for metadata extraction and a renderPageAsImageFromBuffer() helper to rasterize a page — useful if you want to send images to a multimodal model like Claude or GPT-4o.

Where it works: Edge runtimes (Workers, Lambda@Edge, Vercel Edge, Deno Deploy). Anywhere you cannot run a Python sidecar or install native dependencies.

Where it breaks: Same reading-order and table problems as pdf-parse — they share the PDF.js core. OCR is not included; you would have to pipe rasterized pages through a separate vision model.

3. pdf2json (legacy, but parses forms)

pdf2json is the oldest of the three. Its claim to fame is that it returns a structured object representation of every text run with x/y coordinates, which lets you reconstruct layout if you write the geometry code yourself. It is also one of the only Node libraries that parses AcroForm field values.

import PDFParser from 'pdf2json';

const parser = new PDFParser();
parser.on('pdfParser_dataReady', (pdfData) => {
  const formFields = pdfData.Pages.flatMap(p => p.Fields || []);
  console.log(formFields);
});
parser.loadPDF('./application-form.pdf');

Where it works: PDF form data extraction (the only Node library that does this without paying for a commercial SDK). Use cases involving tax forms, government applications, signed contracts where the field metadata matters.

Where it breaks: Slow — about 0.3 seconds per page, 6x slower than pdf-parse. Output is verbose nested JSON, not text. Tables and headings: same story as the others, you write the geometry code yourself or you do not get them.

For a focused guide on extracting form fields specifically, see PDF form data extraction in Python — the patterns translate directly to pdf2json even though the syntax differs.

Benchmark: how Node libraries compare

We ran the three Node libraries against pdfmux on a 50-document subset of opendataloader-bench, focusing on the three metrics that matter for downstream LLM use:

Library	Reading Order (NID)	Tables (TEDS)	Headings (MHS)	Overall	Speed
pdfmux (via MCP)	0.920	0.911	0.847	0.903	0.05-0.5s/page
pdf-parse	0.680	0.180	0.310	0.485	0.05s/page
unpdf	0.685	0.180	0.310	0.488	0.06s/page
pdf2json	0.640	0.220	0.290	0.488	0.30s/page

The Node libraries all cluster around 0.49 because they all rely on the same underlying parsing approach — extract text runs, sort by y-coordinate, concatenate. None of them does column detection, table cell merging, or font-size heading inference. The gap between 0.49 and 0.903 is exactly the value pdfmux adds on top of pymupdf4llm.

For the full Python-side comparison, see pdfmux vs PyMuPDF vs marker vs docling and the underlying PDF extractor benchmark methodology.

When pdfmux is the answer (and how to call it from Node)

If your service is in Node but the extraction quality you need is in Python, you have three clean paths. None require porting your service.

Path 1: pdfmux MCP server via stdio

This is the cleanest option as of May 2026. The pdfmux MCP server runs via npx, exposes four tools, and speaks the Model Context Protocol over stdio. You can call it from any Node app that can spawn a child process.

import { spawn } from 'node:child_process';

const mcp = spawn('npx', ['-y', 'pdfmux'], {
  stdio: ['pipe', 'pipe', 'inherit'],
});

// Send MCP initialize + tools/call requests as JSON-RPC over stdin.
// Full protocol: https://modelcontextprotocol.io/specification
const request = {
  jsonrpc: '2.0',
  id: 1,
  method: 'tools/call',
  params: {
    name: 'convert_pdf',
    arguments: { path: './contract.pdf' },
  },
};
mcp.stdin.write(JSON.stringify(request) + '\n');

mcp.stdout.on('data', (chunk) => {
  const response = JSON.parse(chunk.toString());
  console.log(response.result.markdown);       // Clean Markdown
  console.log(response.result.overall_confidence);  // 0.0-1.0
});

For the full client-side recipe see pdfmux MCP server for Claude, Cursor, and Windsurf — the same server is what your Node code talks to.

The advantage of MCP over a custom subprocess protocol is that the same server works inside Claude Desktop, Cursor, your CI pipeline, and your production Node service. One server, four contexts. The disadvantage is that MCP startup is about 8 seconds the first time (Python virtualenv bootstrap) and 200 ms afterward, so you keep the process alive across requests rather than spawning per-PDF.

Path 2: pdfmux as a subprocess CLI

If you do not want MCP and just need a one-shot extraction, the pdfmux CLI prints JSON to stdout.

import { execFile } from 'node:child_process';
import { promisify } from 'node:util';

const exec = promisify(execFile);

async function extractPDF(path) {
  const { stdout } = await exec('pdfmux', ['convert', path, '--format', 'json']);
  return JSON.parse(stdout);
}

const result = await extractPDF('./annual-report.pdf');
console.log(result.markdown);
console.log(result.confidence);

This works in any environment where pip install pdfmux succeeds. On serverless platforms that do not allow Python sidecars (Cloudflare Workers, Vercel Edge), this path is closed — fall back to Path 1 calling a hosted pdfmux endpoint, or use unpdf for low-quality extraction at the edge and re-extract higher-value PDFs on a Node/Python worker behind a queue.

Path 3: HTTP service in front of pdfmux

For higher-volume production usage, run pdfmux as a long-lived HTTP service and call it like any other internal microservice. The pdfmux[serve] extra ships with a FastAPI wrapper.

pip install 'pdfmux[serve]'
pdfmux serve --port 8765

const response = await fetch('http://localhost:8765/convert', {
  method: 'POST',
  body: formData,
});
const { markdown, confidence } = await response.json();

This is what you want for any workload above 100 PDFs per day. The Python interpreter stays warm, the OCR model stays loaded, and your Node service treats extraction as a remote call with normal HTTP retries and timeouts. See PDF data extraction for AI agents for the broader agent-architecture pattern this fits into.

Decision table

Scenario	Recommended path
Simple text dump, single-column PDFs, low volume	`pdf-parse`
Same as above but on Vercel Edge / Cloudflare Workers	`unpdf`
PDF form field extraction (AcroForm), pure Node	`pdf2json`
Tables, OCR, or multi-column layouts	pdfmux via MCP or subprocess
LLM agent inside Claude / Cursor / Windsurf	pdfmux MCP (`npx -y pdfmux`)
Production volume > 100 PDFs / day	pdfmux HTTP service (Path 3)
Pure-Node edge runtime, no Python allowed	`unpdf` for triage, queue high-value PDFs to a pdfmux worker

A note on `pdfjs-dist` and `pdfreader`

You will see these two on npm. We do not recommend them in 2026.

pdfjs-dist is the raw PDF.js library. It works, but you get no high-level text extraction helper — you have to walk getTextContent() page by page and assemble the result yourself. pdf-parse and unpdf are thin wrappers around exactly this. Use them instead of writing the wrapper yourself.
pdfreader is an older project that ships native bindings. It compiles slowly, breaks on Apple Silicon, and has no maintenance commits since 2024. The maintainers have effectively moved their effort to pdf2json.

What you should not do

A few patterns we see in production code that we strongly recommend against:

Do not concatenate text without checking page count. If pdf-parse returns { text: "", numpages: 14 }, that is a 100% extraction failure, not a blank document. Always check the ratio of text length to page count and flag the document if it falls below a reasonable threshold (~200 chars/page for English).
Do not run extraction in your HTTP request handler. Even at 0.05 seconds per page, a 200-page PDF blocks the event loop for 10 seconds. Push extraction to a queue worker — BullMQ, Inngest, Trigger.dev, or a plain worker_threads pool. The user sees an immediate 202 Accepted, the worker emails them when extraction finishes.
Do not feed raw extracted text directly to an LLM without a confidence check. This is the single most expensive class of bug in production RAG pipelines. The PDF parser returns gibberish from a scanned page, the LLM dutifully summarizes the gibberish, and the user gets a confidently-wrong answer. Use pdfmux’s self-healing pipeline and confidence scores, or implement your own equivalent — but do not skip the check.

The shortest path

If you are starting a new Node project today and need to extract PDFs:

// 1. For simple cases — pdf-parse
import pdfParse from 'pdf-parse';
const { text } = await pdfParse(buffer);

// 2. For everything else — pdfmux subprocess
import { execFile } from 'node:child_process';
import { promisify } from 'node:util';
const exec = promisify(execFile);
const { stdout } = await exec('pdfmux', ['convert', path, '--format', 'json']);
const { markdown, confidence, warnings } = JSON.parse(stdout);
if (confidence < 0.85) console.warn('Low-confidence extraction:', warnings);

That is the whole pattern. Use pdf-parse (or unpdf if you are on the edge) for cheap text dumps. For anything where the LLM will read the output and act on it, call pdfmux and check the confidence score before you trust the result.

For the broader architectural picture — queue patterns, retries, when to use the MCP server vs the HTTP API — see PDF data extraction for AI agents.