Running PDF Extraction Locally with Gemma 4: Zero API Costs, Full Privacy

Direct answer: Gemma 4 (released April 2, 2026, Apache 2.0 license) replaces Gemini Flash as PDFMux’s LLM fallback extractor, running fully on local hardware with zero API costs and no data leaving the machine. It ships in four sizes (E2B, E4B, 26B MoE, 31B dense), supports native multimodal OCR across 140+ languages, and runs on Ollama with one command. Benchmark delta versus Gemini Flash: 3 to 8 percent lower accuracy on complex mixed documents and 10 to 15 percent lower on degraded scanned forms, in exchange for full privacy and no per-page cost. For law firms, healthcare, finance, and regulated industries where documents cannot leave the premises, that tradeoff is trivially worth it.

Why local matters now

PDFMux’s default pipeline uses a cloud LLM (Gemini 2.5 Flash) as the last-resort fallback for pages that PyMuPDF and OCR cannot handle. It is cheap (roughly $0.002 per page) and accurate. But for three categories of users, cloud is a non-starter no matter the price:

Law firms processing privileged client documents. Sending a confidential merger agreement to Google’s API is malpractice.
Healthcare processing PHI under HIPAA, or patient records under GDPR Article 9. Third-party processors require Business Associate Agreements and an audit trail that most teams would rather avoid.
Finance and regulated industries where data residency rules (UAE PDPL, Saudi PDPL, Swiss FADP, EU GDPR) prohibit processing outside specific jurisdictions.

For these users, the right answer is not “use cloud and hope for the best.” It is “run the model locally.” Until April 2026 the local options were weaker than the cloud frontier by 15 to 25 percent on extraction quality. Gemma 4 closes most of that gap.

What shipped on April 2, 2026

Google DeepMind released Gemma 4 under the Apache 2.0 license, which means commercial use is permitted with no royalty and no usage restrictions. The release included four model sizes, all weights downloadable:

Model	Parameters	Architecture	Context	Target hardware
Gemma 4 E2B	2.6B effective	Dense	128K	8 GB RAM, CPU or 6 GB VRAM
Gemma 4 E4B	4.3B effective	Dense	128K	16 GB RAM, CPU or 8 GB VRAM
Gemma 4 26B MoE	26B total, 7B active	Mixture of experts	128K	24 GB VRAM (RTX 4090, M2 Max 32GB)
Gemma 4 31B	31B	Dense	128K	48 GB VRAM (A6000, H100, M3 Ultra)

All four sizes are natively multimodal. They accept image inputs alongside text, which is what makes them usable as OCR engines. The training set includes 140+ languages with strong representation for Arabic, Chinese, Hindi, and Japanese scripts (not just Latin-centric like many earlier open models).

The MoE model is the sweet spot for document extraction: 26B total parameters for quality but only 7B active at inference, so it runs at roughly E4B speed while scoring closer to the 31B dense model.

How PDFMux uses Gemma 4

PDFMux’s pipeline routes pages by difficulty:

PyMuPDF digital extraction for clean digital pages. Fast, free, CPU-only.
Tesseract or EasyOCR for scanned pages with Latin-script text.
LLM multimodal extraction for pages that fail the first two (handwriting, complex layouts, tables bleeding across columns, non-Latin scripts at low DPI).

Before April 2026, step 3 was cloud-only. Now it has three modes:

from pdfmux import convert

# Default: cloud LLM (Gemini 2.5 Flash)
result = convert.pdf("doc.pdf", llm_mode="cloud")

# Local Gemma 4 via Ollama
result = convert.pdf("doc.pdf", llm_mode="local:gemma4")

# Fully offline, no LLM fallback, strict mode
result = convert.pdf("doc.pdf", llm_mode="none")

The local:gemma4 mode auto-detects which Gemma 4 size is available on the machine and routes accordingly. An M3 MacBook Pro with 36 GB unified memory picks E4B. A workstation with an RTX 4090 picks 26B MoE. A server with an H100 picks the 31B dense model.

Hardware requirements per model size

Real numbers from PDFMux’s test matrix:

Hardware	Recommended model	Tokens/sec	Pages/min (avg)
M1 MacBook Air, 16 GB	E2B	42	14
M2 Pro, 32 GB	E4B	58	22
M3 Max, 64 GB	26B MoE	74	31
M3 Ultra, 192 GB	31B dense	52	19
RTX 4060 Ti, 16 GB	E4B	84	34
RTX 4090, 24 GB	26B MoE	142	57
A6000, 48 GB	31B dense	168	68
H100 80GB	31B dense	312	124

Page-per-minute numbers are for full-pipeline extraction (PyMuPDF first, OCR second, Gemma 4 only on the 20 to 30 percent of pages that need it). If every page hit the LLM, throughput would be roughly one third of the numbers above.

Bottom line: a $2,000 workstation (RTX 4090 or M3 Max) processes more than 50 pages per minute fully offline. For most legal, medical, and financial workflows that is faster than the cloud path including network round-trips.

Benchmark: Gemma 4 vs Gemini 2.5 Flash

PDFMux ran both models on the same 200-document opendataloader-bench plus an internal 500-document set of law-firm, healthcare, and finance PDFs. Honest results:

Document type	Gemini 2.5 Flash	Gemma 4 26B MoE	Delta
Clean digital PDFs	0.962	0.958	-0.4%
Scanned contracts (good quality)	0.941	0.903	-4.0%
Scanned invoices (medium quality)	0.918	0.872	-5.0%
Handwritten + printed mixed	0.872	0.791	-9.3%
Degraded scanned forms (fax, low DPI)	0.843	0.721	-14.5%
Bilingual Arabic-English BLs	0.934	0.898	-3.9%
Financial statements (dense tables)	0.921	0.866	-6.0%
Medical records (PHI, mixed script)	0.889	0.823	-7.4%
Weighted average	0.921	0.859	-6.7%

The delta is real. It is also smaller than you might expect for a fully local model. On the documents where most extraction actually happens (digital PDFs, good-quality scans, invoices) the gap is 4 to 6 percent. Only on genuinely degraded documents (faxed forms, handwritten intake sheets) does the gap open to double digits.

For a law firm deciding between “send privileged documents to a cloud API” and “accept a 5 percent quality delta on local,” the answer is obvious. Same for healthcare under HIPAA and finance under banking secrecy rules.

Ollama setup: full installation

The full local path takes about 15 minutes on a first-time setup.

Step 1: install Ollama

macOS:

brew install ollama
brew services start ollama

Linux:

curl -fsSL https://ollama.com/install.sh | sh
sudo systemctl enable --now ollama

Windows: download the installer from ollama.com.

Step 2: pull the right Gemma 4 size

Pick based on your hardware:

# Laptop class
ollama pull gemma4:e2b          # 1.8 GB, runs on 8 GB RAM
ollama pull gemma4:e4b          # 3.2 GB, runs on 16 GB RAM

# Workstation class
ollama pull gemma4:26b-moe-q4   # 14 GB quantized, 24 GB VRAM
ollama pull gemma4:31b-q4       # 18 GB quantized, 32 GB VRAM

# Server class
ollama pull gemma4:31b          # 62 GB full precision, 80 GB VRAM

The quantized (q4) variants lose about 1 percent accuracy for roughly 4x smaller memory footprint. For extraction work that tradeoff is fine.

Step 3: install PDFMux with local support

pip install "pdfmux[local]"

The [local] extra pulls the Ollama client library and local image preprocessing dependencies.

Step 4: configure PDFMux to use Gemma 4

Create ~/.pdfmux/config.toml:

[llm]
mode = "local"
provider = "ollama"
model = "gemma4:26b-moe-q4"
endpoint = "http://localhost:11434"
timeout_seconds = 120

[privacy]
allow_cloud_fallback = false
log_pii = false

Setting allow_cloud_fallback = false is the hard gate. With that flag, PDFMux will refuse to send anything to a cloud endpoint even if the local model fails. The process returns a confidence-zero error instead. For regulated workflows that is the behavior you want.

Step 5: run a test extraction

pdfmux convert contract.pdf --out contract.md

Watch ollama ps in another terminal to confirm the model is being invoked locally. No network traffic, no API calls, no logs to Google. The document is processed on your machine and the output is written to disk.

Cost comparison over a year

Assume 50,000 pages processed per year at a mid-size firm.

Setup	Hardware cost	Per-page cost	Annual total
Gemini 2.5 Flash cloud	$0	$0.002	$100
OpenAI GPT-4o cloud	$0	$0.008	$400
Claude 3.7 Sonnet cloud	$0	$0.012	$600
Gemma 4 E4B on existing MacBook	$0	$0	$0
Gemma 4 26B MoE on new RTX 4090 box	$2,400 one-time	$0	$2,400 year 1, $0 after

The hardware path breaks even against Claude Sonnet in year one for anything above 200,000 pages per year, and pays for itself against even the cheapest cloud option inside three years. But the cost argument is usually not the deciding factor. The privacy and compliance argument is.

When to use which mode

Scenario	Recommended mode
Personal side project, occasional PDFs	Cloud (Gemini Flash default)
SaaS product processing public documents	Cloud
Solo developer on a laptop, privacy-conscious	Local E4B
Legal, healthcare, finance, regulated	Local 26B MoE on workstation
High-volume enterprise (>500K pages/yr)	Local 31B on server
Air-gapped environment (gov, defense)	Local with `allow_cloud_fallback = false`

For teams already running the PDFMux MCP server in Claude Desktop or Cursor, switching to local Gemma 4 is a one-line config change. The MCP tools are identical. Only the extraction path underneath changes.

Known limitations

Three things Gemma 4 does less well than the frontier cloud models today:

Handwritten text across multiple mixed languages. Gemma 4 handles printed multilingual well, but handwritten Arabic or Chinese on low-quality scans is where the 10 to 15 percent gap shows up.
Extremely long documents (>200 pages) processed as a single context. Gemma 4’s 128K context is sufficient, but dense documents near the limit see quality degradation. Chunk-and-merge pipelines are recommended for anything over 100 pages.
Structured extraction with complex nested schemas. Cloud models still have the edge on deeply nested JSON schemas. For flat schemas (invoices, Bills of Lading, forms) Gemma 4 is within 2 percent of cloud.

None of these are blockers. All of them will close further as open models continue to improve.

Conclusion

Gemma 4 is the first open model good enough to replace cloud LLMs as PDFMux’s extraction fallback without losing most of the quality. The tradeoff (5 to 7 percent accuracy on average, zero API costs, full privacy, no data leaving the machine) is the right one for any regulated workflow. Ollama installs in five minutes, the model pulls in under three, and PDFMux’s llm_mode="local:gemma4" flag flips the entire pipeline over.

For law firms, healthcare providers, banks, and anyone else who cannot send documents to a third-party API, this is the setup that finally works.