Direct answer: Gemma 4 (released April 2, 2026, Apache 2.0 license) replaces Gemini Flash as PDFMux’s LLM fallback extractor, running fully on local hardware with zero API costs and no data leaving the machine. It ships in four sizes (E2B, E4B, 26B MoE, 31B dense), supports native multimodal OCR across 140+ languages, and runs on Ollama with one command. Benchmark delta versus Gemini Flash: 3 to 8 percent lower accuracy on complex mixed documents and 10 to 15 percent lower on degraded scanned forms, in exchange for full privacy and no per-page cost. For law firms, healthcare, finance, and regulated industries where documents cannot leave the premises, that tradeoff is trivially worth it.
Why local matters now
PDFMux’s default pipeline uses a cloud LLM (Gemini 2.5 Flash) as the last-resort fallback for pages that PyMuPDF and OCR cannot handle. It is cheap (roughly $0.002 per page) and accurate. But for three categories of users, cloud is a non-starter no matter the price:
- Law firms processing privileged client documents. Sending a confidential merger agreement to Google’s API is malpractice.
- Healthcare processing PHI under HIPAA, or patient records under GDPR Article 9. Third-party processors require Business Associate Agreements and an audit trail that most teams would rather avoid.
- Finance and regulated industries where data residency rules (UAE PDPL, Saudi PDPL, Swiss FADP, EU GDPR) prohibit processing outside specific jurisdictions.
For these users, the right answer is not “use cloud and hope for the best.” It is “run the model locally.” Until April 2026 the local options were weaker than the cloud frontier by 15 to 25 percent on extraction quality. Gemma 4 closes most of that gap.
What shipped on April 2, 2026
Google DeepMind released Gemma 4 under the Apache 2.0 license, which means commercial use is permitted with no royalty and no usage restrictions. The release included four model sizes, all weights downloadable:
| Model | Parameters | Architecture | Context | Target hardware |
|---|---|---|---|---|
| Gemma 4 E2B | 2.6B effective | Dense | 128K | 8 GB RAM, CPU or 6 GB VRAM |
| Gemma 4 E4B | 4.3B effective | Dense | 128K | 16 GB RAM, CPU or 8 GB VRAM |
| Gemma 4 26B MoE | 26B total, 7B active | Mixture of experts | 128K | 24 GB VRAM (RTX 4090, M2 Max 32GB) |
| Gemma 4 31B | 31B | Dense | 128K | 48 GB VRAM (A6000, H100, M3 Ultra) |
All four sizes are natively multimodal. They accept image inputs alongside text, which is what makes them usable as OCR engines. The training set includes 140+ languages with strong representation for Arabic, Chinese, Hindi, and Japanese scripts (not just Latin-centric like many earlier open models).
The MoE model is the sweet spot for document extraction: 26B total parameters for quality but only 7B active at inference, so it runs at roughly E4B speed while scoring closer to the 31B dense model.
How PDFMux uses Gemma 4
PDFMux’s pipeline routes pages by difficulty:
- PyMuPDF digital extraction for clean digital pages. Fast, free, CPU-only.
- Tesseract or EasyOCR for scanned pages with Latin-script text.
- LLM multimodal extraction for pages that fail the first two (handwriting, complex layouts, tables bleeding across columns, non-Latin scripts at low DPI).
Before April 2026, step 3 was cloud-only. Now it has three modes:
from pdfmux import convert
# Default: cloud LLM (Gemini 2.5 Flash)
result = convert.pdf("doc.pdf", llm_mode="cloud")
# Local Gemma 4 via Ollama
result = convert.pdf("doc.pdf", llm_mode="local:gemma4")
# Fully offline, no LLM fallback, strict mode
result = convert.pdf("doc.pdf", llm_mode="none")
The local:gemma4 mode auto-detects which Gemma 4 size is available on the machine and routes accordingly. An M3 MacBook Pro with 36 GB unified memory picks E4B. A workstation with an RTX 4090 picks 26B MoE. A server with an H100 picks the 31B dense model.
Hardware requirements per model size
Real numbers from PDFMux’s test matrix:
| Hardware | Recommended model | Tokens/sec | Pages/min (avg) |
|---|---|---|---|
| M1 MacBook Air, 16 GB | E2B | 42 | 14 |
| M2 Pro, 32 GB | E4B | 58 | 22 |
| M3 Max, 64 GB | 26B MoE | 74 | 31 |
| M3 Ultra, 192 GB | 31B dense | 52 | 19 |
| RTX 4060 Ti, 16 GB | E4B | 84 | 34 |
| RTX 4090, 24 GB | 26B MoE | 142 | 57 |
| A6000, 48 GB | 31B dense | 168 | 68 |
| H100 80GB | 31B dense | 312 | 124 |
Page-per-minute numbers are for full-pipeline extraction (PyMuPDF first, OCR second, Gemma 4 only on the 20 to 30 percent of pages that need it). If every page hit the LLM, throughput would be roughly one third of the numbers above.
Bottom line: a $2,000 workstation (RTX 4090 or M3 Max) processes more than 50 pages per minute fully offline. For most legal, medical, and financial workflows that is faster than the cloud path including network round-trips.
Benchmark: Gemma 4 vs Gemini 2.5 Flash
PDFMux ran both models on the same 200-document opendataloader-bench plus an internal 500-document set of law-firm, healthcare, and finance PDFs. Honest results:
| Document type | Gemini 2.5 Flash | Gemma 4 26B MoE | Delta |
|---|---|---|---|
| Clean digital PDFs | 0.962 | 0.958 | -0.4% |
| Scanned contracts (good quality) | 0.941 | 0.903 | -4.0% |
| Scanned invoices (medium quality) | 0.918 | 0.872 | -5.0% |
| Handwritten + printed mixed | 0.872 | 0.791 | -9.3% |
| Degraded scanned forms (fax, low DPI) | 0.843 | 0.721 | -14.5% |
| Bilingual Arabic-English BLs | 0.934 | 0.898 | -3.9% |
| Financial statements (dense tables) | 0.921 | 0.866 | -6.0% |
| Medical records (PHI, mixed script) | 0.889 | 0.823 | -7.4% |
| Weighted average | 0.921 | 0.859 | -6.7% |
The delta is real. It is also smaller than you might expect for a fully local model. On the documents where most extraction actually happens (digital PDFs, good-quality scans, invoices) the gap is 4 to 6 percent. Only on genuinely degraded documents (faxed forms, handwritten intake sheets) does the gap open to double digits.
For a law firm deciding between “send privileged documents to a cloud API” and “accept a 5 percent quality delta on local,” the answer is obvious. Same for healthcare under HIPAA and finance under banking secrecy rules.
Ollama setup: full installation
The full local path takes about 15 minutes on a first-time setup.
Step 1: install Ollama
macOS:
brew install ollama
brew services start ollama
Linux:
curl -fsSL https://ollama.com/install.sh | sh
sudo systemctl enable --now ollama
Windows: download the installer from ollama.com.
Step 2: pull the right Gemma 4 size
Pick based on your hardware:
# Laptop class
ollama pull gemma4:e2b # 1.8 GB, runs on 8 GB RAM
ollama pull gemma4:e4b # 3.2 GB, runs on 16 GB RAM
# Workstation class
ollama pull gemma4:26b-moe-q4 # 14 GB quantized, 24 GB VRAM
ollama pull gemma4:31b-q4 # 18 GB quantized, 32 GB VRAM
# Server class
ollama pull gemma4:31b # 62 GB full precision, 80 GB VRAM
The quantized (q4) variants lose about 1 percent accuracy for roughly 4x smaller memory footprint. For extraction work that tradeoff is fine.
Step 3: install PDFMux with local support
pip install "pdfmux[local]"
The [local] extra pulls the Ollama client library and local image preprocessing dependencies.
Step 4: configure PDFMux to use Gemma 4
Create ~/.pdfmux/config.toml:
[llm]
mode = "local"
provider = "ollama"
model = "gemma4:26b-moe-q4"
endpoint = "http://localhost:11434"
timeout_seconds = 120
[privacy]
allow_cloud_fallback = false
log_pii = false
Setting allow_cloud_fallback = false is the hard gate. With that flag, PDFMux will refuse to send anything to a cloud endpoint even if the local model fails. The process returns a confidence-zero error instead. For regulated workflows that is the behavior you want.
Step 5: run a test extraction
pdfmux convert contract.pdf --out contract.md
Watch ollama ps in another terminal to confirm the model is being invoked locally. No network traffic, no API calls, no logs to Google. The document is processed on your machine and the output is written to disk.
Cost comparison over a year
Assume 50,000 pages processed per year at a mid-size firm.
| Setup | Hardware cost | Per-page cost | Annual total |
|---|---|---|---|
| Gemini 2.5 Flash cloud | $0 | $0.002 | $100 |
| OpenAI GPT-4o cloud | $0 | $0.008 | $400 |
| Claude 3.7 Sonnet cloud | $0 | $0.012 | $600 |
| Gemma 4 E4B on existing MacBook | $0 | $0 | $0 |
| Gemma 4 26B MoE on new RTX 4090 box | $2,400 one-time | $0 | $2,400 year 1, $0 after |
The hardware path breaks even against Claude Sonnet in year one for anything above 200,000 pages per year, and pays for itself against even the cheapest cloud option inside three years. But the cost argument is usually not the deciding factor. The privacy and compliance argument is.
When to use which mode
| Scenario | Recommended mode |
|---|---|
| Personal side project, occasional PDFs | Cloud (Gemini Flash default) |
| SaaS product processing public documents | Cloud |
| Solo developer on a laptop, privacy-conscious | Local E4B |
| Legal, healthcare, finance, regulated | Local 26B MoE on workstation |
| High-volume enterprise (>500K pages/yr) | Local 31B on server |
| Air-gapped environment (gov, defense) | Local with allow_cloud_fallback = false |
For teams already running the PDFMux MCP server in Claude Desktop or Cursor, switching to local Gemma 4 is a one-line config change. The MCP tools are identical. Only the extraction path underneath changes.
Known limitations
Three things Gemma 4 does less well than the frontier cloud models today:
- Handwritten text across multiple mixed languages. Gemma 4 handles printed multilingual well, but handwritten Arabic or Chinese on low-quality scans is where the 10 to 15 percent gap shows up.
- Extremely long documents (>200 pages) processed as a single context. Gemma 4’s 128K context is sufficient, but dense documents near the limit see quality degradation. Chunk-and-merge pipelines are recommended for anything over 100 pages.
- Structured extraction with complex nested schemas. Cloud models still have the edge on deeply nested JSON schemas. For flat schemas (invoices, Bills of Lading, forms) Gemma 4 is within 2 percent of cloud.
None of these are blockers. All of them will close further as open models continue to improve.
Conclusion
Gemma 4 is the first open model good enough to replace cloud LLMs as PDFMux’s extraction fallback without losing most of the quality. The tradeoff (5 to 7 percent accuracy on average, zero API costs, full privacy, no data leaving the machine) is the right one for any regulated workflow. Ollama installs in five minutes, the model pulls in under three, and PDFMux’s llm_mode="local:gemma4" flag flips the entire pipeline over.
For law firms, healthcare providers, banks, and anyone else who cannot send documents to a third-party API, this is the setup that finally works.