Direct answer: Extracting data from Arabic PDFs fails in most tools because of right-to-left (RTL) text order, bidirectional (bidi) mixing with English, and ligature reshaping. PDFMux handles bilingual Arabic-English documents by combining PyMuPDF with explicit bidi post-processing for digital text and Gemma 4 multimodal OCR for scanned pages, covering 140+ languages including Arabic script variants. A mid-size GCC freight forwarder processing 400 shipments per month typically spends $12,000 per month on manual data entry at 60 to 90 minutes per Bill of Lading. PDFMux reduces that to under 90 seconds per document with 94 to 97 percent field-level accuracy on the standard GCC shipping document set.


The GCC logistics document pain

Freight forwarders in Dubai, Jeddah, Doha, and Riyadh handle documents that are structurally hostile to automated extraction. A single shipment produces 8 to 14 PDFs: a Bill of Lading, commercial invoice, packing list, certificate of origin, customs declaration, insurance certificate, delivery order, and various free-zone or ministry forms. Roughly 70 percent of those documents are bilingual Arabic and English, and about 40 percent are scanned rather than digital (photographed on a phone, faxed, or printed and re-scanned).

At a mid-size forwarder handling 400 shipments per month, the data entry workload looks like this:

ActivityTime per shipmentMonthly hoursMonthly cost
Document sorting and classification10 min67$1,340
Manual data entry (BL, invoice, packing)45 min300$6,000
Bilingual field verification20 min133$2,660
Error correction and customs re-submission15 min100$2,000
Total90 min600$12,000

That is one full-time equivalent just typing numbers off PDFs into Dubai Trade and Fasah portals. And 2026 makes this worse, not better.

Two hard deadlines

UAE e-invoicing, July 2026. The Federal Tax Authority mandates structured e-invoices (PINT AE format, based on Peppol BIS) for all VAT-registered businesses. Freight forwarders must issue and receive invoices with machine-readable Arabic and English fields (tax identifier, line items, HS codes, currency, bilingual descriptions).

Saudi ZATCA Wave 24, June 30 2026. The Zakat, Tax and Customs Authority’s Phase 2 e-invoicing integration goes live for businesses with annual revenue above SAR 375,000. Every tax invoice must be cleared through the ZATCA Fatoora platform in real time, in XML format, with bilingual Arabic-English content.

Companies still running manual data entry hit both deadlines with a wall of rejected submissions. The fix is automated bilingual extraction that actually handles Arabic script correctly.


Why Arabic PDFs break most extractors

Three structural problems compound:

1. Right-to-left (RTL) reading order

Arabic reads right to left, but PDF content streams store characters in visual order, not logical order. A naive text extractor returns characters in the wrong sequence. The word appears reversed or fragmented.

Example. The Arabic word شحنة (shipment) is stored in the content stream as four glyphs in visual right-to-left order. PyMuPDF’s default get_text() returns them in the order they appear in the stream, which may be ة ن ح ش (reversed). A downstream tool that splits on whitespace then sees gibberish.

2. Bidirectional (bidi) text mixing

A typical GCC Bill of Lading mixes Arabic, English, numbers, and punctuation on the same line:

Port of Loading: ميناء جبل علي (Jebel Ali) - Container MSKU1234567

The Unicode Bidirectional Algorithm (UAX #9) defines how this should render visually, but storage order in the PDF does not always follow it. Naive extractors produce output that looks correct on screen and parses as nonsense.

3. Arabic ligatures and contextual shaping

Arabic letters change form based on position (initial, medial, final, isolated) and frequently combine into ligatures. A single semantic letter may be encoded as one of four glyphs, or as a multi-letter ligature that must be decomposed. Extractors that do not call unicodedata.normalize('NFKC', ...) return strings that break exact matching, search, and database storage.

On top of these three, scanned Arabic documents add the usual OCR problems: dots on the wrong letter (ب vs ت vs ث differs only by dot count and position), broken baselines, and fonts that were not designed for legibility at low DPI.


How PDFMux handles bilingual extraction

PDFMux routes each page through a three-stage pipeline:

  1. PyMuPDF with explicit bidi processing. For digital pages, extract characters with get_text("dict"), reconstruct logical reading order using the python-bidi library, normalize with NFKC, and merge Arabic ligatures back to their base letters.
  2. Quality audit. Score the extracted text against language detection (fastText), character distribution, and layout consistency. Pages below a confidence threshold are flagged.
  3. Gemma 4 multimodal OCR fallback. Scanned or low-confidence pages get re-extracted by Gemma 4’s vision model, which natively handles 140+ languages including Arabic, Farsi, Urdu, and Pashto. No separate Tesseract pipeline, no manual language flag. See running PDF extraction locally with Gemma 4 for the full architecture.

The combined pipeline scores 0.94 to 0.97 field-level accuracy on the 18 standard GCC shipping document types PDFMux tests against, including:

  • Bill of Lading (master and house)
  • Commercial invoice (bilingual)
  • Packing list
  • Certificate of origin (Chamber of Commerce formats for UAE, Saudi Arabia, Qatar, Kuwait, Oman, Bahrain)
  • Dubai Trade customs declaration
  • Saudi Fasah import manifest
  • ZATCA Phase 2 tax invoice (XML embedded)
  • Free-zone entry and exit permits

Extracting a bilingual Bill of Lading: working code

Here is a full example that processes a bilingual BL and produces structured JSON ready for customs submission.

from pdfmux import convert, extract_structured
from pdfmux.models import BillOfLading
import json

pdf_path = "data/bl-msku1234567.pdf"

# Step 1. Fast triage
analysis = convert.analyze(pdf_path)
print(f"Languages detected: {analysis.languages}")
print(f"Pages digital: {analysis.pages_digital}, scanned: {analysis.pages_scanned}")
# Output: Languages detected: ['ar', 'en']
# Output: Pages digital: 1, scanned: 1

# Step 2. Full extraction with bilingual mode enabled
result = convert.pdf(
    pdf_path,
    languages=["ar", "en"],
    bidi_mode="logical",        # reorder RTL to logical
    normalize_unicode="NFKC",   # decompose ligatures
    ocr_fallback=True,
)

print(f"Overall confidence: {result.overall_confidence:.3f}")
# Output: Overall confidence: 0.952

# Step 3. Structured extraction into a typed BL model
structured = extract_structured.as_model(
    pdf_path,
    schema=BillOfLading,
    languages=["ar", "en"],
)

print(json.dumps(structured.model_dump(), indent=2, ensure_ascii=False))

Sample output:

{
  "bl_number": "MSKU1234567",
  "bl_number_arabic": "ام اس كيو ١٢٣٤٥٦٧",
  "shipper": {
    "name_en": "Al Futtaim Logistics LLC",
    "name_ar": "الفطيم للخدمات اللوجستية ش.ذ.م.م",
    "address_en": "Jebel Ali Free Zone, Dubai, UAE",
    "address_ar": "المنطقة الحرة بجبل علي، دبي، الإمارات"
  },
  "consignee": {
    "name_en": "Saudi Industrial Export Co.",
    "name_ar": "الشركة السعودية للصادرات الصناعية"
  },
  "port_of_loading": {
    "code": "AEJEA",
    "name_en": "Jebel Ali",
    "name_ar": "جبل علي"
  },
  "port_of_discharge": {
    "code": "SADMM",
    "name_en": "Dammam",
    "name_ar": "الدمام"
  },
  "container_numbers": ["MSKU1234567", "MSKU7654321"],
  "gross_weight_kg": 18420.5,
  "freight_terms": "PREPAID",
  "issue_date": "2026-04-12",
  "confidence": 0.952
}

Both Arabic and English values are populated. The model returns the Arabic strings in their correct logical (not visual) order, already normalized, ready to insert into your database or submit to Dubai Trade.


Mapping extracted fields to the 2026 compliance deadlines

UAE e-invoicing (PINT AE)

The UAE Federal Tax Authority’s PINT AE format requires the following fields per line item, in both Arabic and English:

PINT AE fieldPDFMux extraction
cbc:ID (invoice number)structured.invoice_number
cbc:IssueDatestructured.issue_date
cac:AccountingSupplierPartystructured.supplier
cac:InvoiceLine/cbc:Note (bilingual description)structured.line_items[].description_en + description_ar
cbc:CommodityClassification (HS code)structured.line_items[].hs_code
cbc:TaxAmountstructured.tax_amount

Map the extraction result fields to the PINT AE XML template, validate against the FTA schema, and submit. PDFMux’s confidence score tells you which documents need manual review before submission.

Saudi ZATCA Phase 2

ZATCA requires a signed XML envelope cleared through Fatoora in real time. The schema is UBL 2.1 with Saudi extensions, and every human-readable field must be bilingual.

from pdfmux import extract_structured
from pdfmux.compliance import to_zatca_xml

result = extract_structured.invoice(
    "data/invoice-sa-2026-041.pdf",
    locale="sa",
    languages=["ar", "en"],
)

if result.confidence >= 0.90:
    xml = to_zatca_xml(result, seller_vat="300000000000003")
    # submit xml to Fatoora clearance endpoint
else:
    # route to human review queue
    print(f"Manual review required: confidence {result.confidence:.2f}")

The confidence gate is the part most teams skip. Submitting a low-confidence invoice to ZATCA produces a clearance rejection, which the portal counts against your compliance record. Gating by confidence catches these before they become compliance incidents.


Benchmarks on the GCC document set

PDFMux tested against 18 document types, 200 documents per type, collected from real freight forwarder archives (PII redacted):

Document typeField-level accuracyAvg processing time
Bill of Lading (digital)0.9721.8s
Bill of Lading (scanned)0.9434.2s
Commercial invoice (bilingual)0.9612.1s
Packing list0.9541.5s
Certificate of origin (UAE)0.9582.0s
Certificate of origin (Saudi)0.9492.3s
Dubai Trade customs declaration0.9671.9s
Fasah import manifest0.9412.4s
ZATCA Phase 2 tax invoice0.9781.7s
Free-zone exit permit0.9462.2s

Compared to Tesseract with ara+eng language packs (the typical open source baseline): PDFMux is 12 to 18 percent more accurate on digital pages and 22 to 28 percent more accurate on scanned pages, driven by Gemma 4’s multimodal OCR outperforming Tesseract on Arabic dot-count confusions and dense ligatures.

Bottom line: on the GCC document set, PDFMux’s bilingual pipeline is the only open pipeline that produces submission-ready output for both Dubai Trade and ZATCA Fatoora without a human reviewer on the hot path.


Integration patterns for freight forwarders

Pattern 1: inbox to Dubai Trade

Watch a shared inbox, extract every PDF attachment, validate, and push to Dubai Trade.

from pdfmux import batch
from dubai_trade import submit_declaration

def process_new_mail(mail_folder):
    results = batch.convert_directory(
        mail_folder,
        schema="customs_declaration",
        min_confidence=0.90,
    )
    for r in results.high_confidence:
        submit_declaration(r.structured)
    for r in results.needs_review:
        queue_for_human(r.file, r.confidence, r.warnings)

Pattern 2: WhatsApp to spreadsheet

Freight forwarders in the GCC receive a lot of BLs by WhatsApp. Hook PDFMux into a WhatsApp Business API webhook and write extracted fields directly to a shared Google Sheet or Airtable.

Pattern 3: PDFMux MCP inside Claude Desktop

For ad-hoc operational work, the PDFMux MCP server lets an ops manager ask Claude:

“Open every PDF in ~/Downloads/shipments-week-16/ and tell me which ones have a consignee mismatch between the BL and the commercial invoice.”

Claude calls batch_convert, cross-checks the fields, and returns a list. No code, no pipeline, just a question.


What this changes for a 400-shipment forwarder

Taking the $12,000 per month baseline from the opening:

ActivityManualWith PDFMuxSavings
Document sorting and classification10 min1 min9 min
Data entry45 min1 min (review only)44 min
Bilingual verification20 min2 min18 min
Error correction15 min3 min12 min
Total per shipment90 min7 min83 min
Monthly cost$12,000$930$11,070

One full-time data entry role redirected to customer operations. Submission rejection rates drop from the 4 to 7 percent range we see in manual pipelines to under 1 percent on PDFMux’s confidence-gated output.


Conclusion

Bilingual Arabic-English PDFs are not an edge case in GCC logistics. They are the default. The 2026 e-invoicing deadlines in the UAE and Saudi Arabia turn this from an operational nuisance into a compliance risk. PDFMux handles the three structural problems (RTL order, bidi mixing, ligatures) with explicit bidi processing and Gemma 4 multimodal OCR, and exposes a confidence score so you know which documents are submission-ready and which need a human.

Install PDFMux, point it at a folder of BLs, and you will see the accuracy numbers above on your own documents. That is the test that matters.