How to Extract Data from Arabic PDFs: A Complete Guide for GCC Logistics

TL;DRExtract bilingual Arabic and English data from Bills of Lading, commercial invoices, and customs forms. Production code, benchmark data, and compliance notes for UAE and

Direct answer: Extracting data from Arabic PDFs fails in most tools because of right-to-left (RTL) text order, bidirectional (bidi) mixing with English, and ligature reshaping. PDFMux handles bilingual Arabic-English documents by combining PyMuPDF with explicit bidi post-processing for digital text and Gemma 4 multimodal OCR for scanned pages, covering 140+ languages including Arabic script variants. A mid-size GCC freight forwarder processing 400 shipments per month typically spends $12,000 per month on manual data entry at 60 to 90 minutes per Bill of Lading. PDFMux reduces that to under 90 seconds per document with 94 to 97 percent field-level accuracy on the standard GCC shipping document set.

The GCC logistics document pain

Freight forwarders in Dubai, Jeddah, Doha, and Riyadh handle documents that are structurally hostile to automated extraction. A single shipment produces 8 to 14 PDFs: a Bill of Lading, commercial invoice, packing list, certificate of origin, customs declaration, insurance certificate, delivery order, and various free-zone or ministry forms. Roughly 70 percent of those documents are bilingual Arabic and English, and about 40 percent are scanned rather than digital (photographed on a phone, faxed, or printed and re-scanned).

At a mid-size forwarder handling 400 shipments per month, the data entry workload looks like this:

Activity	Time per shipment	Monthly hours	Monthly cost
Document sorting and classification	10 min	67	$1,340
Manual data entry (BL, invoice, packing)	45 min	300	$6,000
Bilingual field verification	20 min	133	$2,660
Error correction and customs re-submission	15 min	100	$2,000
Total	90 min	600	$12,000

That is one full-time equivalent just typing numbers off PDFs into Dubai Trade and Fasah portals. And 2026 makes this worse, not better.

Two hard deadlines

UAE e-invoicing, July 2026. The Federal Tax Authority mandates structured e-invoices (PINT AE format, based on Peppol BIS) for all VAT-registered businesses. Freight forwarders must issue and receive invoices with machine-readable Arabic and English fields (tax identifier, line items, HS codes, currency, bilingual descriptions).

Saudi ZATCA Wave 24, June 30 2026. The Zakat, Tax and Customs Authority’s Phase 2 e-invoicing integration goes live for businesses with annual revenue above SAR 375,000. Every tax invoice must be cleared through the ZATCA Fatoora platform in real time, in XML format, with bilingual Arabic-English content.

Companies still running manual data entry hit both deadlines with a wall of rejected submissions. The fix is automated bilingual extraction that actually handles Arabic script correctly.

Why Arabic PDFs break most extractors

Three structural problems compound:

1. Right-to-left (RTL) reading order

Arabic reads right to left, but PDF content streams store characters in visual order, not logical order. A naive text extractor returns characters in the wrong sequence. The word appears reversed or fragmented.

Example. The Arabic word شحنة (shipment) is stored in the content stream as four glyphs in visual right-to-left order. PyMuPDF’s default get_text() returns them in the order they appear in the stream, which may be ة ن ح ش (reversed). A downstream tool that splits on whitespace then sees gibberish.

2. Bidirectional (bidi) text mixing

A typical GCC Bill of Lading mixes Arabic, English, numbers, and punctuation on the same line:

Port of Loading: ميناء جبل علي (Jebel Ali) - Container MSKU1234567

The Unicode Bidirectional Algorithm (UAX #9) defines how this should render visually, but storage order in the PDF does not always follow it. Naive extractors produce output that looks correct on screen and parses as nonsense.

3. Arabic ligatures and contextual shaping

Arabic letters change form based on position (initial, medial, final, isolated) and frequently combine into ligatures. A single semantic letter may be encoded as one of four glyphs, or as a multi-letter ligature that must be decomposed. Extractors that do not call unicodedata.normalize('NFKC', ...) return strings that break exact matching, search, and database storage.

On top of these three, scanned Arabic documents add the usual OCR problems: dots on the wrong letter (ب vs ت vs ث differs only by dot count and position), broken baselines, and fonts that were not designed for legibility at low DPI.

How PDFMux handles bilingual extraction

PDFMux routes each page through a three-stage pipeline:

PyMuPDF with explicit bidi processing. For digital pages, extract characters with get_text("dict"), reconstruct logical reading order using the python-bidi library, normalize with NFKC, and merge Arabic ligatures back to their base letters.
Quality audit. Score the extracted text against language detection (fastText), character distribution, and layout consistency. Pages below a confidence threshold are flagged.
Gemma 4 multimodal OCR fallback. Scanned or low-confidence pages get re-extracted by Gemma 4’s vision model, which natively handles 140+ languages including Arabic, Farsi, Urdu, and Pashto. No separate Tesseract pipeline, no manual language flag. See running PDF extraction locally with Gemma 4 for the full architecture.

The combined pipeline scores 0.94 to 0.97 field-level accuracy on the 18 standard GCC shipping document types PDFMux tests against, including:

Bill of Lading (master and house)
Commercial invoice (bilingual)
Packing list
Certificate of origin (Chamber of Commerce formats for UAE, Saudi Arabia, Qatar, Kuwait, Oman, Bahrain)
Dubai Trade customs declaration
Saudi Fasah import manifest
ZATCA Phase 2 tax invoice (XML embedded)
Free-zone entry and exit permits

Extracting a bilingual Bill of Lading: working code

Here is a full example that processes a bilingual BL and produces structured JSON ready for customs submission.

from pdfmux import convert, extract_structured
from pdfmux.models import BillOfLading
import json

pdf_path = "data/bl-msku1234567.pdf"

# Step 1. Fast triage
analysis = convert.analyze(pdf_path)
print(f"Languages detected: {analysis.languages}")
print(f"Pages digital: {analysis.pages_digital}, scanned: {analysis.pages_scanned}")
# Output: Languages detected: ['ar', 'en']
# Output: Pages digital: 1, scanned: 1

# Step 2. Full extraction with bilingual mode enabled
result = convert.pdf(
    pdf_path,
    languages=["ar", "en"],
    bidi_mode="logical",        # reorder RTL to logical
    normalize_unicode="NFKC",   # decompose ligatures
    ocr_fallback=True,
)

print(f"Overall confidence: {result.overall_confidence:.3f}")
# Output: Overall confidence: 0.952

# Step 3. Structured extraction into a typed BL model
structured = extract_structured.as_model(
    pdf_path,
    schema=BillOfLading,
    languages=["ar", "en"],
)

print(json.dumps(structured.model_dump(), indent=2, ensure_ascii=False))

Sample output:

{
  "bl_number": "MSKU1234567",
  "bl_number_arabic": "ام اس كيو ١٢٣٤٥٦٧",
  "shipper": {
    "name_en": "Al Futtaim Logistics LLC",
    "name_ar": "الفطيم للخدمات اللوجستية ش.ذ.م.م",
    "address_en": "Jebel Ali Free Zone, Dubai, UAE",
    "address_ar": "المنطقة الحرة بجبل علي، دبي، الإمارات"
  },
  "consignee": {
    "name_en": "Saudi Industrial Export Co.",
    "name_ar": "الشركة السعودية للصادرات الصناعية"
  },
  "port_of_loading": {
    "code": "AEJEA",
    "name_en": "Jebel Ali",
    "name_ar": "جبل علي"
  },
  "port_of_discharge": {
    "code": "SADMM",
    "name_en": "Dammam",
    "name_ar": "الدمام"
  },
  "container_numbers": ["MSKU1234567", "MSKU7654321"],
  "gross_weight_kg": 18420.5,
  "freight_terms": "PREPAID",
  "issue_date": "2026-04-12",
  "confidence": 0.952
}

Both Arabic and English values are populated. The model returns the Arabic strings in their correct logical (not visual) order, already normalized, ready to insert into your database or submit to Dubai Trade.

Mapping extracted fields to the 2026 compliance deadlines

UAE e-invoicing (PINT AE)

The UAE Federal Tax Authority’s PINT AE format requires the following fields per line item, in both Arabic and English:

PINT AE field	PDFMux extraction
`cbc:ID` (invoice number)	`structured.invoice_number`
`cbc:IssueDate`	`structured.issue_date`
`cac:AccountingSupplierParty`	`structured.supplier`
`cac:InvoiceLine/cbc:Note` (bilingual description)	`structured.line_items[].description_en` + `description_ar`
`cbc:CommodityClassification` (HS code)	`structured.line_items[].hs_code`
`cbc:TaxAmount`	`structured.tax_amount`

Map the extraction result fields to the PINT AE XML template, validate against the FTA schema, and submit. PDFMux’s confidence score tells you which documents need manual review before submission.

Saudi ZATCA Phase 2

ZATCA requires a signed XML envelope cleared through Fatoora in real time. The schema is UBL 2.1 with Saudi extensions, and every human-readable field must be bilingual.

from pdfmux import extract_structured
from pdfmux.compliance import to_zatca_xml

result = extract_structured.invoice(
    "data/invoice-sa-2026-041.pdf",
    locale="sa",
    languages=["ar", "en"],
)

if result.confidence >= 0.90:
    xml = to_zatca_xml(result, seller_vat="300000000000003")
    # submit xml to Fatoora clearance endpoint
else:
    # route to human review queue
    print(f"Manual review required: confidence {result.confidence:.2f}")

The confidence gate is the part most teams skip. Submitting a low-confidence invoice to ZATCA produces a clearance rejection, which the portal counts against your compliance record. Gating by confidence catches these before they become compliance incidents.

Benchmarks on the GCC document set

PDFMux tested against 18 document types, 200 documents per type, collected from real freight forwarder archives (PII redacted):

Document type	Field-level accuracy	Avg processing time
Bill of Lading (digital)	0.972	1.8s
Bill of Lading (scanned)	0.943	4.2s
Commercial invoice (bilingual)	0.961	2.1s
Packing list	0.954	1.5s
Certificate of origin (UAE)	0.958	2.0s
Certificate of origin (Saudi)	0.949	2.3s
Dubai Trade customs declaration	0.967	1.9s
Fasah import manifest	0.941	2.4s
ZATCA Phase 2 tax invoice	0.978	1.7s
Free-zone exit permit	0.946	2.2s

Compared to Tesseract with ara+eng language packs (the typical open source baseline): PDFMux is 12 to 18 percent more accurate on digital pages and 22 to 28 percent more accurate on scanned pages, driven by Gemma 4’s multimodal OCR outperforming Tesseract on Arabic dot-count confusions and dense ligatures.

Bottom line: on the GCC document set, PDFMux’s bilingual pipeline is the only open pipeline that produces submission-ready output for both Dubai Trade and ZATCA Fatoora without a human reviewer on the hot path.

Integration patterns for freight forwarders

Pattern 1: inbox to Dubai Trade

Watch a shared inbox, extract every PDF attachment, validate, and push to Dubai Trade.

from pdfmux import batch
from dubai_trade import submit_declaration

def process_new_mail(mail_folder):
    results = batch.convert_directory(
        mail_folder,
        schema="customs_declaration",
        min_confidence=0.90,
    )
    for r in results.high_confidence:
        submit_declaration(r.structured)
    for r in results.needs_review:
        queue_for_human(r.file, r.confidence, r.warnings)

Pattern 2: WhatsApp to spreadsheet

Freight forwarders in the GCC receive a lot of BLs by WhatsApp. Hook PDFMux into a WhatsApp Business API webhook and write extracted fields directly to a shared Google Sheet or Airtable.

Pattern 3: PDFMux MCP inside Claude Desktop

For ad-hoc operational work, the PDFMux MCP server lets an ops manager ask Claude:

“Open every PDF in ~/Downloads/shipments-week-16/ and tell me which ones have a consignee mismatch between the BL and the commercial invoice.”

Claude calls batch_convert, cross-checks the fields, and returns a list. No code, no pipeline, just a question.

What this changes for a 400-shipment forwarder

Taking the $12,000 per month baseline from the opening:

Activity	Manual	With PDFMux	Savings
Document sorting and classification	10 min	1 min	9 min
Data entry	45 min	1 min (review only)	44 min
Bilingual verification	20 min	2 min	18 min
Error correction	15 min	3 min	12 min
Total per shipment	90 min	7 min	83 min
Monthly cost	$12,000	$930	$11,070

One full-time data entry role redirected to customer operations. Submission rejection rates drop from the 4 to 7 percent range we see in manual pipelines to under 1 percent on PDFMux’s confidence-gated output.

Conclusion

Bilingual Arabic-English PDFs are not an edge case in GCC logistics. They are the default. The 2026 e-invoicing deadlines in the UAE and Saudi Arabia turn this from an operational nuisance into a compliance risk. PDFMux handles the three structural problems (RTL order, bidi mixing, ligatures) with explicit bidi processing and Gemma 4 multimodal OCR, and exposes a confidence score so you know which documents are submission-ready and which need a human.

Install PDFMux, point it at a folder of BLs, and you will see the accuracy numbers above on your own documents. That is the test that matters.