Perly Consulting │ Beck Eco

The State of Play

A living index of AI adoption across industries — where established practice meets the bleeding edge
UPDATED DAILY

The AI landscape doesn't move in one direction — it lurches. Some techniques leap from experiment to table stakes in a single quarter; others stall against regulatory walls, technical ceilings, or organisational inertia that no amount of hype can dislodge. Knowing which is which is the hard part. The State of Play cuts through the noise with a rigorously maintained index of AI techniques across every major business domain — classified by maturity, evidenced by real-world adoption, and updated daily so you always know where you stand relative to the field. Stop guessing. Start knowing.

The Daily Dispatch

A daily newsletter distilling the past two weeks of movement in a domain or two — delivered to your inbox while the index updates in the background.

AI Maturity by Domain

Each dot marks the weighted maturity of practices within a domain — hover for a brief summary, click for more detail

DOMAIN
BLEEDING EDGEESTABLISHED

Document & diagram understanding

LEADING EDGE
ALSO IN👁️ Computer Vision & Sensing🔄 Operations & Process Automation

TRAJECTORY

Stalled

AI that understands complex documents, diagrams, handwriting, and degraded or historical texts using vision-language models and specialised OCR. Includes architectural drawing interpretation and historical manuscript digitisation; distinct from standard document processing which handles structured forms and clear printed text.

OVERVIEW

Document and diagram understanding remains bifurcated between proven adoption in specialised contexts and unresolved limitations in horizontal AI approaches. Institutional deployments across cultural heritage, finance, and government continue delivering measurable ROI, but the fundamental constraint that blocked horizontal adoption in 2025 persists into June 2026: vision-language models systematically fail at diagram comprehension and rely on textual priors rather than visual grounding. This structural gap defines the leading-edge status and stalled trend. Forward-leaning organisations—archives transcribing medieval manuscripts at 9.7% error rate, governments digitizing millions of handwritten records, financial services processing loans in under 2 minutes—are scaling document understanding to institutional production use. But most organisations remain on manual workflows, and the general-purpose AI that could unlock horizontal adoption does not yet exist.

VLMs have crossed the threshold on text extraction in complex documents (90-99% accuracy on invoices and forms now routine), but diagram understanding remains categorically broken. Recent June 2026 research confirms the problem: frontier models achieve 51% accuracy on architectural object counting while maintaining 95% on text extraction—a 44-point gap indicating symbol-centric reasoning is unreliable. Engineering diagram description tokens reach only 3-18% F1-score despite 61-87% recall on parts. VLM failure on diagrams is not a tuning problem but a fundamental misalignment between how language models encode visual information and how diagrams encode structural relationships. Seventy percent of manufacturers still extract engineering tolerance data by hand. Until this visual-language gap closes, the practice will remain segmented: specialised tools (Transkribus, fine-tuned models, hybrid approaches) delivering production value, and manual workflows persisting where diagram understanding is required.

CURRENT LANDSCAPE

Transkribus dominates the cultural heritage segment at scale: 90 million images processed across 227 cooperative members in 30 countries. June 2026 deployments reinforce sustained adoption: Inria's CoMMa project transcribed 32,763 medieval manuscripts in 4 months with 9.7% character error rate and 3 billion-word corpus, confirming production-scale HTR for low-resource historical scripts. University of Georgia's Hargrett Library deployed Transkribus plus custom Python workflow for 20,000+ Colonial-era pages in under two months with 2-person team, establishing reusable institutional model. University of South Carolina Libraries processed 100,000+ handwritten pages with JSTOR Seeklight AI at 97% accuracy, demonstrating mainstream adoption in academic institutions. Vatican Library deployed ResNet-18 and Swin Transformer models on medieval manuscripts, achieving >80% accuracy on scribe identification with explainability requirements for humanistic scholarship. U of T/UCL researchers applied Transkribus to 13th-century Latin legal manuscripts, overcoming medieval abbreviations through collaborative retraining. Government deployments continue: India's Gyan Bharatam Mission documented 4.4M+ manuscripts with ₹491.66 crore funding through 2031; King County, WA cut document redaction time from 30 minutes to under five seconds at 96% accuracy.

Enterprise market acceleration through June 2026: Gartner's inaugural Intelligent Document Processing Magic Quadrant (September 2025) identified 5 Leaders (ABBYY, Hyperscience, Infrrd, Tungsten, UiPath), with accuracy converged at 90-99% and audit trail emerging as primary differentiator. IDP adoption reached 63% of Fortune 250, market sized at $4.31B (33% CAGR). Gartner data shows 67% of enterprises now evaluating agentic approaches versus 23% two years ago. VLM-based invoice processing achieves 85-94% accuracy at $1.20 per document. Production scale-ups: ArcelorMittal processes 300,000+ invoices annually at 90% accuracy with processing time reduced from 7-10 days to 1 day; M2P Fintech deployed Document Intelligence Agent with 18-24 hours → <2 minutes processing, 85-90% → 95%+ accuracy, ₹800-1,200 → ₹80-150 per-application cost, handling 150,000 pages/hour; Nevada County deployed Chandra model for 200K+ Gold Rush documents at 150X speedup (3 weeks → 2 hours) with 95-98% accuracy on modern and 90% on complex historical handwriting.

Cloud platform maturation continues through Q2 2026: Microsoft Azure Content Understanding GA (March 2026) achieved 40% accuracy improvement via labeled examples with named customers (DataSnipper, FinHero, Wolters Kluwer) confirming deployment value. Databricks released ai_extract and ai_classify functions as native GA capabilities (June 2026), integrating document understanding into core data platform workflows. UiPath Helix model family reached GA (May 2026) with improved extraction and classification. Vendor ecosystem expansion: ABBYY FineReader added layout analysis, handwritten and Chinese recognition, LLM integration; Google Cloud Document AI released quality scoring, digital PDF support, model versioning with named deployments (Jack Henry, PwC, Mr. Cooper). Technical skill expansion: June 2026 research demonstrates frontier models (Gemini 3.1 Pro) achieve 97.91-98.51% character accuracy on classical Arabic scripts (naskh, ruq'ah, ta'liq), indicating HTR advances for specialized non-Latin paleography.

Critical limitations persist and recent research crystallizes them. Diagram understanding remains fundamentally broken for general-purpose VLMs: AECV-bench (May 2026) shows best model (Gemini 3 Pro) achieves 51% accuracy on architectural object counting versus 95% on text extraction—44-point gap exposing symbol recognition as unreliable. Enginuity benchmark (June 2026) confirms: frontier models reach Recall@all 0.61-0.87 on engineering diagram parts but Token F1 only 0.03-0.18 on descriptions, quantifying the relationship-reasoning failure. June 2026 Vision-Grounded study documents that VLMs systematically rely on textual priors over visual grounding: proprietary models show 27-38% gaps between Vision-Grounded and baseline variants, signaling fundamental visual-language misalignment. Handwriting OCR accuracy varies 63-99% across platforms (block ~95%, cursive ~45%), heavily dependent on writing style and document type. Layout analysis emerges as critical bottleneck: DFG/AHRC-funded Tibetan newspaper research documents Transkribus failing on dense multi-script layouts, requiring custom TransYolo solution. Non-Latin script accuracy remains dependent on fine-tuning; specialized deployments on Tamil, Arabic, and Urdu scripts confirm HTR maturity concentrated in high-resource domains. Hybrid human-in-the-loop workflows remain production standard. Azure Document Intelligence reliability issues persist (May 2026 outages, extraction service hangs), constraining enterprise adoption. Systematic review of OCR evaluation (2006-2025) documents structural bias: evaluation frameworks center on modern Western documents, leaving historical and marginalized materials systematically underrepresented in maturity assessments.

TIER HISTORY

ResearchJan-2017 → Jan-2018
Bleeding EdgeJan-2018 → Jan-2020
Leading EdgeJan-2020 → present

EVIDENCE (167)

— Peer-reviewed evaluation of Gemini 3.1 Pro on Arabic classical scripts achieved 97.91%-98.51% character accuracy across naskh/ruq'ah/ta'liq, demonstrating frontier VLM effectiveness for specialized non-Latin paleography.

— University of South Carolina deployed JSTOR Seeklight AI on 100,000+ handwritten pages with 97% accuracy; production integration with student workflow demonstrates sustainable institutional adoption.

— Inria ALMAnaCH deployed CoMMa project transcribing 32,763 medieval manuscripts in 4 months with 9.7% character error rate; 3B+ word corpus confirms production-scale HTR for low-resource historical scripts.

— Central Institute of Classical Tamil digitized 48% of Thirukkural manuscripts with corpus explicitly developed as training data for handwritten Tamil text recognition; shows document digitization as foundational for specialized script HTR advancement.

— Government of India Gyan Bharatam Mission deployed AI for handwritten manuscript digitization at scale: 4.4M+ manuscripts documented, 800K+ digitized, 129K+ public access; ₹491.66 crore funding through 2031 demonstrates government backing.

— M2P Fintech deployed Document Intelligence Agent in loan origination: TAT 18–24 hours → <2 minutes; accuracy 85–90% → 95%+; per-application cost ₹800–1,200 → ₹80–150; handles classification, extraction, authenticity verification, fraud detection at 150,000 pages/hour.

— Archion deployed Transkribus API at production scale (200K+ books, 32M images) with text overlay in research platform; outcomes: improved user experience, faster paleographic research, automatic model improvement adoption.

— Peer-reviewed benchmark exposing VLMs systematically rely on textual priors over visual grounding: all models degrade on Vision-Grounded variant; proprietary models show wider grounding gaps (27–38%), signaling fundamental failure mode for document/diagram understanding.

HISTORY

  • 2017: Transkribus platform consolidated academic research in HTR and document layout analysis; European and German archival projects drove development; ICDAR competitions validated technical progress; persistent accuracy challenges on degraded texts limited wider adoption.
  • 2018: Transkribus matured to operational platform with real deployments; HTR+ technology achieved measurable accuracy gains (9% CER on difficult handwriting); user community grew to 100+ with documented case studies; platform transitioned to cooperative business model; critical adoption barrier identified: page-level accuracy does not translate to field-level accuracy for data extraction.
  • 2019: Cloud vendors entered market with production-ready solutions (Google Document Understanding AI, AWS Document Understanding Solution); Transkribus validated at 97% F1-measure on historical newspapers, outperforming commercial alternatives; commercial deployment evidence emerged (Interface Financial Group invoice processing at 99% accuracy); research advanced multilingual and low-resource script recognition; adoption remained specialized (archives, high-value financial documents) with ecosystem fragmented across platforms.
  • 2020: Cloud vendors pursued vertical specialization (Google Lending DocAI for mortgage lending); commercial deployments expanded beyond archives (Pernambucanas retail fraud detection at 10K documents/day); research initiatives funded (MultiHTR multilingual project, Fraunhofer DocuLib); practitioner evidence revealed persistent accuracy and generalization challenges (UiPath low-accuracy reports); adoption remained niche despite capability maturity.
  • 2021: Google Document AI reached general availability as unified platform; Transkribus secured major institutional deployment (Trinity College Dublin Beyond 2022 archival digitization, government-funded); Workday adopted Procurement DocAI for cross-lingual receipt/invoice automation; research consolidated gains in writer-adaptive HTR (MetaHTR) and multi-modal architectures (LayoutLMv2); document visual understanding advanced via ICDAR competitions; ecosystem expanded from specialized verticals toward enterprise adoption but remained bounded by training cost and domain-specific customization demands.
  • 2022-H1: National Archives of the Netherlands initiated 3 million-page digitization project with Transkribus (7% CER), planned 100M+ page rollout over 15 years; research advanced open-source HTR for medieval manuscripts (1.65% CER after finetuning) and attention-based architectures; Google expanded Document AI into tax form processing and enterprise automation; industry reports documented widespread adoption (UiPath 52% error reduction, Gartner predicting $900K annual savings per finance team); deployment remained vertically segmented (archives, financial services, tax processing) with cross-domain generalization barriers persisting.
  • 2022-H2: Transkribus reached 100,000 users milestone; Google Cloud released Document AI Workbench GA enabling rapid custom model training with named enterprise deployments (BBVA, Searce, Libeo) reporting 80% time-to-market reduction and 75.6%→83.9% accuracy gains; Donut (ECCV paper) introduced OCR-free visual document understanding architecture with open-source release; systematic academic review found Transkribus rapidly integrating into archival and library digitization workflows; Lexion achieved 94% accuracy on complex document extraction in one week, demonstrating custom model development maturity; however, adoption remained vertically concentrated (archives, finance, tax), with horizontal scaling constrained by training requirements and domain-specific customization demands.
  • 2023-H1: Research documented enduring challenges in scientific document processing (discourse structure, layout complexity, multimodality); independent academic deployments showed mixed results—successful transcriptions of historical documents but requiring substantial manual effort and model customization; practitioner assessments highlighted persistent production barriers: document quality variability, OCR limitations, and inherent accuracy ceilings. Transkribus maintained leadership in heritage digitization with institutional adoption evident, while broader enterprise automation remained bottlenecked by domain-specific training costs and cross-document-type generalization failures.
  • 2023-H2: Google Cloud Document AI expanded with generative AI features (Custom Extractor, Summarizer) reaching GA with named enterprise deployments (Deutsche Bank, BBVA); research advanced diagram understanding (ChartT5 achieved 8% gains on chart visual language pre-training); Wikimedia Foundation integrated Transkribus across 13 wikis for multilingual handwritten manuscripts, demonstrating adoption breadth in underrepresented languages. However, Azure Document Intelligence encountered SDK migration friction and API compatibility issues, highlighting fragmentation and technical debt in platform evolution. Vertical specialization and institutional use (archives, finance) remained dominant adoption patterns; horizontal scaling constrained by training costs and cross-domain generalization barriers.
  • 2024-Q1: Symbol recognition in engineering drawings achieved with density-insensitive performance but noise-sensitive accuracy; large vision-language models showed limitations on chart comprehension (hallucinations, factual errors) prompting continued reliance on specialized models; Azure Document Intelligence faced SDK compatibility and regional availability issues; Google and UiPath expanded production deployments (Custom Extractors, 70K+ hour automation gains) while research documented persistent training cost and cross-domain generalization barriers.
  • 2024-Q2: Transkribus matured for production-level academic and archival adoption, with University of Edinburgh deploying platform for automated scholarly edition creation and German archives (Saarland, Braunschweig) launching transcription workflows on historical collections; production deployment evidence highlighted persistent architectural barriers in cloud platforms—Azure Document Intelligence documented rate limiting (15 TPS), lack of webhook support, and regional degradation (latency spikes to 60+ seconds), constraining horizontal scaling despite generative AI feature expansion.
  • 2024-Q3: Market inflection point with LLMs outperforming traditional HTR (Transkribus) on handwritten documents—achieving 1.8% CER at 1/50th cost—signaling technological disruption; vision-language models proved systematically inadequate for diagram understanding (58-65% accuracy vs. 82%+ human baseline) and low-level vision tasks, cementing diagram understanding as unsolved category; Transkribus remained production-dominant in cultural heritage (National Library of Norway NorHand model, 4% CER), while cloud platforms (UiPath, Google) advanced but faced reliability setbacks (Google Custom Extractor training failures September 2024); ecosystem research expanded (Docmatix dataset 240x larger) but specialized custom training remained non-negotiable for production; architectural scaling barriers (Azure rate limits, regional degradation) persisted.
  • 2024-Q4: Platform consolidation and ecosystem maturation despite persistent technical gaps; Transkribus expanded product portfolio (Sites platform for searchable digital editions, adoption in 20+ countries) and scholarship adoption (1M+ credits awarded, academic deployments in astronomy/medieval/architectural history), validating continued dominance in cultural heritage. LVLMs continued failing on diagram understanding—EMNLP 2024 research confirmed hallucinations and data bias in chart analysis; DesignQA benchmark showed GPT-4o, Claude, and Gemini cannot reliably interpret engineering drawings and CAD images, reinforcing diagram understanding as unsolved at production scale. Azure Document Intelligence faced intermittent production failures (API errors on identical requests), highlighting reliability barriers; practitioner analysis detailed specific AI failure modes in P&ID interpretation (symbol ambiguity, OCR errors, LLM hallucinations), confirming hybrid human-in-the-loop necessity. Specialized training remained mandatory for document accuracy despite continued VLM pressure on HTR economics.
  • 2025-Q1: Platform evolution and continued VLM limitations; Google Document AI continued production deployments (Spendbase check scanning automation with high-accuracy field extraction); AIA research documented early-stage architectural adoption (6% regular AI use among architects, 8% of firms implementing, concerns about accuracy/security), highlighting barriers and opportunities in vertical segments. VLM diagram understanding remained fundamentally unsolved—research proposed text-driven XML extraction approach as workaround to direct vision methods, and critical assessments documented VLMs achieving only 40% accuracy on relational reasoning tasks, confirming continued unsuitability for production diagram understanding. Practitioner assessments across platforms (UiPath, archival services) documented persistent handwriting recognition challenges (style variability, low accuracy on non-Latin scripts) requiring human proofing, reaffirming hybrid workflows as production necessity. Specialized document and diagram understanding remained distinct technology categories with no horizontal VLM convergence.
  • 2025-Q2: Specialized diagram understanding advanced while general-purpose VLM limitations persisted; ICML 2025 research confirmed LVLMs rely on background knowledge shortcuts rather than genuine diagram comprehension; engineering drawing parsing achieved 97.3% F1 via hybrid YOLOv11 + Donut framework, demonstrating specialized diagram understanding capability for manufacturing. Transkribus expansion continued with platform serving 500k+ users across 100+ languages and 300+ community models, covering diverse endangered scripts (Irish, Ottoman Turkish, Balinese) and continuing cultural heritage dominance. Government sector showed emerging document understanding adoption: King County, WA deployed AI redaction with 96% success (30min→<5sec processing), Covered California achieved 84% Google Document AI verification rate. Platform reliability concerns emerged: Azure Document Intelligence production outages (June US East region) documented, constraining enterprise adoption. Specialized custom training remained non-negotiable for production accuracy; horizontal VLM scaling remained blocked by diagram understanding unsuitability and platform reliability barriers.
  • 2025-Q3: Platform reliability crises and crystallizing VLM diagram understanding failure; Azure Document Intelligence experienced September 2025 outages with prolonged processing times (20+ minutes) across multiple regions, blocking production deployments. Research consensus hardened: peer-reviewed IEEE VIS 2025 study found VLMs struggle with chart encoding types despite accurate dimensionality/purpose recognition; ICML 2025 research confirmed LVLMs cannot reliably understand diagram relationships; CHART NOISe dataset demonstrated sharp performance degradation on degraded/occluded visualizations with hallucinations and overconfidence. Comparative benchmark evaluated 13 AI models on tables and engineering drawings, providing evidence of trade-offs between accuracy, latency, and cost across platforms. Industry practitioners documented production necessity of hybrid approaches—Reducto CEO analysis emphasized VLM failures on complex documents (table misreading, hallucination, information loss), advocating multi-pass hybrid workflows as requirement for reliability. HTR engine research highlighted continued specialization necessity: Titan and TrOCR-f superior for out-of-the-box Latin scripts, but non-Latin script accuracy remained dependent on fine-tuning. Diagram understanding remained categorically unsolved for horizontal VLM approaches despite continued research progress in specialized domains (engineering drawing parsing via YOLOv11+Donut hybrid, reaching 97.3% F1). Transkribus maintained 500k+ user base and ecosystem expansion (Sites platform adoption, 300+ community models). Market bifurcation deepened: specialized solutions demonstrating ROI and reliability; horizontal approaches facing platform reliability barriers and fundamental capability limitations.
  • 2025-Q4: Definitive research evidence of VLM relationship-reasoning failure; ICML 2025 peer-reviewed study provided conclusive findings that LVLMs achieve strong entity recognition (85%+) but cannot understand relationships (40-54% on relational reasoning), with impressive performance being "an illusion" from background knowledge rather than genuine visual comprehension. Transkribus consolidated market leadership with October 2025 research documentation of 90 million images processed, 235k registered users, 227 cooperative members in 30 countries, validating sustained adoption scale in cultural heritage sector. UiPath released major IXP platform update (November 2025) with generative AI features and agentic extraction, signaling continued vendor investment despite VLM capability limitations. Azure Document Intelligence continued experiencing reliability issues throughout quarter. Market structure solidified: specialized diagram understanding approaches (Transkribus, custom-trained models) maintained production dominance with demonstrated ROI; horizontal VLM approaches definitively proven unsuitable for relationship reasoning in diagrams; platform reliability remained a constraint on enterprise scaling; and diagram understanding remained categorically unsolved for general vision-language model applications.
  • 2026-Jan: Platform reliability crises deepened across vendors; UiPath Document Understanding experienced service incidents (East US extraction failures, Canada classification issues) indicating ongoing operational challenges in January 2026, while Azure Document Intelligence reported recurring extraction service hangs causing application downtime. VLM deployment guides documented production implementations achieving 85-94% accuracy on invoices/contracts with measurable ROI ($12→$1.20 per invoice), suggesting continued horizontal VLM scaling despite theoretical limitations. Platform documentation confirmed ongoing GA status and feature evolution (UiPath v2024.10 January release). Critical assessment: Transkribus production deployment at NIOD archives revealed quality concerns—automated text recognition fabricating entire lines—and ethical issues around model versioning/error transparency, highlighting accuracy maintenance challenges in real-world digitization workflows. Market bifurcation persisted: reliability issues constrained cloud platform scaling; specialized vendors maintained production dominance; VLM horizontal approaches continued scaling in cost-sensitive segments despite acknowledged limitations.
  • 2026-Feb: Research and deployment evidence solidified bifurcation thesis: comprehensive document parsing survey synthesized modular-vs-VLM approaches; peer-reviewed Czech study demonstrated generative AI feasibility for handwritten transcription but with critical expert-verification requirements; Transkribus volunteer deployment on New France manuscripts achieved 3-4% CER, confirming continued cultural heritage adoption; VISTA-Bench research revealed fundamental VLM modality gap on visualized text; Transkribus roadmap emphasis on LLM integration and data sovereignty; industry survey documented persistent manual extraction barriers (70% GD&T still manual), suggesting specialized technical drawing understanding remains unmet market need.
  • 2026-Mar/Apr: Deployment stage shift confirmed; industry analysis documented document AI transitioning from "credibility problem" to "production infrastructure" with 95% field-level accuracy as production threshold. Specialized academic deployments continued: U of T/UCL trained Transkribus on 13th-century Latin legal manuscripts, overcoming medieval abbreviations and hyphens through collaborative retraining. Research hardened diagram understanding constraints: VLM benchmark (IKEA-Bench) on 1,623 assembly diagram questions documents fundamental visual encoding bottleneck; VLM-RobustBench confirmed geometric distortions (resampling, elastic transforms) cause 34pp accuracy loss, critical for scanned documents. Layout analysis identified as underappreciated bottleneck: DFG/AHRC-funded Tibetan newspaper research documents Transkribus limitations on dense multi-script layouts, requiring custom TransYolo solution. Systematic review of OCR evaluation (2006-2025) documents structural bias: historical and marginalized documents underrepresented in training/benchmarking. Cloud platforms matured: Microsoft Azure Content Understanding GA with 40% accuracy improvement via labeled examples; orchestrated multi-model pipelines reduced manual processing 30-45min→<5min. Handwriting OCR adoption varies widely: 63-99% accuracy across platforms with pronounced style variance (block ~95%, cursive ~45%); independent analyst ranks Microsoft second-leader in IDP market (93% faster invoice processing at scale). Market bifurcation firmly established: specialized solutions (Transkribus, fine-tuned models) demonstrating sustained ROI; horizontal VLM approaches continuing in cost-sensitive segments despite documented limitations.
  • 2026-Apr/May: Frontier model benchmarks and production deployments confirm bifurcation trajectory. Peer-reviewed research on handwritten form digitization shows frontier models (Gemini 3.1, GPT-5.4, Claude Sonnet 4.6) achieving ~85% field-level accuracy with prompt optimization yields 60%+ macro improvements but only 2-5% weighted gains—signal of optimization plateau. Benchmark of Old Church Slavonic OCR across 11 systems documents persistent challenges: Transkribus best at diacritical marks (CER ~0.3–0.4) but LLMs fail (CER 0.88–0.95); agentic correction pipelines achieved combined CER as low as 0.011 on best pages. Multi-script OCR analysis (GlotOCR-Bench, 100+ Unicode scripts) confirms critical limitation: near-zero accuracy on low-resource scripts (Tifinagh, Bamum), revealing maturity heavily concentrated in high-resource languages. Google Cloud Document AI expanded with three new OCR features (Intelligent Document Quality scoring, digital PDF support, model versioning) reaching public preview with named customer deployments (Jack Henry, PwC, Mr. Cooper). Production benchmarking (TokenMix, 25,000 documents) shows Claude Sonnet 4.6 leading at 97.6% field extraction accuracy, with 3-7% performance cliff when documents exceed context windows and require chunking. Amazon Science released Document Haystack benchmark for long-context VLM evaluation. ThoughtWorks Technology Radar positioned unified VLM document parsing in "Assess" tier with trade-off analysis: simplicity vs hallucination risk. TOPPAN Group announced specialized AI-OCR for medieval Greek manuscripts (Vatican Apostolic Library collaboration), signaling ecosystem investment in niche historical scripts. LlamaIndex released ParseBench (2,000 enterprise document pages, 167,000 test rules) evaluating parsers on production-critical dimensions (tables, charts, visual grounding). Critical analyst commentary documented why benchmark performance (97%+) does not translate to production: clean printed text 96.5–99%, academic papers ~60%, handwritten ~80%, degraded scans highly variable. Evidence maintains bifurcation thesis: specialized solutions demonstrating clear ROI; frontier VLM models continuing horizontal scaling despite persistent limitations in multi-script support, geometric robustness, and diagram understanding.
  • 2026-May: Market inversion and production validation confirmed. OmniDocBench v1.6 documents structural shift: specialist sub-1B VLMs (MinerU 2.5 at 95.75, GLM-OCR at 95.22) decisively outperform frontier models (Gemini 3 Pro at 90.33, Qwen3-VL-235B at 89.15) with dramatically better cost-efficiency. Production case study validated at IEEE-CAI 2026: Kubernetes pipeline processing 14,000+ PDFs (378,000+ pages), extracting 30+ complex fields per document at $100 total cost. Vendor ecosystem maturation accelerated: Snowflake announced 25% OCR accuracy improvement and 20% multilingual gains (May 4 release); ABBYY enhanced FineReader with layout analysis, handwritten and Chinese recognition, and LLM integration. DocScope benchmark revealed critical maturity gap: only 29% of correct answers have complete evidence chains and region grounding is the weakest capability across models — quantifying the practice's distance from trustworthy long-document reasoning. Azure Document Intelligence's OCR architectural constraint confirmed: Draw Region cannot recover text missed at the OCR layer, setting a hard ceiling on extraction accuracy for certain document classes. Market bifurcation crystallized: specialized solutions maintaining production dominance with demonstrated ROI; horizontal VLM approaches facing fundamental limitations in diagram understanding and reliability.
  • 2026-Jun: Enterprise IDP market reached measurable scale: Gartner's inaugural Magic Quadrant for IDP identified 5 Leaders with vendor accuracy converging at 90-99% and audit trail emerging as the primary differentiator; market reached $4.31B (33% CAGR), with 63% of Fortune 250 adopting IDP and 67% of enterprises now evaluating agentic document processing (up from 23% two years ago). Platform expansions: UiPath Helix model family reached GA with improved extraction; Databricks released ai_extract and ai_classify functions as native GA capabilities (June 11); Azure Content Understanding expanded with LLM-based unstructured extraction, three named enterprise customers; Microsoft, Google, and vendors continued GA feature releases. Production deployments accelerated across both enterprise and cultural-heritage segments: University of Georgia Hargrett Library transcribed 20,000+ Colonial-era pages in under two months with a 2-person team; M2P Fintech deployed Document Intelligence Agent achieving <2 minute processing, 95%+ accuracy, ₹80-150 per-application cost; Nevada County deployed historical document digitization at 150X speedup with 95-98% accuracy; Archion deployed Transkribus API at 32M image scale (200K+ books). Heritage HTR deployments confirmed production scale: Inria ALMAnaCH's CoMMa project transcribed 32,763 medieval manuscripts in 4 months at 9.7% CER producing a 3B-word corpus; University of South Carolina Libraries processed 100,000+ handwritten pages via JSTOR Seeklight AI at 97% accuracy; India's Gyan Bharatam Mission documented 4.4M+ manuscripts with ₹491.66 crore government funding through 2031. Frontier model HTR capability extended to specialized scripts: Gemini 3.1 Pro achieved 97.91–98.51% character accuracy on classical Arabic paleography (naskh, ruq'ah, ta'liq), signaling VLM viability for high-resource non-Latin scripts. Research hardened VLM constraints: Enginuity benchmark documented engineering diagram failures (Token F1 0.03–0.18 on descriptions); phrasing-controlled study confirmed VLMs systematically rely on textual priors over visual grounding (proprietary models 27–38% gap on Vision-Grounded variants). Diagram understanding categorically remained unsolved for general-purpose VLM approaches; hybrid human-in-the-loop workflows remained production standard; specialized solutions (Transkribus, custom-trained models) continued demonstrating ROI in cultural heritage, finance, and government sectors.

TOOLS