The AI landscape doesn't move in one direction — it lurches. Some techniques leap from experiment to table stakes in a single quarter; others stall against regulatory walls, technical ceilings, or organisational inertia that no amount of hype can dislodge. Knowing which is which is the hard part. The State of Play cuts through the noise with a rigorously maintained index of AI techniques across every major business domain — classified by maturity, evidenced by real-world adoption, and updated daily so you always know where you stand relative to the field. Stop guessing. Start knowing.
A daily newsletter distilling the past two weeks of movement in a domain or two — delivered to your inbox while the index updates in the background.
Each dot marks the weighted maturity of practices within a domain — hover for a brief summary, click for more detail
AI that understands documents containing mixed content — tables, diagrams, images, and text — extracting meaning from each. Includes chart reading, diagram interpretation, and table extraction; distinct from standard OCR which processes text rather than mixed visual elements.
Multimodal document understanding -- AI that extracts meaning from documents combining tables, diagrams, charts, and text -- has proven its value at forward-leaning organisations but remains far from mainstream adoption. The technology works: production deployments at scale show strong ROI, with named enterprises (Goldman Sachs for KYC/entity extraction, Sun Finance for identity verification achieving 91% cost reduction, contract systems like Doczy.ai achieving 99% accuracy vs 55% rules-based baseline) and a major international bank cutting invoice processing time by 85% across 50K monthly documents. Major vendors released new capabilities in June 2026 (Google Gemini API multimodal RAG with unified embeddings, Microsoft Build 2026 Azure Content Understanding combining Document Intelligence with LLM reasoning paired with named production users like DataSnipper and Wolters Kluwer). Yet recent research surfaces persistent limitations that complicate adoption. New benchmarks (ReceiptBench, CiteVQA, FinDocMRE) document the gap between demo accuracy and field performance: Gemini-3.1-Pro exhibits "attribution hallucination" (76% accuracy despite correct answers), financial document understanding maxes at 65% across 11 models, and text-only extraction paradoxically outperforms multimodal approaches on real-world documents (MMM-Bench). June 2026 research identifies critical failure mode: multimodal models exhibit hallucination cascading in interactive multi-turn document scenarios (MM-Snowball, ICML 2026), and layout detection models fail to generalize from academic benchmarks to real institutional documents. The binding constraint has shifted from raw model capability to architectural composition and interactive reliability. Multi-model verification architectures demonstrate 61% hallucination reduction (480M enterprise outputs), but traditional approaches (specialized transformers on layout-intensive tasks) outperform LLM-based approaches on visually rich document types. For organisations with strong document infrastructure, tolerance for hybrid OCR+vision architectures, and domain-specific model tuning, the returns are real. For most, adoption remains gated by interactive reliability, architectural composition, and the infrastructure readiness to support hybrid pipelines.
As of June 2026, the vendor ecosystem has solidified into three distinct layers, each with complementary strengths and constraints. Cloud platforms (AWS Textract, Azure AI Search, Google Document AI, Oracle) dominate by market reach and feature maturity, with Textract serving 1,700+ customers on billions of pages annually and Google shipping Gemini API multimodal RAG with unified embeddings (May 5, 2026; June 2026 Azure Content Understanding combining Document Intelligence with LLM reasoning). Specialist agentic platforms (Indico, Convr, LandingAI) are winning on hard document types: Indico processes 1M+ pay stubs daily at 90%+ accuracy, LandingAI scaled a healthcare RCM deployment from 120K to 240K prior authorization pages daily (60%→90%+ accuracy), and Convr reports 97% accuracy on commercial insurance. Open-source ecosystems show maturation: Docling (now offered by IBM as managed service) hosts 24 models with community-developed visual inspection tooling (Docling Studio) enabling production RAG workflows; IBM released Granite Vision 4.1 4B specialized VLM achieving 94.2% VAREX zero-shot accuracy; self-hosted deployment patterns on GPU infrastructure now well-documented and production-ready with Marker and MinerU alternatives. Architecture stratification now evident: specialized multimodal Transformers (LayoutLMv3, Donut) outperform LLM-based approaches on layout-intensive documents; independent OCRBench testing shows VLMs match or exceed traditional OCR on real documents, especially charts/handwriting/complex fields; hybrid OCR+vision pipelines emerging as production standard balancing cost, speed, and accuracy. Agentic parsing platforms demonstrate clear advantage: RealDoc-Bench independent benchmark shows agentic document infrastructure (Extend Parse 2.0, LlamaParse) achieving 91-96% Q&A accuracy on production documents vs traditional OCR platforms (AWS Textract) at 70.5%, confirming architectural preference for layout-preserving approaches. Model adoption patterns show enterprise preference shift: KPMG AI Adoption Index shows Claude climbed from seventh to second place in "primary LLM in production" (Q1 2025→Q1 2026), driven by superior document handling on messy real-world inputs; legal sector adoption accelerating with 68% of law firms/in-house teams deploying multimodal AI agents (Harvey, RSGI survey June 2026) with 11 hrs/week time savings and 44% revenue increase among tracking firms. June 2026 research surfaces critical limitations and validated architectural solutions. Hallucination cascading in interactive multi-turn document scenarios documented (MM-Snowball, ICML 2026); layout detection models fail to generalize from academic benchmarks to real institutional documents; financial document understanding shows no model exceeds 65% accuracy despite 79% human baseline (FinDocMRE); receipt extraction exhibits "Analyst-Calculator Dichotomy" (semantic vs numerical precision gap); attribution hallucination persists (Gemini-3.1-Pro 76% strict accuracy). Yet peer-reviewed mitigation architectures are advancing: multi-model verification reduces hallucination rates 61% across 480M enterprise outputs in regulated sectors; retrieval-augmented reliability-aware inference improves accepted prediction accuracy 85.84%→88.88%; typed hallucination auditing on legal contracts shows multi-agent debate reduces fabrications 45% while achieving parity with commercial APIs at 4B parameters; fine-grained hallucination detection systems (ZINA, CVPR 2026) enable phrase-level error localization and actionable correction. Production constraints shifting from model capability to architectural composition: inference-time graph-based evidence routing (TIGER framework) enables fact-level repair; hallucination-to-action exploitation mitigated only in specialized Evidence-Carrying Agent architectures (0% unsafe-action vs 100% for naive agents); manual exception-handling costs (~$4.83/page) dominate API costs; multimodal token costs (2-5× text) drive architectural choices toward hybrid pipelines. Organizational readiness remains critical: 61% still rely on paper despite automation claims, and only 38% rate their document data as suitable for AI use. Enterprise software multimodal adoption projected to reach 80% by 2030 (Gartner), with insurers using multimodal first-notice-of-loss processing reporting 35% reduction in claim cycle time and improved fraud flagging accuracy.
— AWS Bedrock product GA for Claude multimodal vision capabilities: extracts insights from documents, processes diagrams, reads charts. Named enterprise use cases in legal (parse documents and answer questions), insurance (claims/policy analysis), operations (extract from emails/business documents). 1M token context window.
— Production-document benchmark across logistics, healthcare, finance, real estate (1,500 layout samples, 1,359 Q&A prompts): demonstrates agentic parsers (Extend, LlamaParse) achieve 91-96% Q&A accuracy vs traditional OCR platforms 70.5% (AWS Textract), revealing infrastructure preference for layout-preserving agentic approaches.
— Multimodal document understanding deployment on legal contracts (CUAD, 510 contracts, 249K instances): reveals 51–56% aggregate hallucination masking typed gaps (numeric/obligation claims 65–74%, temporal 29–35%); multi-agent debate reduces fabrications 45%, achieves parity with commercial APIs at 4B parameters.
— RSGI independent study (87 respondents, 60 firms, April-June 2026): 68% deploying multimodal AI agents for legal documents; 21% running 50+ agents; 11 hours/week time savings; 44% revenue increase, 53% profitability gain among firms tracking outcomes—signals production-scale adoption in regulated vertical.
— IBM general availability of Docling as managed enterprise service (40M total downloads, 500k daily), with hardening work for production resilience and independent user validation: Singapore financial institution reported improved accuracy and 2× parsing speed over open-source baseline.
— Peer-reviewed hallucination mitigation framework for MLLMs using external visual evidence retrieval and reliability scoring. Empirical results: improves accepted prediction accuracy 85.84%→88.88% (89% coverage) and reduces hallucination rate 14.16%→11.12% on ImageNet-100.
— CVPR 2026 fine-grained hallucination detection system with VisionHall dataset (6.9k manually annotated MLLM outputs, 20k synthetic samples) classifying errors across six categories at phrase-level. Outperforms GPT-4o and Llama-3.2 on detection/editing, enabling actionable error correction in production document pipelines.
— Production study of 480M verified AI outputs across legal, financial, healthcare: multi-model verification reduced factual errors from 8.3% to 3.2% (61% reduction), demonstrating concrete reliability improvement for document-centric workflows in regulated sectors.
2024-Q1: AWS Textract and Adobe announce major platform expansions with production deployments and GA launches. Multimodal document understanding moves beyond OCR into vision-language model territory. Academic benchmarks proliferate (MMDocIR, M-LongDoc), signaling research maturity. Cloud provider SDK reliability issues emerge as a deployment bottleneck (Azure 404 errors, latency). Market adoption reported at 1,700+ companies using Textract; survey data suggests 78% of organizations planning multimodal document processing implementation within the year.
2024-Q2: Production deployments accelerate across vendors with named organizational wins (NYSE, Ryanair, Netsmart on Bedrock; UiPath CoE reporting 70K+ hours freed). Academic research shifts from capability benchmarks to rigorous limitation assessment: WONDERBREAD (Stanford) reveals validation gaps in workflow documentation; ICLR 2024 confirms hallucinations and compositionality flaws in open-source LMMs. Industry R&D (JPMorgan DocLLM) explores alternative architectures to improve scaling and reliability. Azure Document Intelligence surfaces architectural scaling limits (rate-capping at 15 TPS POST) and regional reliability issues, signaling vendor platform maturity challenges.
2024-Q3: Named enterprise deployments continue (Deltek's multimodal RAG on AWS, Stride's document automation on UiPath), but operational friction increases. Open-source ecosystem gains traction (Docling 55.8k stars). Academic research identifies domain-specific limitations: GPT-4V fails on medical documents despite apparent expertise. UiPath on-premises deployments encounter cascading failures with dataset scale; Azure Document Intelligence exhibits latency (5-6 seconds per document) and API version instability. Market shows adoption expanding beyond finance/insurance into education and healthcare, but production reliability remains the primary constraint.
2024-Q4: Bloomberg and UNC researchers demonstrate scalable multimodal RAG handling 40K pages with sub-2-second latency, confirming technical capability advancement in financial sector R&D. Baidu releases PP-DocBee with SOTA benchmark results, signaling international vendor competition. Federal sector deployments expand (Precise Software Solutions with 4x productivity gains on Textract). However, Azure Document Intelligence surfaces operational maturity constraints: intermittent validation errors (InvalidContentDimensions) in production, SDK versioning confusion delaying GA adoption, and unresolved developer experience friction. ACL 2024 research establishes consensus on enterprise adoption barriers: data limitations, model calibration issues, and evaluation gaps in field-level performance. Market maturity increases but production reliability and integration friction remain primary adoption constraints.
2025-Q1: Research ecosystem accelerates with BigDocs (7.5M multimodal samples, 15.14% benchmark gains), Docopilot (native document VLM outperforming Gemini 1.5-Pro), and Document Haystack (long-context needle-in-haystack benchmarks). Academic surveys (ACL 2025 Multimodal RAG) signal community maturity. Production deployments show strong ROI: major international bank achieves 85% invoice processing time reduction and $2.1M annual savings with 50K monthly volume. However, Azure Document Intelligence exhibits new reliability issues (intermittent freezes in production); field evidence confirms model limitations in layout sensitivity and hallucinations persist. Market splits between generalist foundation models and specialist document-focused architectures.
2025-Q2: AWS Textract announces June updates for improved low-resolution and complex document handling; vendor momentum sustained across platform improvements. Base64.ai reports multimodal deployments achieving 99.7% accuracy in insurance/pharma applications, confirming enterprise adoption ROI. Academic research intensifies: IJCAI 2025 tutorial formalizes MLLM-driven document understanding curriculum; PLOS ONE and independent RMIT evaluation advance both performance benchmarks (0.88 NDCG multimodal retrieval) and critical assessment of real-world limitations (vendor name/date extraction failures in expense processing). However, Azure Document Intelligence suffers major production outage in June (US East region, InternalServerError on multiple models), reinforcing vendor platform maturity as primary adoption constraint. Specialist models (PP-DocBee-2) demonstrate significant performance gains (11.4% improvement, 73% latency reduction) on domain-specific document types. Adoption velocity driven less by capability advancement and more by vendor operational reliability and domain-specific model selection.
2025-Q3: Ecosystem expansion with Oracle Cloud releasing Document Understanding 2.0 (multilingual support, Label Studio integration), signaling third-tier vendor commitment. Research consensus solidifies: two authoritative surveys (arXiv, ACL 2025) formalize MLLM-based document understanding field maturity and identify efficiency, generalizability, robustness as advancement frontiers. Adoption paradox deepens: SER Group survey reports 78% of organizations claiming document processing implementation, yet critical assessment reveals 61% still rely on paper, 48% expect paper volumes to increase, and most remain rule-based (not true AI). Azure Document Intelligence reliability deteriorates further: September outages in East US (30+ minute processing delays) and West Europe (indefinite hanging on prebuilt models, 100% failure rates in healthcare applications), reinforcing vendor platform maturity as critical bottleneck. Production reliability and organizational integration friction remain primary adoption constraints, not capability.
2025-Q4: Agentic automation emerges as market differentiator: Indico Data processes over 1M pay stubs daily with 90%+ accuracy, replacing hyperscaler solutions at 70%, demonstrating vertical specialization advantage. However, infrastructure readiness gap widens: Apryse survey (December, 465 organizations) reveals 64.5% have AI in production but only 38.1% rate document data as excellent, with 76.6% storing 25-75% of data in documents and 82.8% planning document automation investment. Azure Document Intelligence continues reliability failures: November 2025 production outage with hanging errors on custom models, reinforcing vendor platform constraints. Implementation friction persists: 78% survey adoption claims conflict with reality (61% still paper-dependent, 48% expecting paper volume growth). Market maturity reflected in three-tier vendor ecosystem and specialist model gains, but adoption velocity constrained by infrastructure investment gaps and vendor platform reliability, not capability advancement.
2026-Jan: Deployment momentum accelerates with named enterprise wins and international vendor innovation: Allegis Global Solutions achieves 80-90% success rates using UiPath Document Understanding with agentic automation for constantly changing invoice formats; Oracle integrates Document Understanding into Oracle Integration Cloud with support for invoices, receipts, passports, and custom documents; Document Force releases enhanced models with 200%+ GPQA improvement and 1/3 cost reduction. Independent benchmarking (AIMultiple, 300-document study) confirms Azure Document Intelligence at 96% accuracy on printed text, validating SOTA multimodal LLMs as OCR alternative. Academic research continues validation: SANER 2026 conference paper from Fujitsu demonstrates multimodal LLM success on structural recognition but identifies persistent challenges in semantic interpretation of complex diagrams. However, vendor reliability remains critical constraint: Azure Document Intelligence continues January 2026 production failures with hanging extraction requests causing application downtime, signaling unresolved platform maturity issues despite ecosystem expansion. Market maturity reflected in deployment velocity and international vendor entry, but organizational infrastructure readiness and vendor platform reliability remain primary adoption bottlenecks over capability.
2026-Feb: Research consensus on maturity combined with practical deployment acceleration: Two new comprehensive academic surveys (arXiv) and an industry report document the MLLM-driven document understanding landscape, synthesis of VDR methodologies, and agentic automation market expansion (LlamaIndex 500M+ documents, 90%+ automation; Convr 97% accuracy on insurance). UiPath advances product maturity with Field Groups and Document Understanding API v2; Textract demonstrates production ROI (Associa: 48M documents, unknown-document accuracy 50%→85%). However, February brings continued evidence of Azure platform brittleness: HTTP 500 failures on copied custom models affecting multi-environment deployments; concurrent evidence of capability limitations (French financial documents show 34-62% chart accuracy, multi-turn dialogue failures at 50%, mismatched decoder problem constraining modality understanding). Market bifurcates: agentic specialist vendors (Convr, LlamaIndex, Indico) with vertical focus outperforming generalist platforms; architectural and deployment reliability remain primary adoption constraints over raw model capability.
2026-Mar: Deployment velocity continues with strong ROI validation and methodological advances, while capability limitations become increasingly explicit. New evidence: Fullerton Health (9-market healthcare deployment, 87% field accuracy, 300× efficiency); industry ROI analysis confirms 60-80% cost reduction and 6-18 month payback across financial services and lending (mortgage processing $8.5-28K monthly savings). Academic peer-reviewed benchmarks (CVPR, EACL, IJCAI tracks) formalise maturity: SEA-Vision benchmark reveals critical adoption barrier in low-resource languages (11 languages, pronounced degradation); VAREX (1,777 docs, 20 models) shows schema compliance is below-4B bottleneck; EACL study validates image-only MLLMs match OCR+MLLM pipelines on business documents, reducing infrastructure requirements. However, quality constraints sharpen: FREAK and Reading-Not-Thinking papers document severe hallucination issues and text-as-pixels modality gap (60+ point degradation on math tasks, fixed with self-distillation). ViG-LLM (Amazon Science) addresses privacy-constrained extraction without external OCR. LandingAI achieves 99.16% accuracy on DocVQA with agentic visual grounding. Market signal: agentic architectures winning on complex documents; image-only MLLM pipelines viable for business use; multilingual adoption limited by language resource gaps; infrastructure readiness and vendor reliability remain primary constraints over raw capability.
2026-Apr: Adoption-reality gap widens despite platform maturity. New evidence: AIIM survey (600 enterprises) shows 61% still paper-dependent despite 78% operational AI, 66% IDP tool replacement rate reflecting prior failures. Multi-industry case studies (Quantiva) confirm "95% demo, 60% reality" pattern across financial/media/music—document AI requires composed architecture (classification, extraction, tables, validation) not monolithic tools. Independent benchmarking (BenchLM, April 2026) elevates document understanding (OfficeQA Pro) to essential enterprise capability with Qwen3.6-35B leading at 89.9%, signalling evaluation maturity. Ecosystem standardisation signals emerge: Abbyy's VLM-focused strategy and the DocLang working group (IBM, Red Hat under the Linux Foundation) indicate consolidation toward agentic architectures. Real-world financial-document testing (120 documents, three frontier models) shows 87-89% headline accuracy masking field-level error patterns and confidence calibration failures, confirming production deployment requires model-specific tuning. Platform expansion: Azure AI Search GA for multimodal documents, LandingAI scales healthcare 120K→240K pages daily (60%→90%+ accuracy). Production constraints sharpening: multimodal LLM costs 2-5× text tokens; hybrid OCR+vision pipelines and vision-guided chunking (84.4% retrieval accuracy) emerging as architecture response. Specialist vendors and open-source (Docling 24 models, Studio tooling) outperforming cloud platforms on hard document types; infrastructure readiness and organisational data quality remain primary adoption constraints.
2026-Jun: Platform capability, ecosystem maturity, and production mitigation architectures advance. Microsoft Build 2026 shipped Azure Content Understanding combining Document Intelligence with LLM reasoning, with named production deployments (DataSnipper, FinHero, Wolters Kluwer tax workflows); Doczy.ai production contract system achieved 99% accuracy versus 55% rules-based baseline. Ecosystem maturity signal: IBM launches Docling for IBM watsonx as fully managed service (June 15, 2026), backing the 40M-download, 500k daily-download open-source toolkit with production hardening and independent validation from Singapore financial institution (2× parsing speed, improved accuracy). AWS Bedrock Claude GA expands enterprise multimodal reach with 1M token context across legal document parsing, insurance claims analysis, and operations document extraction. Legal sector adoption scales rapidly: RSGI independent survey (87 respondents, April-June 2026) reports 68% of law firms/in-house teams deploying multimodal AI agents, 21% running 50+ agents, 11 hours/week time savings, and 44% revenue increase among tracking firms. RealDoc-Bench independent benchmark (1,500 layout samples, 1,359 Q&A prompts across logistics, healthcare, finance, real estate) confirms agentic parsers achieve 91-96% Q&A accuracy versus AWS Textract at 70.5%, establishing architectural preference for layout-preserving approaches. Hallucination mitigation research advances: retrieval-augmented reliability-aware inference (RARI) improves accepted prediction accuracy 85.84%→88.88%; typed hallucination auditing on legal contracts (LegalHalluLens, CUAD 510 contracts) shows multi-agent debate reduces fabrications 45% while 4B-parameter models achieve commercial API parity; fine-grained detection systems (ZINA, CVPR 2026) enable phrase-level error localization across six categories. Layout detection models still fail to generalize from academic benchmarks to real institutional documents; multi-turn hallucination cascading (MM-Snowball, ICML 2026) now peer-reviewed. Market signal: deployment ROI proven at scale with vendor backing, field-ready mitigation patterns emerging, but attribution gaps and interactive reliability remain adoption gates for most organizations.
2026-May: Evaluation maturity and deployment acceleration accelerate alongside explicit limitation acknowledgment. New evidence: (1) Docling Linux Foundation donation (March 2026) and Red Hat Summit presence (May 11–13) demonstrate production adoption in aviation, banking, insurance at scale; IBM's 258M-parameter Granite-Docling-258M model signals specialized architecture preference. (2) Major vendor product releases: Google Gemini API multimodal RAG (May 5, 2026) ships unified embeddings across text/image/audio/video/PDF modalities, eliminating preprocessing; Gemini Omni native multimodal model with enterprise named adopters (Accenture, AirAsia, Deloitte) announced May 20. (3) Academic research confirms production reality: peer-reviewed KYC study (arXiv) shows multistage pipeline achieving 87.27% on 120 real financial documents (3000+ pages); Amazon Science's Document Haystack and ReceiptBench benchmarks formalize document understanding evaluation maturity. However, simultaneous research surfaces persistent capability gaps: ReceiptBench (10K receipts) documents "Analyst-Calculator Dichotomy" (semantic reasoning vs numerical extraction precision), CiteVQA reveals "attribution hallucination" (Gemini-3.1-Pro at 76% strict accuracy despite correct answers on multi-page PDFs), MMM-Bench (5,990 real Alibaba documents) shows text-only extraction paradoxically outperforming multimodal approaches, FinDocMRE (2,878 financial PDFs) caps all 11 models at 65% accuracy. DocAtlas (82-language framework) addresses low-resource gaps via DPO adaptation. (4) Practitioner assessment solidifies: critical analysis documents why LLM-only extraction fails (hallucinations, layout collapse, inconsistent results) and advocates explicit OCR-first/LLM-second hybrid pattern—evidence of architectural consensus post-pilot. Evidence-Carrying Multimodal Agents research proposes hallucination-to-action safety via deterministic verification (0% unsafe-action rate vs 100% for naive agents). (5) Concrete deployment wins: Goldman Sachs autonomous agents for KYC/entity extraction and judgment-driven classification; Proxet case study shows global investment firm reduced week-long portfolio analysis to hours using Claude agentic multimodal agents; Sun Finance achieved 79.7%→90.8% accuracy, 91% cost reduction, 20 hours→<5 seconds. Market signal: deployment ROI proven in financial services, vendor product maturity advancing, but field research reveals that frontier-model demo accuracy (87-89%) masks field-level error patterns; multimodal approaches must compete against simpler text-only baselines on real data; and attribution/hallucination gaps prevent production safety without supplementary verification architectures.