Document summarisation & synthesis

The AI landscape doesn't move in one direction — it lurches. Some techniques leap from experiment to table stakes in a single quarter; others stall against regulatory walls, technical ceilings, or organisational inertia that no amount of hype can dislodge. Knowing which is which is the hard part. The State of Play cuts through the noise with a rigorously maintained index of AI techniques across every major business domain — classified by maturity, evidenced by real-world adoption, and updated daily so you always know where you stand relative to the field. Stop guessing. Start knowing.

AI Maturity by Domain

Each dot marks the weighted maturity of practices within a domain — hover for a brief summary, click for more detail

DOMAIN

BLEEDING EDGEESTABLISHED

LEADING EDGE

TRAJECTORY— Stalled

AI that summarises individual documents and synthesises information across multiple sources into coherent outputs. Includes executive summary generation and cross-document theme extraction; distinct from deep research which autonomously gathers sources rather than summarising provided ones.

OVERVIEW

Document summarisation has reached an awkward plateau. Every major productivity platform now ships summarisation as a native feature, and forward-leaning organisations in finance, legal, and insurance are using it in production workflows. Yet the practice remains leading-edge rather than good-practice because reliability has not kept pace with availability. Independent benchmarks consistently find error rates between 10 and 50 percent depending on domain, with multi-document synthesis and high-stakes contexts still failing basic accuracy tests. The result is a sharp two-tier pattern: bounded, low-risk summarisation -- meeting notes, internal documents, customer reviews -- works well enough with mandatory human post-editing, while scientific, legal, regulatory, and financial use cases remain blocked by factual consistency failures and validation costs that erode the speed gains. The defining tension is not whether AI can summarise documents (it can, fluently) but whether organisations can trust those summaries without re-reading the source material -- and for most complex domains, the answer is still no.

CURRENT LANDSCAPE

Vendor platforms now treat summarisation as commodity infrastructure. Microsoft and Google have embedded it deeply across productivity suites (Copilot with agentic document editing in Word/Excel/PowerPoint, Gemini summaries with audio narration in Docs/Drive/enterprise search). AWS Bedrock documents it as GA use case for financial analysis, legal review, healthcare, and operations. May 2026 scan shows vendor infrastructure maturing rapidly: Claude Opus 4.7 (April 2026) sustains 78.3% accuracy at 1M-token scale over 14.5-hour workflows; Claude Cowork Desktop (GA April 2026) enables multi-document cross-analysis of 10–100 documents simultaneously; Gemini Enterprise Agent Designer synthesizes multi-source data (CRM, email, video, database) into actionable briefs. Deployment evidence is increasingly concrete. Freshfields law firm (5,700 employees) achieved 500%+ adoption of Claude contract summarization in six weeks (April 2026). A global law firm reports 60% reduction in legal research time via production GPT-5.1 deployment. Patent landscape analysis (April 2026) confirms active commercial field with 94.3% summarization accuracy benchmarks and dense 2022–2026 patent filing activity indicating non-research deployments. Yet reliability barriers have sharpened rather than softened. Macquarie Bank reported 130K hours saved across Gemini Enterprise in seven months—but deployment scope obscures how much is summarization vs. other capabilities. Microsoft Research's DELEGATE-52 benchmark (May 2026) shows every frontier model corrupts ~25% of content in extended document workflows; models omit content more often than they hallucinate (85.5% accuracy but 0.40 completeness on EnterpriseDocBench). Hallucination rates remain structural: Claude 4.6 Sonnet ~3%, GPT-5.2 ~8-12%, Gemini 2.5 Pro ~10-15% on benchmarks (April 2026); source-grounded summarization still achieves 0.7%+ hallucination. Citation accuracy is the worst-performing task (6.8–19.1% error rates). A May 2026 opinion analysis quantifies business impact: $67.4B annual loss from AI hallucinations, with 47% of executives admitting they make major decisions on unverified AI output. Mitigations exist (RAG grounding, retrieval verification, semantic task specification improve accuracy +17–23pp; extended thinking and structured prompts reduce hallucination 30–50%) but require engineering overhead that erodes speed gains. The bifurcation has crystallized into stable equilibrium: bounded, internal-use summarisation (documents, emails, reviews) is commodity with mandatory post-editing standard across all major platforms; high-stakes domains (legal, medical, financial, multi-document synthesis, regulatory, scientific) remain blocked by factual consistency barriers and validation costs. Legal-specific tools (CoCounsel, Harvey) achieve 77–95% accuracy on controlled legal tasks—but general-purpose models remain unreliable for this domain. Healthcare and insurance continue piloting production deployments despite accuracy risks, driven by document volume and labour constraints. The practical reality: organizations deploy summarization widely for volume reduction in bounded contexts; they defer high-stakes use cases pending reliability improvements that are not yet materializing at scale.

TIER HISTORY

ResearchJun-2022 → Jun-2022

Bleeding EdgeJun-2022 → Jan-2024

Leading EdgeJan-2024 → present

EVIDENCE (118)

Microsoft Research Finds AI Model Degradation Is Quietly Corrupting Your Work DocumentsResearch Papers2026-05-04

— DELEGATE-52 benchmark shows every frontier model corrupts ~25% of content in long document workflows; critical negative signal for synthesis reliability at scale despite capability maturity.

Claude Opus 4.6 and 4.7: What's New for Enterprise AIProduct Launches2026-05-04

— Claude Opus 4.7 (GA April 16, 2026) enables 1M-token context for processing entire contract libraries and multi-year reports; 78.3% accuracy sustained at scale with 14.5-hour workflows.

HalluScan: A Systematic Benchmark for Detecting and Mitigating Hallucinations in Instruction-Following LLMsResearch Papers2026-05-04

— Systematic benchmark for hallucination detection across 72 configurations; NLI Verification achieves 0.88 AUROC for validating factual claims—applicable as post-processing for summarization output.

Responsible AI Weekly - April 25 – May 3, 2026Opinion2026-05-03

— Research digest shows hallucination rates 3–19% across frontier models, down 8x from 2024 but persistent; reasoning RL trade-offs and accuracy-warmth tensions constrain high-stakes deployment.

NLP Legal Document Review Automation Patents 2026Industry Reports2026-04-30

— Patent landscape analysis documents 94.3% summarization accuracy (Gemini 1.5) and active commercial deployment across 16-year technical evolution; field shows transition from research to production.

Freshfields Law Firm Deploys Claude Contract Analysis at Scale (5,700 employees, April 2026)Case Studies2026-04-29

— Major law firm (Freshfields, 5,700 employees) achieved 500%+ adoption of Claude contract analysis and document summarization in six weeks with multi-year co-development agreement.

Claude by Anthropic - Models in Amazon Bedrock (Document Summarization Use Cases)Product Launches2026-04-29

— AWS Bedrock GA documentation financial analysis, legal review, healthcare, and operations summarization use cases with 1M-token context capability for large-scale document synthesis.

LLM Hallucination Index: RAG SpecialAdoption Metrics2026-04-29

— Empirical benchmark of 22 models' context fidelity across 1,000–100,000 token windows; establishes summarization reliability constraints and performance baselines by model and document length.

HISTORY

2022-H1: Major vendors (Google, AWS, Microsoft) released or matured summarisation capabilities within cloud NLP platforms. Benchmark-based research confirmed LLM human-parity on standard news datasets, but critical gaps emerged: evaluation metric unreliability, inconsistent human-AI collaboration outcomes, and lack of real-world deployment case studies beyond early adopter pilots like CarMax customer review aggregation.
2022-H2: Vendor ecosystem expanded with new product releases (Microsoft Azure multi-genre support, Google Cloud Document AI). Research published mid-late 2022 sharply exposed adoption barriers: multi-document synthesis failed in medical/academic domains (GPT-3), factual consistency metrics were unreliable (models scored false statements highly), benchmark datasets had validity issues. ChatGPT's November release demonstrated capability but revealed widespread hallucination and incomplete outputs. Adoption remained limited to bounded, low-risk domains where error tolerance existed or post-editing was feasible.
2023-H1: Real deployments accelerated (Parabol meeting summaries, Azure pipeline tutorials), yet evaluation problem persisted. Early 2023 research showed ChatGPT could improve factual inconsistency detection but exhibited reasoning flaws and instruction comprehension issues. User reports confirmed hallucination when summarizing external sources (YouTube, specific documents), indicating that capability breadth had expanded faster than reliability assurance. Adoption pattern stabilized: vendors releasing tools, startups integrating summaries in low-risk contexts (meeting notes, review aggregation), with human post-editing as standard practice. High-stakes domains and multi-document synthesis remained exploration-phase rather than production-deployed.
2023-H2: Vendor momentum accelerated with GA releases: Google Cloud (Document AI Summarizer with Deutsche Bank/BBVA customers), Microsoft (Azure AI Language task-optimized features with Beiersdorf and Arthur D Little), and mainstream adoption surveys (67% of enterprises, O'Reilly November 2023). However, hard boundaries emerged simultaneously: JPMorgan and Deutsche Bank banned ChatGPT due to accuracy/liability concerns; legal practitioner documented complete failure of TextRank + ChatGPT for case-law summaries; bioRxiv's scientific preprint pilot showed mixed results (some summaries were "gibberish"). Practice bifurcated: low-risk bounded summarization (meetings, reviews, internal docs) became standard with human post-editing; high-stakes domains (legal, financial, scientific) remained blocked by unresolved hallucination and factual consistency issues.
2024-Q1: Vendor infrastructure consolidated: Google Cloud's Vertex AI deployed Gemini 1.0 Pro (GA February 2024) with named customers Samsung and Palo Alto Networks; Azure and SageMaker maintained positions. Research validated both capability progress (medical abstract summarization: 92.5% accuracy, 90/100 quality ratings) and persistent limitations (ChatGPT modest at relevance classification; hallucination/bias open challenges per academic survey). Critical domains deepened validation barriers: hospitals reported "needle in haystack" manual review burden for clinical summaries; risk sensitivity remained high. Large-scale positive deployment: Digital Science integrated AI summarization across 350 million research documents (March 2024), indicating research domain confidence. Two-tier adoption hardened: low-risk bounded summarization standard with post-editing; high-stakes domains (medical, legal, financial, scientific) remained adoption-constrained by validation, liability, and factual consistency risks.
2024-Q2: Vendor GA milestones expanded distribution: Google's Gemini in Workspace (June 2024) made summarization a native feature in Docs, Sheets, Slides, Drive for millions of business users; Microsoft advanced Azure AI Language with conversation recap GA and native document support preview. Research exposed specific capability gaps: NAACL 2024 benchmarking found GPT-4 covers only 40% of diverse information in multi-document news summarization, establishing limits in multi-source synthesis. Independent assessments documented practical failures: ChatGPT omitted main proposal in 50-page pension policy summary, validating persistent gaps in understanding and synthesis. Deployment bifurcation sharpened: commodity availability in low-risk contexts (meetings, reviews, internal docs) vs. high-stakes domains (legal, medical, financial, research) blocked by unresolved diversity coverage, factual consistency, and validation cost barriers.
2024-Q3: Major vendors delivered platform-native summarization: Google Drive native PDF summarization (July 30, GA), Microsoft Word Copilot summaries (GA September, 80,000-word limit with documented failures on longer documents). Real-world deployment evidence emerged: Factal deployed production summarization for risk intelligence with validated source material, requiring careful prompt engineering and post-editing. Critical negative signal: Australian government trial (ASIC, September) tested Llama2-70B summarization and found AI scoring 47% on accuracy rubric vs. 81% human baseline, struggling with basic tasks (page references, relevance). Peer research confirmed multi-document synthesis remains fragile (TACL finding models fail on sentiment synthesis across reviews). Practitioner assessments documented ChatGPT omitting key content and fabricating information on long documents. Pattern held firm: commodity summarization in bounded, low-risk contexts now standard across enterprise platforms; high-stakes domains (legal, financial, scientific, medical) remained blocked by synthesis gaps, factual consistency risks, and validation costs.
2024-Q4: Vendor momentum sustained: Google (December 2024) continued Gemini availability in Workspace; Microsoft's Azure fine-tuned Phi-3.5-mini for summarization (previewed February 2025). Critical reliability findings emerged: Tow Center study found ChatGPT Search returned incorrect responses in 153/200 test cases, fabricating citations and misattributing sources—exposing synthesis accuracy failures in real-world applications. Research deepened multi-document concerns: academic analysis identified LLM bias in summarization, systematically overrepresenting viewpoints. Adoption signals mixed: regulatory professionals survey (100 respondents) showed 96% see AI as essential for document summarization in submissions, yet barriers persisted (outdated IT 45%, perceived risks 44%, data quality 42%). New product deployments: Nutrient launched AI Assistant for document management with summarization, Q&A, and redaction; Allvue survey showed 82% private equity AI adoption but only 58% active use due to regulatory/data quality gaps. High-stakes domain concerns solidified: legal practitioners explicitly recommended against using generative AI for document comparison due to hallucinations; user reports documented ChatGPT PDF handling failures. Bifurcation hardened: low-risk bounded summarization (meetings, reviews, internal docs) standard with post-editing; high-stakes domains (legal, regulatory, financial, research) remained blocked by fabrication risks, synthesis fairness issues, and validation cost barriers.
2025-Q1: Vendor product momentum sustained: Google (January 2025) updated Vertex AI document tuning capabilities; Microsoft released Azure Language summarization updates (March 2025) with Phi fine-tuning. Critical reliability findings deepened: BBC independent benchmark (February 2025) tested major AI chatbots on 100 news article summaries, finding 51% error rate with 19% introducing specific factual errors (dates, numbers, source misattribution); Apple suspended news summarization feature due to accuracy failures. Positive signal from legal domain: VLAIR benchmark (February 2025) showed legal-specific tools (CoCounsel, Harvey) achieved 77–95% document summarization accuracy with 6–80x speed improvement in controlled legal tasks. Industry analysis documented dual signals: enterprise adoption advancing (40% time savings reported) but constrained by 10–15% persistent error rates and hallucinatory risks, with documented cases of mis-summarization triggering regulatory fines. Bifurcation deepened: low-risk bounded summarization commodity with post-editing; news synthesis, regulatory/financial documents, and multi-document theme extraction remained blocked by accuracy barriers and validation cost barriers.
2025-Q2: Vendor platform momentum accelerated with GA releases: Google Gemini 2.5 auto-PDF summarization in Drive (June 2025, 120-page → 500-word summaries, 20 languages, Workspace Business/Enterprise availability); Microsoft Azure Language transparency note (June 2025) with explicit high-stakes warnings and responsible AI guardrails. Critical capability regression signal emerged: Royal Society peer-reviewed study (May 2025) of 5000 summaries across 10 LLMs found 73% omission probability and 5x error rates vs. human abstracts, with regression across model updates (ChatGPT-4o 9x worse than predecessor). Technical verification confirmed production issues: Azure Abstractive API mixing languages on long inputs (May 2025, confirmed by Microsoft). Bifurcation sharpened: low-risk bounded summarization (internal docs, reviews) commodity with mandatory post-editing; multi-document synthesis, scientific/news/regulatory documents remained blocked by quality regression, factual consistency barriers, and validation costs. Platform ubiquity masked underlying reliability degradation.
2025-Q3: Vendor GA expansion continued: Google integrated proactive summarization into Google Forms (GA September 15) and expanded Gemini summarization to Drive folders/documents (GA September 30, 35x YoY growth in Cloud Gemini usage). Adoption analysis revealed scaling barriers: ISG report (September 2025) found only 31% of use cases in production with 1 in 4 achieving expected ROI; copilots top use case but only one-third in production. Field evidence on capability limitations: study at Society of Science Writers (September) documented ChatGPT hallucinating and inverting causality in scientific paper summaries. Analyst projections optimistic (Forrester TEI model 25–63 hours meeting savings, 122–408% ROI) but deployment reality constrained: low-risk bounded summarization commodity across platforms with post-editing standard; high-stakes contexts (multi-document synthesis, news, scientific, regulatory) remained blocked by documented hallucination, synthesis quality, and validation cost barriers. Bifurcation sustained at higher absolute volumes: ubiquity in bounded contexts, hard capability limits in complex domains.
2025-Q4: Vendor consolidation persisted: no major GA announcements in summarization (Microsoft Phi fine-tuning preview continued from Q1). Enterprise ROI measurement barrier sharpened: October survey found 50% of technology leaders unable to quantify productivity savings from Copilot and summarization features despite 12+ months of adoption—highlighting deployment difficulty is measurement opacity, not capability ambiguity. Domain-specific adoption emerged: insurance industry piloting AI summarization for high-complexity claims files (thousands of pages per claim), signaling real-world deployment in specialized contexts where document volume drives adoption despite accuracy risks. Reliability signals consolidated: November BBC independent analysis confirmed ongoing hallucination and inaccuracy rates (error prevalence noted in multiple vendor implementations). Bifurcation hardened into stable equilibrium: commodity-status bounded summarization (meetings, forms, internal docs, customer reviews) with post-editing standard across all major platforms; high-stakes domains (legal, medical, financial, multi-document synthesis, scientific literature) remained blocked by accuracy barriers, validation costs, and documented failure cases preventing broad adoption.
2026-Jan: Vendor product momentum continued: Microsoft advanced Copilot with agentic Agent Mode for document editing and summarization (Word December 2025, Excel/PowerPoint January 2026), signaling feature maturity in mainstream productivity. Adoption signal: Fuse Research Network survey of 23 asset managers found 91% used document summarization in 2025, up from 56% in 2024—rapid adoption in financial services. Deployment reality: Microsoft Copilot tested at ~40,000 words/150 slides with quality varying by PDF structure; ChatGPT long-document implementation reports show structured workflows mitigating context-chunking errors. Critical reliability signal: ChatGPT hallucination analysis documented 60%+ fabricated citations, with specific cases of misaligned numbers and invented studies in summarization tasks; legal document assessment confirmed tools missing critical nuances. Bifurcation sustained: low-risk bounded summarization commodity-status with mandatory post-editing; high-stakes domains remained blocked by factual consistency barriers and validation costs.
2026-Feb: Major vendor platform maturity milestones: Google launched audio summarization in Docs (GA February 12), expanding modality diversity across Workspace; Microsoft released Copilot Tuning Document Summary agent template (GA February 28) enabling organizational customization; Google Cloud documented Gemini Enterprise search summarization API (February 19). Real-world domain adoption confirmed: Nextpoint eDiscovery survey of 559 practitioners found 65.8% use AI summarization in actual projects, indicating production deployment in regulated legal domain despite accuracy/defensibility concerns. Critical limitation signal: Eindhoven University assessment (February 24) reported summarization accuracy 68.8% (Gemini 3 Pro), 61.8% (ChatGPT 5), 51.3% (Claude 4.5 Opus) with AI summaries 5x more prone to overgeneralization than human summaries—institutional critique establishing educational domain unsuitability. Peer-reviewed evidence: JMIR biomedical study found ChatGPT faster and more consistent than humans but with significantly higher error odds (OR 0.10), confirming reliability trade-offs in scientific domain. Bifurcation hardened at scale: commodity summarization (bounded documents, internal use) achieved production status across major platforms with post-editing standard; high-stakes domains (academic research, regulated legal, scientific, financial) faced persistent accuracy barriers, validation costs, and institutional resistance despite availability.
2026-Mar: Vendor product expansion and deployment acceleration: Google announced Gemini Workspace integration (March 10) with 70.48% accuracy benchmark on SpreadsheetBench; Microsoft continued Copilot expansion with organizational customization; PwC scaled Copilot to 230,000 employees with document/email summarization as core value driver. Critical domain-specific findings: FINRA 2026 report identified document summarization as #1 GenAI use case in financial services (production deployment in compliance workflows), while simultaneously flagging hallucination and autonomous storage risks. B3Networks deployed Gemini Enterprise across JIRA/Confluence/Docs to synthesize unstructured data, generating 1,800 answers from 3,500 queries in one month with 20+ minute per-query time savings. Peer-reviewed research published: SciZoom benchmark of 44,946 papers (Pre/Post-LLM eras) documented linguistic impact—up to 10x increase in formulaic expressions and 23% decline in hedging language after ChatGPT release. UC San Diego study confirmed behavioral impact: despite 60% hallucination rate on product reviews, AI summaries increased purchase intent, with 26.42% of summaries shifting sentiment—demonstrating real-world adoption but also systematic distortion. Hallucination benchmarking consolidated: authoritative meta-analysis showed 0.7% hallucination baseline (basic summarization), rising to 18.7% (legal questions) and 15.6% (medical queries); hallucination established as inherent structural property. Critical failure case documented: systematic caveat omission in scientific abstracts led to misdeployment of clinical algorithm, resulting in 22% increase in unnecessary antibiotics. Bifurcation persists at scale: commodity summarization (documents, emails, meetings) standard across enterprises at 230K+ deployment scale; high-stakes domains (legal, medical, regulatory, multi-document synthesis) remain blocked by documented hallucination, caveat loss, and validation cost barriers.
2026-Apr: Research and market analysis sharpens deployment reality: EACL 2026 peer-reviewed papers documented systematic capability gaps—ARC benchmark showed LLMs frequently omit salient arguments in legal and scientific documents due to context window positional bias and role-specific preferences; SumRank research achieved 42x speedup for long-document ranking via query-aware summarization, addressing production scalability. Real-world healthcare deployment confirmed: German hospital integrated clinical summarization for discharge summary generation with expert validation. Critical market signal emerged: AIIM survey of 600 enterprises revealed 61% of intelligent document processing workflows still involve manual intervention, with 66% of new tool deployments replacing failed prior implementations—signalling fundamental trust barrier rather than capability gap. Gartner market analysis identified data quality as single biggest adoption constraint and observed inflection from experimentation-phase general-purpose pilots toward verticalized production deployments in domains where document volume justifies validation overhead (insurance claims, healthcare, compliance). Vendor product deployments continue: V7 Labs demonstrated multi-sector customer success with quantified outcomes (insurance: 33% daily claim processing increase); real-world tests of Gemini Drive auto-summarization condensed a 120-page grant application to an 8-bullet summary with follow-up prompts, confirming the feature as production-ready commodity. Independent satisfaction surveys show 82% of Google Workspace users perceive genuine AI value in summarization vs 66% for Microsoft 365 Copilot. Hallucination benchmarking reveals persistent structural limitations: Vectara data shows hallucination rates jump 3-10x on enterprise-length documents, with grounding in source documents achieving 30-50% reduction; comparative empirical testing across 500+ factual queries places Claude ~4%, GPT ~6%, Gemini ~9%—all exceeding acceptable thresholds for high-stakes domains; analysis across 37 LLMs confirms >15% hallucination rates on factual tasks, demonstrating larger context windows do not guarantee accuracy. Multimodal research (Amazon REFINESUMM) addresses dataset quality bottleneck for text-visual synthesis. Bifurcation persists: commodity summarization in bounded contexts (meetings, forms, reviews) now standard; healthcare and insurance domains emerging with production deployments; high-stakes legal, scientific, and regulatory contexts remain constrained by accuracy barriers and validation costs despite technical capability maturity.
2026-May: Vendor infrastructure maturity and deployment evidence crystallize platform-level saturation: Claude Opus 4.7 (GA April 2026) sustains 78.3% accuracy at 1M-token scale; Claude Cowork Desktop (GA April 2026) enables 10–100 document simultaneous cross-analysis; Gemini Enterprise Agent Designer synthesizes multi-source data streams. Production adoption accelerates: Freshfields law firm (5,700 employees) achieved 500%+ adoption of Claude contract summarization in six weeks (April 2026); global law firm reports 60% legal research time reduction via GPT-5.1 deployment; patent landscape analysis documents 94.3% accuracy benchmarks with dense 2022–2026 commercial filing activity. Yet reliability barriers sharpen: Microsoft Research DELEGATE-52 benchmark shows every frontier model corrupts ~25% of content in extended workflows; EnterpriseDocBench reveals 85.5% accuracy but 0.40 completeness gap, with weak cross-stage correlation (r=0.02–0.17). Hallucination metrics show structural persistence: Claude 4.6 Sonnet ~3%, GPT-5.2 ~8-12%, Gemini 2.5 Pro ~10-15%; citation accuracy worst-performing (6.8–19.1% error). LLM-ReSum framework achieves +33% factual accuracy and +39% coverage via self-reflective evaluation. Mitigations quantified: RAG grounding, semantic task context, extended thinking improve accuracy +17–23pp; hallucination detection (NLI verification 0.88 AUROC) enables post-processing validation. Business impact documented: $67.4B annual loss from AI hallucinations, 47% of executives making major decisions on unverified output. Bifurcation stabilized: commodity bounded summarization (10–15% sustained error, post-editing mandatory) standard across all major platforms; legal-specific tools (CoCounsel, Harvey) achieve 77–95% on controlled tasks but general-purpose models remain unreliable; high-stakes domains (medical, financial, multi-document, regulatory, scientific) remain adoption-constrained by validation costs and factual consistency barriers despite infrastructure capability maturity.