Perly Consulting │ Beck Eco

The State of Play

A living index of AI adoption across industries — where established practice meets the bleeding edge
UPDATED DAILY

The AI landscape doesn't move in one direction — it lurches. Some techniques leap from experiment to table stakes in a single quarter; others stall against regulatory walls, technical ceilings, or organisational inertia that no amount of hype can dislodge. Knowing which is which is the hard part. The State of Play cuts through the noise with a rigorously maintained index of AI techniques across every major business domain — classified by maturity, evidenced by real-world adoption, and updated daily so you always know where you stand relative to the field. Stop guessing. Start knowing.

The Daily Dispatch

A daily newsletter distilling the past two weeks of movement in a domain or two — delivered to your inbox while the index updates in the background.

AI Maturity by Domain

Each dot marks the weighted maturity of practices within a domain — hover for a brief summary, click for more detail

DOMAIN
BLEEDING EDGEESTABLISHED

Call summarisation & disposition

GOOD PRACTICE

TRAJECTORY

Stalled

AI that automatically summarises support calls and generates disposition codes and structured notes for CRM entry. Includes after-call work automation and key moment extraction; distinct from call transcription in sales which focuses on sales conversations rather than support calls.

OVERVIEW

Call summarisation and disposition has reached mainstream platform maturity with ecosystem-wide GA adoption and documented ROI from early production deployments. Every major vendor (Microsoft, AWS, Zendesk, Genesys, Talkdesk, ServiceNow, Oracle, Webex, Dialpad, Five9) ships AI-generated post-call summaries and automated disposition as core GA features, and proven deployments report 25-50% AHT reductions with sustained agent productivity gains (40-90 seconds saved per call in field implementations). The practice replaces manual after-call work—typing summaries, selecting disposition codes, updating CRM—with LLM-based automation that extracts issue, resolution, action items, and classification codes on-call or immediately post-call. The technology clearly works at scale: Genesys, Vapi, and AWS implementations confirm sub-100-second automation cycles, and third-party platforms (Lindy.ai, Lindy) are shipping plug-and-play solutions. However, production deployments universally require deliberate architectural choices, quality validation protocols, and acceptance of persistent limitations. Hallucination, fabricated customer statements, and diarisation failures on hybrid calls remain unsolved technical barriers that separate capability availability from mid-market adoption. Organisations willing to invest in RAG-grounded architectures, fine-tuning, and human-in-the-loop validation achieve genuine efficiency gains; those deploying out-of-the-box models continue to experience agent distrust and quality failures.

CURRENT LANDSCAPE

Platform ecosystem standardisation is complete: all major contact centre vendors (AWS, Microsoft, Genesys, Zendesk, ServiceNow, Oracle, Webex, Talkdesk, Dialpad, Five9, CloudTalk) offer GA call summarisation bundled into base platform pricing. Geographic expansion continues (AWS Contact Lens June 2026 added Portuguese, French, Italian, German, Spanish, Chinese, Japanese, Korean support), confirming sustained investment in global production readiness. Third-party vendors (Vapi, Lindy.ai, Aircall, Nextiva) are shipping independent AI summarisation and disposition products, demonstrating ecosystem depth beyond platform incumbents.

Real-world deployments confirm ROI at scale: Genesys implementations document 45-90 seconds saved per call; Lindy.ai reports 40-60 second reductions with +40% contact centre capacity; Five9 TruConnect (healthcare) achieved 40% ACW reduction; Verint baseline research (1,000 agents) establishes 54% of calls require after-call work including summarisation, confirming scale of demand. Field implementations validate 25-50% AHT reduction from combined front-of-call and back-of-call AI automation, establishing credible mid-range ROI.

However, production deployments reveal persistent technical limitations constraining mid-market adoption. Hallucination remains endemic: independent comparative testing of five AI medical scribes documents systematic fabrication of medications/dosages, phantom exam findings, and confabulated patient statements—error types that directly map to call summarisation risks (invented customer statements, misrepresented agreements, fabricated action items). AI Evals production framework establishes threshold requirements (faithfulness >95%, coverage >85%) that most raw summaries fail; organisations deploying out-of-the-box models face 63-89% raw accuracy, rising to 94-96% only with structured human-in-the-loop validation. Speaker diarisation accuracy drops ~30 percentage points on hybrid calls; domain jargon blindness requires custom vocabulary tuning; context reconstruction on escalations costs $200-500 per incident. Genesys implementations explicitly document mandatory human review of AI outputs before finalisation (agents cannot skip validation step), and UJET research confirms 93% of agents feel need to double-check AI outputs pre-deployment despite crediting summarisation with ACW reductions. The practice tier has stabilised at good-practice: mainstream feature availability coexists with explicit technical barriers (hallucination, quality validation costs, tuning complexity) that separate capability from confident autonomous deployment.

TIER HISTORY

ResearchJan-2021 → Jan-2021
Bleeding EdgeJan-2021 → Jan-2023
Leading EdgeJan-2023 → Apr-2025
Good PracticeApr-2025 → present

EVIDENCE (128)

— Peer-reviewed empirical study of LLM-based call summarization (JMIR); rated summaries across accuracy, thoroughness, and hallucination freedom; documents both competence (useful 4.8/5, consistent 4.9/5) and gaps (hallucination-free 4.4/5)—validating careful quality assessment as required practice.

— Directly studies hallucination in LLM-based summaries across major models (ChatGPT 0.62, GPT-4 0.84, Claude 2 1.55 hallucinations per summary); demonstrates 35% hallucination reduction via factored verification, critical constraint for disposition reliability.

— GA call summarization platform with reasoning-first architecture generating structured post-call summaries (intent, resolution, sentiment, next steps) that flow into CRM/helpdesk; 98% accuracy claim and 48-hour deployment timeline demonstrating market maturity.

— Verint Wrap Up Bot uses generative AI to automate after-call summarization; named case study (Utilita Energy) documents 35-second reduction in summary time and 10% agent capacity increase from automated disposition.

— Benchmarks hallucination risk on 2,075 real production contact center calls (Feb-May 2026); finds GPT-5.5 achieves 84.8% non-hallucination rate in production, establishing 15.2% production failure ceiling even for frontier models—directly relevant to disposition quality reliability.

— Named fintech deployed Claude models via Amazon Bedrock for automated call summarization in production, achieving 250,000+ annual hours saved, 18-second handling-time reduction, 5-point NPS lift, and $700k annual efficiency gain.

— Peer-reviewed JMIR study introduces multi-dimensional evaluation framework (fabrication, accuracy, comprehensiveness, usefulness) for LLM-generated summaries; demonstrates systematic quality assessment methodology applicable to call summarization validation.

— Production evaluation framework rejecting ROUGE metrics in favor of faithfulness >0.95 (one fabrication per 20 summaries) and coverage >0.85; recommends FActScore atomic-claim decomposition and RAGAS hallucination scoring for continuous monitoring—establishing quality baseline for contact center deployments.

HISTORY

  • 2021: IBM Research releases TWEETSUMM dataset for customer service dialog summarization; AWS Contact Lens launches production machine learning call summarization; Zendesk and competitors begin early access programs; standalone vendors like Noota claim commercial traction.
  • 2022-H1: AWS expands with Transcribe Call Analytics GA (March); Microsoft releases Context IQ AI-generated summaries in Dynamics 365 (April); ASAPP launches AutoSummary with 10%+ handle time reduction claims (May); agent surveys show 41% prioritize call summarization automation as top workflow improvement (June); academic research surfaces LLM position bias and gaps in responsible AI consideration in summarization systems.
  • 2023-H1: Technology shifts to LLM-based approaches across all major platforms; Zendesk and Google expand geographic rollout of generative AI summarization (March); research demonstrates fine-tuning techniques for smaller LLMs with controlled summary length (April); open-source community continues active development (CallSum, June); technical focus moves from capability maturation toward safe, fair, and responsible deployment patterns.
  • 2023-H2: Secondary vendors (CallMiner, Talkdesk) launch generative AI summarization capabilities (July-September); AWS publishes production deployment patterns for LLM-based summarization (November); Balto AI survey documents actual adoption momentum and ROI realities in contact centers (October); practice approaches commodification with cost and tuning barriers replacing capability barriers.
  • 2024-Q1: Microsoft automatically enables Copilot summarization for all Dynamics 365 Enterprise customers (January), signaling mainstream production-ready status; AWS enhances Contact Lens with generative AI post-contact summaries (March); ClickUp ships native call summarization in workflow platform; Qualtrics survey shows only 20% agent AI adoption despite platform availability; research and vendor analysis document persistent challenges: hallucination risks, accuracy issues across languages, 120-300 seconds still spent per call on dispositioning, less than 25% of notes meeting quality standards. Practice moves into mainstream with availability but deployment success requires significant customer-specific tuning.
  • 2024-Q2: Microsoft announces Dynamics 365 Contact Center as Copilot-first CCaaS platform with general availability July 1, 2024, establishing call summarization as core capability; AWS markets generative AI summarization in Transcribe Call Analytics for post-call efficiency; ServiceNow releases post-call summarization in Q2 2024 CSM update; Microsoft Wave 1 enhancements expand Copilot across omnichannel. Vendor consensus confirms market readiness, but Deloitte survey finds only innovator segment (minority) actively deploying, indicating broad platform availability without proportional adoption uptake. Persistent accuracy and tuning barriers remain despite universal vendor support.
  • 2024-Q3: Platform standardization completes—Microsoft, AWS, ServiceNow all deliver production summarization capabilities with general availability. Microsoft Dynamics 365 Contact Center launches July 1, 2024 (Copilot-first CCaaS); AWS extends summaries to agents in Contact Lens (July); ServiceNow formalizes now-assist call summarization (Aug). Microsoft's September guide emphasizes pilot-first rollout with measurement criteria for tuning success. However, technical barriers persist: peer-reviewed empirical study documents fine-tuned BART models achieving 71% recall but >50% degradation in zero-shot scenarios; Australian government evaluation shows LLMs produce verbose hallucinated summaries inferior to human effort. Practice commodified but constrained by deployment tuning and accuracy barriers.
  • 2024-Q4: Early production deployments validate ROI at scale. Lenovo case study (December) documents 15% productivity gains and 20% handle time reduction with Copilot summarization. Amazon reports tens of thousands of Connect customers (10M daily interactions) with named adopters across retail, logistics, education, and travel. AWS releases fresh generative AI analytics (December) and detailed secure-summarization technical tutorial (October); research advances fine-tuning of smaller cost-efficient LLMs with length control (October). Gartner recognizes Microsoft as CRM leader, validating strategic Copilot-first architecture. Broadscale adoption remains constrained by accuracy, tuning, and business case barriers despite universal feature availability and proven Fortune 500 deployments.
  • 2025-Q1: Full platform standardization and feature expansion signal vendor confidence. AWS restructures Connect pricing to bundle post-contact summaries (March 2025), Microsoft extends summaries into quality management and compliance workflows (February 2025), and Zendesk achieves GA for agent workspace summaries (March 2025). Gartner reports 60%+ adoption across contact centers with 87% projected by year-end. Specialized vendors optimize for domain: AssemblyAI releases Conversational summarization models targeting support calls (February 2025). Practice moves from platform availability to deployment friction: organizational adoption, summary quality tuning, and ROI validation for mid-market remain the binding constraint on broadscale growth.
  • 2025-Q2: Production deployments validate ROI at scale; technical limitations become explicit in vendor transparency. Wisconsin DOR achieves 66% cost reduction and 60% hold time improvement across 500 agents with Amazon Connect Contact Lens (May 2025); Metrigy research documents 35% call time savings (May 2025). Simultaneously, Microsoft's official Azure AI documentation (June 2025) acknowledges quality degradation across language dialects and hallucination risks in production systems, and industry analysis (Dialpad, May 2025) details ASR errors and lack of reliable evaluation metrics for factual consistency. Practice reaches inflection point: mainstream platform adoption and early adopter ROI validation coexist with explicit documentation of technical barriers for broader deployment.
  • 2025-Q3: Platform standardization completes with quality focus shift. Empower (financial services) scales Amazon Connect Contact Lens + Bedrock for QA automation with 5,000 daily transcriptions and 20x QA efficiency (August 2025); global company deploys Dynamics 365 Contact Center with Copilot post-call summaries across regions (July 2025). ServiceNow updates Now Assist documentation (July 2025); Zendesk expands GA internationally (September 2025). Observe.AI research documents critical quality limitation: all 20 major LLMs (OpenAI, Claude, Llama, Nova) exhibit measurable operational bias on real call transcripts—shifting narrative from deployment speed to bias mitigation (August 2025). Practitioner ROI claims remain (25-40% handle time reduction) but adoption constraints shift from availability to accuracy and organizational change management. Practice consolidates in good-practice tier as early adopters prove ROI while broader mid-market adoption waits for bias and fine-tuning solutions.
  • 2025-Q4: Vendor feature parity reaches completion with enterprise-context capabilities. AWS launches AI-powered case summaries supporting multi-interaction and cross-team context (November 2025); Oracle ships automatic summarization with agent review workflows (October 2025); Zendesk enhances ticket summary capture with expanded word limits and improved context inclusion (October 2025). Platform commodification stabilizes with all major vendors offering GA features; remaining barriers are implementation friction (fine-tuning for dialect/vocabulary), organizational adoption (agent retraining), and quality limitations (persistent LLM bias from Q3 research). Early-adopter ROI documented (25-40% handle time reduction, 30% productivity gains) is primarily driven by customer-specific tuning, not platform feature quality alone. Practice remains at good-practice tier—proven ROI for innovators, but platform availability has decoupled from mid-market adoption; success now depends on solving implementation and quality barriers rather than feature development.
  • 2026-Jan: Platform vendor consolidation continues with Microsoft, Talkdesk, and major CCaaS providers confirming GA summarization capabilities (January-end). Practitioner analysis shifts focus from capability availability to implementation economics and accuracy validation: documented evidence shows successful deployments require structured validation workflows (93-96% accuracy vs 63-89% raw AI), economic analysis reveals $200-500 per ticket context reconstruction costs with targeted solutions achieving 20-40% improvement, and technical failure modes (diarization drops 29 points on hybrid calls, jargon blindness, conditional logic omission) remain unresolved in out-of-box deployments. Early-adopter case studies continue to report 25-40% handle time gains, but analysis reveals these depend on customer-specific tuning rather than platform maturity. Practice tier stable at good-practice; mid-market adoption blocked by economic validation requirements and accuracy-tuning friction rather than feature gaps.
  • 2026-Feb: Vendor feature standardization and transparency reach new maturity. Microsoft extends Copilot with row summarization capability in Customer Service (Feb 25, 2026); Webex adds AI-enhanced post-call and mid-call summaries with 24-hour API access (Feb 17); CloudTalk updates product with AI tagging and CRM auto-entry (Feb 27); Dialpad maintains GA for AI Call Summary with sentiment and category support. Critically, Microsoft publishes official Azure AI documentation (Feb 28, 2026) explicitly detailing summarization quality limitations: dialectal variance causing degradation, abstract hallucination risks, poor performance on under-represented conversation types—marking shift from marketing claims to vendor acknowledgment of deployment barriers. Industry metrics from Thunai (Feb 12, 2026) document 60%+ contact center adoption of AI summarization tools with 35% operational cost reduction claims, confirming ecosystem momentum. However, adoption metrics reflect feature deployment rather than ROI realization; the documented validation and tuning requirements from Q1 2026 remain binding constraints. Practice consolidates at good-practice: universal platform GA status coexists with explicit vendor documentation of reliability limitations and persistent deployment friction that separate capability availability from organizational adoption at scale.
  • 2026-Q1 (Mar-Apr): Deployment evidence validates scale and ROI with vendor feature consolidation. AWS Contact Lens (Mar 31) confirms conversational analytics GA with three named customer deployments: Neo Financial (90-second ACW savings per call, 40 hours/month leadership efficiency); Fujitsu (60% QA automation efficiency); Frontdoor (50x sampling increase). Microsoft Dynamics 365 Contact Center (Mar 18) confirms 2026 Wave 1 release with one-click case summaries across chat, email, and notes. Amazon Connect Health (Mar 5) case study: UC San Diego Health deployment with quantified clinical note summarization benefits. Simultaneously, critical quality limitation research documents systematic hallucination and bias risks: SupportLogic production framework reveals 94% to 73% quality variance across models; Suprmind's Vectara HHEM leaderboard benchmarks all 20 major LLMs showing measurable hallucination rates. Practitioner evidence (InflectionCX operator assessment) documents Contact Lens implementation reality: 2-5s latency, manual vocabulary configuration, pattern-matching limitations, and a steep implementation barrier. Real deployment case studies (Utilita: 35-second ACW reduction with Verint; UC San Diego Health) confirm ROI is achievable but requires structured validation workflows and customer-specific tuning. Ecosystem pattern: feature parity complete across AWS, Microsoft, ServiceNow, and secondary vendors; adoption friction has shifted definitively from whether the technology works to whether organizations can cost-justify the tuning and validation burden—a question that remains unsolved for the mid-market. Practice tier stable at good-practice; near-term growth blocked by implementation economics and quality validation requirements rather than capability gaps.
  • 2026-May (early): Vendor differentiation intensifies with domain-specific optimization and governance frameworks. Deepgram releases domain-specific language model for contact center summarization, fine-tuned on 200K conversations, signaling specialized vendor optimization (Apr 27). Cisco Webex adds multi-scenario AI summarization (dropped calls, transfers, consults) as GA feature (Apr 27). Independent third-party testing (Brilo, Apr 29) evaluates 10 platforms across 400+ real test calls, confirming broad vendor ecosystem maturity but revealing quality variance across implementations. Adoption gap persists as documented barrier: European bank case study (Apr 30) shows only 26% of 47,000 quarterly calls captured in CRM summaries, with 35,000 unanalyzed calls containing 2,800 upsell signals and 340 compliance gaps; UK contact center analysis (Apr 20) documents 67% record 100% of calls but 90% lack time/capability to analyze—revealing analysis bottleneck as adoption constraint. Agent trust barriers documented: UJET survey (Apr 22) shows 93% of agents feel need to double-check AI outputs before customer use despite ~70% crediting AI with ACW reduction, indicating quality reliability concerns persist. Governance frameworks emerge as binding requirement: NIST AI 600-1 (Apr 24) establishes pre-deployment testing and compliance requirements directly applicable to regulated summarization deployments. Economic analysis (Apr 20) documents 86/100 viability score for summarization + CRM automation with 3.5 FTE capacity recovery and £4,200 implementation cost. Practice consolidates at good-practice tier: vendors delivering full feature parity, deployments proving ROI, but adoption remains constrained by implementation economics, quality validation requirements, and organizational change barriers rather than technology capability.
  • 2026-May (mid): Platform reach expands and quality risk evidence deepens. Zendesk rolls out Copilot AI ticket summaries to all Professional+ plans at no additional cost (5 uses/agent/month included), marking summarisation as table-stakes infrastructure across mid-market. Genesys Agent Copilot GA confirms AI-generated summaries and wrap-up code prediction as core features at a major CCaaS vendor. Microsoft Copilot extends GA summarisation to non-Microsoft CRM systems via Dynamics 365 Contact Center. Zensar/Databricks production deployment documents 150% efficiency improvement and 40% processing time reduction across 25,000 calls per day—strongest scale case study this cycle. Verint agent experience survey (1,000 agents) confirms ACW automation reduces manual after-call work by 2.7 minutes per call. Technical quality risk sharpens: two independent analyses document compound hallucination amplification in multi-stage AI pipelines (transcription → summarisation → disposition) and the summarisation validity problem where compression discards task-critical information—directly applicable to disposition accuracy in regulated support environments.
  • 2026-May (late): Geographic expansion and architecture validation complete first cycle. AWS Contact Lens expands generative AI post-contact summaries to eight new languages (Portuguese, French, Italian, German, Spanish, Chinese, Japanese, Korean, plus regional English variants), signaling sustained vendor investment in global deployment maturity. Independent technical benchmarking clarifies architecture-outcome linkage: RAG-based approaches achieve 30-70% hallucination reduction vs fine-tuning alone; hybrid RAG+fine-tuning reaches 86% accuracy vs 81% fine-tuning only, demonstrating that deployment reliability depends on architectural choices rather than model quality alone. Five9 case study (TruConnect healthcare) documents 40% ACW reduction through automated summary generation and CRM auto-sync. Independent synthesis of 2026 adoption data (50+ verified metrics) confirms 25-50% AHT reduction from combined front/back-of-call automation, establishing mid-range ROI baseline. Practice consolidates at good-practice: worldwide platform GA status coexists with explicit documentation that production deployment success depends on RAG-grounded architectures, tuning investment, and validation workflows—not feature availability alone.
  • 2026-Jun: Quality validation frameworks, hallucination benchmarks, and production scale evidence converge. A peer-reviewed JMIR study introduces a multi-dimensional evaluation framework (fabrication, accuracy, comprehensiveness, usefulness) for LLM summaries, and production AI eval frameworks establish faithfulness >0.95 and coverage >0.85 as production thresholds—benchmarks most raw out-of-box deployments still miss. A second JMIR study (factored verification) quantifies model-level hallucination rates across major LLMs (ChatGPT 0.62, GPT-4 0.84, Claude 2 1.55 hallucinations per summary) and demonstrates 35% reduction through factored verification, providing the clearest per-model quality signal to date. The vCX-Hard benchmark on 2,075 real production contact centre calls establishes that even GPT-5.5 achieves only 84.8% non-hallucination rate in production—a 15.2% failure ceiling for frontier models on live call data. Business failure cases (Air Canada, law firms sanctioned for fabricated citations, Deloitte contract loss) establish the liability profile where summarisation hallucinations drive customer disputes and compliance breaches, reinforcing mandatory human-in-loop review before CRM entry. On the capability side, Vapi GA confirms Claude Sonnet and GPT-4o as viable summarisation backends; Lindy.ai reports 40-60 seconds saved per call and +40% centre capacity. The standout production case is Chime Financial (Amazon Bedrock/Claude): 250,000+ agent hours saved annually, 18-second handling-time reduction, 5-point NPS lift, and $700K efficiency gain—the strongest single-named ROI case in the practice to date. Verint's 2026 agent survey (1,000 agents) confirms 54% of calls require after-call work, reaffirming ACW automation as the highest-impact AI opportunity in the contact centre.

TOOLS