The AI landscape doesn't move in one direction — it lurches. Some techniques leap from experiment to table stakes in a single quarter; others stall against regulatory walls, technical ceilings, or organisational inertia that no amount of hype can dislodge. Knowing which is which is the hard part. The State of Play cuts through the noise with a rigorously maintained index of AI techniques across every major business domain — classified by maturity, evidenced by real-world adoption, and updated daily so you always know where you stand relative to the field. Stop guessing. Start knowing.
A daily newsletter distilling the past two weeks of movement in a domain or two — delivered to your inbox while the index updates in the background.
AI for supporting, retaining, and understanding customers after the sale. The highest concentration of good-practice tiers: chatbots, ticket routing, sentiment analysis, and voice-of-customer are deployed at scale in most industries. Bleeding-edge frontiers include autonomous resolution without human escalation and real-time emotion detection. Momentum is steady but churn prediction and proactive outreach remain stalled.
Customer operations is the most deployed AI domain in the enterprise, and this scan is the clearest evidence yet that deployment and value have come apart. McKinsey's June survey of 1,847 C-suite leaders puts 45 percent of the Fortune 500 in production with AI agents, up from 8 percent in 2024 — the fastest enterprise technology adoption curve in twenty-five years — and customer service accounts for 78 percent of those deployments. The headline economics are seductive: 42 percent lower cost per interaction, 35 percent better first-contact resolution, a claimed 340 percent average return with a 7.2-month payback. Every major platform — Zendesk, Intercom, Salesforce, Microsoft, AWS, Five9 — now ships autonomous agents, post-call summarisation, quality monitoring, and intent routing as standard, bundled capability. On paper, the domain has crossed into the mainstream.
The reality underneath is harsher. A Sinch survey of 2,527 enterprise decision-makers found that 74 percent of organisations that put autonomous AI customer agents into production rolled them back or shut them down after launch — and the rate climbs to 81 percent among the organisations with the most mature governance. That counter-intuitive finding is the most important structural fact in the domain right now: better monitoring does not prevent failures, it surfaces failures that less-instrumented organisations never notice. The same picture recurs across independent datasets. Thread Transfer's tracking of 1,200-plus agentic projects found only 4 percent reach production with positive ROI at six months. Better Business Bureau analysis of 100,000 complaints found 90 percent of the 20,000 that mention AI are negative. Five9's survey of 3,000 consumers found 83 percent still have to repeat themselves after an AI-to-human handoff, even though every CX leader surveyed claimed their systems preserve context. The gap between vendor metrics and customer-experienced reality — Decagon claims 80 percent deflection; the median real-world Zendesk deployment lands at 41 percent — is now the defining feature of the domain, not a footnote.
What distinguishes customer operations from younger AI domains is that the binding constraint is no longer the model. Across nearly every practice, the evidence points the same way: the technology works for narrow, high-confidence, low-ambiguity tasks, and the barrier to scaling beyond that boundary is organisational — knowledge-base quality, system integration depth, escalation architecture, data governance, and change management. Confluent's data-streaming survey captured it cleanly: 72 percent of leaders now cite real-time data infrastructure, not model capability, as the primary barrier to scaling agentic AI. The domain has bifurcated into a small vanguard extracting genuine returns through deployment discipline and a large majority stuck between executive ambition and operational reality. The winners are not the ones who bought the best platform; they are the ones who kept a human in the loop for 60 to 90 days, wired scores to playbooks, and refused to remove the review gate under ROI pressure.
The dominant movement this fortnight is a single, decisive reframing: auto-draft with human review — the safest, most proven pattern in the domain — shifted from advancing to stalled. This is not a downgrade of the practice's value; it remains the dominant production architecture, with Cresta reporting that 78 percent of customer conversations are now handled by human and AI together. The shift reflects that the practice has hit its ceiling. The new evidence (Intercom Fin across 180 customers at 34 percent handling-time reduction; Resolx across 17,170 businesses at 38.7 percent faster resolution) confirms the value but adds no new headroom, while the surrounding governance data — Gartner's prediction that 40 percent of enterprises will be demoted from higher autonomy tiers by 2027, the 74 percent rollback figure — makes clear that the human gate is now permanent rather than transitional. The practice has matured into its final shape.
Beyond that, this cycle was less about new capability than about reality-check evidence arriving in volume. McKinsey's 45-percent adoption figure and Richpanel's first-party benchmark (70-80 percent autonomous resolution at $0.30 per ticket across 2,000-plus brands) landed alongside a wall of failure documentation: Stanford's AI Index logging 362 AI incidents in 2025 (up 55 percent year-on-year), the catalogue of named disasters (Air Canada, Klarna's $2.3M in unauthorised refunds, a Meta support bot that hijacked 20,000 Instagram accounts through an account-recovery bypass), and Microsoft's GA launch of fully autonomous email resolution in Dynamics 365 colliding with a SumatoSoft survey in which zero of 72 executives reported running fully autonomous customer-facing AI. Two practices continued genuinely advancing against the grain: returns and claims automation (Loop at 55M returns processed, Travelers at 1.5M AI-assisted claims a year, insurance AI adoption jumping from 8 to 34 percent) and real-time voice translation (simultaneous GA launches from Google, Krisp, and Gradium, though accuracy gaps and code-switching keep it experimental). Stability across the rest of the domain is itself the signal: the technology is settled, and the work has moved decisively to operations.
The quieter story this fortnight is in the back office, where the proven, lower-risk practices kept consolidating. Call summarisation produced the strongest single ROI case in its history — Chime Financial saved more than 250,000 agent hours a year and lifted its net promoter score by five points using Claude via Amazon Bedrock — even as a new benchmark on 2,075 real contact-centre calls confirmed that even frontier models hit a 15 percent hallucination ceiling, keeping a human review step mandatory before notes reach the CRM. Agent quality monitoring saw genuine market validation: Verint's $2 billion acquisition of Calabrio merged two of the three largest vendors, and named deployments (a $10M agent-capacity saving at one bank, a 61-point NPS lift at NOS Portugal) firmed up the case — though primary research found that 85 percent of contact centres have deployed quality tooling and only 29 percent use it effectively. Customer health scoring told the same two-speed story: a mid-market case study documented churn falling from 34 to 11 percent and $8.4M in retained revenue, while an adversarial review warned that most "AI" health-scoring tools are repackaged rule-based scoring that adds only 10-20 percent accuracy at ten to fifty times the cost. The throughline across all three is identical to the headline domain story — the capability is settled and the returns are real for those who do the operational work, and absent for those who buy the tool and stop there.
The rollback rate is the story, and governance maturity makes it worse, not better. Sinch's finding that 74 percent of production autonomous agents are pulled — rising to 81 percent at the best-governed organisations — inverts the usual assumption that maturity reduces failure. Mature teams have the observability to detect privilege escalation, cascading actions, and silent drift; less-instrumented teams ship the same failures and never see them. This means the visible failure rate understates the true one, and that buying governance tooling surfaces problems faster than it solves them.
Vendor metrics and customer experience have decoupled, and the measurement gap is now the product risk. Deflection (conversations ended) is not resolution (problems solved), yet vendor benchmarks conflate them. A telecom case study showed 78 percent reported containment but only 41 percent actual resolution; a bank voice bot rated 4.6/5 CSAT saw 91 percent of customers hang up, request an agent, or call back within 24 hours. Fini Labs found 71 percent of support leaders cite inflated automation metrics as their top blocker to trusting AI vendors. The honest metric — true resolution, customer effort, repeat-contact rate — is where the real ceiling sits, and most organisations are still measuring the wrong thing.
The constraint has moved from model to infrastructure, and most organisations have not. Across autonomous resolution, voice AI, summarisation, and health scoring, the same root cause recurs: integration failure, fragmented data, and stale knowledge bases, not LLM quality. Confluent found 72 percent cite real-time data infrastructure as the primary scaling barrier; a single Zendesk customer went from 24 to 80 percent autonomous resolution by switching nothing but the knowledge architecture. Edel Optics improved AI resolution from 25 to 79 percent on the identical Zendesk model purely by restructuring its knowledge base. The lesson is consistent and uncomfortable: the gap from 20 to 80 percent automation is almost always upstream content and integration work, which is expensive, unglamorous, and rarely budgeted.
The human-in-the-loop gate is now a permanent architectural fact, not a transitional safeguard. The legal precedent (Air Canada held liable for its chatbot's fabricated policy), the regulatory tightening (EU AI Act enforcement from August 2026, CMA guidance carrying fines of 10 percent of global turnover, CAN-SPAM penalties at $53,088 per email), and the failure economics all converge on the same answer. Hallucination rates of 22-94 percent across models, with even frontier models capping at roughly 85 percent non-hallucination on live call data, mean the practices that win are the ones that keep humans on the consequential 20 percent. Gartner has formalised this as the "Advise" autonomy level. The domain's competitive advantage now lies in the review gate, not in removing it.
Two-speed adoption is hardening into a structural divide. A small vanguard — typically large enterprises with dedicated CS operations, unified data, and disciplined phased rollout — extracts documented returns: 31 percent churn reduction in health scoring, $12.5M annual QA savings, 250,000 agent hours saved at Chime through call summarisation. The majority remain trapped: only 10-12 percent of organisations report mature, optimised deployment across most practices, even as 80-87 percent express investment intent. The US Census Bureau's behavioural data found just 19.8 percent of businesses actually using AI, against 70-percent-plus claims in commercial surveys. The constraint separating the two groups is not capital or tooling — both are now cheap and accessible — but operational discipline and organisational readiness, which do not arrive with a procurement contract.
CCW Vegas 2026: The AI Honeymoon Is Over (industry-report) — The Sinch finding that 74 percent of autonomous AI agent deployments were rolled back after go-live — rising to 81 percent at the best-governed organisations — is the most counter-intuitive structural fact in the domain, and this CCW coverage is its primary public source. https://www.cxtoday.com/contact-center/ccw-vegas-2026-10-cx-trends-that-prove-the-industrys-ai-honeymoon-is-well-and-truly-over/
McKinsey Reports 45% of Fortune 500 Now Deploy Production AI Agents (industry-report) — The headline adoption number (up from 8% in 2024) combined with the 78% customer-service concentration and claimed 340% ROI supplies the "seductive surface" half of the domain's central tension, which the rollback and failure data then systematically dismantles. https://callsphere.ai/blog/mckinsey-45-percent-fortune-500-deploy-production-ai-agents-2026
Chime Financial Saves 250,000+ Agent Hours Annually with Claude via Amazon Bedrock (case-study) — The strongest single ROI case in call summarisation history — 18-second handling-time reduction, 5-point NPS lift, $700K annual gain — demonstrates what the vanguard extracts when operational discipline matches technology, and sets the benchmark the rest of the domain fails to replicate. https://aws.amazon.com/solutions/case-studies/chime-financial-case-study/
vCX-Hard: Benchmarking Leading AI Models on Real Contact Center Calls (research-paper) — Empirical measurement of 2,075 real production calls finds even the best frontier model (GPT-5.5) hits only 84.8% non-hallucination, establishing the 15% ceiling that makes human review of call summaries and autonomous responses a permanent architectural requirement rather than a transitional one. https://www.retellai.com/blog/vcx-hard-benchmark
Agentic AI in 2026: What Actually Made It to Production (case-study) — Thread Transfer's tracking of 1,200+ agentic projects finding only 4% reach ROI-positive production is the most damning single data point on the gap between the McKinsey adoption headline and what actually survives contact with operations; the surviving pattern (narrow scope, finite action space, unambiguous done signal) is the playbook the summary prescribes. https://thread-transfer.com/blog/2026-06-17-agentic-ai-2026-state-of-production/
New Five9 Research: 83% of Consumers Must Repeat Themselves After AI Handoff (adoption-metric) — Every CX leader surveyed claimed their systems preserve context across AI-to-human handoffs; 83% of consumers reported the opposite — the clearest single illustration of the measurement decoupling between vendor-reported and customer-experienced reality that the summary identifies as the domain's defining feature. https://markets.ft.com/data/announce/detail?dockey=600-202606241500BIZWIRE_USPRX____20260624_BW086700-1
AI Chatbots in Customer Service: The Containment Rate That's Lying to You (case-study) — The telecom deployment with 78% reported containment but only 41% actual resolution is the cleanest documented instance of the deflection/resolution conflation the summary flags as the dominant measurement failure in the domain; it makes the abstract metric gap concrete and named. https://enderturing.com/blog/ai-chatbots-in-customer-service-the-containment-rate-thats-lying-to-you
What Resolution Rate Can AI Customer Support Actually Achieve? (2026 Benchmarks) (industry-report) — Independent aggregation showing vendors claim 67-80% automation while the Zendesk aggregate median sits at 41.2% — with only 14% of interactions reaching verified end-to-end resolution — puts the vendor-metric inflation pattern on a quantitative footing and explains why Fini Labs found 71% of support leaders cite inflated metrics as their top trust blocker. https://www.getmyai.ai/blog/ai-customer-support-automation-rates/
Self-Service Knowledge Base Design: 2026 IA Playbook (industry-report) — The finding that the same Intercom Fin agent delivers 25% resolution on a poorly structured KB and 80% on a well-structured one — without any model change — is the clearest empirical proof that the binding constraint has moved from model capability to content architecture, the summary's third key tension. https://www.digitalapplied.com/blog/self-service-knowledge-base-design-2026-information-architecture-playbook
AI Agent Failures: The 10 Biggest Agentic AI Disasters of Early 2026 (case-study) — The named-disaster catalogue (Air Canada's fabricated refund policy establishing chatbot legal liability, Klarna's $2.3M in unauthorised refunds, Meta's account-recovery bypass hijacking 20,000 Instagram accounts) provides the specific failure evidence that grounds the summary's argument that the human-in-the-loop gate is now a permanent architectural fact, not a transitional safeguard. https://callsphere.ai/blog/ai-agent-failures-biggest-agentic-ai-disasters-early-2026