The AI landscape doesn't move in one direction — it lurches. Some techniques leap from experiment to table stakes in a single quarter; others stall against regulatory walls, technical ceilings, or organisational inertia that no amount of hype can dislodge. Knowing which is which is the hard part. The State of Play cuts through the noise with a rigorously maintained index of AI techniques across every major business domain — classified by maturity, evidenced by real-world adoption, and updated daily so you always know where you stand relative to the field. Stop guessing. Start knowing.
A daily newsletter distilling the past two weeks of movement in a domain or two — delivered to your inbox while the index updates in the background.
Each dot marks the weighted maturity of practices within a domain — hover for a brief summary, click for more detail
AI that classifies, routes, and prioritises incidents while automating root cause diagnosis. Includes intelligent ticket routing and automated fault tree analysis; distinct from automated remediation which takes corrective action rather than diagnosing.
AI-driven incident triage and root cause analysis is a proven practice with a mature vendor ecosystem, GA tooling from multiple platforms, and documented ROI at enterprise scale. The core value proposition — classifying and routing incidents automatically while diagnosing why failures occur, not merely that they occurred — has been validated through years of production deployments at organisations ranging from Fortune 500 to Tier-1 service providers. The question for most teams is no longer whether these tools work, but how to implement them without drowning in integration complexity and alert noise. That distinction matters: despite broad vendor capability and strong MTTR reduction evidence, a persistent adoption paradox has emerged. Organisations overwhelmingly invest in AIOps platforms yet struggle to operationalise AI-assisted triage beyond initial pilots, particularly in the mid-market. LLM-assisted diagnosis is accelerating vendor roadmaps, but research reveals systematic reliability gaps that prompt engineering alone cannot resolve. The practice is mature and accessible; the barrier is execution, not technology.
The vendor ecosystem is broad and GA-ready. Splunk ITSI, BigPanda, Moogsoft (Dell), IBM Instana, and Logz.io all ship LLM-assisted triage and root cause suggestion as production features; BigPanda's AI Incident Assistant and Microsoft's RCA Agent via Copilot Studio represent the latest wave of generative-AI-native releases. Named deployments continue to demonstrate measurable impact: ServiceNow's NBA Workplace Service Delivery achieved 51% annual ROI with 30-50% MTTR reduction and 99.2% noise suppression; Thoughtworks reports RCA cycles compressed from hours to minutes across 16+ client engagements, with L1/L2 ticket volume down 35-40%. Incident.io documents 37% faster MTTR through AI-automated post-mortems, saving teams roughly 75 minutes per incident. Türk Telekom achieved 49% improvement in service outages and Vodafone reduced alarm noise by over 70%.
As of June 2026, agentic RCA reaches new maturity milestone: Splunk ITSI 5.0 GA includes Event iQ Diagnose, an LLM-powered RCA system that identifies likely root causes with confidence-scored recommendations and integrates CMDB/change context; Splunk's broader Agentic Observability GA deploys AI SRE agents that automatically detect issues, identify root causes, and provide step-by-step remediation guidance. AWS and New Relic demonstrate production agentic triage assistants reducing evidence-gathering phases and standardizing investigation methodology across teams. Research advances continue: peer-reviewed frameworks propose graph-agnostic RCA (StableRCA) for systems lacking known topology, and multi-agent orchestration (ORCA) enabling domain experts to access causal analysis without deep statistical knowledge. Production deployment evidence diversifies beyond cloud platforms: Elastic's work with Cisco ISE demonstrates ML-assisted RCA compressing diagnostic time from 20 minutes manual investigation to seconds on network infrastructure.
These gains coexist with a striking adoption gap and persistent limitations. A Sumo Logic survey of 500+ security leaders found that 90% consider AI important for security purchases, yet only 9% have deployed it for incident triage. Operational toil rose 30% in 2025 despite 51% of companies deploying AI tools, and 73% of organisations experienced outages from ignored alerts. Causal AI adoption grew 40% in 2026 but pilot failure rate stands at 95%, driven by data quality issues. The gap is not capability but integration: mid-market RCA projects face severe cost overruns — one insurance deployment reached $4.7M against a $1.2M budget — and 94% of IT leaders cite vendor lock-in as a concern. On the technical side, LLM-based RCA shows systematic failures: a February 2026 study found hallucinated data interpretation and incomplete exploration persist across all model tiers regardless of capability level, requiring human review. Five critical data quality barriers block deployment: incomplete work order history, inconsistent asset naming, missing failure classifications, data silos between observability systems, and inconsistent technician data entry—costing organisations $12.9M annually. Alert-fatigue RCA approaches face fundamental limitations: alert-only correlation inherits blind spots of existing alerting rules and cannot recover weak signals that don't trigger thresholds, plateauing without evidence-rich foundations (metrics, traces, logs) as the diagnostic source. Security governance adds friction: 98% of CISOs in one survey report delaying AI agent deployments due to insufficient controls. The governance and trust barriers remain acute: a 1,000+ IT leader survey found 61% adoption of AI for accelerated RCA but 71% still manually verify outputs and 62% struggle trusting recommendations, capturing an implementation gap where teams receive AI insights but lack confidence or governance frameworks to act on them directly. The tooling works; the organisational, technical, and governance scaffolding to deploy it reliably is what most teams still lack.
— New Relic Autopilot GA—out-of-the-box SRE agent for incident triage and remediation proposals; exemplifies major vendor adoption of agentic RCA with guidance on phased pilot deployment for risk measurement.
— Japanese fintech LayerX deployed Datadog Bits Investigation to production for autonomous incident investigation correlating metrics/logs/traces/change data; reduced on-call cognitive load via AI-driven initial context gathering.
— Compliance-focused enterprise (legal/regulatory) deployed Evoke autonomous triage agent; 97% MTTR reduction (2 hours→40 seconds), deterministic execution with full auditability, demonstrating governance-ready agentic RCA in regulated environments.
— CRITICAL NEGATIVE EVIDENCE: Documents 10 specific agentic AI failure modes in incident response (speed outpacing human control, cascading failures, skill degradation, confident wrong summaries anchoring teams in misdiagnosis) from real post-mortems.
— Synthesized benchmark data (Forrester, Research Square, Fini Labs, BT Group) documents 40-60% MTTR reduction, per-phase metrics (triage <1 sec vs 3-8 min), 95% cost reduction per ticket, 9-18 month ROI payback; identifies mid-market gap as defining 2026 challenge.
— Named deployments (Krafton 107→24 incidents, MTTD 8.8→1.6min, MTTR 53.5→10.3min; Getswish post-2024 outage recovery) demonstrate AI-assisted triage effectiveness requires mature platform engineering foundations for RCA.
— Deployed incident triage system reduces time-to-first-hypothesis from 15-25 minutes to 1-3 minutes with 60-70% status page lag reduction; demonstrates real-time LLM-assisted RCA using structured output for PagerDuty/Slack integration.
— Independent engineer account of production agentic triage via Lambda/Bedrock achieving 88-95% reduction in diagnosis time (30+ min→<5 min); includes operational design lesson on agent health monitoring and CloudWatch fallback coverage.