The AI landscape doesn't move in one direction — it lurches. Some techniques leap from experiment to table stakes in a single quarter; others stall against regulatory walls, technical ceilings, or organisational inertia that no amount of hype can dislodge. Knowing which is which is the hard part. The State of Play cuts through the noise with a rigorously maintained index of AI techniques across every major business domain — classified by maturity, evidenced by real-world adoption, and updated daily so you always know where you stand relative to the field. Stop guessing. Start knowing.
A daily newsletter distilling the past two weeks of movement in a domain or two — delivered to your inbox while the index updates in the background.
Each dot marks the weighted maturity of practices within a domain — hover for a brief summary, click for more detail
AI that classifies, routes, and prioritises incidents while automating root cause diagnosis. Includes intelligent ticket routing and automated fault tree analysis; distinct from automated remediation which takes corrective action rather than diagnosing.
AI-driven incident triage and root cause analysis is a proven practice with a mature vendor ecosystem, GA tooling from multiple platforms, and documented ROI at enterprise scale. The core value proposition — classifying and routing incidents automatically while diagnosing why failures occur, not merely that they occurred — has been validated through years of production deployments at organisations ranging from Fortune 500 to Tier-1 service providers. The question for most teams is no longer whether these tools work, but how to implement them without drowning in integration complexity and alert noise. That distinction matters: despite broad vendor capability and strong MTTR reduction evidence, a persistent adoption paradox has emerged. Organisations overwhelmingly invest in AIOps platforms yet struggle to operationalise AI-assisted triage beyond initial pilots, particularly in the mid-market. LLM-assisted diagnosis is accelerating vendor roadmaps, but research reveals systematic reliability gaps that prompt engineering alone cannot resolve. The practice is mature and accessible; the barrier is execution, not technology.
The vendor ecosystem is broad and GA-ready. Splunk ITSI, BigPanda, Moogsoft (Dell), IBM Instana, and Logz.io all ship LLM-assisted triage and root cause suggestion as production features; BigPanda's AI Incident Assistant and Microsoft's RCA Agent via Copilot Studio represent the latest wave of generative-AI-native releases. Named deployments continue to demonstrate measurable impact: ServiceNow's NBA Workplace Service Delivery achieved 51% annual ROI with 30-50% MTTR reduction and 99.2% noise suppression; Thoughtworks reports RCA cycles compressed from hours to minutes across 16+ client engagements, with L1/L2 ticket volume down 35-40%. Incident.io documents 37% faster MTTR through AI-automated post-mortems, saving teams roughly 75 minutes per incident. Türk Telekom achieved 49% improvement in service outages and Vodafone reduced alarm noise by over 70%.
These results coexist with a striking adoption gap. A Sumo Logic survey of 500+ security leaders found that 90% consider AI important for security purchases, yet only 9% have deployed it for incident triage. Operational toil rose 30% in 2025 despite 51% of companies deploying AI tools, and 73% of organisations experienced outages from ignored alerts. Causal AI adoption grew 40% in 2026 but pilot failure rate stands at 95%, driven by data quality issues. The gap is not capability but integration: mid-market RCA projects face severe cost overruns — one insurance deployment reached $4.7M against a $1.2M budget — and 94% of IT leaders cite vendor lock-in as a concern. On the technical side, LLM-based RCA shows systematic failures: a February 2026 study found hallucinated data interpretation and incomplete exploration persist across all model tiers regardless of capability level, requiring human review. Five critical data quality barriers block deployment: incomplete work order history, inconsistent asset naming, missing failure classifications, data silos between observability systems, and inconsistent technician data entry—costing organisations $12.9M annually. Security governance adds friction: 98% of CISOs in one survey report delaying AI agent deployments due to insufficient controls. The tooling works; the organisational and technical scaffolding to deploy it reliably is what most teams still lack.
As of May 2026, agentic RCA platforms reach production scale with multi-vendor GA: Splunk Observability Cloud AI Troubleshooting Agent correlates metrics, logs, traces into evidence-backed root cause summaries, shifting on-call engineers from data gathering to decision-making; BigPanda's AI Detection and Response (ADR) with AI Incident Assistant delivers real-time triage and root cause analysis integrated with ServiceNow workflows; Kentik Network Intelligence Platform introduces AI Advisor with automated diagnostics (on-demand connectivity, config context, device access) reducing manual troubleshooting steps. Research validation advances causal approaches: peer-reviewed frameworks at ICML 2026 and arXiv validate methods (LATS-RCA achieving high diagnostic accuracy on production microservices despite polyglot stacks; Bayesian Root Cause Discovery with statistical consistency guarantees; PRIM meta-learning achieving zero-shot inference in 17ms for systems with 100+ variables) while identifying real-world implementation constraints. Real-world deployment evidence strengthens: American Express reduced MTTR 32% and achieved 82% RCA accuracy using Traversal; SquareOps managing 50+ production Kubernetes clusters reports 40-70% MTTR reduction via LLM triage and RCA; Microsoft Triangle demonstrated 91% reduction in Time-to-Engage in Azure production. Yet the governance and trust barriers remain acute: a 1,000+ IT leader survey found 61% adoption of AI for accelerated RCA but 71% still manually verify outputs and 62% struggle trusting recommendations, capturing an implementation gap where teams receive AI insights but lack confidence or governance frameworks to act on them directly.
— SRE consulting firm managing 50+ production Kubernetes clusters reports 40-70% MTTR reduction via LLM triage and RCA; demonstrates safe agentic remediation architecture with policy engine guardrails.
— Comprehensive analysis proposing 6-tier agentic RCA capability ladder (L0-L5) with named vendor examples; Traversal achieves 32% MTTR reduction and 82% RCA accuracy at American Express.
— BigPanda AI Detection and Response GA includes AI Incident Assistant with real-time incident triage, root cause analysis generation, and ServiceNow integration for L4-L5 agentic investigation.
— Recent peer-reviewed research on causal meta-learning for RCA; achieves zero-shot inference in 17ms for systems with 100+ variables, advancing methodological boundaries of RCA under structural uncertainty.
— Splunk Observability Cloud AI Troubleshooting Agent GA correlates metrics, logs, traces for evidence-backed root cause summaries; shifts on-call role from data gathering to decision-making.
— Operational analysis quantifying MTTR composition (15min context, 20min troubleshooting, 13min docs); Datadog Bits AI SRE achieves 95% MTTR reduction by eliminating context-assembly coordination tax.
— Peer-reviewed LLM multi-agent RCA framework (LATS-RCA) achieving high diagnostic accuracy on production microservices while identifying real-world challenges like polyglot stacks and inconsistent logging.
— ICML 2026 research on causal inference RCA with statistical consistency guarantees, evaluated across three microservice systems with state-of-the-art top-k accuracy in failure diagnosis.