The AI landscape doesn't move in one direction — it lurches. Some techniques leap from experiment to table stakes in a single quarter; others stall against regulatory walls, technical ceilings, or organisational inertia that no amount of hype can dislodge. Knowing which is which is the hard part. The State of Play cuts through the noise with a rigorously maintained index of AI techniques across every major business domain — classified by maturity, evidenced by real-world adoption, and updated daily so you always know where you stand relative to the field. Stop guessing. Start knowing.
A daily newsletter distilling the past two weeks of movement in a domain or two — delivered to your inbox while the index updates in the background.
Each dot marks the weighted maturity of practices within a domain — hover for a brief summary, click for more detail
AI systems that detect infrastructure failures and automatically execute remediation actions without human intervention. Includes auto-scaling, auto-restart, and configuration self-repair; distinct from runbook generation which documents procedures rather than executing them.
Self-healing infrastructure has moved from bleeding-edge to leading-edge, shifting the debate from "does it work?" to "why isn't everyone deploying it?" Organisations running production remediation loops report downtime reductions of 40-72%, vendor tooling is production-grade, and the structural case for automation is now undeniable: time-to-exploit compressed from 771 days (2018) to <1 hour (projected end-2026), while enterprise patch timelines stretch to 43+ days and remediation capacity has hit a ceiling (only 26% of CISA KEVs fully remediated in 2025, down from 38%). The defining tension has shifted from capability to absorption and governance. Legacy architectures lack the semantic telemetry, event-driven patterns, and metadata layers autonomous remediation requires. Alert fatigue and justified caution about unintended consequences keep most organisations in guided-automation mode with human approval gates. Getting from "works in a controlled environment" to "runs autonomously at scale" demands an infrastructure overhaul and organisational change discipline, not a product purchase. Gartner predicts 70% enterprise agentic AI adoption for infrastructure operations by 2029 (vs <5% in 2025), yet practitioner reality shows a 35-point gap between C-suite belief and deployment readiness, and fewer than 1% of organisations score above 50/100 on automation maturity. May 2026 evidence sharpens the adoption picture: tier-1 vendors (AMD, NVIDIA) ship automated remediation as core GA features; independent research documents 88.9% remediation success in constrained environments; practitioner frameworks identify high-confidence zones (pod restarts, cache flushes, known runbooks) and boundary conditions (ambiguous root causes, irreversible actions); stateless remediation without state persistence causes repeat-incident thrashing and symptom masking, illustrating why safe automation requires governance discipline and observability-as-control-plane architecture.
The structural case for automated remediation has become inescapable: time-to-exploit compressed to <4 hours (2024) with trajectory toward <1 hour by end-2026, yet median enterprise patch time is 43 days and CISA KEV remediation capacity sits at 26% full remediation (down from 38% prior year). FAIR Institute analysis and Qualys research on 1B+ remediation records both conclude that manual workflows cannot keep pace with weaponized vulnerabilities; the bottleneck has fundamentally shifted from vulnerability discovery (now autonomous at scale) to remediation execution. This has triggered vendor consolidation around automated remediation features: AMD GPU Operator v1.5.0 (May 2026) ships Auto Node Remediation as core GA, NVIDIA NVSentinel (production-ready, 297 GitHub stars) provides automated fault recovery for GPU Kubernetes, Harness AIDA demonstrates 68.50% MTTR reduction in fintech deployments, and independent peer-reviewed research (SCARA framework, May 2026) validates 88.9% autonomous remediation success rates even on opaque industrial software (firmware, proprietary handlers, ICS code without source).
Dynatrace and AWS anchor the enterprise vendor field. AWS showcases six named deployments (Banco BMG with 350+ daily autonomously-investigated incidents and 87% MTTR reduction; Commonwealth Bank resolving network/identity issues in <15 minutes vs. hours for manual engineers; Deriv with 40% MTTR reduction; Clariant, Dhan, Granola) using DevOps Agent for autonomous investigation. Dynatrace released hypermodal AI (predictive + causal + generative) where causal AI specifically grounds autonomous remediation decisions, avoiding hallucination. Dynatrace AutomationEngine (GA) with causal AI delivered federal deployments with 80% reduction in manual remediation effort. AWS Support Automation Workflows GA with 50+ curated scenarios, New Relic Workflow Automation with auto-rollback gates for deployment errors, and Red Hat's Ansible Automation Orchestrator (Q3 2026 preview) separating AI recommendation from deterministic production execution represent operationally mature ecosystems.
However, governance barriers and control collapse are now visible at scale. Gartner predicts 40% of enterprises will demote or decommission autonomous agents by 2027 due to governance failures—the core issue is indiscriminate control application causing either over-restriction that drives shadow adoption or under-restriction that expands attack surface. IBM's CIO study found two-thirds of technology executives are legally responsible for autonomous systems they don't oversee, with only 11% feeling prepared for autonomous deployments; 70% of teams deploy faster than central IT can track or evaluate. Broad agent adoption data shows only 41% of enterprise AI agent rollouts achieve positive ROI within 12 months, with 40% of pilots expected scrapped by 2027 (primary drivers: data quality issues, unclear output ownership, workflow redesign failures). Practitioner evidence highlights approval fatigue, auto-approve habit drift, and the control collapse when users must approve hundreds or thousands of daily actions—converting meaningful oversight into reflex clicking. Red Hat's Ansible Orchestrator and AWS DevOps Agent architectures separate investigation (fully automated, read-only) from remediation (human-approved state changes), reflecting the operational discipline now standard in deployed systems. The practice has crossed from capability maturity to absorption and governance maturity: the tension is no longer "does autonomous remediation work?" but "can our organization safely govern and operate it?"
Practitioners document critical failure modes and boundary conditions: stateless auto-remediation causes repeat-incident thrashing and symptom masking; real deployments succeed by establishing high-confidence zones (pod restarts, cache flushes, known runbooks) with strict boundary conditions (no ambiguous root causes, no irreversible actions). Kubernetes-native platforms like OpsAI demonstrate graduated autonomy models—auto-fix for staging, human-reviewed RCA for production. Governance gaps remain substantial: only 39% maintain fully automated audit trails, only 2% of organisations operate fully automated vulnerability workflows. The self-healing networks market (USD 2.61B in 2026, projected USD 9.32B by 2032 at 22.09% CAGR) reflects validated commercial momentum alongside persistent work required to scale beyond leading-edge early adopters.
— GA autonomous SRE agent (June 2026) that monitors telemetry, classifies incidents, executes or suggests remediation runbooks with configurable human control. Paired with Ground Truth API for high-fidelity telemetry, FedRAMP roadmap.
— CRITICAL GOVERNANCE SIGNAL: 93% experienced AI-caused infrastructure incidents; 86% confident in governance but only 30% have formal policy (AI Governance Paradox); 78% use AI-generated IaC without review. Pioneers 6x more likely to have fully automated infrastructure.
— Kubernetes platform feature (RestartAllContainers, beta/default v1.36) enabling in-place pod recovery without full recreation, preserving IP/GPU bindings. Reduces MTTR from minutes to seconds via JobSet adoption, solving control-plane churn and scheduling overhead.
— Enterprise remediation platform with multi-vendor orchestration (70+ security controls), pre-enforcement validation, auto-rollback on verification failure. Deployed metrics: 504 safe remediations/month, 150+ integrations, 99% takedown success rate.
— Independent engineering team (ZopDev) fully automated 4 runbooks and deleted (not archived) them after validating autonomous execution. Demonstrates complete automation with proof: runbook library is hidden automation backlog; deletion validates full logic encoding.
— Benchmark: 40-50% MTTR reduction (Forrester/Research Square 2025); cost per ticket $85→$2-5; BT Group MTTR 2 hours→85 seconds (97% improvement); Level-4 orgs achieve 300% ROI in 18 months. Microsoft Azure 97% triage accuracy, 91% time-to-engage reduction.
— Azure AKS auto-healing feature: automatically creates PDBs for unprotected deployments and reactively scales replicas to unblock node drains during upgrades. Eliminates manual PDB configuration and upgrade failures.
— Production-tested architecture with event noise filtering (count≥3), LLM classification before action, and safe automation boundaries (stateless only, >2 replicas). Honest assessment: full autonomy confined to narrow domains; middle ground (observe→diagnose→act conditionally) production-ready.
2019: Market research showed 85% of organizations engaged with self-healing IT initiatives, mostly at PoC stage. AWS released auto-remediation for policy compliance. Vendors and open-source projects began integrating automated remediation with monitoring and orchestration tools.
2020: Major cloud vendors released GA products: AWS Security Hub automated response, Microsoft Defender auto-IR, Red Hat OKD Compliance Operator. GitHub demonstrated ecosystem-scale automated remediation (1,964 projects). Analyst reports recognized self-healing IT as an established market segment. Reliability challenges persisted in practice, with documented failures in patch management and endpoint security remediation.
2021: Research and technical documentation advanced the practice through EU-funded academic research on AI-based Infrastructure as Code self-healing, AWS expanded Config remediation with approval workflows, and Red Hat published architectural blueprints for closed-loop automation. Community tooling matured with open-source Ansible-Dynatrace integration playbooks. Adoption expanded from compliance remediation to broader infrastructure automation patterns.
2022-H1: Cloud vendors continued maturing infrastructure-as-code support for remediation: AWS CDK added CfnRemediationConfiguration construct, and comprehensive tutorials demonstrated healthcare compliance automation. Market analysis valued self-healing networks at $615M with 33.7% CAGR through 2032, confirming commercial traction. However, product limitations emerged in practice—Azure Policy remediation ran only once, requiring manual re-triggering—while cloud economists highlighted hidden lock-in risks in adopting cloud-native remediation services, indicating adoption barriers persisted despite tooling maturity.
2022-H2: Vendor tooling expanded across security and compliance domains: Dynatrace released Extensions 2.0 for custom metric automation, Qualys launched Flow for no-code vulnerability remediation workflows, AWS published tutorials on organization-wide Config remediation. However, practitioner experience documented critical adoption gaps: change management discipline required to avoid automation-induced outages, and full auto-remediation adoption remained limited in favor of guided/human-in-the-loop approaches that balanced safety with efficiency.
2023-H1: Major vendors continued platform innovation: Dynatrace launched AutomationEngine with Davis causal AI-driven remediation and SLO-based triggering, signaling convergence toward intelligent, context-aware automation. Analyst reports (NelsonHall NEAT) projected self-healing IT infrastructure market growth to $98.5B by 2026 (13.6% CAGR). Market research valued self-healing networks at $500M+ with 25% projected growth. However, production failures continued: ServiceNow users reported automated remediation tasks closing unexpectedly without human verification, and Azure Policy remediation remained limited to initial resource creation, failing on deployed infrastructure. These gaps reinforced the tension: vendor platforms offered sophisticated automation, but operational reality demanded careful gates and human oversight due to unintended consequences and platform limitations.
2024-Q1: Major vendors expanded geographic and capability footprints. AWS extended Config auto-remediation to additional regions (Canada West), Azure GA'd machine-level configuration self-healing via ApplyAndAutoCorrect mode, and Dynatrace released AutomationEngine supporting low-code/no-code workflow automation with causal AI triggering. Academic research advanced theoretical foundations for autonomous repair systems and AI-driven IaC self-healing. However, operational challenges persisted: practitioners documented AWS Config infinite-loop failure modes when remediation actions left resources non-compliant, highlighting the ongoing need for careful parameter validation and staged rollout discipline.
2024-Q2: Vendor ecosystem expanded across domains: Varonis launched automated data remediation for AWS (S3, identity management), AWS published Well-Architected Framework best practices for non-compliant resource remediation, and Red Hat/xMatters published practical integrations (Ansible automation, blue-green deployment rollback). Real-world deployments demonstrated scaling success: City Storage Systems cut Kubernetes support toil by 50% with AKS self-healing framework; Lakeside Software reported $22.50 per-ticket remediation costs and 70% IT staff adoption preference. Platform maturity gaps persisted: Azure Policy and AWS Config both exposed limitations in continuous, progressive remediation of deployed resources, reinforcing practitioner preference for guided automation over fully autonomous execution.
2024-Q3: Cross-cloud vendor and ecosystem maturation continued. Red Hat and Dynatrace published joint case studies showing production deployments at Porsche Informatik and TMBThanachart Bank (Thailand) achieving 90% time-to-market reduction and proactive issue resolution via OpenShift-integrated self-healing. AWS expanded tooling with Systems Manager and Amazon Q Developer integration for automated EBS remediation. Microsoft published Azure Well-Architected Framework reliability guidance emphasizing self-healing design patterns and automated failure detection. Independent deployments, including a large US manufacturer, demonstrated shift from reactive monitoring to predictive, automated remediation. However, practitioner analyses continued highlighting integration challenges with ITSM systems, recurring false positive management, and organizational need for disciplined change management to prevent automation-induced outages.
2024-Q4: Vendor tooling maturity and market expansion continued. Dynatrace reinforced AutomationEngine's production readiness with emphasis on closed-loop remediation; AWS published Well-Architected Framework supply-chain security guidance advocating proactive Config remediation; AWS expanded container vulnerability remediation patterns (EventBridge + CodeBuild + CodeDeploy) for automated CVE patching. Practitioner deployments successfully executed AWS Config Conformance Pack remediation despite documented configuration challenges. Market analyst forecasts updated: self-healing networks projected to reach USD 8.89B by 2032 (28.6% CAGR), utility-sector self-healing grids reaching USD 5.71B by 2031 (9.83% CAGR), confirming sustained commercial momentum. Operational reliability challenges remained visible: Windows remediation service failures documented in Q4 reinforced that platform maturity had not eliminated failure modes, especially in edge case handling and error recovery. Adoption pattern remained consistent: guided/human-approved automation preferred over fully autonomous execution in large-scale production deployments.
2025-Q1: Vendor ecosystem continued advancing with Dynatrace AutomationEngine GA (March 2025) enabling low-code/no-code workflow modeling with closed-loop integration and custom extensibility via App Toolkit. Market research (Research and Markets) projected system infrastructure software market at USD 187.7B in 2025 growing to USD 475.5B by 2034 (10.9% CAGR), with AI-powered automation and self-healing highlighted as key growth drivers. However, adoption gap persisted: Mondoo's 2025 vulnerability remediation survey (125 IT/security professionals) revealed only 2% with fully automated workflows, 62% still manual, and 53% experiencing alert fatigue—exposing disconnect between vendor capability and organizational readiness. Cloud platforms (AWS, Azure) continued documenting 10+ production remediation patterns, but large-scale deployments remained cautious, preferring guided automation over fully autonomous execution due to unintended consequence risks and organizational change management discipline requirements.
2025-Q2: Vendor platform expansion continued: AWS ECS launched automated failure detection and rollback via circuit breaker with CloudWatch Alarms, enabling zero-touch remediation of deployment failures. Academic research validated ML-driven automated recovery, showing 70%+ downtime reduction with DQN-based scheduler in Kubernetes. Self-healing infrastructure adoption expanded beyond IT operations into utility grids, with market analysis reporting 40-70% outage reduction and $918.5M–$11.25B market growth projections. Production deployments remained cautious, preferring guided automation despite vendor capability maturity; organizational adoption barriers persisted around integration complexity, false positive management, and risk of unintended consequences.
2025-Q3: Vendor platform maturity advanced with Dynatrace 3rd-gen agentic AI for complex multi-team remediation scenarios; TELUS case study showed 45-minute-to-2-minute debug time reduction. Market adoption breadth expanded: self-healing network market projected to reach $10.2B by 2033 (21.6% CAGR). Dynatrace's own SaaS platform demonstrated production-grade resilience surviving AWS EC2 outage with automated traffic redirection and zero SLO impact. Balanced signals persisted: adoption breadth across enterprise cloud and utility grids vs. organizational caution favoring guided automation over fully autonomous execution; vendor lock-in risks highlighted by Change Healthcare incident reinforced importance of distributed remediation architectures; implementation challenges (false positives, ITSM integration, parameter validation) remained ongoing.
2025-Q4: Telecom and utility sectors advanced self-healing deployments with market validation: self-healing grids valued at $7.1B–$12.8B with 35% efficiency gains from AI-powered automated fault restoration; telecom analysis highlighted closed-loop automation evolution from equipment reboots to complex multi-domain remediation requiring federated intelligence. Infrastructure-as-code practices matured with Ansible and Bedrock integration tutorials demonstrating repeatable AI operations. However, critical infrastructure gaps surfaced: Ansible network automation collections abandoned, and October 2025 AWS outage traced to automated DNS system race condition causing cascading failures—illustrating how autonomous remediation at scale remains fragile without deterministic safeguards and careful state management.
2026-Jan: Vendor platforms continued investment in agentic AI-driven remediation (Dynatrace emphasized "massive automation"); enterprise deployments at financial services showed 47% faster AI-enabled migration with automated anomaly detection. Market research updated self-healing networks projection to $2.61B (2026) tracking toward $9.32B (2032, 22.09% CAGR). Analyst consensus identified critical limitations: observability-as-control-plane requires deterministic analytics and trust-based automation maturity. Practitioner analysis highlighted fundamental barriers: 2025 AI-agent production failures traced to infrastructure incompatibilities; true autonomous remediation requires semantic telemetry, async event-driven architectures, and metadata layers—not just sophisticated AI. Adoption signals remained mixed: guided automation preferred over fully autonomous execution; 2% fully automated vulnerability workflows vs. 62% manual, exposing persistent organizational readiness gaps despite vendor platform capability.
2026-Feb: Vendor ecosystem advanced with AWS Config and Dynatrace AutomationEngine documentation confirming mature, production-grade remediation capabilities. Academic research proposed multi-agent reinforcement learning framework for closed-loop self-healing service operations. Practitioner analysis underscored persistent maturity gap: organizations achieving self-healing report 72% downtime reduction, yet fewer than 1% score above 50/100 on automation maturity index. Risk assessments emphasized safety challenges: cloud environments heighten business disruption risk, requiring risk-ledger approaches and strict human-on-loop policies despite AI platform capability. GitOps-based approaches (ArgoCD with AI agents) emerged as operationally sovereign alternative driven by regulatory requirements (NIS-2). Adoption pattern remained consistent: benefits clear but organizational adoption barriers (integration complexity, false positives, governance discipline) persisted.
2026-Apr: Named enterprise deployments, peer-reviewed research, and large-scale empirical analysis advanced the production case for autonomous remediation while sharpening the urgency argument.
2026-May: Vendor GA releases and named deployments extended the production evidence base: Dynatrace Davis AI shipped predictive remediation with auto-generated Kubernetes deployment fixes, validated by NEQUI digital bank; New Relic Workflow Automation GA launched auto-rollback on deployment errors with approval gates for critical actions, demonstrating bounded autonomy patterns; AWS Support Automation Workflows expanded to 50+ curated remediation scenarios; Telstra PoC demonstrated live-production outage resolution in minutes via AI agents executing Ansible playbooks. Practitioner frameworks sharpened the capability boundary, distinguishing safe automation zones (pod restarts, cache flushes, known runbooks) from failure conditions (ambiguous root causes, irreversible actions), while Gartner's CEO survey (469 respondents) found 80% expect AI-driven operational overhauls and 27% anticipate primarily autonomous operations by 2028 — sustaining the gap between executive expectation and organizational deployment maturity. AWS DevOps Agent showcased five named enterprises—Commonwealth Bank (complex issues resolved in under 15 minutes vs. hours for manual engineers), Deriv (40% MTTR reduction), Clariant, Granola, and Infor—using agentic AI for autonomous infrastructure investigation. A peer-reviewed study co-authored with JPMorgan Chase demonstrated a multi-agent LLM framework achieving 96.8% IaC drift detection and 95.2% security misconfiguration detection with 6.9-minute MTTR. A separate peer-reviewed paper showed automated remediation in CI/CD pipelines reduces MTTR by 76%, increases deployment frequency 24.2x, and achieves 99.96% reliability in Kubernetes environments. Qualys analysis of 1 billion remediation records across 10,000 enterprises found manual processes failed 88% of the time for weaponized CVEs, and a companion Qualys research report documented exploitation hitting negative-one-day windows—making manual workflows structurally incapable of keeping pace and framing autonomous remediation as a necessity rather than an optimization. Gartner reiterated its forecast of 70% enterprise agentic AI adoption for infrastructure operations by 2029 (vs under 5% in 2025), while organizational governance gaps—only 39% with fully automated audit trails—and the adoption floor (2% fully automated vulnerability workflows) confirmed the gap between capability and deployment maturity. NVIDIA released NVSentinel v1.0.0 (open-source, production-grade fault remediation for GPU-accelerated Kubernetes with automated cordon/drain and break-fix workflows) and AMD GPU Operator v1.5.0 shipped Auto Node Remediation as a core GA feature — two tier-1 hardware vendors standardising automated remediation at the infrastructure layer. Peer-reviewed SCARA framework achieved 88.9% autonomous remediation success on opaque industrial software (firmware, ICS/PLC code without source), extending the practice to critical infrastructure. FAIR Institute analysis documented time-from-disclosure-to-exploitation compressed toward <1 hour by end-2026 with only 26% of CISA KEVs fully remediated, establishing automation as a structural necessity; independent analysis noted fewer than 1% of AI-found vulnerabilities have been patched, confirming the bottleneck has shifted from discovery to remediation execution.
2026-Jun: Governance barriers emerged as the defining constraint, displacing technical capability as the primary concern. AWS DevOps Agent showcased six named enterprises (Banco BMG, Commonwealth Bank, Deriv, Clariant, Dhan, Granola) with 350+ daily incidents autonomously investigated, 87% MTTR reduction, and sub-15-minute RCA versus hours for manual engineers — production maturity now documented at breadth. Red Hat's Ansible Automation Orchestrator preview demonstrated the separation of AI recommendation from deterministic execution, with a live CVE-2024-6387 remediation across 12 hosts in under 10 seconds automation and 38 seconds human review. New Relic launched Autopilot as a GA autonomous SRE agent that monitors telemetry, classifies incidents, and executes or suggests remediation runbooks with configurable human control, paired with a Ground Truth API for high-fidelity telemetry. Against this capability expansion, IBM's CIO study of 2,000 executives found two-thirds legally responsible for autonomous systems they don't oversee, only 11% feel prepared, and 70% of teams deploy faster than central IT can track — while Gartner maintains its forecast that 40% of enterprises will demote or decommission autonomous agents by 2027 due to governance failures.
Platform-layer automation advanced: Kubernetes v1.35 shipped in-place pod restarts (graduated to beta/default in v1.36), enabling recovery without full pod recreation and reducing MTTR from minutes to seconds via JobSet adoption. Azure AKS released automatic Pod Disruption Budget management, autonomously creating PDBs for unprotected deployments and reactively scaling replicas during node drains — eliminating manual configuration and upgrade failures. Enterprise-scale remediation platforms matured: Check Point Safe Remediation deployed across 70+ security controls with 504 safe remediations monthly, 150+ integrations, and 99% takedown success rate, reflecting multi-vendor orchestration at production scale.
ROI benchmarks solidified: industry analysis documented 40-50% MTTR reduction with AIOps (Forrester/Research Square 2025); cost per ticket down from $85 to $2-5; BT Group achieved 97% MTTR improvement (2 hours → 85 seconds); Level-4 organizations report 300% ROI in 18 months. Independent practitioners validated safety patterns: OpsWorker's production-tested framework relies on event noise filtering (count≥3 before action), LLM classification before execution, and strict automation boundaries (stateless deployments only, >2 replicas, no network config changes). ZopDev engineering fully automated and deleted 4 runbooks after verifying autonomous execution, treating runbook deletion as proof of complete automation — a concrete signal that organizations are capturing toil systematically. A Japan-market AIOps analysis documented the broader transition from anomaly detection to agentic self-healing, with a specific MTTR improvement case from 2 hours to 28 minutes alongside persistent adoption barriers in markets with fewer GenAI governance policies.
The AI Governance Paradox emerged as the critical crisis: 93% of organizations experienced AI-caused infrastructure incidents; 86% expressed confidence in AI governance but only 30% maintained formal policy (Spacelift Q2 2026 survey of 406 IT leaders). This mirrors IBM CIO findings and signals that rapid deployment outpaces governance institutionalization — organizations deploying fastest are experiencing the highest incident rates, reinforcing that technical maturity has far outpaced organizational readiness to manage autonomous systems safely.