Automated remediation & self-healing infrastructure

The AI landscape doesn't move in one direction — it lurches. Some techniques leap from experiment to table stakes in a single quarter; others stall against regulatory walls, technical ceilings, or organisational inertia that no amount of hype can dislodge. Knowing which is which is the hard part. The State of Play cuts through the noise with a rigorously maintained index of AI techniques across every major business domain — classified by maturity, evidenced by real-world adoption, and updated daily so you always know where you stand relative to the field. Stop guessing. Start knowing.

AI Maturity by Domain

Each dot marks the weighted maturity of practices within a domain — hover for a brief summary, click for more detail

DOMAIN

BLEEDING EDGEESTABLISHED

LEADING EDGE

TRAJECTORY— Stalled

AI systems that detect infrastructure failures and automatically execute remediation actions without human intervention. Includes auto-scaling, auto-restart, and configuration self-repair; distinct from runbook generation which documents procedures rather than executing them.

OVERVIEW

Self-healing infrastructure has proven it works -- where it has been deployed. Organisations running production remediation loops report downtime reductions of 40-72%, and the vendor tooling is genuinely capable. But most organisations have not started. The practice sits squarely at leading-edge: a small vanguard of forward-leaning teams is extracting real value while the broader market remains stuck in manual or semi-automated workflows. The defining tension is not capability but absorption. Legacy architectures lack the semantic telemetry, event-driven patterns, and metadata layers that autonomous remediation requires. Alert fatigue and justified caution about unintended consequences keep the majority in guided-automation mode with human approval gates. Getting from "works in a controlled environment" to "runs autonomously at scale" demands an infrastructure overhaul, not a product purchase. Gartner's prediction of 70% enterprise adoption by 2029 suggests mainstream transition is imminent, but adoption barriers—governance, audit trails, integration complexity—remain substantial. May 2026 evidence documents critical limitations: stateless remediation systems cause repeat-incident thrashing and symptom masking; practitioners identify safe automation zones (pod restarts, cache flushes) and failure conditions (ambiguous root causes, irreversible actions); a 35-point gap persists between executive expectation and practitioner deployment readiness.

CURRENT LANDSCAPE

Dynatrace and AWS anchor the vendor field with production-grade platforms. AWS recently showcased five named enterprise deployments—Commonwealth Bank, Deriv, Clariant, Granola, Infor—using its DevOps Agent for autonomous infrastructure investigation and remediation. Commonwealth Bank resolved complex network/identity issues in under 15 minutes compared to hours for manual engineers; Deriv achieved 40% MTTR reduction; Clariant reports autonomous pinpointing of infrastructure changes and configuration issues. Dynatrace's AutomationEngine continues maturing; a federal deployment achieved 80% reduction in manual remediation effort with alert volume dropping from 70,000 to 7,000 actionable incidents. AWS Config conformance packs and Systems Manager handle organisation-wide policy remediation, while ECS circuit breakers enable zero-touch deployment rollback. AWS Builders' Library documents its production deployment automation pattern with auto-rollback triggered by metrics. The tooling is mature. The bottleneck is downstream.

May 2026 data sharpens the adoption picture. Neubird AI surveyed 1,000+ SRE/DevOps professionals and found a 35-point gap between C-suite belief in autonomous remediation and practitioner deployment readiness; 44% of organizations experienced production incidents directly caused by alert fatigue and suppressed alerts. Stonebranch's 402-person IT automation survey (across North America, EMEA, LATAM, APAC) revealed 88% operate hybrid IT but only 21% have enterprise-wide AI workflow production—integration complexity and governance readiness remain structural blockers. Splunk and Cisco showcased their production-ready integration at Red Hat Summit 2026: Splunk ITSI correlates events, Event-Driven Ansible evaluates and instantly executes remediation without manual triage. New SRE tool ecosystem comparisons document 10+ GA AI SRE platforms (OpsAI, Datadog Bits AI, Resolve AI) with auto-remediation focus; OpsAI reported 80% auto-resolution rate in beta testing with 90% detection-to-resolution accuracy. However, critical failure modes persist: stateless auto-remediation causes repeat-incident thrashing (40 remediations per day on identical root causes for 11 weeks), symptom masking, and cascading failures. Practitioners identify hard boundaries: high-confidence zones (pod restarts, cache flushes, known runbooks) versus failure conditions (ambiguous root causes, irreversible actions). Halodoc's named data-pipeline deployment demonstrates the inverse: 6-layer self-healing system with eligibility gates and source-vs-lake consistency checks proves safety-gated automation works at scale. MTTR decomposition analysis reveals coordination overhead dominates (23 minutes vs. 90 seconds execution); real deployments pairing agents with runbooks and SLOs achieved 45→5-18 minute MTTR reductions. Gartner's 2026 CEO survey still shows 80% expect AI to drive operational overhauls, yet Fortune 500 CVE remediation at scale (20x faster, 95% automation) and observed governance gaps (only 39% maintain fully automated audit trails) confirm the asymmetry: vendor capability advanced, executive demand high, but organizational absorption constrained by integration complexity, false positive management, and audit discipline. The self-healing networks market, valued at USD 2.61B in 2026 and projected to reach USD 9.32B by 2032, reflects both the promise and the ongoing distance.

TIER HISTORY

ResearchJan-2019 → Jan-2019

Bleeding EdgeJan-2019 → Apr-2024

Leading EdgeApr-2024 → present

EVIDENCE (122)

2026 State of Production Reliability and AI AdoptionAdoption Metrics2026-05-14

— Survey of 1,000+ SRE/DevOps professionals: 44% experienced incidents from suppressed alerts, 35-point gap between exec belief and practitioner reality on autonomous remediation adoption.

Stonebranch Releases 2026 Global State of IT Automation Report, Revealing Orchestration as the Missing Link for AI Adoption and TrustAdoption Metrics2026-05-14

— Survey of 402 IT automation professionals across four regions: 88% hybrid IT, 64% investing in cloud automation, only 21% have enterprise-wide AI workflow production—signals bottleneck in scale.

Splunk Observability Cloud: Six Months That Changed the GameOpinion2026-05-08

— Splunk's evolution from AI assistant to autonomous troubleshooting agents with root-cause analysis and remediation recommendation capability; on-call engineer role shifts from data gathering to decision-making.

Visit The Splunk And Cisco... (Splunk and Cisco at Red Hat Summit 2026: Turning Alerts into Action)Product Launches2026-05-07

— Splunk ITSI + Red Hat Event-Driven Ansible production-ready closed-loop integration: anomaly detection → correlation → automated remediation without manual triage, delivered at enterprise vendor conference.

Automate CVE remediation at scale - OnaCase Studies2026-05-07

— Fortune 500 deployment: agent fleets patch, test, and auto-merge PRs in parallel; 20x faster remediation, 95% automation, 20% engineering capacity freed; audit trails mandatory.

10 Best AI SRE Tools & Agents in 2026 - Middleware.ioProduct Launches2026-05-07

— Comparison of 10 mature AI SRE tools with automated remediation focus: OpsAI (80% auto-resolve in beta, 90% detection-to-resolution), Datadog Bits AI, Resolve AI; documents ecosystem GA maturity.

Agentic Self-Healing in Production — Jack McNicol at AI Engineer Melbourne 2026Conference Talks2026-05-06

— Practitioner framework for safe agentic self-healing: visibility/diagnostics, constrained action spaces, guardrails, escalation policies, canary actions, staged rollout, observability-first design.

Why Auto-Remediation Without Memory FailsOpinion2026-05-05

— Critical negative signal: stateless auto-remediation causes repeat-incident thrashing (40 remediations/day on same root cause for 11 weeks), symptom masking, cascading failures, alert fatigue recreation.

HISTORY

2019: Market research showed 85% of organizations engaged with self-healing IT initiatives, mostly at PoC stage. AWS released auto-remediation for policy compliance. Vendors and open-source projects began integrating automated remediation with monitoring and orchestration tools.
2020: Major cloud vendors released GA products: AWS Security Hub automated response, Microsoft Defender auto-IR, Red Hat OKD Compliance Operator. GitHub demonstrated ecosystem-scale automated remediation (1,964 projects). Analyst reports recognized self-healing IT as an established market segment. Reliability challenges persisted in practice, with documented failures in patch management and endpoint security remediation.
2021: Research and technical documentation advanced the practice through EU-funded academic research on AI-based Infrastructure as Code self-healing, AWS expanded Config remediation with approval workflows, and Red Hat published architectural blueprints for closed-loop automation. Community tooling matured with open-source Ansible-Dynatrace integration playbooks. Adoption expanded from compliance remediation to broader infrastructure automation patterns.
2022-H1: Cloud vendors continued maturing infrastructure-as-code support for remediation: AWS CDK added CfnRemediationConfiguration construct, and comprehensive tutorials demonstrated healthcare compliance automation. Market analysis valued self-healing networks at $615M with 33.7% CAGR through 2032, confirming commercial traction. However, product limitations emerged in practice—Azure Policy remediation ran only once, requiring manual re-triggering—while cloud economists highlighted hidden lock-in risks in adopting cloud-native remediation services, indicating adoption barriers persisted despite tooling maturity.
2022-H2: Vendor tooling expanded across security and compliance domains: Dynatrace released Extensions 2.0 for custom metric automation, Qualys launched Flow for no-code vulnerability remediation workflows, AWS published tutorials on organization-wide Config remediation. However, practitioner experience documented critical adoption gaps: change management discipline required to avoid automation-induced outages, and full auto-remediation adoption remained limited in favor of guided/human-in-the-loop approaches that balanced safety with efficiency.
2023-H1: Major vendors continued platform innovation: Dynatrace launched AutomationEngine with Davis causal AI-driven remediation and SLO-based triggering, signaling convergence toward intelligent, context-aware automation. Analyst reports (NelsonHall NEAT) projected self-healing IT infrastructure market growth to $98.5B by 2026 (13.6% CAGR). Market research valued self-healing networks at $500M+ with 25% projected growth. However, production failures continued: ServiceNow users reported automated remediation tasks closing unexpectedly without human verification, and Azure Policy remediation remained limited to initial resource creation, failing on deployed infrastructure. These gaps reinforced the tension: vendor platforms offered sophisticated automation, but operational reality demanded careful gates and human oversight due to unintended consequences and platform limitations.
2024-Q1: Major vendors expanded geographic and capability footprints. AWS extended Config auto-remediation to additional regions (Canada West), Azure GA'd machine-level configuration self-healing via ApplyAndAutoCorrect mode, and Dynatrace released AutomationEngine supporting low-code/no-code workflow automation with causal AI triggering. Academic research advanced theoretical foundations for autonomous repair systems and AI-driven IaC self-healing. However, operational challenges persisted: practitioners documented AWS Config infinite-loop failure modes when remediation actions left resources non-compliant, highlighting the ongoing need for careful parameter validation and staged rollout discipline.
2024-Q2: Vendor ecosystem expanded across domains: Varonis launched automated data remediation for AWS (S3, identity management), AWS published Well-Architected Framework best practices for non-compliant resource remediation, and Red Hat/xMatters published practical integrations (Ansible automation, blue-green deployment rollback). Real-world deployments demonstrated scaling success: City Storage Systems cut Kubernetes support toil by 50% with AKS self-healing framework; Lakeside Software reported $22.50 per-ticket remediation costs and 70% IT staff adoption preference. Platform maturity gaps persisted: Azure Policy and AWS Config both exposed limitations in continuous, progressive remediation of deployed resources, reinforcing practitioner preference for guided automation over fully autonomous execution.
2024-Q3: Cross-cloud vendor and ecosystem maturation continued. Red Hat and Dynatrace published joint case studies showing production deployments at Porsche Informatik and TMBThanachart Bank (Thailand) achieving 90% time-to-market reduction and proactive issue resolution via OpenShift-integrated self-healing. AWS expanded tooling with Systems Manager and Amazon Q Developer integration for automated EBS remediation. Microsoft published Azure Well-Architected Framework reliability guidance emphasizing self-healing design patterns and automated failure detection. Independent deployments, including a large US manufacturer, demonstrated shift from reactive monitoring to predictive, automated remediation. However, practitioner analyses continued highlighting integration challenges with ITSM systems, recurring false positive management, and organizational need for disciplined change management to prevent automation-induced outages.
2024-Q4: Vendor tooling maturity and market expansion continued. Dynatrace reinforced AutomationEngine's production readiness with emphasis on closed-loop remediation; AWS published Well-Architected Framework supply-chain security guidance advocating proactive Config remediation; AWS expanded container vulnerability remediation patterns (EventBridge + CodeBuild + CodeDeploy) for automated CVE patching. Practitioner deployments successfully executed AWS Config Conformance Pack remediation despite documented configuration challenges. Market analyst forecasts updated: self-healing networks projected to reach USD 8.89B by 2032 (28.6% CAGR), utility-sector self-healing grids reaching USD 5.71B by 2031 (9.83% CAGR), confirming sustained commercial momentum. Operational reliability challenges remained visible: Windows remediation service failures documented in Q4 reinforced that platform maturity had not eliminated failure modes, especially in edge case handling and error recovery. Adoption pattern remained consistent: guided/human-approved automation preferred over fully autonomous execution in large-scale production deployments.
2025-Q1: Vendor ecosystem continued advancing with Dynatrace AutomationEngine GA (March 2025) enabling low-code/no-code workflow modeling with closed-loop integration and custom extensibility via App Toolkit. Market research (Research and Markets) projected system infrastructure software market at USD 187.7B in 2025 growing to USD 475.5B by 2034 (10.9% CAGR), with AI-powered automation and self-healing highlighted as key growth drivers. However, adoption gap persisted: Mondoo's 2025 vulnerability remediation survey (125 IT/security professionals) revealed only 2% with fully automated workflows, 62% still manual, and 53% experiencing alert fatigue—exposing disconnect between vendor capability and organizational readiness. Cloud platforms (AWS, Azure) continued documenting 10+ production remediation patterns, but large-scale deployments remained cautious, preferring guided automation over fully autonomous execution due to unintended consequence risks and organizational change management discipline requirements.
2025-Q2: Vendor platform expansion continued: AWS ECS launched automated failure detection and rollback via circuit breaker with CloudWatch Alarms, enabling zero-touch remediation of deployment failures. Academic research validated ML-driven automated recovery, showing 70%+ downtime reduction with DQN-based scheduler in Kubernetes. Self-healing infrastructure adoption expanded beyond IT operations into utility grids, with market analysis reporting 40-70% outage reduction and $918.5M–$11.25B market growth projections. Production deployments remained cautious, preferring guided automation despite vendor capability maturity; organizational adoption barriers persisted around integration complexity, false positive management, and risk of unintended consequences.
2025-Q3: Vendor platform maturity advanced with Dynatrace 3rd-gen agentic AI for complex multi-team remediation scenarios; TELUS case study showed 45-minute-to-2-minute debug time reduction. Market adoption breadth expanded: self-healing network market projected to reach $10.2B by 2033 (21.6% CAGR). Dynatrace's own SaaS platform demonstrated production-grade resilience surviving AWS EC2 outage with automated traffic redirection and zero SLO impact. Balanced signals persisted: adoption breadth across enterprise cloud and utility grids vs. organizational caution favoring guided automation over fully autonomous execution; vendor lock-in risks highlighted by Change Healthcare incident reinforced importance of distributed remediation architectures; implementation challenges (false positives, ITSM integration, parameter validation) remained ongoing.
2025-Q4: Telecom and utility sectors advanced self-healing deployments with market validation: self-healing grids valued at $7.1B–$12.8B with 35% efficiency gains from AI-powered automated fault restoration; telecom analysis highlighted closed-loop automation evolution from equipment reboots to complex multi-domain remediation requiring federated intelligence. Infrastructure-as-code practices matured with Ansible and Bedrock integration tutorials demonstrating repeatable AI operations. However, critical infrastructure gaps surfaced: Ansible network automation collections abandoned, and October 2025 AWS outage traced to automated DNS system race condition causing cascading failures—illustrating how autonomous remediation at scale remains fragile without deterministic safeguards and careful state management.
2026-Jan: Vendor platforms continued investment in agentic AI-driven remediation (Dynatrace emphasized "massive automation"); enterprise deployments at financial services showed 47% faster AI-enabled migration with automated anomaly detection. Market research updated self-healing networks projection to $2.61B (2026) tracking toward $9.32B (2032, 22.09% CAGR). Analyst consensus identified critical limitations: observability-as-control-plane requires deterministic analytics and trust-based automation maturity. Practitioner analysis highlighted fundamental barriers: 2025 AI-agent production failures traced to infrastructure incompatibilities; true autonomous remediation requires semantic telemetry, async event-driven architectures, and metadata layers—not just sophisticated AI. Adoption signals remained mixed: guided automation preferred over fully autonomous execution; 2% fully automated vulnerability workflows vs. 62% manual, exposing persistent organizational readiness gaps despite vendor platform capability.
2026-Feb: Vendor ecosystem advanced with AWS Config and Dynatrace AutomationEngine documentation confirming mature, production-grade remediation capabilities. Academic research proposed multi-agent reinforcement learning framework for closed-loop self-healing service operations. Practitioner analysis underscored persistent maturity gap: organizations achieving self-healing report 72% downtime reduction, yet fewer than 1% score above 50/100 on automation maturity index. Risk assessments emphasized safety challenges: cloud environments heighten business disruption risk, requiring risk-ledger approaches and strict human-on-loop policies despite AI platform capability. GitOps-based approaches (ArgoCD with AI agents) emerged as operationally sovereign alternative driven by regulatory requirements (NIS-2). Adoption pattern remained consistent: benefits clear but organizational adoption barriers (integration complexity, false positives, governance discipline) persisted.
2026-Apr: Named enterprise deployments, peer-reviewed research, and large-scale empirical analysis advanced the production case for autonomous remediation while sharpening the urgency argument.
2026-May: Vendor GA releases and named deployments extended the production evidence base: Dynatrace Davis AI shipped predictive remediation with auto-generated Kubernetes deployment fixes, validated by NEQUI digital bank; New Relic Workflow Automation GA launched auto-rollback on deployment errors with approval gates for critical actions, demonstrating bounded autonomy patterns; AWS Support Automation Workflows expanded to 50+ curated remediation scenarios; Telstra PoC demonstrated live-production outage resolution in minutes via AI agents executing Ansible playbooks. Practitioner frameworks sharpened the capability boundary, distinguishing safe automation zones (pod restarts, cache flushes, known runbooks) from failure conditions (ambiguous root causes, irreversible actions), while Gartner's CEO survey (469 respondents) found 80% expect AI-driven operational overhauls and 27% anticipate primarily autonomous operations by 2028 — sustaining the gap between executive expectation and organizational deployment maturity. AWS DevOps Agent showcased five named enterprises—Commonwealth Bank (complex issues resolved in under 15 minutes vs. hours for manual engineers), Deriv (40% MTTR reduction), Clariant, Granola, and Infor—using agentic AI for autonomous infrastructure investigation. A peer-reviewed study co-authored with JPMorgan Chase demonstrated a multi-agent LLM framework achieving 96.8% IaC drift detection and 95.2% security misconfiguration detection with 6.9-minute MTTR. A separate peer-reviewed paper showed automated remediation in CI/CD pipelines reduces MTTR by 76%, increases deployment frequency 24.2x, and achieves 99.96% reliability in Kubernetes environments. Qualys analysis of 1 billion remediation records across 10,000 enterprises found manual processes failed 88% of the time for weaponized CVEs, and a companion Qualys research report documented exploitation hitting negative-one-day windows—making manual workflows structurally incapable of keeping pace and framing autonomous remediation as a necessity rather than an optimization. Gartner reiterated its forecast of 70% enterprise agentic AI adoption for infrastructure operations by 2029 (vs under 5% in 2025), while organizational governance gaps—only 39% with fully automated audit trails—and the adoption floor (2% fully automated vulnerability workflows) confirmed the gap between capability and deployment maturity.

TOOLS

Dynatrace AWS Lambda AWS Service Catalog Ansible AWS IoT Device Defender Red Hat OpenShift AI New Relic AWS Support Automation Workflows