Code refactoring & technical debt management

The AI landscape doesn't move in one direction — it lurches. Some techniques leap from experiment to table stakes in a single quarter; others stall against regulatory walls, technical ceilings, or organisational inertia that no amount of hype can dislodge. Knowing which is which is the hard part. The State of Play cuts through the noise with a rigorously maintained index of AI techniques across every major business domain — classified by maturity, evidenced by real-world adoption, and updated daily so you always know where you stand relative to the field. Stop guessing. Start knowing.

AI Maturity by Domain

Each dot marks the weighted maturity of practices within a domain — hover for a brief summary, click for more detail

DOMAIN

BLEEDING EDGEESTABLISHED

BLEEDING EDGE

TRAJECTORY— Stalled

AI that identifies refactoring opportunities, surfaces technical debt, and suggests prioritised improvements for code quality and maintainability. Includes dead code detection, complexity reduction, and maintenance cost estimation; distinct from code review which evaluates new changes rather than existing code.

OVERVIEW

AI-assisted refactoring remains bifurcated: bounded, supervised refactoring for legacy migrations and framework upgrades now delivers quantified business value, while general-purpose automation continues to generate technical debt faster than it retires. The practice's bleeding-edge status reflects this split and a hardening governance requirement. Salesforce, Airbnb, and monday.com have compressed multi-year migrations to weeks, delivering 25-50% cost reductions and 30-40% velocity gains. Yet production evidence in April 2026 confirmed the risk: Amazon's March outage caused $6.3M in losses from AI-generated code; 43% of SRE-managed AI code requires manual debugging in production post-QA despite passing tests. AI-generated code carries 1.7x higher defect density, code churn has increased 861% industry-wide, and 88% of developers report negative technical debt impacts. The trust-adoption paradox deepens: 90% adoption paired with 96% developer distrust and only 15% of organizations achieving business value. Governance has become non-negotiable. High-performing organizations escape the paradox through disciplined debt management: KPMG data shows the top 5% achieve 4.5x ROI by implementing code caps (25-40% AI per feature), mandatory review gates, 20% sprint refactoring budgets, and real-time code health checking via tools like CodeScene (6x more accurate than SonarQube). This practice requires senior engineering judgment, governance discipline, and architectural guardrails—not just better models.

CURRENT LANDSCAPE

Deployment evidence splits sharply between governed and ungoverned adoption. Bounded refactoring with oversight succeeds consistently: Salesforce compressed a two-year legacy migration to four months; Cognizant delivered 35% cost reduction on Java upgrades; Holger's Code completed a 2-year Delphi-to-TypeScript rewrite in one week via pattern-based execution. May 2026 case studies validate continued success: Blue Pearl's Java 11→21 modernization achieved 90% timeline compression (3 days vs 30+ days) with 92% test coverage and zero security vulnerabilities; .NET 8 modernization achieved 35% timeline reduction and 60% infrastructure cost savings. These successes share strict governance: experienced engineers steering the process, deterministic test-driven constraints, and human review gates. Yet frontier research establishes hard capability limits: Scale AI's SWE Atlas benchmark (May 2026, 70 production refactoring tasks across 6 languages) shows the frontier model Claude Opus achieves only 48.57% success on agent-driven refactoring, with open models significantly lagging and introducing regressions. Peer-reviewed research (arXiv May 2026) quantifies the core problem: the "Volume-Quality Inverse Law" proves code volume predicts structural degradation; AI systems produce a "machine signature of defects" invisible to functional correctness testing, reframing refactoring as an architectural complexity problem unsolved by better models. New vendor tooling supports bounded refactoring: SonarQube Remediation Agent reached GA with sandbox verification; CodeScene's MCP integration (6x more accurate than SonarQube) enables real-time code health checking.

Ungoverned deployment triggers cascading debt and production incidents. May 2026 production evidence documents widespread risk: 89% of enterprise engineering teams (Censuswide survey, N=500) have experienced AI-generated code incidents; 25% suffered complete system outages; 41% report increased manual review time post-AI adoption despite 95% awareness of scrutiny best practices. April 2026 production evidence documented: Amazon's March 2026 incidents caused $6.3M in losses from AI-generated code; Lightrun's SRE survey revealed 43% of AI code requires manual debugging in production post-QA. METR's landmark randomized trial (May 2026) contradicts perceived productivity: developers perceive 20% speedup but measure 19% slowdown on complex systems. GitClear documents 8x code duplication and 1.57x more security vulnerabilities in AI samples. Faros telemetry shows AI as primary code author with code churn +861%, bugs +54%, incidents +242.7%. Architectural refactoring capability ceiling exposed: SmellBench (May 2026, arXiv) reveals 63% false positive detection rate on architectural code smell repair, showing autonomous architectural refactoring remains beyond current LLM agent capability. The problem: AI-generated code creates 1.7x more issues than human code, with 30-41% technical debt increase within 90 days, while PR review times spike 441% year-over-year, creating verification bottleneck that governance must address.

Governance frameworks demonstrably work but remain sparse in deployment. A documented case study resolved a Year 2 crisis (3.8x maintenance costs after initial 40% velocity gains) through implementing a governance structure: 35% AI code cap, 20% sprint refactoring budget, tiered review gates, and commit audit trails. KPMG reports the top 5% of organizations achieve 4.5x ROI vs 2x average through disciplined tech debt management, identifying debt governance as the key competitive differentiator. Yet adoption of governance remains sparse: 90% adoption but only 15% achieve business value; 88% report negative debt impacts even as 93% report positive productivity; 96% of developers distrust AI code for deployment. The limiting factor is organizational readiness to establish and enforce governance discipline—not tooling. Systemic governance failure is measurable: 38% of teams see deployment frequency rise while change failure rate increases in parallel; 41% of AI-generated commits correlate with higher rework rates, quantifying the "acceleration with debt" paradox. Organizations lacking code-level visibility into AI contribution face average additional breach costs of $670K.

TIER HISTORY

ResearchJun-2023 → Jan-2024

Bleeding EdgeJan-2024 → present

EVIDENCE (107)

Developer Productivity Guide: Measurement and Metrics in 2026Adoption Metrics2026-05-11

— Quantifies technical debt accumulation: 38% see deployment frequency rise with change failure rate increase; 41% AI commits correlate with higher rework; PR review times spike 441% YoY.

SWE Atlas - RefactoringResearch Papers2026-05-09

— Frontier benchmark: Claude Opus achieves 48.57% success on 70 production refactoring tasks; open models lag with regressions, establishing hard capability limits for autonomous code restructuring.

SmellBench: Evaluating LLM Agents on Architectural Code Smell RepairResearch Papers2026-05-07

— Empirical study: 63% false positive detection rate on architectural code smell repair; exposes autonomy-accuracy trade-off, confirming architectural refactoring beyond current capability.

Faster Code, More Failures. The AI ParadoxIndustry Reports2026-05-06

— METR randomized trial synthesis: developers perceive 20% speedup but measure 19% slowdown on complex systems; GitClear documents 8x duplication and 1.57x more security vulnerabilities.

AI-Generated Smells: An Analysis of Code and Architecture in LLM and Agent-Driven DevelopmentResearch Papers2026-05-04

— Peer-reviewed research: 'Volume-Quality Inverse Law' proves code volume predicts structural degradation; AI produces 'machine signature of defects' invisible to functional testing.

.NET 8 Modernization: Legacy to Cloud Case StudiesCase Studies2026-04-30

— Named vendor case study: .NET Framework→.NET 8 modernization achieved 35% timeline reduction, 60% infrastructure cost savings, 50% API response time improvement.

89% of Enterprise Engineering Teams Have Experienced an AI-Generated Code Incident. The Data Explains Why.Adoption Metrics2026-04-30

— Censuswide survey (N=500): 89% experienced AI incidents; 25% suffered complete outages; 41% report increased manual review time post-AI adoption, documenting verification bottleneck.

How Blue Pearl modernized an outdated codebase and resolved a risky security posture with IBM BobCase Studies2026-04-28

— Named deployment: Blue Pearl's Java 11→21 refactoring achieved 90% timeline compression (3 days vs 30+), 92% test coverage, 127 deprecated APIs resolved, zero CVEs post-migration.

HISTORY

2023-H1: Academic research on AI for technical debt management emerging (literature review, safety models); vendors (Moderne) positioning automated remediation; community tooling for dead code detection visible but limited adoption. Key blocker: tool interpretation and organizational understanding of debt as a structural rather than metric-driven problem.
2023-H2: Strategic adoption in defense sector formalized (DoD programs establish management practices per CMU SEI report); academic research maturation shows evolution to transformer-based SATD detection; industry metrics quantify financial impact ($306K per million LOC per year); tool maturity assessments reveal production readiness gaps even for specialized refactoring. Organizational barriers persist: static analysis tools show only 15% overlap with developer-identified debt, revealing need for complementary detection approaches.
2024-Q1: Generative AI adoption in software development reaches 78% (up from 23% in 2023). Vendor landscape accelerates: Moderne integrates LLMs with OpenRewrite for multi-repo refactoring. Simultaneously, evidence emerges that AI-generated code creates new technical debt (reduced readability, poor test coverage, code churn). Research warns ML systems themselves accumulate structural debt faster than manual development. Tool fragmentation persists; organizational tension between metrics-driven and developer-perceived debt remains unresolved.
2024-Q2: AI-assisted refactoring correctness crisis exposed—industry analysis reveals only 37% of AI-generated refactorings are functionally correct in production contexts, setting back automation confidence. Domain-specific studies confirm existing tools inadequately meet practitioner needs (e.g., deep learning projects). Security dimension of technical debt clarifies: SATD detection research links debt comments to MITRE Top-25 vulnerabilities, establishing debt as a security concern. Measurement research shows technical debt's impact on velocity is inconsistent and context-dependent; identification and measurement emerge as the most sought-after automation activities, yet tool adoption faces barriers from explainability gaps.
2024-Q3: Correctness crisis deepens—extended analysis reveals Copilot 46.3%, ChatGPT 65.2%, CodeWhisperer 31.1% correctness rates; code churn doubling by 2024; 80% of enterprises report technical debt stifles innovation. Security debt quantified at scale: Veracode's 13M scans show 42% of apps have >1-year-old unresolved vulnerabilities (Java 46%, Python 23%). Developer trust in AI coding tools remains low (42% trust output, 45% report inadequacy for complex tasks despite 76% adoption). Positive signal: Moderne demonstrates production-scale AI semantic code search across 1,218 repos; practitioner successes with bounded refactoring tasks (null-safety migrations, 5000+ changes). BERT models emerge as most effective for technical debt detection (arXiv September 2024). Organizational and technical barriers persist as tier-limiting factors.
2024-Q4: AI tool adoption reaches 84% of developers (up from 76% in Q3), but trust plummets to 46% distrust (from 31%), signaling adoption-confidence divergence. Refactoring correctness ceiling remains: ChatGPT achieves 63.6% expert-parity rates, Gemini 56.2%; legacy tools below 50%. New concern: AI-generated code creates technical debt faster than manual development (CAST November 2024). Vendor consolidation toward "compliance-grade" refactoring (Byteable, CAST Highlight) reflects market shift to risk management. Academic research (Journal of Systems and Software, December 2024) surveys technical debt in AI-enabled systems; arXiv papers on IDE trust safeguards signal focus on adoption barriers. Practitioner consensus emerges: LLMs excel in bounded domains (null-safety, standard patterns, young codebases) but worsen debt in legacy/novel code. Correctness and organizational trust remain tier-limiting factors.
2025-Q1: AI tool adoption reaches 98% of developers (near-universal), but verification practices stagnate—96% don't fully trust output, only 48% verify before commit. Technical debt created by AI-generated code accelerates sharply: 8-fold increase in code duplication, 10x redundancy since 2022, 7.2% delivery stability decline. Vendor tooling pivots to remediation: SonarQube releases AI Code Assurance and AI CodeFix to validate and auto-fix AI-generated code. Accenture quantifies US technical debt at $2.41 trillion annually. Practitioner analysis (GitLab, Tomassetti, RecodeX) documents persistent limitations: AI excels at standard patterns but fails on symbol resolution, context-dependent logic, and legacy systems. The practice remains bleeding-edge: verification gap, correctness ceiling, and organizational trust barriers prevent general-purpose automated refactoring from reaching production maturity despite pockets of success (bounded task automation, young codebases).
2025-Q2: Web developers report 61% of AI-generated code requires refactoring due to readability and duplication issues, confirming technical debt creation at deployment scale. SonarQube's AI Code Assurance adoption grows among federal agencies seeking compliant AI validation workflows. Practitioner experiments document critical refactoring failures: type mismatches breaking production code, architectural misunderstandings in autonomous transformations, silent state-handling bugs. Survey data shows 82% of developers using AI assistants daily but two-thirds report AI missing critical context for large refactoring tasks. Enterprise tooling demonstrates bounded-task success (40% lead time improvements from complexity reduction via targeted debt fixes) but AI-generated code carries 1.7x higher defect density than human code. Verification gap and contextual blindness remain tier-limiting: deployment-scale refactoring automation achieves success only in constrained domains, while general-purpose production refactoring remains unresolved.
2025-Q3: Real-world deployment evidence emerges with qualified success: Cognizant demonstrates 35% cost and 25% effort reduction on major Java migrations; European telecom leader successfully modernizes monolithic legacy architecture via phased, engineer-guided AI refactoring. Simultaneously, GitClear's analysis of 211M LOC confirms refactoring has become the weakest link in the development cycle—2025 marks the first year code duplication introduction exceeds refactoring activity, indicating AI-generated code creates technical debt faster than automated remediation can address it. Stack Overflow's 49K-developer survey (July 2025) and Google DORA research (~5K professionals) both show persistent trust deficit: adoption remains near-universal but practitioners remain reluctant. Fastly's developer survey documents that 33% of senior developers ship >50% AI-generated code (vs. 13% junior), signaling differentiated deployment by experience level. Academic research (IEEE) on distributed system refactoring reveals failure propagation risks across service boundaries, underlining the context-sensitivity of safe refactoring. The tier-limiting tension persists: bounded refactoring tasks (major framework upgrades, null-safety migrations, standardized patterns) show measurable success in controlled deployments, but the paradox deepens—automation addresses only the small subset of refactoring work where AI correctness is highest, while the bulk of technical debt remains in legacy, novel, and domain-specific code where AI-assisted approaches continue to generate new debt.
2025-Q4: Significant production deployment evidence crystallizes the bifurcated landscape. Salesforce publicly documents AI-driven refactoring cutting a 2-year legacy migration to 4 months—a major signal of bounded-task mastery. Yet the adoption-distrust paradox deepens: Stack Overflow's year-end survey (49K+ developers) shows 80% adoption paired with trust collapsing to 29%, with 66% reporting extra time fixing AI-generated code. JetBrains' global ecosystem survey (24.5K developers) confirms 85% regular AI use and 62% reliance on AI coding assistants, but developers express skepticism about productivity metrics in technical debt contexts. Practitioner case studies reveal nuanced deployment: Altom's test automation refactoring achieved acceleration with AI but required human expertise for complex inheritance patterns. Developer communities document critical refactoring failures: over-application of DRY principles creating "Conditional Monsters," wrong abstractions, and silent state-handling bugs. Security dimension intensifies: real breaches (Toyota, Decathlon) trace to legacy migration and misconfiguration debt from rapid AI-assisted deployments, underscoring data risk consequences. The practice remains bleeding-edge: bounded refactoring (legacy migrations, test suite modernization, pattern-based transformations) achieves repeatable success with engineer oversight, but unguided large-scale automation continues to generate technical debt, and the trust-adoption gap signals organizational barriers persist unresolved. Correctness, architectural understanding, and contextual sensitivity remain tier-limiting factors preventing production-ready general-purpose AI refactoring.
2026-Jan: GenAI-induced technical debt becomes measurable and formalized. TechDebt 2026 conference peer-reviewed research quantifies 81 documented cases of GenAI-Induced Self-Admitted Technical Debt (GIST) with specific patterns: verification gaps, incomplete AI-code adaptation, and developer comprehension failures. Moderne's production multi-repo refactoring platform gains enterprise adoption (MEDHOST, Interactions, Allstate, Intel Capital, Choice Hotels) validating vendor maturity. Positive deployment signal: D3 Alpha case study documents successful AI agent refactoring of complex enterprise data systems in weeks vs. prior 76-day manual efforts. Consulting firms (Thoughtworks, BCG, EY, IBM, Xebia) formalize GenAI modernization methodologies with 30-60% efficiency gains on legacy systems. Critical countervailing evidence: Baytech analysis frames rapid AI code generation as creating "Efficiency Paradox" with high-interest debt in final 30% of projects; practitioner guidance emphasizes proactive debt inventory and allocation strategies (20% Rule, Debt Log). The bifurcated landscape persists: bounded refactoring (framework upgrades, legacy migrations) with oversight demonstrates production viability, while unguided large-scale automation continues risk of new debt creation. Verification gap, architectural blindness, and contextual insensitivity remain tier-limiting factors.
2026-Feb: Tool ecosystem expansion for managing AI-induced debt accelerates. Moderne extends automated refactoring to Python, signaling market evolution toward practical debt governance as 50% of code changes become AI-generated. Vendor and practitioner evidence validates bounded refactoring success: Holger's Code demonstrates 2-year Delphi migration in 1 week via pattern-based AI execution with human oversight; GitHub Debt Insights deployment shows 45% reduction in debt-related incidents with 3-week advance prediction. Yet the trust-adoption paradox persists and intensifies: SonarSource survey finds 88% of developers report at least one negative AI impact (53% cite unreliable code, 40% duplication), while 93% report at least one positive impact, revealing AI's dual nature. Stack Overflow's ongoing survey shows 84% adoption but only 29% trust, defining trust as willingness to deploy with minimal review—a critical signal that low confidence hinders deployment at scale. Practitioner case studies document nuanced outcomes: Tech Stratos's 40% codebase refactoring showed 30-40% feature velocity gains and 18% test coverage improvement but also critical failures (async error handling, performance regressions, architectural drift), reinforcing that AI amplifies senior judgment but cannot replace it. Verification gap, contextual blindness, and correctness ceiling remain tier-limiting blockers. The practice remains bleeding-edge because while vendor tooling has matured and pockets of successful bounded refactoring exist, the trust-adoption gap signals that organizational readiness for general-purpose AI-driven refactoring has not advanced commensurate with tool capability.
2026-Mar: Bifurcation deepens with refined measurement evidence. CodeTaste benchmark (arXiv March 2026) quantifies the autonomy-accuracy gap: LLM agents achieve 70% accuracy on specified refactoring tasks but <8% success on autonomous discovery, establishing hard limits on unguided automation. Macro-scale analysis of 8.1M PRs from 4,800 teams documents AI-generated code carries 1.7× higher defects, 30-41% more technical debt, 19% slower delivery despite perceived productivity gains; separately, US technical debt costs are quantified at $2.4T annually with high-debt orgs spending 40% more on maintenance. Large-scale deployment cases prove win-loss split: Airbnb refactored 3.5k React tests in 6 weeks (vs 1.5-year baseline, 75%→97% success); monday.com completed JS monolith breakup in 6 months (vs 8-year baseline) via hybrid AI+engineering approach. CodeScene's agentic refactoring benchmark (Claude Code + CodeHealth MCP) achieved 2–5x code health improvement across 25k files, with Extract Method refactorings increasing 3x under structured guidance—but against an industry baseline of 5.15/10 health vs the 9.4+ threshold required for AI-safe code. Veracode's 150-model security analysis reveals a persistent security ceiling: 55% pass rate regardless of model scale. The practice remains firmly bleeding-edge: structured, supervised refactoring demonstrates concrete productivity gains with oversight, but security debt, correctness failures, and architectural blindness continue to exceed remediation capacity.
2026-Apr: Debt accumulation evidence intensified alongside governance signal. Faros telemetry (22K developers, 4K+ teams) confirmed AI as primary code author with code churn +861%, bugs +54%, and incidents +242.7%; BayTech meta-analysis (211M LOC) documented refactoring activity collapsing below 10% of commits while code cloning quadrupled — quantifying the debt creation rate that refactoring practices must now address. KPMG's survey of 2,500 executives found the top 5% of organisations achieve 4.5x ROI through disciplined tech debt governance versus 2x for the average, identifying governance discipline as the key competitive differentiator. A practitioner case study documented Year-2 crisis mechanics precisely: 40% velocity gains in Year 1 reversed into 3.8x maintenance costs by Year 2, resolved only by implementing a governance framework with a 35% AI code cap, 20% sprint debt budget, and tiered review gates. Enterprise case study evidence documented the "throughput trap": teams accumulate surface area faster than they can validate it, with AI building on dead code and ignoring legacy patterns — a distinct organizational governance failure mode separate from individual code quality issues. SonarQube's AI CodeFix reached GA alongside a technical review confirming AI Code Assurance and Quality Gates as key mechanisms for maintaining code quality amid rapid AI-assisted development. OpenRewrite's LST-based recipe engine was validated as foundational tooling for large-scale bounded refactoring (migrations, framework updates, consistency fixes) at scale. SonarQube Remediation Agent and CodeScene's MCP integration (6x more accurate than SonarQube for maintainability prediction) continued to mature. The phase hardened the practice's defining tension: tooling to manage AI-induced debt has matured, but adoption of governance discipline remains sparse — 91% of teams use AI coding but only 15% achieve business value.
2026-May: Frontier research establishes hard capability limits for autonomous refactoring. Scale AI's SWE Atlas benchmark (70 production refactoring tasks across 10 repos, 6 languages) shows frontier model Claude Opus achieves 48.57% success, with open models lagging significantly and introducing regressions—establishing the lower bound of current agent capability. Peer-reviewed research (arXiv May 2026) quantifies architectural decay: the "Volume-Quality Inverse Law" proves code volume predicts structural degradation; AI systems produce a "machine signature of defects" invisible to functional correctness testing, shifting the problem from code generation to architectural complexity management. SmellBench (May 2026) reveals architectural refactoring capability ceiling: 63% false positive detection rate on code smell repair, confirming autonomous architectural refactoring remains beyond current LLM agent capability. Production incident prevalence documented: Censuswide survey (N=500 enterprise IT engineers) shows 89% experienced AI-generated code incidents, 25% suffered complete system outages; 41% report increased manual review time post-AI adoption. METR's landmark randomized trial contradicts productivity perceptions: developers feel 20% speedup but measure 19% slowdown on complex systems; GitClear documents 8x code duplication and 1.57x more security vulnerabilities in AI samples. Named deployment successes persist but remain bounded: Blue Pearl's Java 11→21 refactoring achieved 90% timeline compression (3 days vs 30+ days) with 92% test coverage; .NET 8 modernization achieved 35% timeline reduction and 60% infrastructure savings. Systemic governance failure measured at scale: PR review times spike 441% year-over-year; 38% of teams see deployment frequency rise while change failure rate increases in parallel; 41% of AI-generated commits correlate with higher rework rates. The practice remains firmly bleeding-edge: frontier research confirms bounded refactoring (framework upgrades, pattern-based migrations) with engineer oversight succeeds at scale, while architectural refactoring, autonomous discovery, and unguided deployment continue creating debt faster than governance frameworks can manage. Architectural complexity management and organizational adoption of governance discipline remain the tier-limiting factors.