Agentic coding for production integration

The AI landscape doesn't move in one direction — it lurches. Some techniques leap from experiment to table stakes in a single quarter; others stall against regulatory walls, technical ceilings, or organisational inertia that no amount of hype can dislodge. Knowing which is which is the hard part. The State of Play cuts through the noise with a rigorously maintained index of AI techniques across every major business domain — classified by maturity, evidenced by real-world adoption, and updated daily so you always know where you stand relative to the field. Stop guessing. Start knowing.

AI Maturity by Domain

Each dot marks the weighted maturity of practices within a domain — hover for a brief summary, click for more detail

DOMAIN

BLEEDING EDGEESTABLISHED

BLEEDING EDGE

TRAJECTORY— Stalled

AI agents completing production coding tasks within supervised workflows, including PR creation and review cycles. Includes agent-generated PRs with human review gates and CI checks; distinct from fully autonomous coding which removes the human approval step.

OVERVIEW

Agentic coding for production integration -- AI agents that write, test, and propose code changes through pull requests with human review gates and CI checks -- is widely experimented with but remains fundamentally constrained by governance and safety barriers. Enterprise adoption has reached 90% at the pilot level, yet only 11% of agentic AI projects escape the pilot phase into production. The tier-defining tension is not capability but risk: real-world deployments consistently trigger critical failures when agents retain production access without execution boundaries. April 2026 incidents document the pattern: a Cursor-based agent deleted a production database due to credential mismatch and unverified API permissions; Amazon's Kiro agent destroyed AWS infrastructure and caused retail outages totalling 6.3M orders; GitHub's platform itself buckles under 275M commits per week and 17M agent-generated PRs per month. Agents reliably handle boilerplate and refactoring -- 26% of refactoring commits are explicitly agent-authored with measurable quality gains -- but fail on 70-90% of complex tasks. Successful supervised workflows (Microsoft .NET runtime: 878 agent PRs, 67.9% merge rate over 10 months) require explicit human oversight, structured tool gateways, and immutable state management. The bottleneck has shifted decisively from capability to architecture: not "can agents write better code?" but "can we build organizational and technical controls that prevent unverified decisions from becoming destructive actions at machine speed?"

CURRENT LANDSCAPE

GitHub dominates the supervised agentic workflow infrastructure, optimized to 50% faster startup by March 2026 and formalized with the Agents tab for centralized management. Copilot reached 4.7M paid subscribers (January 2026, +75% YoY) with 90% Fortune 100 penetration, and Claude Code achieved $2.5B ARR, confirming platform maturity. Yet April 2026 evidence reveals production integration at scale has become genuinely hazardous. GitHub's infrastructure absorbed 275M commits per week (14x YoY growth) and 17M agent-generated PRs per month before experiencing five major outages in early April alone -- search downtime, Copilot backend exhaustion, agent session failures. The velocity has outstripped safety: Claude Code generates 2.6M commits weekly, a 25x increase from September. May 2026 analysis of 29,585 real GitHub PR lifecycles (Chung & Hassan) confirms that across five production agents (Copilot, Devin, Cursor, Claude Code, OpenAI), merge authority remains "almost exclusively human," with tooling distributing agent initiative differently but supervision gates consistently blocking autonomous deployment. Platform maturity is organizational, not technical: GitHub's May changelog formalizes secrets management and cloud agent environment controls; IBM Bob (May GA) and OpenAI Codex (May disclosure) both detail enterprise production architecture including sandboxing, approval workflows, and telemetry.

Real production failures now dominate the evidence base. April 2026 incident logs document the systematic pattern: (1) Cursor agent deleted production Railway database after finding unverified API token in unrelated file; (2) Amazon's Kiro agent destroyed entire AWS infrastructure (RDS, VPC, ECS, load balancers) after misinterpreting stale Terraform state, causing retail outages totalling 6.3M lost orders (~$6.3M impact) and triggering 90-day code safety reset requiring two-reviewer minimum on all AI-generated changes; (3) GitHub infrastructure strain from agent-driven load, causing cascading outages. Independent research from April 2026 validates the severity: MSR study of 11,771 real production PRs found top models complete only 24% of complex tasks autonomously with 70-90% failure rates as complexity increases; Lightrun's production data shows 49% of AI-generated code fails in production despite passing QA. Security remains critical: 33,000+ agent-generated PRs show recurring vulnerabilities (regex inefficiencies, injection flaws, path traversal) that are merged despite known issues, and analysis of legacy codebases shows AI code produces 2.74x more security vulnerabilities and 1.7x more issues than human code. May 2026 security research intensified this crisis: Microsoft Defender team (May 7) disclosed CVE-2026-26030 and CVE-2026-25592 in Semantic Kernel (27K GitHub stars), demonstrating RCE execution via prompt injection in production agent frameworks; six coordinated research teams (May 9) disclosed credential theft exploits across Codex, Claude Code, Copilot, and Vertex AI, revealing 78% of enterprises lack PAM (Privileged Access Management) for agent credentials; Adversa.AI disclosed TrustFall supply-chain attack (May 7) showing malicious repository injection succeeds identically across all major CLI agents (Claude, Cursor, Copilot, Gemini); NVIDIA Red Team disclosed AGENTS.md injection vector (April 30) specific to cloned repositories in production environments. The gap between framework-level guarantees and deployed reality is absolute: vulnerabilities exist at the design level (tool parameter trust) rather than implementation, making patching insufficient without architectural redesign.

Yet supervised deployments do work at production scale when guardrails are explicit. Microsoft's .NET runtime team achieved 67.9% PR merge rate over 10 months (878 PRs, 535 merged) with 0.6% revert rate -- equivalent to human-authored code -- through explicit human oversight, tool gateways with schema validation, and immutable state management. Specialized single-agent deployments (Iowa fintech: 6 agents, 585 sessions; Ledgerpoint: 180K LOC Java→Kotlin in 8 weeks, 94% first-pass approval) succeeded by enforcing narrow scope, verifiable output, and review gates. The structural pattern is clear: autonomous or minimally-reviewed agents cause cascading failures; supervised agents within tight boundaries succeed. The governance bottleneck remains unsolved at scale: Lightrun reports developers spend 38% of week (double pre-AI baseline) debugging and verifying AI code, with hidden verification costs of $82-103K per month for 20-person teams. Developers spend 4.6x longer reviewing agent PRs, introduction of agent code correlates with 52% increase in review time and 18% rise in production incidents even in supervised workflows. The tier-defining constraint is not capability but architectural readiness: whether organizations can enforce cost caps, tool-call validation, immutable memory boundaries, and observability instrumentation.

TIER HISTORY

ResearchOct-2024 → Jan-2025

Bleeding EdgeJan-2025 → present

EVIDENCE (76)

Managing agentic AI's speed, scale and sprawl: Insights from Think 2026Case Studies2026-05-11

— IBM Bob GA (April 2026) end-to-end SDLC agent with named customer deployments, specific metrics, and human-in-the-loop review architecture for production integration.

Semantic Kernel bugs turned prompt injection into remote code executionCase Studies2026-05-10

— CVE-2026-26030 and CVE-2026-25592 in Microsoft Semantic Kernel demonstrate RCE execution risk in production agent frameworks via prompt injection, directly impacting deployed agentic systems.

AI Coding Agents Breached - Attackers Took the KeysCase Studies2026-05-09

— Six research teams disclosed coordinated exploits of production agents (Codex, Claude Code, Copilot, Vertex AI), revealing credential theft vectors and 78% enterprise adoption gap in agent PAM controls.

OpenAI Details Codex Safety Controls For Enterprise Coding AgentsProduct Launches2026-05-09

— OpenAI Codex production safety architecture for enterprise deployment: sandboxing, human approval workflows, network policies, and continuous telemetry for CI/CD integration.

Collaborator or Assistant? How AI Coding Agents Partition Work Across Pull Request LifecyclesResearch Papers2026-05-08

— Empirical analysis of 29,585 real GitHub PR lifecycles across five production agents (Copilot, Devin, Cursor, Claude Code, OpenAI) shows merge authority remains almost exclusively human with distinct tool-specific governance patterns.

GitHub Changelog: May 2026 – Agentic Features & Cloud Agent UpdatesProduct Launches2026-05-08

— GitHub GA: Copilot cloud agent improvements, org-level secrets/variables configuration, and cloud agent environment management for production agentic integration.

When prompts become shells: RCE vulnerabilities in AI agent frameworksResearch Papers2026-05-07

— Microsoft Defender Security Team disclosed host-level RCE vulnerabilities in widely-used Semantic Kernel framework (27K+ GitHub stars), demonstrating execution risk for integrated production agents.

AI Coding Agents Could Fuel Next Supply Chain CrisisCase Studies2026-05-07

— Adversa.AI 'TrustFall' attack demonstrates supply-chain compromise via malicious repository cloned by agents in production CI/CD pipelines across Claude Code, Cursor, Copilot, and Gemini.

HISTORY

2024-Q4: First empirical evaluation of agents on real-world GitHub issues showed mixed results: 30-35% resolution with successful patches reducing duplication but persistent failures on complex problems. Industry pilots at Fortune 500 level reported cautiously, with emphasis on risk barriers rather than productivity gains.
2025-Q1: Product GA for supervised integration (Copilot Autofix in Workspace, Copilot Workspace for code scanning). Adoption metrics showed nearly 50% of professional programmers using agent mode; surveys identified reliability and integration as top barriers. Real-world testing (Devin 3/20 task success) and cost analysis confirmed narrow domain viability but complex reasoning limitations.
2025-Q2: GitHub Copilot coding agent reached general availability (May 13) with rollout to Business and Pro tiers by June; asynchronous PR generation with human review gates became standard supervised workflow. Independent testing of seven agents and developer surveys documented persistent reliability and security issues; 68% of developers reported increased security incident load from AI-generated code. Vendor consolidation (IBM watsonx COBOL generation) indicated platform integration investment, but organizational adoption remained cautious and concentrated in narrow use cases.
2025-Q3: Enterprise agentic adoption accelerated (50%→82% over six months) yet empirical evidence mounted of fundamental gaps: peer-reviewed PR analysis showed distinct agentic patterns with high revision rates; security audits found 42% of generated snippets hide flaws and 10% leak private data. Analyst reports predicted 40% project abandonment by 2027; agents documented failing 70% of multi-step tasks. Practitioner consensus: agentic coding viable for boilerplate and routine fixes, unsuitable for complex logic or architectural decisions—adoption blocked by reliability, security, and cost barriers despite vendor GA features and feature expansion.
2025-Q4: Vendor feature expansion continued (GitHub Agent Skills for specialized tasks, org-wide instructions, built-in security scanning) yet empirical research documented twin realities: agents actively generate refactorings with measurable quality improvements (26.1% of commits), but also produce code bloat with unnecessary methods requiring skilled review. PR acceptance metrics (83.8% vs. 91% human baseline) revealed 7% friction cost; 45% required revision. Operational challenges surfaced: forced model migrations affected 24K+ developers, exposing vendor lock-in. Maturity consolidated: agentic coding standard for boilerplate/refactoring, unsuitable for architecture/multi-step logic; organizational adoption 82%+ but scaling blocked by operational governance and vendor stability rather than capability gaps.
2026-Jan: Large-scale adoption metrics (15.85%-22.60% across 129,134 GitHub projects) confirmed rapid ecosystem uptake, yet paradoxes widened: 90% enterprise adoption correlates with 11% production deployment rate and elevated friction (4.6x longer PR review, 15-18% more vulnerabilities). GitHub shipped Agents tab for centralized workflow management; Amazon shared structured specification approaches for production scale. Practitioner consensus hardened: agents excel at boilerplate/refactoring but fail consistently on complex integration tasks. Production barriers remain structural: review burden, security governance, comprehension debt.
2026-Feb: Vendor feature expansion accelerated (Windows environment support, model picker, self-review, security scanning) and capability benchmarks improved (task length doubled to 14.5 hours; Claude Code ARR reached $2.5B), but organizational barriers hardened. Dynatrace survey: 50% of projects stuck in pilot due to supervision/security challenges. Internal GitHub data revealed security governance gaps (only 17% use firewall protections). Case studies showed agents excel at refactoring but struggle with complex integration. Practitioner feedback highlighted team adaptation challenges and reviewer burden. Thesis unchanged: agents mature for narrow use cases but production integration blocked by supervision complexity, security governance, and human organizational readiness rather than capability gaps.
2026-Mar: Production case studies with detailed metrics confirm supervised integration is viable: Microsoft's .NET runtime team achieved 67.9% PR merge rate over 10 months with explicit human oversight and equivalent code quality to human PRs (0.6% revert rate). Adoption scale validated (Copilot 4.7M subscribers, 90% Fortune 100, +75% YoY; 50% faster agent startup shipped in March); architectural patterns for production readiness crystallized (orchestration boundaries, tool gateways, state management, observability instrumentation). Security vulnerabilities in deployed systems documented: 86% of design systems components contained XSS, 15-18% more vulns than human code, driving adoption of tiered guardrail frameworks. Bottleneck analysis (Agoda, Faros AI data across 10K+ developers) confirmed structural shift: individual developer velocity increased but project velocity gains modest (21% more tasks, 98% more PRs, 91% more review time); agents individually productive but collectively increasing governance burden. Hallucination research (172B tokens, 35 models) showed fabrication tripling from 32K to 128K context, directly affecting reliability at production scale. Practitioner consensus sharpened: agentic coding operationally viable for boilerplate/refactoring with proper supervision; production barriers are organizational (review burden, specification clarity, governance architecture) not technical.
2026-Apr: Independent deployments demonstrate multi-agent production integration: Iowa-based developer deployed 6 specialized agents on commercial IoT platform (585 sessions) with documented solutions (shared memory system, file-locking patterns, role boundaries) preventing cross-team conflicts; Ledgerpoint fintech migrated 180K LOC Java→Kotlin in 8 weeks via agentic workflows with 94% first-pass PR approval. Empirical production data validates asymmetric scaling: Fortune 500 financial services org reports 30% PR volume increase but 52% review time increase and 18% incident rise, documenting governance bottleneck. Large-scale empirical analysis (110K open-source PRs from 5 agents) shows agent-generated code exhibits higher long-term churn than human code. Q1 2026 platform analysis confirms ecosystem maturation: 78% of sessions involve multi-file edits, average sessions 4→23 minutes, 47 tool calls per session. Major production failures dominated late April: a Cursor agent deleted a production Railway database via unverified credential assumptions; Amazon's Kiro destroyed AWS infrastructure causing a 13-hour China outage and 6.4M lost retail orders; GitHub processed 275M commits per week (14x YoY) and 17M agent PRs per month before sustaining five major infrastructure outages. Lightrun 2026 survey found 43% of AI-generated code fails in production post-QA with developers spending 38% of their week debugging ($82-103K/month hidden verification cost per 20-person team). Peer-reviewed analysis of 33,000+ agent PRs confirmed recurring undetected vulnerabilities (injection flaws, path traversal) still being merged. NVIDIA, Google, and OpenAI simultaneously shipped enterprise agent platforms—with Google reporting 75% of its own code now AI-generated—signaling structural shift even as failure modes multiplied. Production readiness remains empirically defined by supervision architecture and access control discipline, not raw capability.
2026-May: Large-scale governance research confirms production supervision is mandatory: Chung & Hassan analysis of 29,585 real GitHub PR lifecycles across five production agents shows merge authority remains "almost exclusively human" regardless of tool. Security crisis escalated sharply: Microsoft Defender disclosed CVE-2026-26030/25592 in Semantic Kernel enabling RCE via prompt injection; six coordinated research teams disclosed credential theft across Codex, Claude Code, Copilot, and Vertex AI (78% of enterprises lack PAM for agent credentials); Adversa.AI's TrustFall demonstrated supply-chain injection working identically on all major CLI agents; NVIDIA disclosed the AGENTS.md injection vector for cloned repositories. Enterprise platforms matured: IBM Bob GA deployed with named customers and human review architecture; GitHub (May 8) shipped org-level agent secrets/variables control; OpenAI Codex (May 9) detailed sandboxing and network policy architecture. Analyst synthesis quantified the production gap — 73% of enterprise AI projects never reach production, with governance and orchestration as the decisive blockers — while GitHub's official analysis (May 7) documented 52% increase in review burden from agent-generated code, confirming that raw capability is table-stakes and organizational execution readiness is the binding constraint.

TOOLS

GitHub Copilot Coding Agent IBM watsonx Code Assistant GitHub Copilot Workspace GitHub Copilot Autofix