Perly Consulting │ Beck Eco

The State of Play

A living index of AI adoption across industries — where established practice meets the bleeding edge
UPDATED DAILY

The AI landscape doesn't move in one direction — it lurches. Some techniques leap from experiment to table stakes in a single quarter; others stall against regulatory walls, technical ceilings, or organisational inertia that no amount of hype can dislodge. Knowing which is which is the hard part. The State of Play cuts through the noise with a rigorously maintained index of AI techniques across every major business domain — classified by maturity, evidenced by real-world adoption, and updated daily so you always know where you stand relative to the field. Stop guessing. Start knowing.

The Daily Dispatch

A daily newsletter distilling the past two weeks of movement in a domain or two — delivered to your inbox while the index updates in the background.

AI Maturity by Domain

Each dot marks the weighted maturity of practices within a domain — hover for a brief summary, click for more detail

DOMAIN
BLEEDING EDGEESTABLISHED

Agentic coding with full autonomy

BLEEDING EDGE

TRAJECTORY

Stalled

AI agents independently completing development tasks end-to-end with minimal human oversight or intervention. Includes autonomous issue-to-merge workflows and self-directed multi-file changes; distinct from supervised production integration which retains human review gates.

OVERVIEW

Fully autonomous coding agents -- AI systems that take a task from issue to merged code with minimal human involvement -- remain firmly experimental. The promise is transformative: end-to-end development without review gates or approval workflows. The reality is that reliability constraints keep pushing practitioners back toward human oversight. State-of-the-art agents achieve only 11% success on complex, multi-commit feature work, even as they score well on narrower benchmarks. That gap between demo and production is the defining tension of the practice. Two-thirds of organisations are experimenting with agentic AI, yet only about one in ten has reached production deployment. The pattern actually emerging is not full autonomy but orchestrated autonomy -- developers scoping focused tasks for agents, then validating outputs before integration. Hybrid human-AI teams consistently outperform unsupervised agents, and autonomously generated code carries measurably higher defect rates. The tooling ecosystem is maturing fast, with tier-one IDE vendors now shipping native agent support, but the operational model has converged on bounded delegation rather than hands-off execution. Full autonomy works for narrow, well-defined tasks. For production codebases, human judgment at integration gates remains essential.

CURRENT LANDSCAPE

The vendor ecosystem has converged on agent-native IDEs. GitHub ships Copilot agent mode with Model Context Protocol support and sandboxed draft PRs; Apple integrated autonomous coding into Xcode 26.3 via Claude Agent and Codex; Devin, Cursor, and Claude Code compete as standalone agent platforms. Configuration conventions are solidifying -- an analysis of 2,926 GitHub repositories found AGENTS.md emerging as an interoperable standard -- though advanced features like sub-agents see shallow adoption so far.

Enterprise deployments are real but bounded. Goldman Sachs deployed Devin across 12,000 technologists; Rakuten reported 99.9% accuracy on a 12.5-million-line codebase; Zapier runs 800+ internal agents with 89% developer adoption. These cases share a common trait: heavy human-in-the-loop guardrails. Anthropic's own 2026 trends data confirms the pattern -- engineers use AI in 60% of their work but fully delegate only 0-20% of tasks. A Dynatrace survey of 919 enterprise leaders found 69% of agent decisions still verified by humans.

Empirical research explains why. FeatureBench (ICLR 2026) showed the best agents solve 74% of isolated SWE-bench patches but only 11% of complex multi-file features. Code generated autonomously carries 1.7x more bugs and 75% more logic errors than human-written code, according to a Stack Overflow study of 470 repositories. A Stanford-Carnegie Mellon study found hybrid human-AI teams outperform fully autonomous agents by 68.7%. Practitioners report that unsupervised agents create reviewer burden and erode team trust when they modify code without deep contextual understanding. The result: production deployments overwhelmingly default to orchestrated autonomy with human validation gates, not the end-to-end hands-off model the term implies.

TIER HISTORY

ResearchJan-2025 → Jan-2025
Bleeding EdgeJan-2025 → present

EVIDENCE (68)

— Product announcement and enterprise case study of IBM Bob—agentic SDLC system—deployed to 80,000+ employees with 45% productivity gain at scale.

— Authoritative reporting on GitHub pausing Copilot sign-ups due to autonomous agentic workflows exceeding monthly compute budgets, with named company and specific economics impact.

— Specific to AI coding agents: 88% of enterprise pilots never reach production. Names major coding agents. Framework for seven non-negotiable enterprise controls.

Rise of the Overnight AgentsResearch Papers

— Data-driven analysis from code review platform showing 27.6% of merged PRs (April 2026) are fully AI-authored, with revert rates and code churn metrics per agent type—unique production evidence of autonomous adoption.

— Gartner hype cycle: 40% adoption vs 40% cancellation. Names successful large-scale deployments (Citi 180k employees, Microsoft, Google). Governance infrastructure framework separating successful from failed deployments.

— Authoritative practitioner analysis citing Andrej Karpathy's explicit deprecation of 'vibe coding' (full autonomy without review) in favor of 'agentic engineering' (orchestrated with spec/test/architecture). Includes METR study showing 37-point productivity swing with proper tooling in supervised model.

— Cites Anthropic's 2026 Agentic Coding Trends Report finding that developers use AI in 60% of work but fully delegate only 0–20% to autonomous agents. Provides key evidence that full autonomy remains limited despite high overall adoption, and explains why (trust gap, comprehension debt).

— Independent testing of 6 agents on 10 production tasks (30K-line Node/React app). Claude Code outperformed flagship IDEs; Devin at $500/mo underperformed. Real-world validation of autonomy vs cost tradeoff.

HISTORY

  • 2025-Q1: Early evidence of autonomous coding agents completing real development tasks (Devin contributing to open-source projects with 8 PRs), but independent testing exposed significant reliability gaps (70% failure rate on real-world tasks). Market sentiment showed widespread exploration (90% of developers) with low trust (3% confidence in code quality). Enterprise experts questioned whether current "agents" represent true autonomy or merely sophisticated tool-calling.

  • 2025-Q2: Major vendors (GitHub, OpenAI, Cognition) launched autonomous agent products with GA releases and broad rollouts; GitHub agent mode deployed with MCP support for tool extensibility. Independent evaluation of 15 agents identified strong performers (24/25 points). Practitioner case studies documented success (Python refactoring in 15 min for $2.25, 20% perf gain). However, production readiness remained constrained: 59% of engineering leaders report AI code introduces errors at least half the time; 67% spend more debugging time on AI code; intensive agent use exhausts premium model quotas in days. Cost barriers and error rates still block sustained full autonomy in production workflows.

  • 2025-Q3: Rigorous empirical evidence tempered hype: randomized trials showed experienced developers slowed by 19% when using AI agents, and Fortune 500 deployment (Goldman Sachs) signaled enterprise adoption but also revealed market divergence—only 31% of 49,000+ surveyed developers use full agents despite 80% using AI tools, with trust in accuracy at 29%. Academic research documented persistent failure modes (hallucinations, overeagerness, inadequate human communication), solidifying consensus that full autonomy is viable for narrow tasks but supervised autonomy (draft-validate-integrate) is the emerging production practice pattern.

  • 2025-Q4: Platform vendors accelerated feature deployment (GitHub Agent Skills for customized autonomy; 50+ Copilot updates including agent enhancements across JetBrains, Eclipse, Xcode). Market consolidation showed economic viability (Devin pricing dropped to $20/month; Goldman Sachs piloting at 12,000-developer scale). However, rigorous comparative research proved decisive: Stanford-Carnegie study showed hybrid human-AI teams outperformed fully autonomous agents by 68.7%, fundamentally undermining the full-autonomy thesis. Critical analyses documented error compounding (95% per-step accuracy → 36% over 20 steps), architectural weakness, and persistent production-readiness gaps. Independent evaluation of Devin showed 13.86% SWE-bench success but only 15% real-world task completion. Industry analyst consensus crystallized: the paradigm had shifted from reactive assistants to autonomous IDEs, but the deployment model was converging on bounded autonomy—developers delegating focused workflows to agents rather than end-to-end autonomy. Full autonomy remains viable only for narrow, well-scoped tasks; production codebases still require human validation gates.

  • 2026-Jan: Market matured past early hype into careful production deployment. Dynatrace survey of 919 leaders showed 50% of projects in POC/pilot, only 13% using fully autonomous agents, with 69% of decisions verified by humans—an inflection point constrained by reliability and governance gates. Anthropic's 2026 trends report documented the adoption reality: engineers use AI in 60% of work but fully delegate only 0-20% of tasks; case studies from Rakuten, TELUS, and Zapier showed orchestrated adoption (800+ internal agents at Zapier) but with human-in-loop models. The experiment-to-production gap widened: Byteiota analysis showed 66% experimenting but only 11% in production, with Gartner predicting 40% project cancellations by 2027. Quality concerns deepened: Stack Overflow research of 470 repos found AI-created code has 1.7x more bugs and 75% more logic errors, driving continued emphasis on validation. Platform vendors continued maturing agent capabilities (GitHub Copilot CLI with parallel agents), signaling ecosystem expansion. Consensus solidified: full autonomy is a high-risk, narrow-use model; the production practice is bounded, orchestrated autonomy with persistent human validation gates.

  • 2026-Feb: Vendor platform maturation accelerated with Apple Xcode 26.3 adding integrated autonomous coding support (Claude Agent, Codex), signaling tier-1 IDE convergence toward agentic tools. Rigorous empirical research sharpened understanding of real-world limits: FeatureBench (ICLR 2026) showed state-of-the-art agents achieving only 11% success on complex multi-commit feature development compared to 74.4% on isolated SWE-bench tasks, quantifying the gap between benchmarks and production complexity. Configuration analysis of 2,926 repos revealed AGENTS.md emerging as an interoperable standard but advanced autonomy features (Skills, Subagents) seeing shallow adoption—practitioners defaulting to minimal configuration. Industry deployment reports cited concrete ROI: Telus saving 40 minutes per AI interaction across 57,000 employees, Suzano achieving 95% query-time reduction, Danfoss cutting response times from 42 hours to real-time. However, organizational adoption challenges surfaced: practitioner reports of excessive reviewer burden and damaged trust when agents modify unfamiliar code without deep codebase understanding. By month-end, the narrative remained consistent: full autonomy is technically viable for narrow, well-scoped features but production deployment mandates strong organizational practices (precise specifications, configuration standards, human validation gates).

  • 2026-Mar: Platform vendor convergence accelerated on autonomous agent capabilities while governance gaps became acute. GitHub Copilot coding agent rolled out as "fully autonomous background worker" with 50% faster startup optimization enabling iterative autonomous refinement; Devin released "Devin Manages Devins" multi-agent orchestration enabling autonomous coordination across parallel agents without human intervention. Technical reverse-engineering (Disassembling AI Agents) revealed Copilot's explicit autonomy mandate in system prompts: agents explicitly configured to "implement the change" rather than propose it. However, production incident evidence surfaced critical risks: Amazon's Kiro agent outage caused autonomous deletion of production environment (over-permissioning flaw); OpenClaw autonomously deleted director's inbox. Gartner prediction reinforced governance constraint: 40% of agent projects will be canceled by 2027 due to autonomous failures and unclear ROI. Anthropic's 2026 trends report confirmed persistent reality: engineers use AI in 60% of work but can fully delegate only 0-20% of tasks, with productivity gains (30% faster shipping, 4-8 month projects compressed to 2 weeks) offset by need for strong human-in-loop guardrails; Fortune reports that reliability has improved at half the rate of capability growth. Real-world deployment data (6-month production Wiz agent) showed material outcomes (15-20h/week savings) but significant operational burden (30% browser failures, 25% rate limits, 20% state corruption). By month-end consensus remained: vendor platforms enable technical autonomy, but governance frameworks and organizational practices lag behind tooling maturity. Full autonomy is achievable for well-scoped tasks with proper guardrails, but broader production deployment requires clarity on accountability, boundaries, and human review gates.

  • 2026-Apr (early): Production inflection point crossed but with explicit industry shift away from full autonomy. Zylos Research marked Q1 2026 as the inflection point where autonomous agents entered mainstream production tooling—Claude Code sessions grew to 78% multi-file edits (up from 34%) with 47 tool calls and 23-minute average duration—but critically, telemetry shows developers maintain 80-100% oversight on all delegated tasks. Empirical comparative testing (Ethan Cole, 3 months, 12 tasks) proved no tool achieves unconstrained full autonomy; Devin scored lowest on multi-file debugging (broke existing features). Production scale achieved at Stripe (1,000+ autonomously merged PRs/week) requires rich "harness engineering"—deterministic verification loops, not autonomous validation. MSR 2026 study of 110,000 real open-source agent PRs from Claude Code, Copilot, Devin, Jules, and Codex documented quality concern: agent-contributed code exhibits significantly higher churn and maintenance burden over time. Industry explicitly rejected full autonomy: 92% of developers use AI tools but only 33% trust accuracy; 45% of pure AI-generated code contains security flaws or architectural debt; productive teams shifted to "Vibe & Verify" (generate + human verification) rather than hands-off execution. Governance crisis emerged: 88% of organizations reported confirmed/suspected agent incidents; only 14.4% had full security approval while 81% deployed to testing/production (6x mismatch); documented incidents: cost explosions ($847K runaway costs), database deletions, supply chain attacks (postmark-mcp affecting ~300 orgs). Reliability fundamentals unresolved: mathematical analysis shows 85% per-step reliability yields 20% end-to-end success on 10-step workflows; real incidents documented (Google Antigravity wiped user's D: drive). By month-end, consensus crystallized: autonomous agents are production-viable only for narrow, tightly scoped tasks with extensive scaffolding and guardrails; the industry has actively moved away from the full-autonomy thesis toward bounded, orchestrated autonomy with human-in-the-loop governance.

  • 2026-Apr (late): Platform ecosystem matured toward hybrid local-remote autonomous execution. Windsurf 2.0 introduced Devin Cloud integration (cloud-hosted autonomous agents callable from local IDE), and GitHub rolled out inline agent mode with global auto-approve for JetBrains (enabling unattended agent execution in editor context). However, infrastructure challenges surfaced: GitHub VP disclosed agentic workflows consuming far more compute than planned, forcing temporary Copilot signup pause—evidence of real-world autonomous deployment at scale. Security vulnerabilities emerged: Johns Hopkins peer-reviewed research documented prompt injection flaws in Claude Code, Gemini, and Copilot agents integrated with GitHub Actions, revealing real attack surface. Cognition engineers (at Google Cloud Next) disclosed critical production incidents: multi-agent sessions interfering via shared build caches, agents inheriting developer credentials without permission scoping. Solutions emerging include Firecracker VM isolation, identity-aware access control, and task-scoped permissions. Named enterprise deployment continued: Kikagaku 10-month Devin integration documented 209 sessions across production design-to-deployment workflows. Anthropic's strategic 2026 report reaffirmed the dominant pattern: 60% AI integration but only 0–20% full delegation, with "active human participation" required. Mathematical analysis solidified reliability constraints: 95% per-step success yields only 60% at 10 steps, 0.00002% at 100 steps—compound failure modes (context drift, silent errors, specification drift) prevent long-horizon autonomous task completion. Industry consensus stable: full autonomy remains technically viable for narrow, well-scoped tasks but requires extensive platform engineering, security scaffolding, and governance frameworks. The production deployment reality is bounded, orchestrated autonomy with human oversight at integration gates, not hands-off execution.

  • 2026-May: Full-autonomy deployment reached measurable scale but governance constraints hardened. Greptile's analysis of 650K+ merged PRs documented 27.6% fully AI-authored by April 2026 (32x growth from February 2025); IBM Bob deployed to 80,000+ employees with 45% productivity gain. Yet Northflank found 88% of enterprise autonomous coding pilots never reach production; GitHub paused Copilot sign-ups after agentic workflows blew past monthly compute budgets; Andrej Karpathy publicly deprecated "vibe coding" in favor of supervised agentic engineering; and ICLR 2026 research showed enhanced reasoning amplifies tool hallucination rather than suppressing it — a foundational reliability ceiling for autonomous agents. Only 20% of developers fully delegate to agents despite 60% tool adoption, with trust collapse and governance gaps (88% of organisations with confirmed incidents) remaining the binding constraints.

TOOLS