Agentic coding for exploration & prototyping

The AI landscape doesn't move in one direction — it lurches. Some techniques leap from experiment to table stakes in a single quarter; others stall against regulatory walls, technical ceilings, or organisational inertia that no amount of hype can dislodge. Knowing which is which is the hard part. The State of Play cuts through the noise with a rigorously maintained index of AI techniques across every major business domain — classified by maturity, evidenced by real-world adoption, and updated daily so you always know where you stand relative to the field. Stop guessing. Start knowing.

AI Maturity by Domain

Each dot marks the weighted maturity of practices within a domain — hover for a brief summary, click for more detail

DOMAIN

BLEEDING EDGEESTABLISHED

LEADING EDGE

TRAJECTORY↑ Advancing

AI agents that autonomously write, run, and iterate on code for proofs of concept and exploratory development. Includes tools like Claude Code, Cursor agent mode, and Devin used for throwaway prototypes and spikes; distinct from production agentic coding which requires CI/CD integration and review workflows.

OVERVIEW

Agentic coding for exploration and prototyping has crossed from experimental novelty into measurable deployment at forward-leaning organisations, but most teams have not yet started. The practice uses AI agents to autonomously write, run, and iterate on throwaway code -- proofs of concept, architecture spikes, and rapid prototypes where the learning matters more than the output. Unlike production agentic coding, which demands CI/CD integration and formal review workflows, exploration-mode agents operate in tight human-agent loops with continuous inspection and course correction. That constraint is the key insight. Named deployments at organisations like Rakuten, TELUS, and CODERCOPS show substantial productivity gains when agents work within guardrails: small tasks, clear scope, human oversight. Yet independent research consistently documents that agents complete only a fraction of open-ended tasks autonomously, and AI-generated code carries measurably higher bug and security-defect rates than human-written code. The defining tension is not whether the tools work, but how narrow the envelope of reliable use remains.

CURRENT LANDSCAPE

Population-level adoption is confirmed: 55% of professional developers now regularly use agentic coding tools, with Claude Code achieving #1 adoption status in just eight months (May–December 2025). Named organisational deployments show measurable gains within guardrails: TELUS deployed 13,000+ custom AI solutions, saving 500,000+ hours across 57,000 employees and shipping code 30% faster; Rakuten reduced feature delivery time 79% (24 days → 5 days); CODERCOPS reported 45% sprint velocity increase and 56% faster bug resolution. A Norwegian research institute (SINTEF) completed multi-file refactoring of a 500K-LOC production reservoir simulator with 99.9% accuracy—demonstrating that well-scoped exploration tasks can be automated at scale. The tooling ecosystem has hardened: Claude Code's context management and file-system navigation now outperform semantic search on long-context tasks (17.3% improvement over SOTA on reasoning/RAG benchmarks). Multi-agent orchestration patterns emerged as standard practice (LangGraph, AutoGen), with specialised agents handling implementation, validation, architecture, and documentation in coordinated workflows.

Yet independent evidence documents persistent reliability gaps. A Princeton study shows reliability improved at only half the rate of capability growth—critical gaps remain in calibration (52% on newer models) and safety-critical decisions. Real-world testing on production codebases shows task-dependent success: well-defined bug fixes achieve 78%, test generation 82%, but refactoring drops to 45% and architecture to 15%. Deloitte projects 40% of agentic AI projects will fail by 2027 due to cost escalation and governance gaps; empirical analysis of 847 deployments found 76% fail in production. The defining pattern: agents excel at bounded, well-specified exploration tasks (variant testing, rapid prototyping, infrastructure setup) but struggle with architectural judgment and ambiguous requirements. Quality degradation in extended sessions (context rot) remains a structural problem, though teams using atomic-task decomposition and fresh-context patterns report sustained output quality. Organisational readiness—not tool capability—remains the binding constraint; fewer than 20% of deploying organisations have redesigned workflows around AI autonomy.

TIER HISTORY

ResearchMar-2024 → Jul-2024

Bleeding EdgeJul-2024 → Jan-2026

Leading EdgeJan-2026 → present

EVIDENCE (95)

Claude Updates by Anthropic - May 2026 - ReleasebotProduct Launches2026-05-11

— Claude Platform on AWS now GA with managed agents, code execution, MCP connector—first-party infrastructure for enterprise agentic exploration at scale.

Anthropic Claude Code Quality Drop: What Actually HappenedCase Studies2026-05-11

— Detailed postmortem audit of 6,852 sessions documenting measurable quality degradation (March–April 2026) and recovery; provides evidence of adoption scale and reliability gaps requiring governance.

Claude Code Is Killing Software Engineering (2026)Opinion2026-05-07

— Critical assessment with named sources (AMD Sr Director, TrustedSec CEO): specific failure metrics (47% quality drop, 52% vulnerability rate) and failure modes (laziness, incomplete reasoning).

The Productivity Impact of Coding Agents: The Real Truth - HCL GUVIAdoption Metrics2026-05-06

— Large-scale adoption (84% developer use, 41% of code authored by AI) with wide variance: 3.6 hrs/week median savings but 41% see 'little effect' and 19% report slowdown—signals practice maturity with limits.

Mise en Place for Agentic Coding: Deliberate Preparation as Context Engineering MethodologyResearch Papers2026-05-06

— Systematic preparation methodology for agentic coding: contextual grounding, collaborative specification, task decomposition applied to rapid parallel exploration in hackathon environment.

AI Coding Agents: Market Developments, Risks and Developer TakeupAdoption Metrics2026-05-06

— Investment analyst assessment: Claude Code $1B+ ARR, GitHub Copilot 20M users, 55.8% faster completion, quality trade-offs—signals market maturity at significant scale.

Higher usage limits for Claude and a compute deal with SpaceXProduct Launches2026-05-05

— Doubled Claude Code rate limits and removed peak-hour throttling (May 2026), backed by 300+ MW SpaceX compute: removes mid-session interruption constraint that blocked multi-file exploration tasks.

After Pushback, Amazon Rolls Out Claude Code, Codex to All EmployeesAdoption Metrics2026-05-04

— Named enterprise (Amazon) formally deploys Claude Code company-wide May 2026; signals exploration-stage maturity at FAANG scale with both external tools and internal Kiro infrastructure.

HISTORY

2024-Q1: Research papers on agentic coding (code aesthetics evaluation, code-LLM agency theory) published alongside product launches (Devin). Tools demonstrated 13-15% autonomous code task completion. Practitioner adoption accelerated experimentally; skepticism about production readiness grew in parallel.
2024-Q2: Market reality correction: open-source tools (SWE-Agent, AutoCodeRover) matched or exceeded Devin's benchmarks; critical analysis exposed marketing overstatement. Academic rigor increased with SWE-Compass and other evaluation frameworks revealing persistent capability gaps. Developer surveys showed 76% adoption but only 43% accuracy trust. Semi-agentic architectures (AppMap Navie, Claude Code) emerged as practical exploration-stage solutions with cost advantages.
2024-Q3: Tool ecosystem matured: Devin reported 80% speed improvements, Cursor gained community analytics tooling, Claude Artifacts spawned deployment facilitators. Quality metrics hardened—ChatGPT 65.2% accuracy, Copilot 46.3%, CodeWhisperer 31.1%; surveys documented 38% developer accuracy concerns and 50%+ company security issues. Microsoft ISE team warned agentic frameworks were brittle and unsuitable for production. Consensus shifted: agentic coding viable for exploration when constrained by human oversight and guardrails; risk management became the defining concern.
2024-Q4: Product ecosystem matured: Claude 3.5 reached 100M+ developers on GitHub Copilot; SailPoint deployed Claude for TypeScript generation on Amazon Bedrock; Devin reached GA at $500/month. Reality-check evidence consolidated: DA-Code showed 30.5% accuracy on data science tasks; Google engineer Addy Osmani documented the "70% problem" (70% quick, 30% wall); infrastructure assessments detailed security and deployment risks. Practitioner consensus hardened: agentic coding effective for exploration within guardrails (tight loops, human oversight, clear scope), but brittleness and accuracy gaps remained unresolved. The practice matured from research to pragmatism, not to solved-problem.
2025-Q1: Tool maturity concerns emerged: Claude Code auto-update bug bricked systems; independent evaluations (Answer.AI) showed Devin at 15% success rate (3/20 tasks); technical analysis identified core failure patterns (spatial mismatch, temporal forgetfulness, code duplication). Yet constrained workflows succeeded: Thoughtworks saved 97% effort with Claude Code when adding language support (then faced reliability issues); TaskFlow SaaS built production MVP in 3 weeks with 95% test coverage. Practitioner guidance emphasized codebase adaptation and human oversight as prerequisites. The paradox hardened: agentic coding produced real results within guardrails but remained brittle at scale.
2025-Q2: Practitioner adoption accelerated; Devin 2.0 launched with pricing drop (to $20/month) addressing cost barriers. Comparative testing (seven agents on prototyping task) ranked Claude Code first but showed all tools required trial-and-error iteration. Tool stability issues persisted: Claude Code exhibited instruction-following drift and session-destroying compaction bugs on Windows. Critical assessment documented why autonomy remained unreachable: debugging failures, hallucinations, multi-file task limitations. Market inflection: messaging shifted from "autonomous engineers" to rapid iteration within guardrails. Agentic exploration now positioned as time-saving accelerator proven to work under tight human supervision, but tool maturity gaps remained blockers for enterprise adoption.
2025-Q3: Enterprise adoption reached critical mass (82% across 400+ companies, up from 50% in Dec 2024). Named deployments with measurable ROI emerged (TELUS 500k+ hours saved, Brex 75% auto-processing, CRED 2× velocity). Yet independent RCT (METR) contradicted productivity claims, finding experienced developers 19% slower with AI tools despite subjective belief in gains. Tool evolution continued (Cursor 1.4 improvements, Devin Sonnet rebuild with 2× speed, 12% better evals). Adoption–trust gap widened: only 33% trusted accuracy; 66% frustrated by "almost right" code. Landscape shifted from capability debate to ROI uncertainty and operational reliability concerns.
2025-Q4: Platform expansion and consolidation: Claude Code reached web availability (October), enabling exploration from any browser; Devin reached 25% of Cognition's internal PRs (December). Yet credibility crisis emerged: state-sponsored threat actor weaponized Claude Code in 80%+ automated cyberattack across 30+ organizations (November, Zenity disclosure). Adoption-readiness gap hardened: 57% of companies in production but only 6% with required infrastructure; 60% of multi-agent systems failing. Practitioner sentiment bifurcated into four camps (vibecoders, sweet spot, dubious, artisans) based on adoption–skepticism matrix. Consensus shifted: exploration-mode agentic coding works under constraints (small tasks, tight loops, human oversight) but security governance and readiness gaps remain critical blockers.
2026-Jan: Population-level adoption reached empirical confirmation: 15.85%-22.60% of 129k GitHub projects employ agentic tools (arxiv); Microsoft shows 18.57% adoption within enterprise. Tool ecosystem matured with infrastructure improvements: Claude Code introduced MCP Tool Search reducing context from 134k to 5k tokens (85% reduction) and improving accuracy from 49% to 74%; Amazon released Kiro agentic IDE promoting spec-driven development over vibe coding. Enterprise deployment accelerated: Infosys rolled out Devin company-wide for COBOL migration and modernization (January), reporting material productivity gains. Credibility tension hardened: parallel GitHub analysis of 470 repos documented AI code creating 1.7x more bugs than humans, with 75% higher logic errors and 1.5-2x security issues, directly contradicting productivity claims. Practitioner exploration and prototyping workflows mainstream but quality concerns prevent broader production adoption.
2026-Feb: Real-world deployment metrics emerged validating exploration-stage ROI: CODERCOPS reported 90-day production gains (45% sprint velocity, 50% faster PR merges, 56% faster bug resolution), while Rakuten and TELUS achieved 79% time-to-market reduction and 30% faster shipping via Claude Code. Platform evolution continued: Devin 2.2 released with 3x faster startup and desktop testing support. Yet credibility crisis deepened: Carnegie Mellon research showed agents complete only 8.6-30% of workplace tasks; Gartner predicted 40% of agentic AI projects will be canceled by 2027; analysis of 847 deployments showed 76% fail in production. The window captured the field's fundamental tension: named orgs achieved measurable exploration gains within guardrails, while independent research documented persistent autonomy failures and governance gaps. Practitioner sentiment remained bifurcated: exploration-stage agentic coding proven viable for constrained tasks but unreliable for autonomous operation.
2026-Mar: Population-level adoption confirmed at 55% of professional engineers regularly using agents, with Claude Code reaching #1 status in just 8 months; task-specific success rates documented: bug fixes 78%, test writing 82%, refactoring 45%, architecture 15%. Named deployments validated the constrained-use model: SINTEF completed autonomous multi-file refactoring of a 500K-LOC reservoir simulator with 99.9% accuracy; TELUS and Rakuten case studies (Anthropic trends report) reconfirmed 79% delivery-time reduction and 500K+ hours saved under human oversight; Brookings researchers used Claude Code to build full R packages and 20-page analyses in hours. A Princeton study found reliability improvements lagging capability growth at half the rate, with critical calibration gaps (52%); context-rot patterns were quantified with architectural countermeasures emerging (fresh subagents, atomic tasks, token budgeting). Deloitte's 2025 reality check confirmed 11% production deployment and projected 40% project failure by 2027. The defining tension held: exploration-mode agents deliver real productivity within guardrails, but autonomy gaps and reliability lags prevent broader deployment.
2026-Apr (Early): Market inflection confirmed: Claude Code achieved 46% 'most loved' ranking (vs. Copilot 9%, Cursor 19%), writing 4% of all GitHub commits (135K/day across 1M+ repos, 8% week-over-week growth). JetBrains survey shows 41% market share vs. Copilot's 38%, with 91% satisfaction and 54 NPS. Yet critical reliability degradation documented: AMD Sr Director's forensic analysis of 6,852 sessions revealed thinking depth collapsed 67-75% (late February–March), reads per edit dropped 70%, full-file rewrites doubled, monthly Bedrock costs exploded 122x ($345→$42K)—traced to Opus 4.6 adaptive thinking misconfiguration. Peer-reviewed research on 110K OSS PRs confirms agent activity scaling but documents higher code churn vs. human-authored code. Leading-edge capability (Ultraplan cloud planning, Monitor tool for reactive agents) shipped alongside hidden reliability regression, exemplifying the practice's defining tension: exploration-mode agentic coding delivers measurable productivity gains within guardrails, but tool maturity gaps (thinking depth, cost control, code quality) create operational hazards that constrain adoption to disciplined teams with oversight. New accessibility signal: domain experts (computational historians) successfully prototype without coding, expanding addressable population beyond software engineers.
2026-Apr (Mid-Late): Competitive capability inflection and reliability crisis converged. OpenAI GPT-5.5 released April 23 (82.7% Terminal-Bench SOTA, 88.7% SWE-bench, 20-hour autonomous engineering runs), crossing agentic coding from research into production default—yet same window revealed critical tool maturity gaps. Claude Opus 4.7 released April 16 with major benchmarks (SWE-Bench Pro 64.3%, CursorBench 70%), new xhigh reasoning effort tier, and task budgets for cost management. Yet April 23 postmortem documented three production bugs had degraded Claude Code quality March 4–April 20: reasoning-effort default flipped (34 days undetected), cache bug caused context loss and repetition, verbosity constraint cut coding-eval 3%—user costs spiked $345→$42K/month. Real-world empirical study (MSR 2026) on 11,771 PRs found Claude Opus 4.7 leads at 87.6% SWE-Bench but only 24% autonomous completion on complex tasks (70-90% failure as complexity rises), establishing critical gap between benchmark performance and real-world autonomy. Population-level adoption signal solidified: 275M weekly AI-authored commits on GitHub (agentic baseline past novelty into infrastructure), with bottleneck shifted from agent output velocity to human review and orchestration pipeline capacity. Anthropic 2026 industry report documented TELUS 500K+ hour savings, Augment Code 4-to-2-week project compression, yet showed developers use AI in 60% of work but fully delegate only 0-20% (collaborative model, not autonomous). The period crystallized the practice's maturity: capability crossed into production-ready (GPT-5.5 architectural reasoning, Opus 4.7 reasoning control), deployment reached infrastructure scale (275M commits/week), yet reliability gaps (tool bugs, benchmark-reality gap, limited autonomy on complex tasks) remain binding constraints on broader adoption beyond disciplined teams with strong oversight cultures.
2026-May: Infrastructure expansion and enterprise normalization continued. Anthropic doubled Claude Code rate limits and removed peak-hour throttling (May 6), backed by a 300+ MW SpaceX compute deal; Claude Platform on AWS reached GA (May 11) with managed agents, code execution, and MCP integration. Amazon formally deployed Claude Code company-wide (May 4) and investment analysts valued it at $1B+ ARR alongside GitHub Copilot at 20M users. Adoption paradoxes deepened: HCL GUVI documented 84% developer adoption (41% of code AI-authored) but with median savings of 3.6 hrs/week and 19% of developers reporting slowdowns; Alibaba SWE-CI found 75%+ of agents showing accelerating regression despite per-task success. An Anthropic postmortem of 6,852 sessions documented quality degradation in March–April 2026 (partially recovered via versioning and task budgets), and a "mise en place" methodology—deliberate context engineering before agent runs—emerged from research as the key preparation pattern for reliable exploration workflows.