Perly Consulting │ Beck Eco

The State of Play

A living index of AI adoption across industries — where established practice meets the bleeding edge
UPDATED DAILY

The AI landscape doesn't move in one direction — it lurches. Some techniques leap from experiment to table stakes in a single quarter; others stall against regulatory walls, technical ceilings, or organisational inertia that no amount of hype can dislodge. Knowing which is which is the hard part. The State of Play cuts through the noise with a rigorously maintained index of AI techniques across every major business domain — classified by maturity, evidenced by real-world adoption, and updated daily so you always know where you stand relative to the field. Stop guessing. Start knowing.

The Daily Dispatch

A daily newsletter distilling the past two weeks of movement in a domain or two — delivered to your inbox while the index updates in the background.

AI Maturity by Domain

Each dot marks the weighted maturity of practices within a domain — hover for a brief summary, click for more detail

DOMAIN
BLEEDING EDGEESTABLISHED

Multi-agent development pipelines

BLEEDING EDGE

TRAJECTORY

Stalled

Multiple AI agents collaborating across development tasks such as planning, coding, reviewing, and testing in coordinated workflows. Includes orchestrated agent teams with specialised roles; distinct from single-agent agentic coding which uses one agent across the lifecycle.

OVERVIEW

Multi-agent development pipelines coordinate multiple specialised AI agents -- planner, coder, reviewer, tester -- across software engineering workflows, distributing tasks that single-agent systems handle monolithically. The premise is compelling: decompose complex development work the way human teams do, with role-specific agents handing off artifacts through an orchestrated pipeline.

By May 2026, infrastructure maturity has consolidated around five standardised orchestration patterns: supervisor-workers (most common), pipeline (sequential), fan-out (parallel), hierarchical, and peer-to-peer. Framework ecosystem has stabilized: LangGraph dominates at 38% of production deployments (vs. 28% custom orchestration, 12% CrewAI, 9% AutoGen); by late 2025 it overtook LangChain as the preferred choice for new production systems. All major cloud platforms shipped managed multi-agent orchestration (Azure AI Foundry, AWS Bedrock AgentCore, Google Vertex AI Agent Builder, Anthropic Claude Managed Agents public beta May 2026). Real deployments demonstrating measurable value exist at organizational scale: Stripe Minions (1000+ unattended PRs/week), Google Agent Smith (25%+ of production code), OpenAI Harness (~1M agent-written LOC), Klarna's LangGraph system (2.3M conversations/month, $60M savings). Yet enterprise adoption remains bifurcated: 41% of companies funded agentic projects but only 11% reached production (42-point gap larger than any enterprise software adoption curve on record). The core bottleneck is not tooling but orchestration architecture and governance: coordination overhead degrades sequential reasoning 39-70% (DeepMind/MIT 2026), state schema evolution breaks checkpointing (observed: 12 days silent degradation), cascading failures amplify single false claims through all agents within three rounds (genealogy graph mitigation only raises defense from 32% to 89%), and error attribution across agent boundaries remains 14.2% accurate at best. Verified production failures include silent context poisoning (4k tokens → 90k, failure invisible until deployed), cost explosions (multi-agent 5-20× single-agent with only 2.1pp accuracy gain), and governance gaps (58% of CTOs cite governance as #1 adoption blocker). Tightly scoped, disciplined deployments with explicit orchestration topology, typed inter-agent schemas, and human-in-the-loop checkpoints deliver measurable value; general-purpose enterprise-scale adoption remains constrained by unresolved architectural brittleness.

CURRENT LANDSCAPE

Production deployments span both organizational-scale and process-scale implementations. At organizational scale: Stripe Minions produce 1000+ unattended PRs per week, Google Agent Smith handles >25% of new production code, OpenAI Harness shipped ~1M lines of agent-written code with zero manual writing. At process scale: 1inch's ticket-to-PR CI pipeline runs implement + seven parallel review agents + synthesizer autonomously with human approval; Klarna's LangGraph system processes 2.3M conversations/month achieving $60M savings; Deriv operates 50+ agents via registry-based Operations Center architecture; ZaloCRM completed payment module refactor (estimated 5 days) in 1.5 days with 6 orchestrated Claude Code agents. These successes share common patterns: tightly scoped domain definitions, explicit orchestration topology (supervisor+workers most common), typed inter-agent schemas preventing silent format degradation, human-in-the-loop approval gates, and disciplined tool management via MCP (97M/month downloads, 9,400+ servers, +58% QoQ growth). Framework selection now reflects production requirements: LangGraph chosen for native token streaming and custom checkpointing despite competitors; CrewAI excels at rapid prototyping but token overhead ($4.3k/month cost gap at 10k executions) limits production deployments; vanilla SDK gains traction where framework overhead is unjustified (Octomind ripped LangChain after 6 months, shipped faster with direct Anthropic SDK).

Failure modes are now well-documented and structural. Production case studies from May 2026 document five critical issues in LangGraph deployments: state schema changes break checkpoint deserialization mid-flight (17 workflows stuck; 12-day silent degradation observed), checkpointing costs balloon (10GB in 3 weeks), supervisor pattern burns 40% of token budget on routing alone, missing message contracts cause silent output format degradation, and convergence failures create cost overruns ($2.16 actual vs $0.72 planned). May 2026 research shows in-context prompting significantly outperforms LangGraph orchestration on procedural tasks (24% orchestration failures vs 11.5% in-context on travel support; 17% vs 5% on insurance). UC Berkeley MAST study analyzing 1,600+ production traces documents 41–86.7% failure rates across seven frameworks; cascade failures (inventory agent hallucinating SKU → appearing in purchase orders, shipping manifests, and customer pages within hours) are systems failures, not model failures. Cascade problem analysis backed by ZenML's 1,200-deployment study shows 96.3% hands-off supervisor+specialist pattern (Abemon: $0.08/request, 12s p95 latency) works in production; but generalized peer-to-peer agent orchestration requires explicit cascade mitigation. Governance cited by 58% of CTOs as #1 blocker (up from 23% in 2025), exceeding model performance, integration, and talent constraints. Organizations with documented escalation pathways, audit infrastructure producing decision traces within 10 minutes, and change management treating prompts as code shipped at 2x the rate.

Vendor platform maturity continues accelerating (Claude Managed Agents public beta, May 2026; all cloud platforms shipped managed orchestration by Q2). MCP SDK reached 97M/month downloads; 2026 roadmap addresses stateful sessions, agent-to-agent communication, and enterprise auth. Yet enterprise adoption remains deeply constrained: 83% of enterprises funded agentic projects in 2026, but only 41% reached production (42-point gap); 58% are stalled on governance reviews or paused; 38% of Fintech projects hit compliance boundary disputes (SEC/FINRA jurisdiction). The gap between pilot success (72% report gains) and production viability (only 28% sustain post-deployment) reflects unresolved infrastructure challenges: state management under concurrent execution, retry semantics (policy for failed steps: retry same, re-plan, skip, escalate), observability for probabilistic systems (need decision-level logging, not just I/O logging), cost modeling under non-linear scaling, and human-in-the-loop gate design. Successful production deployments limit agent coordination to 3 agents maximum, implement deterministic routing where possible (95% cost savings on supervisor pattern when probabilistic delegation is replaced with typed state machines), and use structural prevention patterns: scope hierarchy (orchestrator delegates only well-defined subtasks), authority attenuation (subagents cannot make decisions above their scope), and typed protocols (inter-agent message contracts enforced at runtime). The practice demonstrates that tightly scoped, disciplined orchestration delivers measurable value; general-purpose enterprise-scale adoption remains blocked by unresolved governance gaps, cost modeling failures, and absence of proven scaling patterns.

TIER HISTORY

ResearchSep-2024 → Oct-2024
Bleeding EdgeOct-2024 → present

EVIDENCE (101)

When AI builds itselfCase Studies

— Anthropic's production deployment: 80% of merged code authored by Claude (up from single digits in 2024), 8x code output per engineer, autonomous agents delegating to sub-agents, multi-hour task autonomy.

— UC Berkeley MAST taxonomy (1,600+ traces across 7 frameworks): 41-86.7% failure rates, 14 failure modes in 3 categories, cost multipliers 2-10x per category; cascade failures are systems-level, not model failures.

— Critical negative signal: most multi-agent systems underperform single-agent baseline under normalized protocol; only 1 of 6 tested MAS exceeds anchor, trailing by 2.56-11.29 accuracy points with higher cost.

— Longitudinal field validation across 17 repositories: 8,589 commits, 1,822 tasks, 13,866 tests (99.87% pass); wave-based topological dispatch, dual validation gates, human-as-agent integration.

— Red Hat 500+ person org production deployment: seven agentic streams (requirements through release), 30% flagged for review, security false positives improved 58% → 22%, guardrails-as-features model.

— 750k-line Zig-to-Rust codebase port: Jarred Sumner (Bun CEO) orchestrated hundreds of agents with generator-validator GAN pattern, 11 days, 99.8% tests passing, zero human intervention post-prompt.

— Fundamental failure mode documented: 'Reasoning Trap' where reasoning-heavy orchestrators fail due to context squeezing as tasks flow downstream; predicts performance collapse via entropy dynamics model.

— Multi-agent orchestration as GA platform primitive: coordinator delegates to specialist sub-agents with isolated session threads, persistent history, and named parallelization/specialization/escalation patterns.

HISTORY

  • 2024-Q3: Early research phase with benchmark-driven validation. HyperAgent achieves SOTA on SWE-Bench and Defects4J. Enterprise case studies reveal $127M in failed deployments; critical blockers include inadequate testing and legacy system integration. Token efficiency concerns emerge from ChatDev analysis. Field consensus: feasible in research, barriers block production adoption.

  • 2024-Q4: Production deployments emerge at scale: LinkedIn SQL Bot, Uber code migration, AppFolio copilot (10+ hrs/week savings), Elastic and Replit multi-agent systems all live in production. Framework infrastructure (LangGraph) matures. Adoption breadth grows (68% of companies deployed agents) but ROI gap widens (only 32% see significant value). Industry shift toward tightly scaffolded "intelligent workflows" signals recognition of autonomy limits. Conference engagement increases but practitioner analysis remains cautious: applications scarce, systems not yet human-assistant equivalents.

  • 2025-Q1: Research now documents systematic failure modes: UC Berkeley peer-reviewed study identifies 18 failure patterns across 5 frameworks on 150+ tasks, with performance gains remaining minimal vs. single agents. New production case study: Build.inc's 25-agent LangGraph system reduces land diligence from 4 weeks to 75 minutes. Cloud vendors (AWS, Microsoft) release native multi-agent tutorials and orchestration capabilities. Critical gap emerges: practitioner analysis informed by Microsoft Research interviews surfaces underdeveloped debugging infrastructure, missing security/compliance standards, and tool immaturity as primary adoption barriers. Deployment breadth unchanged (68% companies) but ROI realization stalls (32% threshold). Bifurcation signal: domain-specific systems (land diligence, SQL conversion) demonstrate viability; general-purpose orchestration faces reliability and debuggability challenges.

  • 2025-Q2: Framework infrastructure reaches maturity: LangGraph Platform reaches GA with 400 companies deploying to production. Anthropic releases production multi-agent system achieving 90.2% performance improvement over single-agent (though at 15x token cost). Enterprise adoption accelerates: KPMG survey shows 33% of organizations deployed agents, up from 11% in prior quarters. Simultaneously, research and practitioner evidence documents persistent technical barriers: failure attribution models achieve only 14.2% accuracy in pinpointing failure steps; Gartner forecasts 50% error rates in multi-agent systems; production failures documented (32% conversion drops in e-commerce). GitHub signals platform evolution with agentic workflow capabilities. Pattern emerges: rapid adoption momentum (infrastructure GA, framework maturation, vendor platform integration) coexists with unresolved technical fragility (failures, debugging gaps, token inefficiency), expanding deployment breadth while reliability concerns remain.

  • 2025-Q3: Vendor platform expansion and concurrent risk signal convergence: AWS ships Strands Agents 1.0 with 2,000+ stars and multi-provider backing (Anthropic, Meta, OpenAI, Cohere, Mistral); Microsoft positions multi-agent systems as enterprise strategic imperative with architecture guides. Academic frameworks advance (Yale/Chicago/Oxford freephdlabor system for dynamic workflows). Deployment reports claim growth: 51% of teams in production (ZenML), millions of queries via Deutsche Telekom LMOS and Cognizant. Critical countervailing signals intensify: Gartner predicts 40% project cancellation by 2027; Carnegie Mellon benchmark shows 70% agent failure rate on standard tasks (Claude 3.7 Sonnet 26.3%, Gemini 2.5 Pro 30.3%, GPT-4o 8.6% success); HP infrastructure analysis documents 88% prototype failure cascade and unresolved cost/privacy/security roadblocks. Bifurcation sharpens: tooling maturity and platform integration accelerate while production viability signals worsen, suggesting the "adoption" metric reflects experiment breadth rather than production value realization.

  • 2025-Q4: Framework maturation continued with LangGraph Platform confirming hundreds of production deployments and GitHub shipping Custom Agents for Copilot (October 2025). Analyst consensus hardened on adoption limits: Gartner reports less than 5% of enterprise applications deployed "real agents" by year-end (IntuitionLabs, November 2025). Deloitte warns 40% of agentic projects face abandonment by 2027. Production deployment evidence: JPMorgan Chase, NuvoBank, LinkedIn, and food manufacturing firms using LangGraph with human oversight; Cognizant and Deutsche Telekom processing millions of queries. Critical negative signals: Parallel AI documents $47,000 loss from coordination failures (November 2025); analysis attributes 95% deployment failures to architectural flaws in state management. Market forecasts project $35B autonomous agent market by 2030 yet production penetration remains confined to tightly scoped workflows. Orchestration complexity, failure recovery, and governance gaps persist as primary blockers to general-purpose enterprise adoption.

  • 2026-Jan: Ecosystem maturity and failure mode documentation intensify in parallel. Large-scale analysis of 42K commits across 8 multi-agent systems (LangChain, CrewAI, AutoGen) reveals 40.8% perfective maintenance and 10% issues attributed to agent coordination. Production case: FRE|Nxt's InterviewLM with 8 specialized agents achieving 100+ concurrent sessions and 40% cost optimization. Adoption metrics show persistent gap: 66% experimenting, 38% in pilots, 14% ready to deploy, but only 11% live—with Gartner forecasting 40% project cancellation by 2027. Critical negative signals consolidate: 35% performance degradation in production systems from coordination overhead; MAST taxonomy documents 14 failure modes (41.8% specification, 36.9% misalignment, 21.3% verification); OWASP 2026 analysis identifies inter-agent trust and cascading failure vulnerabilities. Bifurcation persists: frameworks mature while deployment reliability and architectural robustness remain unresolved core challenges constraining viability to tightly scoped, heavily guarded workflows.

  • 2026-Feb: Production infrastructure maturation coexists with persistent adoption-to-deployment gap. Dotzlaw case study demonstrates viable LangGraph scaling to 10k concurrent users with 60% cost savings. Durable execution emerges as critical infrastructure (Temporal $5B valuation). Ecosystem expands rapidly (97M MCP SDK downloads/month) but quality concerns intensify (average tool score 44.7/100). Adoption metrics reveal stalled enterprise transition: 11% in production vs 39% experimenting; Gartner forecasts 40% project cancellation by 2027. GitHub and practitioner analysis identify core failure patterns in orchestration and state management. Bifurcation sharpens: framework maturity and vendor backing accelerate while deployment reliability and operational overhead remain central blockers to enterprise-scale adoption.

  • 2026-Mar: Vendor platform convergence confirmed: all major platforms (OpenAI Codex Subagents GA with manager-worker architecture, Dapr Agents v1.0 GA, Claude Code, GitHub, Devin, Grok) shipped multi-agent parallel execution by mid-March, and adoption metrics show high-adoption teams achieving 2.2x PR throughput (Jellyfish, 700+ companies). Concurrent failure evidence hardened: peer-reviewed study documents two-agent coordination accuracy drops from 58% to 25% under reduced specification detail (25-39pp gap independent of model capability); CooperBench confirms 50% lower collaboration success vs solo agents; Princeton research shows chained-agent reliability degrades to 74% combined even when components exceed 90%; a real 4-agent design system deployment documented 86% XSS vulnerabilities and $0.88–$146 cost variance per component. Enterprise adoption reality remains bifurcated: Deloitte confirms only 11% of companies use agents in production, with Klarna's structured LangGraph system (2.3M conversations/month, $60M savings) demonstrating that tightly scoped deployments with explicit orchestration topology deliver measurable ROI while general-purpose scaling remains constrained by coordination overhead and unresolved failure attribution.

  • 2026-Apr: Organizational-scale production evidence accumulated: Stripe Minions (1000+ unattended PRs/week), Google Agent Smith (25%+ of production code), OpenAI Harness (~1M agent-written LOC), and 1inch's ticket-to-PR CI pipeline (implement + seven parallel review agents + synthesizer) documented as distinct deployment patterns; framework ecosystem consolidated around CrewAI (12M daily executions), LangGraph (Klarna/JPMorgan), and Microsoft Agent Framework 1.0 (AutoGen merger). Anthropic launched Claude Managed Agents public beta (April 8, 2026) as the first major vendor production-grade agent runtime with built-in orchestration, sandboxing, MCP integration, and state persistence. Enterprise orchestration evidence sharpened: Salesforce documented Alcon reaching 900+ agents in uncoordinated silos (security and governance crisis) versus RBC Advisor deploying 12+ specialized agents with an orchestrator-supervisor achieving 50% reduction in advisor prep time—illustrating that orchestration architecture, not agent count, determines enterprise viability; Gartner logged a 1445% inquiry surge for multi-agent topics while separate analysis shows 40% of multi-agent pilots fail in production. Production deployment economics reinforced coordination overhead as central viability constraint ($47k/month multi-agent vs $22.7k single-agent with only 2.1pp accuracy gain); Deloitte data confirmed 89% of multi-agent pilots fail at production deployment, with failures attributed to governance and integration maturity rather than technical limitations. Concurrent failure research hardened: peer-reviewed study shows single false claim spreads to all agents within three rounds across six frameworks (genealogy graph mitigation raises defense from 32% to 89%); structural limits synthesis documents 17.2x error amplification without coordination and 39-70% sequential reasoning degradation across all multi-agent variants. Research taxonomy established Wave 1 (viability) vs Wave 2 (measurement) framing, with key insight that single-agent systems with well-designed interfaces (10.7pp improvement from interface design alone in SWE-agent) often outperform multi-agent architectures for narrowly scoped tasks. Anthropic published five coordination patterns (generator-verifier, orchestrator-subagent, agent teams, message bus, shared state) with documented failure modes, establishing architectural standards for production deployment.

  • 2026-May: Framework maturity sharpens with production failure analysis and governance consolidation. LangGraph 1.0 migration guide documented critical breaking changes—state schema changes break checkpoint deserialization mid-flight (17 workflows stuck, 12-day silent degradation), checkpointing costs balloon (10GB in 3 weeks), supervisor pattern burns 40% of token budget on routing—with deterministic typed state machines achieving 95% cost savings over LLM-driven delegation. Cross-framework comparison (Octomind postmortem, Reditus, Ditto production postmortems) confirms trend toward vanilla SDK simplicity; LangGraph maintains 38% of production deployments but actual production evidence increasingly favors simpler orchestration. Cascade problem research (ZenML 1,200-deployment study; Abemon: $0.08/request, 12s p95 at 96.3% hands-off success) documents that generalized peer-to-peer orchestration drives systems failures (inventory agent hallucination propagating to purchase orders, manifests, and customer pages within hours), not model failures—requiring structural cascade mitigation. Governance emerged as the dominant adoption blocker: 58% of CTOs cite it as #1 constraint (up from 23% in Q4 2025), exceeding model performance and integration barriers. Enterprise adoption gap hardened: 83% of enterprises funded agentic projects but only 41% reached production; 38% of Fintech projects stalled on regulatory boundary disputes. Anthropic Claude Managed Agents shipped public beta (May 6) with multiagent orchestration supporting up to 20 specialists with shared filesystem and recursive decomposition; all major cloud platforms now GA on managed orchestration, indicating infrastructure maturity while structural reliability gaps remain unresolved at enterprise scale.

  • 2026-Jun: Infrastructure GA and organizational-scale validation converge with critical failure research. Anthropic's own development reached 80% of merged code authored by Claude (up from single digits in 2024) with 8x code output per engineer; Jarred Sumner autonomously ported a 750k-line Zig codebase to Rust in 11 days using a generator-validator GAN pattern with 99.8% tests passing; Red Hat's 500+ person organization deployed seven agentic SDLC streams with security false positives reduced from 58% to 22% using guardrails-as-features. However, rigorous June 2026 research sharpened the architectural ceiling: UC Berkeley MAST study (1,600+ traces, 7 frameworks) confirmed 41-86.7% failure rates with cascade cost multipliers of 2-10x per failure category; a peer-reviewed evaluation found only 1 of 6 multi-agent architectures outperforms a single-agent baseline (trailing by 2.56-11.29pp at higher cost); and entropy dynamics research documented the "Reasoning Trap" where orchestrators' context is squeezed by downstream task flow, causing performance collapse. The practice remains bifurcated: tightly scoped deployments with explicit topology and validation gates deliver measurable value; general-purpose enterprise adoption remains blocked by state schema fragility, cascade amplification, and orchestrator bottleneck effects. Anthropic released Managed Agents multi-agent orchestration (May 30) as GA primitive with coordinator delegating to specialist sub-agents on isolated session threads. Jarred Sumner (Bun CEO) deployed Claude Dynamic Workflows to autonomous 750k-line Zig-to-Rust codebase migration in 11 days with 99.8% tests passing using generator-validator GAN pattern. Anthropic's own development shows 80% of merged code authored by Claude (up from single digits in 2024), 8x code output per engineer, with agents delegating multi-hour work to sub-agents. However, rigorous June 2026 research publications solidified critical scaling constraints: UC Berkeley MAST study confirms 41-86.7% failure rates across seven frameworks with 14 documented failure modes costing 2-10x amplification per failure category; peer-reviewed studies show (1) most multi-agent systems underperform single-agent baselines (only 1 of 6 MAS architectures exceeds anchor, trailing 2.56-11.29pp), (2) orchestrator models face 'Reasoning Trap' where context squeezing degrades performance as tasks flow downstream, (3) architectural elaboration (adding planner/researcher/tester/verifier) inflates complexity without accuracy gain. Red Hat's 500+ person organization deployment demonstrates production viability through guardrails-as-features (30% flagged for review, security false positives reduced 58% → 22%), while SPOQ field validation across 17 repositories (8,589 commits, 99.87% test pass) shows wave-based topological dispatch with dual validation gates works. Practice demonstrates bifurcated maturity: infrastructure and deployment scale both advanced, yet fundamental architectural brittleness (state schema fragility, cascade amplification, orchestrator bottlenecks) constrains general-purpose adoption to tightly scoped domains with explicit guardrails. Trend remains 'stalled' at bleeding-edge: infrastructure maturity sufficient for specialized deployments; architectural limitations prevent broader enterprise scaling.