Perly Consulting │ Beck Eco

The State of Play

A living index of AI adoption across industries — where established practice meets the bleeding edge
UPDATED DAILY

The AI landscape doesn't move in one direction — it lurches. Some techniques leap from experiment to table stakes in a single quarter; others stall against regulatory walls, technical ceilings, or organisational inertia that no amount of hype can dislodge. Knowing which is which is the hard part. The State of Play cuts through the noise with a rigorously maintained index of AI techniques across every major business domain — classified by maturity, evidenced by real-world adoption, and updated daily so you always know where you stand relative to the field. Stop guessing. Start knowing.

The Daily Dispatch

A daily newsletter distilling the past two weeks of movement in a domain or two — delivered to your inbox while the index updates in the background.

AI Maturity by Domain

Each dot marks the weighted maturity of practices within a domain — hover for a brief summary, click for more detail

DOMAIN
BLEEDING EDGEESTABLISHED

Multi-step autonomous deep research

BLEEDING EDGE

TRAJECTORY

Stalled

AI agents that conduct multi-step research autonomously — formulating queries, reading sources, following leads, and synthesising findings. Includes tools like Gemini Deep Research and Perplexity Pro; distinct from single-query retrieval which answers from a single search round.

OVERVIEW

Multi-step autonomous deep research -- AI agents that formulate queries, read sources, follow leads across multiple rounds, and synthesise findings without human intervention -- has crossed from consumer experimentation into bounded production deployment. Three major platforms (Google, OpenAI, Perplexity) now offer enterprise-grade deep research features deployed at scale, with Perplexity Computer reaching $450M ARR and Google's Deep Research Max launching native multi-model orchestration for asynchronous enterprise workflows. Yet the practice remains fundamentally constrained by two unresolved gaps. First: reliability. Multi-turn research accuracy gaps persist across all platforms: Stanford HAI 2026 AI Index shows agents achieving only ~50% of PhD specialist performance on complex workflows; WildClawBench reveals best-in-class models reach only 62.2% on 60 realistic long-horizon tasks; AutoExperiment demonstrates frontier agents collapse from 30-37% accuracy to 6-10% when research tasks have cross-function dependencies; multi-model consensus breaks down, with 99.1% of real-world turns showing contradictions across frontier models and Gemini's single-model confidence suffering 51.3% contradiction rate from peers. A Princeton-backed analysis documented that eighteen months of model capability gains yielded zero reliability improvement for production agents. Second: organizational scaling. Only 23% of enterprises scale agentic systems enterprise-wide despite 88% using AI somewhere; Gartner forecasts 40%+ of agentic projects cancelled by 2027 due to governance and unclear ROI. Deep research agents are fully available and deployed in early-adopter teams -- but orchestration matters as much as models (multi-agent systems achieve 90.2% improvement over single-agent), and the majority of organizations have not yet solved the verification frameworks, context governance, and process redesign required for autonomous research at scale. For exploratory research and bounded decision support with human review, deep research delivers measurable acceleration. For mission-critical analysis, autonomous research remains a supervised tool requiring verification boundaries.

CURRENT LANDSCAPE

By June 2026, deep research has consolidated around three major vendor platforms (Google, OpenAI, Perplexity) with production deployments across enterprise and consumer markets. Perplexity Computer, launched February 2026, orchestrates 20 AI models for multi-step autonomous research tasks with enterprise deployment at $200/month; expanded to desktop (Mac Mini with audit trails, Snowflake/Salesforce integration), Comet Enterprise browser, and four new APIs (Search, Agent, Embeddings, Sandbox). Google's Deep Research Max (April 2026) adds Model Context Protocol (MCP) support for proprietary data integration, native chart generation, and API access for asynchronous enterprise workflows. Perplexity Computer reached $450M ARR (50% monthly growth) with 100M+ monthly active users and tens of thousands of enterprise clients executing multi-step workflows (document review, campaign planning, tax filing automation, investment analysis). Google's Gemini Deep Research achieved Workspace GA, enabling teams to synthesize internal documents (Gmail, Drive, Chat) with public web sources; technical review shows capability reaching 30-60 source synthesis with iterative reasoning (4-7 search iterations) completing in 3-10 minutes on Deep Research Max. OpenAI's Deep Research continues scaling across ChatGPT's 700M weekly users. Perplexity user base reached 33M monthly active users with 20.8% of queries targeting research and learning.

Early-stage production deployments achieve measurable results in specific use cases: legal teams conducting M&A due diligence autonomously across centuries of corporate records, reducing months to single afternoons (90% time reduction); sales research pipeline (Gemini Deep Research → NotebookLM → Google Sheets) producing 0-16 qualification scores with cited evidence in under 20 minutes per company; financial analysis (Skywork case study: 93% citation accuracy, 15-25% speed gains); media research (GDELT Project: autonomous analysis of 500+ TV transcripts generating think tank reports at $2.14 per run, unattended); academic research with frontier models (AlphaLab: multi-phase research with 4.4x GPU kernel speedup). Infrastructure maturation: LangChain Deep Agents framework, standardized multi-model orchestration patterns, and verification boundaries enabling production control.

Yet critical reliability barriers persist. WildClawBench reveals best-in-class models reach only 62.2% on 60 realistic long-horizon tasks; orchestration architecture alone shifts performance by up to 18 percentage points, showing architectural factors matter as much as model capability. AutoResearchBench quantified closure and state-tracking defects: Claude Opus 4.6 achieves only 9.39%, GPT-5.4 at 7.44%, Gemini 3.1 Pro at 7.93% on multi-step research tasks. AutoExperiment demonstrates frontier agents collapse from 30-37% accuracy on single-function tasks to 6-10% when functions have cross-function dependencies—agents fail on implicit data-flow reasoning. Multi-model consensus breaks down: 99.1% of production turns show contradictions across frontier models; Gemini's high-confidence answers suffer 51.3% contradiction rate from peers, revealing single-model confidence as unreliable. Source credibility failures are systematic: citation volume treated as credibility makes agents vulnerable to poisoned data (AI Slop Loop: fabricated articles cited as fact within 24 hours); 50-90% of autonomous research citations remain unsupported. Washington State University study shows ChatGPT at 76.5% accuracy but only 41% consistency across runs. Governance constrains adoption: only 23% of enterprises scale agentic systems enterprise-wide despite 88% using AI somewhere; Deloitte reports 11% production deployment; Cisco's survey shows 85% pilot vs 5% production trust; Dynatrace reports 88% never reach production. Deep research is production-feasible for bounded use cases with human review and verification boundaries, but autonomous research at enterprise scale remains blocked by architectural reliability gaps, organizational-process barriers, and governance immaturity.

TIER HISTORY

ResearchDec-2024 → Dec-2024
Bleeding EdgeDec-2024 → present

EVIDENCE (101)

— Perplexity Computer integration: BrowseComp +43pp (40.7%→83.8%), Humanity's Last Exam +14pp; 'Search as Code' paradigm with parallel retrieval/filtering; rolling to Max tier and Agent API—multi-model orchestration reaching production scale with measured benchmark improvements.

— SciAgentArena benchmark (Stanford/MIT/Harvard): agents effective on well-specified data workflows but struggle with multi-constraint optimization, novel insight generation, and unsupported claim detection—defining boundaries of autonomous research capability.

— Zhao et al. audited 2.5M arXiv/bioRxiv/PubMed papers: 146,932 hallucinated citations identified in 2025; rate rose from 1/2,828 papers (2023) to 1/277 (early 2026); demonstrates widespread deployment of deep research tools at scale with endemic citation failure modes.

— First rigorous benchmark: Claude Code achieved only 21.5% on 40 real scientific re-discovery tasks; error modes (experimental mismatch, evidence gaps, missing core) concentrated; reveals critical gap between market adoption and measured research agent reliability.

— Production study (Feb-May 2026): Perplexity Computer achieved 26-minute autonomous execution per session vs 33 seconds for Search (48× increase); 87% task time reduction, 94% cost savings on matched 10k sessions; 23% novel task expansion showing scope amplification.

— Adoption-to-production gap: 79% claim agents, 11% in production; 88% of pilots fail; Gartner forecasts 40%+ cancellation by 2027; successful deployments (Klarna $60M, Salesforce 380k interactions) follow narrow scope + measurable output—critical scaling barrier for deep research.

— CHARM framework (arXiv June 3, 2026) formalizes cascading hallucinations in multi-step RAG pipelines; existing detectors catch only 12.8–41.7% of failures; LLM self-correction counterproductive (12.8% detection); identifies fundamental pipeline reliability barrier for deep research.

— Deep Research Max API benchmarks: 93.3% on DeepSearchQA (vs 66.1% Dec 2025, +41.3pp gain); MCP support for proprietary data; async background workflows up to 60 minutes; API GA (April 21, 2026)—technical capability maturation documented.

HISTORY

  • 2024-Q4: Google launches Gemini Deep Research in Gemini Advanced across 150+ countries as a flagship agentic feature; Perplexity Pro Search demonstrates 50% adoption lift via multi-step reasoning architecture. Industry adoption surveys show 68% of organizations have deployed AI agents, though ROI realization remains below 50%. Accuracy and hallucination challenges identified as key adoption barriers.
  • 2025-Q1: OpenAI launches Deep Research (Feb 2025) as deep-research-specific agent in ChatGPT Pro, using o3 reasoning model for autonomous multi-step investigation; Google extends Gemini Deep Research to Workspace users. Perplexity benchmarks at 93.9% on SimpleQA. Critical analyses emerge noting agents as "fallible tools" rather than expert-level; agentic RAG becomes category's enabling architecture. Category transitions from experimental to mainstream availability across three major platforms.
  • 2025-Q2: Perplexity reaches 15M active users (50% growth in 3 months); pursues $500M–$1B funding at $18B valuation target. Google I/O announces Flash 2.5 experimental support in Deep Research. Production-ready patterns emerge across platforms (OpenAI, Google, Perplexity, Claude/Anthropic); enterprise deployments adopt steerable workflows for controlled autonomy. Academic surveys document category advances; knowledge cutoff bias and information lag emerge as persistent reliability gaps at scale.
  • 2025-Q3: Perplexity grows to 30M monthly active users (780M monthly queries) with 66% YoY growth; enterprises across banking, pharma, law adopt for mission-critical research (60% of Pro customers). Google's Gemini Deep Research achieves production status with Workspace integration and usage quotas. However, critical assessments surface: peer-reviewed medical research examines risks to citation integrity and research quality; user reports document hallucinations in current affairs research; MIT study shows 95% of GenAI pilots fail to reach production due to reliability, data quality, and governance barriers.
  • 2025-Q4: Deep research consolidates around three major platforms (Perplexity, Gemini, OpenAI) with evidence of bounded production use (Skywork case study: 93% citation accuracy, 15-25% speed gains on market research reports). Perplexity's 100M+ interactions show 57% targeting research/learning. However, adoption ceiling persists: Gartner finds only 15% of IT leaders deploying fully autonomous agents (Oct), Deloitte reports 11% production deployment (Dec), and Gemini 3 Pro maintains 88% hallucination rate despite 53% accuracy lead. Domain-specific scientific research shows promise (energy materials agents), but governance, security, and reliability gaps constrain enterprise scaling. Practice matured from experimentation to selective production use but faces unresolved trustworthiness barriers.
  • 2026-Jan: Mainstream business adoption reaches 67% of enterprises using AI research tools (Gartner); Perplexity achieves 370% YoY user growth and 14.1% market share. However, scaling barriers persist: 62% of organizations experimented with agentic workflows but 70-80% struggle to scale with only 5% achieving ROI. EBU/BBC study reveals 45% of AI research responses contain errors; Gemini exhibits 72% sourcing problems. LangChain releases Deep Agents framework enabling multi-step task decomposition through subagents. Deep research remains viable for exploratory, non-critical use but unsuitable for mission-critical workflows requiring reliability and governance.
  • 2026-Feb: Platform consolidation continues with Perplexity reaching 33M monthly active users (20.8% research-focused queries) and Google releasing Gemini 3.1 Pro with upgraded Deep Think model. Specialized research agents emerge: DeepMind's Aletheia achieves 91.9% on mathematical reasoning benchmarks and autonomously co-authors published papers. However, HalluHard benchmark reveals state-of-the-art models still hallucinate ~30% in multi-turn conversations even with web search. Princeton-backed analysis shows 18 months of model capability gains have not improved production agent reliability, widening the gap between capability and trustworthiness.
  • 2026-Mar: Gemini Deep Research reached Workspace GA integrating Gmail, Drive, and Chat with web sources in unified report-generation workflows. Perplexity Computer expanded to desktop (Mac Mini with audit trails), enterprise (Snowflake/Salesforce integration), and Comet Enterprise browser, adding four new APIs (Search, Agent, Embeddings, Sandbox) orchestrating 20 models. Reliability benchmarks remained sobering: a Washington State University study of 700+ scientific hypotheses found ChatGPT at 76.5% accuracy but only 41% consistency across runs; Google's DeepFact paper showed PhD-level experts improve factuality evaluation from 60.8% to 81%+ only when benchmarks are iteratively refined, highlighting benchmark brittleness as a compounding barrier. CrewAI's enterprise survey found 81% claim to be scaling agentic AI but only 11% have agents in production, with 38% stuck in pilots and Gartner predicting 40% of agentic projects cancelled by 2027 — underscoring that deep research capability continues to outpace organisational readiness to deploy and govern it.
  • 2026-Apr: Product launches and reliability failures defined the month in parallel. Google launched Deep Research Max with Gemini 3.1 Pro, adding MCP support for proprietary data integration and native chart generation for asynchronous enterprise workflows; Perplexity Computer launched multi-model orchestration (Claude Opus reasoning, Gemini Deep Research, GPT-4 drafting) with sub-agent parallelisation and background workflows running hours or days unattended, reaching $450M ARR. Stanford HAI 2026 AI Index documented agents achieving only ~50% of PhD specialist performance on complex research workflows; AlphaLab demonstrated GPT-5.2 and Claude Opus 4.6 autonomously conducting multi-phase research with 4.4x GPU kernel speedup at $150-200 per campaign. The AI Slop Loop case surfaced a systemic failure: fabricated SEO articles were cited as fact by Perplexity within 24 hours, revealing model collapse through poisoned sources. An independent survey of 2,400 enterprises found 97% have deployed AI agents but only 29% see ROI, with 67% suffering data breaches via unapproved tools and 36% lacking governance plans — directly explaining why deep research agents remain at the bleeding edge despite product maturity. Google's Gemini Enterprise named deployments (Macquarie Bank 38% engagement lift, JCOM analysing 100k+ conversations monthly) and an M&A due diligence case study (90% time reduction, months to single afternoon) showed bounded autonomous research delivering measurable value in governed contexts.
  • 2026-Jun: Production evidence and new benchmarks jointly deepened the capability picture. Harvard/Perplexity study (10,000 matched sessions, Feb-May 2026) documented Perplexity Computer achieving 26-minute autonomous execution per session vs 33 seconds for Search (48× autonomy increase) with 87% task time reduction and 94% cost savings — while simultaneously Perplexity's multi-model routing (20+ models) showed +43 percentage point improvement on BrowseComp (40.7%→83.8%) and +14pp on Humanity's Last Exam, validating orchestration and model diversity as reliability amplifiers. Rigorous benchmarking clarified fundamental limits: ResearchClawBench found Claude Code achieving only 21.5% on 40 real scientific re-discovery tasks; SciAgentArena (Stanford/MIT/Harvard) showed agents effective on well-specified data workflows but failing on multi-constraint optimization and novel insight generation; UC Berkeley research confirmed orchestration, memory, and context governance matter as much as model capability, with multi-agent systems achieving 90.2% improvement over single-agent baselines. CHARM framework formalized cascading hallucination failures in multi-step RAG pipelines — existing detectors catch only 12.8–41.7% of propagation errors while LLM self-correction proves counterproductive. Citation integrity crisis deepened: Zhao et al. audit of 2.5M papers found 146,932 hallucinated citations (fabrication rate 1 per 2,828 papers in 2023 → 1 per 277 in early 2026); arXiv enacted one-year submission bans for unchecked LLM content. Adoption gap persisted: 79% of enterprises claim AI agents but only 11% reach production; Gartner forecasts 40%+ cancellation by 2027. Forrester explicitly validated frontier capability: 'Anthropic has demonstrated multiday research agents' — distinguishing long-horizon operation from shorter task agents. Verdict: measurable production ROI in bounded use cases (due diligence, marketing), yet endemic citation failures, pipeline reliability gaps, and governance barriers block autonomous deployment in mission-critical workflows.

TOOLS