Multi-step autonomous deep research

The AI landscape doesn't move in one direction — it lurches. Some techniques leap from experiment to table stakes in a single quarter; others stall against regulatory walls, technical ceilings, or organisational inertia that no amount of hype can dislodge. Knowing which is which is the hard part. The State of Play cuts through the noise with a rigorously maintained index of AI techniques across every major business domain — classified by maturity, evidenced by real-world adoption, and updated daily so you always know where you stand relative to the field. Stop guessing. Start knowing.

AI Maturity by Domain

Each dot marks the weighted maturity of practices within a domain — hover for a brief summary, click for more detail

DOMAIN

BLEEDING EDGEESTABLISHED

BLEEDING EDGE

TRAJECTORY— Stalled

AI agents that conduct multi-step research autonomously — formulating queries, reading sources, following leads, and synthesising findings. Includes tools like Gemini Deep Research and Perplexity Pro; distinct from single-query retrieval which answers from a single search round.

OVERVIEW

Multi-step autonomous deep research -- AI agents that formulate queries, read sources, follow leads across multiple rounds, and synthesise findings without human intervention -- has crossed from consumer experimentation into bounded production deployment. Three major platforms (Google, OpenAI, Perplexity) now offer enterprise-grade deep research features deployed at scale, with Perplexity Computer reaching $450M ARR and Google's Deep Research Max launching native multi-model orchestration (Claude Opus reasoning, Gemini Deep Research, GPT-4 drafting) for asynchronous enterprise workflows. Yet the practice remains fundamentally constrained by two unresolved gaps. First: reliability. Multi-turn research accuracy gaps persist across all platforms: Stanford HAI 2026 AI Index shows agents achieving only ~50% of PhD specialist performance on complex research workflows (39% on autonomous paper interaction); Google's FACTS benchmark shows Gemini 3 Pro at 69% accuracy on fact-based research tasks; HalluHard demonstrates ~30% hallucination rates even with web search enabled. A Princeton-backed analysis documented that eighteen months of model capability gains yielded zero reliability improvement for production agents, widening the gap between capability and trustworthiness. Second: organizational scaling. Across 2025-2026, 97% of enterprises deployed AI agents, yet only 11% run them in production; 40% of agentic projects are projected to fail by 2027 due to governance, process design, and ROI realization barriers. Deep research agents are fully available, technically feasible, and deployed in early-adopter teams -- but the majority of organizations have not yet solved the process redesign, governance, and verification frameworks required for autonomous research at organizational scale. For exploratory research and bounded decision support, deep research delivers measurable acceleration. For mission-critical analysis, autonomous research remains a supervised tool, not a replacement for human verification.

CURRENT LANDSCAPE

By April 2026, deep research has consolidated around three major vendor platforms (Google, OpenAI, Perplexity) with production deployments across enterprise and consumer markets. Perplexity Computer, launched February 2026, orchestrates 20 AI models for multi-step autonomous research tasks with enterprise deployment at $200/month; expanded March 2026 to desktop (Mac Mini with audit trails, Snowflake/Salesforce integration), Comet Enterprise browser, and four new APIs (Search, Agent, Embeddings, Sandbox). April 2026 brought Google's Deep Research Max, adding Model Context Protocol (MCP) support for proprietary data integration, native visualization generation, and API access for asynchronous enterprise workflows (e.g., nightly automated due diligence). Perplexity Computer reached $450M ARR (50% monthly growth March 2026) with 100M+ monthly active users and tens of thousands of enterprise clients executing multi-step workflows (document review, campaign planning, tax filing automation). Google's Gemini Deep Research achieved Workspace GA in March 2026, enabling teams to blend internal documents (Gmail, Drive, Chat) with public web sources in unified report-generation workflows. OpenAI's Deep Research continues scaling across ChatGPT's 700M weekly users. Perplexity user base reached 33M monthly active users (Q1 2026) with 20.8% of queries targeting research and learning.

Early-stage production deployments emerge in specific use cases with documented ROI: legal teams conducting M&A due diligence autonomously across centuries of corporate records, reducing months of work to single afternoons with 90% time reduction; internal tests showing $225K annual marketing spend replaced in one weekend via Perplexity Computer; financial analysis (Skywork case study: 93% citation accuracy, 15-25% speed gains); media research (GDELT Project: autonomous analysis of 500+ TV transcripts generating synthetic think tank reports at $2.14 per run, 100% unattended); academic research with frontier models (AlphaLab: GPT-5.2 and Claude Opus 4.6 autonomously conducting multi-phase research with 4.4x GPU kernel speedup, 22% validation loss reduction). Infrastructure maturation continues: LangChain Deep Agents framework, Perplexity's Agentic Research API, and standardized multi-model orchestration patterns lower barriers for custom workflows.

Yet critical reliability barriers persist at the architectural level. AutoResearchBench (May 2026) quantified fundamental defects in frontier models: Claude Opus 4.6 achieves only 9.39% accuracy on multi-step research tasks, GPT-5.4 at 7.44%, Gemini 3.1 Pro at 7.93%—revealing core failures in closure (state tracking, constraint verification) and evidence aggregation rather than retrieval access. Practitioner testing across multiple domains confirms: all five major tools (Perplexity, ChatGPT, Gemini, Elicit, SPARKIT) identified correct answers on a hard biomedical question, yet ChatGPT and Gemini lack production APIs and show 100x latency variance; nuanced academic research (Egyptian history, British constitutional law) produces citation hallucinations, false source attribution, and imprecision across all three major platforms. Multi-turn research failures compound across steps: HORIZON benchmark reveals cumulative error degradation, context loss after 20+ actions, and faulty error recovery—demonstrating that short-horizon benchmarks do not predict long-horizon reliability. Source credibility failures are systematic: deep research agents treat citation volume as credibility, making them vulnerable to poisoned data (AI Slop Loop: fabricated articles cited as fact within 24 hours); source credibility degrades across synthesis (50-90% of autonomous research citations remain unsupported). Pipeline desynchronization failures emerge in production (Gemini Deep Research: safety refusals triggered by escaped Markdown, topic drift in recovery, fabricated infographics with synthetic data unrelated to report content). Washington State University study of 700+ scientific hypotheses shows ChatGPT at 76.5% accuracy but only 41% consistency across runs. Governance and organizational readiness, not technology capability, now constrain adoption: Deloitte Tech Trends 2026 reports only 11% of enterprises run agentic AI in production despite widespread pilots; Cisco's May 2026 survey shows 85% of enterprises pilot agents yet only 5% trust them for production; Dynatrace reports 88% of agents never reach production, with 80% of integration work spent on legacy system connectors and evaluation frameworks rather than model capability. Independent survey (Writer/Workplace Intelligence, April 2026) shows 97% of enterprises deployed agents but only 29% see ROI; 67% suffered data breaches through unapproved AI tools; 36% lack governance plans. Deep research is production-feasible for bounded use cases with human review, but autonomous research at enterprise scale remains blocked by reliability, governance, verification, and organizational-process barriers.

TIER HISTORY

ResearchDec-2024 → Dec-2024

Bleeding EdgeDec-2024 → present

EVIDENCE (78)

The Research Reality Gap: Why AI Agents are Scraping the FloorResearch Papers2026-05-04

— AutoResearchBench quantifies frontier model failures on 1,000 multi-step research tasks: Claude Opus 4.6 achieves 9.39% accuracy, GPT-5.4 at 7.44%, Gemini 3.1 Pro at 7.93%—fundamental closure and state-tracking defects in current research agents.

45 AI Agent Statistics You Need to Know in 2026Adoption Metrics2026-05-02

— Market snapshot: 51% of enterprises have agents in production; 40%+ of agentic projects project to be cancelled by 2027; deep research consolidates on 3 platforms with Perplexity at $450M ARR.

A Strategic OSINT Assessment of AI Agent Deployment, Workforce Economics and Geopolitical Differentiation in 2026Industry Reports2026-05-01

— Strategic analysis: R&D/research agents 'transformational at long time horizons'; documents real deployment challenges (integration complexity, autonomy failures in long-horizon reasoning) and workforce economics of autonomous research scaling.

The Duel: ChatGPT Deep Research vs Google Gemini Deep SearchCase Studies2026-04-29

— Practitioner side-by-side: Gemini retrieved 211 sources vs ChatGPT's 48; Gemini completed in minutes vs ChatGPT's 14 minutes; Gemini showed transparent reasoning plan—real deployment metrics validating capability differentials.

Equip your agent with deep research in one line of code | SPARKITCase Studies2026-04-28

— All 5 deep research tools (Perplexity, ChatGPT, Gemini, Elicit, SPARKIT) identified correct biomedical answer; critical gap: ChatGPT/Gemini lack production APIs, 100x latency variance, Gemini returns 6,000+ words vs Perplexity's 40.

85% of Enterprises Run AI Agents, Only 5% Trust Them for ProductionAdoption Metrics2026-04-26

— Cisco RSA 2026: 85% pilot AI agents, 5% production deployment; 88% experienced AI security incidents; real CEO examples show agents autonomously overriding policy, committing code without approval—governance barriers constraining deep research at scale.

Enterprise AI Agents: 5 Proven Reasons 88% Fail Production in 2026Industry Reports2026-04-25

— Dynatrace 2026 Pulse: 88% of enterprise agents never reach production; 80% of integration time spent on legacy system connectors, evaluation frameworks, and governance design—core barriers for production deep research deployment.

AI deep research agents struggle with nuanced academic subjects despite bold claimsOpinion2026-04-24

— Historian's independent testing: all three agents (ChatGPT, Perplexity, Gemini) exhibit citation hallucinations, false source attribution, and imprecision on specialized academic questions—critical limits of autonomous research on nuanced topics.

HISTORY

2024-Q4: Google launches Gemini Deep Research in Gemini Advanced across 150+ countries as a flagship agentic feature; Perplexity Pro Search demonstrates 50% adoption lift via multi-step reasoning architecture. Industry adoption surveys show 68% of organizations have deployed AI agents, though ROI realization remains below 50%. Accuracy and hallucination challenges identified as key adoption barriers.
2025-Q1: OpenAI launches Deep Research (Feb 2025) as deep-research-specific agent in ChatGPT Pro, using o3 reasoning model for autonomous multi-step investigation; Google extends Gemini Deep Research to Workspace users. Perplexity benchmarks at 93.9% on SimpleQA. Critical analyses emerge noting agents as "fallible tools" rather than expert-level; agentic RAG becomes category's enabling architecture. Category transitions from experimental to mainstream availability across three major platforms.
2025-Q2: Perplexity reaches 15M active users (50% growth in 3 months); pursues $500M–$1B funding at $18B valuation target. Google I/O announces Flash 2.5 experimental support in Deep Research. Production-ready patterns emerge across platforms (OpenAI, Google, Perplexity, Claude/Anthropic); enterprise deployments adopt steerable workflows for controlled autonomy. Academic surveys document category advances; knowledge cutoff bias and information lag emerge as persistent reliability gaps at scale.
2025-Q3: Perplexity grows to 30M monthly active users (780M monthly queries) with 66% YoY growth; enterprises across banking, pharma, law adopt for mission-critical research (60% of Pro customers). Google's Gemini Deep Research achieves production status with Workspace integration and usage quotas. However, critical assessments surface: peer-reviewed medical research examines risks to citation integrity and research quality; user reports document hallucinations in current affairs research; MIT study shows 95% of GenAI pilots fail to reach production due to reliability, data quality, and governance barriers.
2025-Q4: Deep research consolidates around three major platforms (Perplexity, Gemini, OpenAI) with evidence of bounded production use (Skywork case study: 93% citation accuracy, 15-25% speed gains on market research reports). Perplexity's 100M+ interactions show 57% targeting research/learning. However, adoption ceiling persists: Gartner finds only 15% of IT leaders deploying fully autonomous agents (Oct), Deloitte reports 11% production deployment (Dec), and Gemini 3 Pro maintains 88% hallucination rate despite 53% accuracy lead. Domain-specific scientific research shows promise (energy materials agents), but governance, security, and reliability gaps constrain enterprise scaling. Practice matured from experimentation to selective production use but faces unresolved trustworthiness barriers.
2026-Jan: Mainstream business adoption reaches 67% of enterprises using AI research tools (Gartner); Perplexity achieves 370% YoY user growth and 14.1% market share. However, scaling barriers persist: 62% of organizations experimented with agentic workflows but 70-80% struggle to scale with only 5% achieving ROI. EBU/BBC study reveals 45% of AI research responses contain errors; Gemini exhibits 72% sourcing problems. LangChain releases Deep Agents framework enabling multi-step task decomposition through subagents. Deep research remains viable for exploratory, non-critical use but unsuitable for mission-critical workflows requiring reliability and governance.
2026-Feb: Platform consolidation continues with Perplexity reaching 33M monthly active users (20.8% research-focused queries) and Google releasing Gemini 3.1 Pro with upgraded Deep Think model. Specialized research agents emerge: DeepMind's Aletheia achieves 91.9% on mathematical reasoning benchmarks and autonomously co-authors published papers. However, HalluHard benchmark reveals state-of-the-art models still hallucinate ~30% in multi-turn conversations even with web search. Princeton-backed analysis shows 18 months of model capability gains have not improved production agent reliability, widening the gap between capability and trustworthiness.
2026-Mar: Gemini Deep Research reached Workspace GA integrating Gmail, Drive, and Chat with web sources in unified report-generation workflows. Perplexity Computer expanded to desktop (Mac Mini with audit trails), enterprise (Snowflake/Salesforce integration), and Comet Enterprise browser, adding four new APIs (Search, Agent, Embeddings, Sandbox) orchestrating 20 models. Reliability benchmarks remained sobering: a Washington State University study of 700+ scientific hypotheses found ChatGPT at 76.5% accuracy but only 41% consistency across runs; Google's DeepFact paper showed PhD-level experts improve factuality evaluation from 60.8% to 81%+ only when benchmarks are iteratively refined, highlighting benchmark brittleness as a compounding barrier. CrewAI's enterprise survey found 81% claim to be scaling agentic AI but only 11% have agents in production, with 38% stuck in pilots and Gartner predicting 40% of agentic projects cancelled by 2027 — underscoring that deep research capability continues to outpace organisational readiness to deploy and govern it.
2026-Apr: Product launches and reliability failures defined the month in parallel. Google launched Deep Research Max with Gemini 3.1 Pro, adding MCP support for proprietary data integration and native chart generation for asynchronous enterprise workflows; Perplexity Computer launched multi-model orchestration (Claude Opus reasoning, Gemini Deep Research, GPT-4 drafting) with sub-agent parallelisation and background workflows running hours or days unattended, reaching $450M ARR. Stanford HAI 2026 AI Index documented agents achieving only ~50% of PhD specialist performance on complex research workflows; AlphaLab demonstrated GPT-5.2 and Claude Opus 4.6 autonomously conducting multi-phase research with 4.4x GPU kernel speedup at $150-200 per campaign. The AI Slop Loop case surfaced a systemic failure: fabricated SEO articles were cited as fact by Perplexity within 24 hours, revealing model collapse through poisoned sources. An independent survey of 2,400 enterprises found 97% have deployed AI agents but only 29% see ROI, with 67% suffering data breaches via unapproved tools and 36% lacking governance plans — directly explaining why deep research agents remain at the bleeding edge despite product maturity. Google's Gemini Enterprise named deployments (Macquarie Bank 38% engagement lift, JCOM analysing 100k+ conversations monthly) and an M&A due diligence case study (90% time reduction, months to single afternoon) showed bounded autonomous research delivering measurable value in governed contexts.
2026-May: Empirical benchmarking and deployment analysis consolidated the May picture. AutoResearchBench revealed frontier model failures at core architectural level: Claude Opus 4.6, GPT-5.4, and Gemini 3.1 Pro all scored 7-9% on 1,000 multi-step research tasks due to closure defects (state tracking, constraint verification). Practitioner testing across biomedical research confirmed all major tools identified correct answers but revealed critical production gaps: ChatGPT and Gemini lack production APIs, showing 100x latency variance. Historian's independent testing found all three major agents (ChatGPT, Perplexity, Gemini) exhibited citation hallucinations and false source attribution on specialized academic research (Egyptology, constitutional law)—critical limits on nuanced research tasks. Governance barriers quantified: Cisco RSA survey showed 85% pilot vs 5% production deployment with 88% of orgs experiencing AI security incidents; Dynatrace reported 88% of agents never reach production with 80% of integration time consumed by legacy system connectors and evaluation frameworks. Market adoption: 51% of enterprises have agents in production (up from ~11% in 2025-Q4); 40%+ of agentic projects projected to cancel by 2027 due to escalating costs and unclear value. Perplexity reached $450M ARR validating commercial viability, yet broader enterprise adoption constrained by architecture-level reliability gaps and organizational process redesign barriers. Trend remains: production-viable for specific, bounded use cases (legal due diligence, marketing optimization); unsuitable for mission-critical autonomous research without human verification.

TOOLS

Gemini Deep Research Perplexity Pro