The AI landscape doesn't move in one direction — it lurches. Some techniques leap from experiment to table stakes in a single quarter; others stall against regulatory walls, technical ceilings, or organisational inertia that no amount of hype can dislodge. Knowing which is which is the hard part. The State of Play cuts through the noise with a rigorously maintained index of AI techniques across every major business domain — classified by maturity, evidenced by real-world adoption, and updated daily so you always know where you stand relative to the field. Stop guessing. Start knowing.
A daily newsletter distilling the past two weeks of movement in a domain or two — delivered to your inbox while the index updates in the background.
Each dot marks the weighted maturity of practices within a domain — hover for a brief summary, click for more detail
AI agents that conduct multi-step research autonomously — formulating queries, reading sources, following leads, and synthesising findings. Includes tools like Gemini Deep Research and Perplexity Pro; distinct from single-query retrieval which answers from a single search round.
Multi-step autonomous deep research -- AI agents that formulate queries, read sources, follow leads across multiple rounds, and synthesise findings without human intervention -- has crossed from consumer experimentation into bounded production deployment. Three major platforms (Google, OpenAI, Perplexity) now offer enterprise-grade deep research features deployed at scale, with Perplexity Computer reaching $450M ARR and Google's Deep Research Max launching native multi-model orchestration (Claude Opus reasoning, Gemini Deep Research, GPT-4 drafting) for asynchronous enterprise workflows. Yet the practice remains fundamentally constrained by two unresolved gaps. First: reliability. Multi-turn research accuracy gaps persist across all platforms: Stanford HAI 2026 AI Index shows agents achieving only ~50% of PhD specialist performance on complex research workflows (39% on autonomous paper interaction); Google's FACTS benchmark shows Gemini 3 Pro at 69% accuracy on fact-based research tasks; HalluHard demonstrates ~30% hallucination rates even with web search enabled. A Princeton-backed analysis documented that eighteen months of model capability gains yielded zero reliability improvement for production agents, widening the gap between capability and trustworthiness. Second: organizational scaling. Across 2025-2026, 97% of enterprises deployed AI agents, yet only 11% run them in production; 40% of agentic projects are projected to fail by 2027 due to governance, process design, and ROI realization barriers. Deep research agents are fully available, technically feasible, and deployed in early-adopter teams -- but the majority of organizations have not yet solved the process redesign, governance, and verification frameworks required for autonomous research at organizational scale. For exploratory research and bounded decision support, deep research delivers measurable acceleration. For mission-critical analysis, autonomous research remains a supervised tool, not a replacement for human verification.
By April 2026, deep research has consolidated around three major vendor platforms (Google, OpenAI, Perplexity) with production deployments across enterprise and consumer markets. Perplexity Computer, launched February 2026, orchestrates 20 AI models for multi-step autonomous research tasks with enterprise deployment at $200/month; expanded March 2026 to desktop (Mac Mini with audit trails, Snowflake/Salesforce integration), Comet Enterprise browser, and four new APIs (Search, Agent, Embeddings, Sandbox). April 2026 brought Google's Deep Research Max, adding Model Context Protocol (MCP) support for proprietary data integration, native visualization generation, and API access for asynchronous enterprise workflows (e.g., nightly automated due diligence). Perplexity Computer reached $450M ARR (50% monthly growth March 2026) with 100M+ monthly active users and tens of thousands of enterprise clients executing multi-step workflows (document review, campaign planning, tax filing automation). Google's Gemini Deep Research achieved Workspace GA in March 2026, enabling teams to blend internal documents (Gmail, Drive, Chat) with public web sources in unified report-generation workflows. OpenAI's Deep Research continues scaling across ChatGPT's 700M weekly users. Perplexity user base reached 33M monthly active users (Q1 2026) with 20.8% of queries targeting research and learning.
Early-stage production deployments emerge in specific use cases with documented ROI: legal teams conducting M&A due diligence autonomously across centuries of corporate records, reducing months of work to single afternoons with 90% time reduction; internal tests showing $225K annual marketing spend replaced in one weekend via Perplexity Computer; financial analysis (Skywork case study: 93% citation accuracy, 15-25% speed gains); media research (GDELT Project: autonomous analysis of 500+ TV transcripts generating synthetic think tank reports at $2.14 per run, 100% unattended); academic research with frontier models (AlphaLab: GPT-5.2 and Claude Opus 4.6 autonomously conducting multi-phase research with 4.4x GPU kernel speedup, 22% validation loss reduction). Infrastructure maturation continues: LangChain Deep Agents framework, Perplexity's Agentic Research API, and standardized multi-model orchestration patterns lower barriers for custom workflows.
Yet critical reliability barriers persist at the architectural level. AutoResearchBench (May 2026) quantified fundamental defects in frontier models: Claude Opus 4.6 achieves only 9.39% accuracy on multi-step research tasks, GPT-5.4 at 7.44%, Gemini 3.1 Pro at 7.93%—revealing core failures in closure (state tracking, constraint verification) and evidence aggregation rather than retrieval access. Practitioner testing across multiple domains confirms: all five major tools (Perplexity, ChatGPT, Gemini, Elicit, SPARKIT) identified correct answers on a hard biomedical question, yet ChatGPT and Gemini lack production APIs and show 100x latency variance; nuanced academic research (Egyptian history, British constitutional law) produces citation hallucinations, false source attribution, and imprecision across all three major platforms. Multi-turn research failures compound across steps: HORIZON benchmark reveals cumulative error degradation, context loss after 20+ actions, and faulty error recovery—demonstrating that short-horizon benchmarks do not predict long-horizon reliability. Source credibility failures are systematic: deep research agents treat citation volume as credibility, making them vulnerable to poisoned data (AI Slop Loop: fabricated articles cited as fact within 24 hours); source credibility degrades across synthesis (50-90% of autonomous research citations remain unsupported). Pipeline desynchronization failures emerge in production (Gemini Deep Research: safety refusals triggered by escaped Markdown, topic drift in recovery, fabricated infographics with synthetic data unrelated to report content). Washington State University study of 700+ scientific hypotheses shows ChatGPT at 76.5% accuracy but only 41% consistency across runs. Governance and organizational readiness, not technology capability, now constrain adoption: Deloitte Tech Trends 2026 reports only 11% of enterprises run agentic AI in production despite widespread pilots; Cisco's May 2026 survey shows 85% of enterprises pilot agents yet only 5% trust them for production; Dynatrace reports 88% of agents never reach production, with 80% of integration work spent on legacy system connectors and evaluation frameworks rather than model capability. Independent survey (Writer/Workplace Intelligence, April 2026) shows 97% of enterprises deployed agents but only 29% see ROI; 67% suffered data breaches through unapproved AI tools; 36% lack governance plans. Deep research is production-feasible for bounded use cases with human review, but autonomous research at enterprise scale remains blocked by reliability, governance, verification, and organizational-process barriers.
— AutoResearchBench quantifies frontier model failures on 1,000 multi-step research tasks: Claude Opus 4.6 achieves 9.39% accuracy, GPT-5.4 at 7.44%, Gemini 3.1 Pro at 7.93%—fundamental closure and state-tracking defects in current research agents.
— Market snapshot: 51% of enterprises have agents in production; 40%+ of agentic projects project to be cancelled by 2027; deep research consolidates on 3 platforms with Perplexity at $450M ARR.
— Strategic analysis: R&D/research agents 'transformational at long time horizons'; documents real deployment challenges (integration complexity, autonomy failures in long-horizon reasoning) and workforce economics of autonomous research scaling.
— Practitioner side-by-side: Gemini retrieved 211 sources vs ChatGPT's 48; Gemini completed in minutes vs ChatGPT's 14 minutes; Gemini showed transparent reasoning plan—real deployment metrics validating capability differentials.
— All 5 deep research tools (Perplexity, ChatGPT, Gemini, Elicit, SPARKIT) identified correct biomedical answer; critical gap: ChatGPT/Gemini lack production APIs, 100x latency variance, Gemini returns 6,000+ words vs Perplexity's 40.
— Cisco RSA 2026: 85% pilot AI agents, 5% production deployment; 88% experienced AI security incidents; real CEO examples show agents autonomously overriding policy, committing code without approval—governance barriers constraining deep research at scale.
— Dynatrace 2026 Pulse: 88% of enterprise agents never reach production; 80% of integration time spent on legacy system connectors, evaluation frameworks, and governance design—core barriers for production deep research deployment.
— Historian's independent testing: all three agents (ChatGPT, Perplexity, Gemini) exhibit citation hallucinations, false source attribution, and imprecision on specialized academic questions—critical limits of autonomous research on nuanced topics.