The AI landscape doesn't move in one direction — it lurches. Some techniques leap from experiment to table stakes in a single quarter; others stall against regulatory walls, technical ceilings, or organisational inertia that no amount of hype can dislodge. Knowing which is which is the hard part. The State of Play cuts through the noise with a rigorously maintained index of AI techniques across every major business domain — classified by maturity, evidenced by real-world adoption, and updated daily so you always know where you stand relative to the field. Stop guessing. Start knowing.
A daily newsletter distilling the past two weeks of movement in a domain or two — delivered to your inbox while the index updates in the background.
Each dot marks the weighted maturity of practices within a domain — hover for a brief summary, click for more detail
AI agents that conduct multi-step research autonomously — formulating queries, reading sources, following leads, and synthesising findings. Includes tools like Gemini Deep Research and Perplexity Pro; distinct from single-query retrieval which answers from a single search round.
Multi-step autonomous deep research -- AI agents that formulate queries, read sources, follow leads across multiple rounds, and synthesise findings without human intervention -- has crossed from consumer experimentation into bounded production deployment. Three major platforms (Google, OpenAI, Perplexity) now offer enterprise-grade deep research features deployed at scale, with Perplexity Computer reaching $450M ARR and Google's Deep Research Max launching native multi-model orchestration for asynchronous enterprise workflows. Yet the practice remains fundamentally constrained by two unresolved gaps. First: reliability. Multi-turn research accuracy gaps persist across all platforms: Stanford HAI 2026 AI Index shows agents achieving only ~50% of PhD specialist performance on complex workflows; WildClawBench reveals best-in-class models reach only 62.2% on 60 realistic long-horizon tasks; AutoExperiment demonstrates frontier agents collapse from 30-37% accuracy to 6-10% when research tasks have cross-function dependencies; multi-model consensus breaks down, with 99.1% of real-world turns showing contradictions across frontier models and Gemini's single-model confidence suffering 51.3% contradiction rate from peers. A Princeton-backed analysis documented that eighteen months of model capability gains yielded zero reliability improvement for production agents. Second: organizational scaling. Only 23% of enterprises scale agentic systems enterprise-wide despite 88% using AI somewhere; Gartner forecasts 40%+ of agentic projects cancelled by 2027 due to governance and unclear ROI. Deep research agents are fully available and deployed in early-adopter teams -- but orchestration matters as much as models (multi-agent systems achieve 90.2% improvement over single-agent), and the majority of organizations have not yet solved the verification frameworks, context governance, and process redesign required for autonomous research at scale. For exploratory research and bounded decision support with human review, deep research delivers measurable acceleration. For mission-critical analysis, autonomous research remains a supervised tool requiring verification boundaries.
By June 2026, deep research has consolidated around three major vendor platforms (Google, OpenAI, Perplexity) with production deployments across enterprise and consumer markets. Perplexity Computer, launched February 2026, orchestrates 20 AI models for multi-step autonomous research tasks with enterprise deployment at $200/month; expanded to desktop (Mac Mini with audit trails, Snowflake/Salesforce integration), Comet Enterprise browser, and four new APIs (Search, Agent, Embeddings, Sandbox). Google's Deep Research Max (April 2026) adds Model Context Protocol (MCP) support for proprietary data integration, native chart generation, and API access for asynchronous enterprise workflows. Perplexity Computer reached $450M ARR (50% monthly growth) with 100M+ monthly active users and tens of thousands of enterprise clients executing multi-step workflows (document review, campaign planning, tax filing automation, investment analysis). Google's Gemini Deep Research achieved Workspace GA, enabling teams to synthesize internal documents (Gmail, Drive, Chat) with public web sources; technical review shows capability reaching 30-60 source synthesis with iterative reasoning (4-7 search iterations) completing in 3-10 minutes on Deep Research Max. OpenAI's Deep Research continues scaling across ChatGPT's 700M weekly users. Perplexity user base reached 33M monthly active users with 20.8% of queries targeting research and learning.
Early-stage production deployments achieve measurable results in specific use cases: legal teams conducting M&A due diligence autonomously across centuries of corporate records, reducing months to single afternoons (90% time reduction); sales research pipeline (Gemini Deep Research → NotebookLM → Google Sheets) producing 0-16 qualification scores with cited evidence in under 20 minutes per company; financial analysis (Skywork case study: 93% citation accuracy, 15-25% speed gains); media research (GDELT Project: autonomous analysis of 500+ TV transcripts generating think tank reports at $2.14 per run, unattended); academic research with frontier models (AlphaLab: multi-phase research with 4.4x GPU kernel speedup). Infrastructure maturation: LangChain Deep Agents framework, standardized multi-model orchestration patterns, and verification boundaries enabling production control.
Yet critical reliability barriers persist. WildClawBench reveals best-in-class models reach only 62.2% on 60 realistic long-horizon tasks; orchestration architecture alone shifts performance by up to 18 percentage points, showing architectural factors matter as much as model capability. AutoResearchBench quantified closure and state-tracking defects: Claude Opus 4.6 achieves only 9.39%, GPT-5.4 at 7.44%, Gemini 3.1 Pro at 7.93% on multi-step research tasks. AutoExperiment demonstrates frontier agents collapse from 30-37% accuracy on single-function tasks to 6-10% when functions have cross-function dependencies—agents fail on implicit data-flow reasoning. Multi-model consensus breaks down: 99.1% of production turns show contradictions across frontier models; Gemini's high-confidence answers suffer 51.3% contradiction rate from peers, revealing single-model confidence as unreliable. Source credibility failures are systematic: citation volume treated as credibility makes agents vulnerable to poisoned data (AI Slop Loop: fabricated articles cited as fact within 24 hours); 50-90% of autonomous research citations remain unsupported. Washington State University study shows ChatGPT at 76.5% accuracy but only 41% consistency across runs. Governance constrains adoption: only 23% of enterprises scale agentic systems enterprise-wide despite 88% using AI somewhere; Deloitte reports 11% production deployment; Cisco's survey shows 85% pilot vs 5% production trust; Dynatrace reports 88% never reach production. Deep research is production-feasible for bounded use cases with human review and verification boundaries, but autonomous research at enterprise scale remains blocked by architectural reliability gaps, organizational-process barriers, and governance immaturity.
— Perplexity Computer integration: BrowseComp +43pp (40.7%→83.8%), Humanity's Last Exam +14pp; 'Search as Code' paradigm with parallel retrieval/filtering; rolling to Max tier and Agent API—multi-model orchestration reaching production scale with measured benchmark improvements.
— SciAgentArena benchmark (Stanford/MIT/Harvard): agents effective on well-specified data workflows but struggle with multi-constraint optimization, novel insight generation, and unsupported claim detection—defining boundaries of autonomous research capability.
— Zhao et al. audited 2.5M arXiv/bioRxiv/PubMed papers: 146,932 hallucinated citations identified in 2025; rate rose from 1/2,828 papers (2023) to 1/277 (early 2026); demonstrates widespread deployment of deep research tools at scale with endemic citation failure modes.
— First rigorous benchmark: Claude Code achieved only 21.5% on 40 real scientific re-discovery tasks; error modes (experimental mismatch, evidence gaps, missing core) concentrated; reveals critical gap between market adoption and measured research agent reliability.
— Production study (Feb-May 2026): Perplexity Computer achieved 26-minute autonomous execution per session vs 33 seconds for Search (48× increase); 87% task time reduction, 94% cost savings on matched 10k sessions; 23% novel task expansion showing scope amplification.
— Adoption-to-production gap: 79% claim agents, 11% in production; 88% of pilots fail; Gartner forecasts 40%+ cancellation by 2027; successful deployments (Klarna $60M, Salesforce 380k interactions) follow narrow scope + measurable output—critical scaling barrier for deep research.
— CHARM framework (arXiv June 3, 2026) formalizes cascading hallucinations in multi-step RAG pipelines; existing detectors catch only 12.8–41.7% of failures; LLM self-correction counterproductive (12.8% detection); identifies fundamental pipeline reliability barrier for deep research.
— Deep Research Max API benchmarks: 93.3% on DeepSearchQA (vs 66.1% Dec 2025, +41.3pp gain); MCP support for proprietary data; async background workflows up to 60 minutes; API GA (April 21, 2026)—technical capability maturation documented.