Perly Consulting │ Beck Eco

The State of Play

A living index of AI adoption across industries — where established practice meets the bleeding edge
UPDATED DAILY

The AI landscape doesn't move in one direction — it lurches. Some techniques leap from experiment to table stakes in a single quarter; others stall against regulatory walls, technical ceilings, or organisational inertia that no amount of hype can dislodge. Knowing which is which is the hard part. The State of Play cuts through the noise with a rigorously maintained index of AI techniques across every major business domain — classified by maturity, evidenced by real-world adoption, and updated daily so you always know where you stand relative to the field. Stop guessing. Start knowing.

The Daily Dispatch

A daily newsletter distilling the past two weeks of movement in a domain or two — delivered to your inbox while the index updates in the background.

AI Maturity by Domain

Each dot marks the weighted maturity of practices within a domain — hover for a brief summary, click for more detail

DOMAIN
BLEEDING EDGEESTABLISHED

Single-query research retrieval & summary

LEADING EDGE

TRAJECTORY

Stalled

AI that retrieves relevant information from a single search query and synthesises a coherent answer with source attribution. Includes search-augmented generation and cited responses; distinct from deep research which conducts multi-step autonomous investigation.

OVERVIEW

Single-query research retrieval has crossed into production at forward-leaning organisations, but a persistent reliability gap keeps it from becoming standard infrastructure. The practice combines information retrieval with generative AI: a system executes one or a few searches, retrieves relevant sources, and synthesises a cited answer in a single pass. Perplexity, You.com, ChatGPT with browsing, and enterprise RAG deployments all embody this pattern. Adoption is real and growing fast -- Perplexity alone has surpassed 50 million monthly active users, and two-thirds of B2B buyers report using AI search tools. Yet the same deployments surface a stubborn accuracy ceiling: hallucinations, misattributed citations, and query-specific failures that aggregate metrics routinely mask. Architectural advances such as hybrid retrieval and adaptive depth are narrowing the gap, not closing it. The defining tension at leading-edge is exactly this: organisations are deploying at scale while knowingly accepting systematic factuality risks that no shipping product has resolved.

CURRENT LANDSCAPE

The vendor ecosystem is broad and hardening toward enterprise infrastructure. Perplexity has reached 45-50 million monthly active users with $454M ARR (50% monthly growth) and secures $750M 3-year Microsoft Azure partnership for production-scale deployment; You.com processes over one billion API queries per month for enterprise customers including Alibaba and DuckDuckGo; Databricks has launched Instructed Retrieval, a hybrid deterministic-probabilistic search yielding 35-50% recall gains; and Google Cloud now offers a production RAG platform on Vertex AI with hybrid search and re-rankers.

Enterprise deployments confirm measurable productivity value with named evidence. Ontop (global payroll company) deployed enterprise AI search reducing response time from 20 minutes to 20 seconds for legal compliance questions, saving legal team 130 hours monthly with 60% query acceptance rate. A 300-person manufacturing firm switched from Google to Perplexity Enterprise and reduced competitive-analysis research cycles from 7 days to 2 days using enterprise index integration. The Cleveland Cavaliers use Perplexity across 15+ teams reporting 10+ hours saved per employee per week; Databricks documents 5,000 working hours monthly savings. Yet a wide gap separates usage from bottom-line impact: 71% of organisations use generative AI regularly, but only 17% attribute more than 5% of earnings to it. Conversion analysis shows AI-cited traffic converts 14.2% versus 2.8% organic baseline—high intent but constrained by reliability.

Reliability remains the binding constraint, with May 2026 evidence sharpening the gap between deployment scale and citation accuracy. Critical failures documented across medical, legal, and peer-review contexts: Royal College of Surgeons (April 2026) found 25-34% of medical references fabricated in chatbot responses; Lancet study (May 2026) identified 12-fold rise in fake citations across 2.5 million biomedical papers since 2023; legal sanctions across four US jurisdictions (Pennsylvania, Northern California, Georgia, California) for attorneys using AI-generated false citations in court filings. Platform divergence matters: Perplexity shows 78% citation coverage on complex research queries versus ChatGPT's 62%, but only 11% citation overlap between systems on identical queries—suggesting platform-specific architectural success at scale. Quality improvements show frontier models reaching 1.0-2.5% hallucination on summarization tasks (up from 3-8% in 2023), yet hallucination rates vary 5-15x by topic class. The CRAG benchmark (Meta/HKUST, 4,409 QA pairs) documents the capability ceiling: advanced LLMs achieve ≤34% accuracy, basic RAG 44%, state-of-the-art solutions 63% without hallucination. Architectural research (SIRA, May 2026) shows single-pass retrieval can be optimized through LLM-guided corpus discrimination, compressing multi-round search into single queries while outperforming dense retrievers. However, practitioner case studies remain clear: audit findings show 29% of citations have conclusions misaligned with source content despite nominally correct references. Standard evaluation metrics continue to mask query-specific catastrophes—a system can score 78% precision on average while failing systematically on the queries users care most about. No published breakthrough has closed this gap between deployment momentum and operational reliability.

TIER HISTORY

ResearchNov-2022 → Nov-2022
Bleeding EdgeNov-2022 → Jan-2025
Leading EdgeJan-2025 → present

EVIDENCE (116)

— Empirical analysis of 28,870 source events reveals 71% of sources exclusive to single model; 16–59% pairwise overlap across engines. Documents structural divergence in single-query retrieval architecture rather than convergence toward standard.

— 2026 ACM Web Conference study: high-quality synthetic content snowballs into 80%+ of top results while accuracy metrics stay reassuring—retrieval systems drift onto synthetic evidence invisibly. Critical systemic failure mode of single-query systems at scale.

— Anthropic web search API enables Claude to autonomously decide when to search, refine queries, and return cited results. Customizable domain allowlists, web search integrated into Claude Code beta. Shows major vendor expanding into single-query research space.

— Empirical study documents vector search dilution failure: Wyoming DOT corpus scaling 54→1,128 documents reduced accuracy 75%→below 40%. Proposes MASDR-RAG and identifies precision-faithfulness paradox—demonstrates adoption barriers when retrieval scales to large, noisy collections.

— Author's testing demonstrates hallucination reduction from 19% to 2% error rate with Citations API; legal domain goes 88%→52%, healthcare 56%→21%. Deployed on Anthropic API, Bedrock, Vertex AI with measured ROI (4 hours → 35 min audit trail).

— Ecosystem maturity signal: brands now running quarterly hallucination audits, optimizing entity schema for AI citation. Wikipedia overweights at 47.9% of ChatGPT top-10 sources. Organizations have normalized single-query retrieval as business infrastructure requiring active management.

— Reka AI released 374-question benchmark replacing saturated SimpleQA, achieving performance discrimination across 26.7–59.1% accuracy range. Signals ecosystem recognition that single-query search-augmented LLMs warrant dedicated rigorous evaluation.

— Hands-on testing across six AI tools (Gemini, ChatGPT, Copilot, Claude, Perplexity, DeepSeek) shows massive variation in citation coverage and UX; Gemini in-text complete, ChatGPT/Perplexity inconsistent, Claude lacks sources pane. Reveals citation attribution is non-standardized across platforms.

HISTORY

  • 2022-H2: Dense retrieval foundations mature (300+ paper survey). QUILL system deployed at billion scale using retrieval augmentation for query understanding. You.com launches enterprise AI product with source attribution. Early adoption accelerating (ChatGPT 1M users by Dec 4). Fact hallucination documented as key reliability challenge; data protection concerns cited as adoption barrier.
  • 2023-H1: Perplexity AI reaches 10M monthly visits with 100% MoM growth; secures Series A and explores partnerships with Instacart, Klarna. You.com adds multimodal chat. Academic research confirms widespread hallucination and inaccurate citation in production engines; large-scale MIT study (12K+ queries) shows users distrust AI search but false citations increase perceived trustworthiness. Critical reliability gap persists despite explosive adoption.
  • 2023-H2: You.com launches web search APIs for LLM integration ($100/month) with enterprise adoption by LlamaIndex, Anthropic, Cohere—API-first strategy expands beyond consumer products. Professional adoption accelerates in healthcare and content workflows despite documented hallucination failures (arithmetic errors, citation fabrication). Practitioner guides compare citation quality to ChatGPT; international adoption signals emerge. Developer APIs and tooling infrastructure mature while reliability concerns remain unresolved.
  • 2024-Q1: You.com scales production infrastructure to handle 1B+ monthly API calls across Search, Content, News, and Images endpoints. Perplexity continues expanding adoption despite critical failures: documented medical errors (wrong post-surgery guidance) reveal persistent accuracy gaps in real-world deployment, highlighting the reliability-adoption paradox where users embrace single-query systems despite known hallucination risks.
  • 2024-Q2: Perplexity secures enterprise contracts with Zoom, HP, Stripe, and Cleveland Cavaliers, demonstrating willingness to deploy at scale despite reliability concerns. Academic research (NAACL, Google Cloud AI) identifies persistent technical limitations: retrieval augmentation inconsistently helps LLMs and can hurt performance, imperfect retrieval is widespread (70% of passages don't contain true answers), and enterprise deployment faces unresolved barriers around accuracy validation. Independent studies reveal systematic trustworthiness failures: citation of AI-generated sources and second-hand hallucinations detected in production use, defining the core adoption tension.
  • 2024-Q3: Vendor ecosystem hardens: Coveo launches production-grade Relevance-Augmented Passage Retrieval API (September GA) to address precision and hallucination. Real-world deployment evidence emerges (HP salesforce adopting Perplexity for prospect research). However, peer-reviewed and independent evidence of quality failures intensifies: JMIR study shows 61.6% citation irrelevancy in medical chatbot evaluation (July); Cornell/UW/Waterloo benchmarking shows top models achieve only 35% hallucination-free responses (August); technical research highlights single-query limitations (RQ-RAG, July; EuroPython 2024 talk on RAG failure modes). Maturation visible but fundamental reliability gaps persist—systems scaling operationally while failing visibly in production.
  • 2024-Q4: Enterprise deployment accelerates at scale: Cleveland Cavaliers adopt Perplexity across 15+ teams saving 10+ hours/week per employee; Amplitude deploys for market research; Perplexity launches Election Information Hub for fact-checked real-time election information. However, critical reliability evidence hardens: Columbia University Tow Center finds ChatGPT Search misattributes sources 76.5%; peer-reviewed analysis identifies 16 design failures (bias, hallucination, misattribution); consultant case studies document seven failure modes in production RAG systems. Operational maturity confirmed but reliability-adoption paradox crystalizes: enterprises deploy at scale while accepting systematic factuality and attribution failures.
  • 2025-Q1: Infrastructure consolidation accelerates: Perplexity hardens pplx-api on NVIDIA infrastructure for production throughput; enterprise risk tolerance normalizes despite API outages (January 23, Perplexity). Technical landscape stalls—practitioner analysis shows most production systems remain at stage 1-2 (chatbots/reasoners) rather than advancing to autonomous agents. Reliability paradox deepens: 70% of enterprises depend on LLM-based research tools, yet persistent gaps in context understanding, sentiment analysis, and service availability define adoption ceiling. No breakthrough solutions emerge; the practice consolidates operationally while remaining technically constrained.
  • 2025-Q2: Quality degradation and investor skepticism emerge. Databricks reports 5,000 working hours monthly savings with Perplexity, validating enterprise ROI, but counterbalanced by tech journalism reporting model collapse in AI search tools, documented finance accuracy failures (97% error rates), WhatsApp bot scaling outages, and investor skepticism (conference vote: Perplexity "most likely to flop"). Market reports 42% of users encounter misleading content. The practice shifts from "proven but unreliable" to "deployed but questioned"—deployment momentum persists but quality concerns begin eroding organizational confidence.
  • 2025-Q4: Vendor RAG ecosystem accelerates at scale—Google Cloud ships production RAG platform with Vertex AI and hybrid search (Dec); RAGAS evaluation framework reaches 4,000+ GitHub stars and 5M+ monthly evaluations. Research adoption expands (Qualtrics: 72% of AI-using teams report increased organizational dependence). However, systematic evidence of limitations hardens: PRISMA systematic review of 128 RAG studies documents persistent data quality failures across pipeline stages; ICIS practitioner study identifies 15 data quality dimensions and failure modes; legal domain research reveals Document-Level Retrieval Mismatch failure in production systems. Quality degradation signals from Q2 persist with no published breakthrough solutions. Deployment breadth has hardened as incumbent infrastructure—single-query retrieval now standard in enterprise research workflows—but technical maturation has plateaued.
  • 2026-Jan: Vendor innovation accelerates in single-query architecture: Databricks launches Instructed Retrieval combining deterministic and probabilistic search for 35-50% recall gains; Perplexity acquires Carbon to enable enterprise grounding across internal sources with mid-market rollout; Perplexity expands public-sector adoption (200-seat law enforcement deployment). Analysis of production failures deepens: practitioner research identifies persistent evaluation blind spots where metrics mask query-specific catastrophes; advocates for adaptive retrieval (query-aware dynamic depth) showing 40-60% latency improvements in production. Market survey shows 71% of orgs use GenAI regularly but only 17% realize 5%+ earnings impact—broad deployment with productivity-to-ROI gap persists. Single-query systems now incumbent infrastructure but reliability constraints remain unresolved despite architectural innovations addressing static retrieval limitations.
  • 2026-Feb: Enterprise adoption metrics harden: 67% of B2B buyers use AI search tools (Perplexity 29% preference), with AI search shortening research cycles by 34%; Perplexity reaches 50+ million monthly active users (up from 15M in early 2025); You.com operates at billion-scale infrastructure (1B+ API queries/month) with enterprise customers (Alibaba, Amazon, DuckDuckGo). Vendor ecosystem matures: Perplexity releases Agent API and Embeddings API to general availability, enabling production-grade custom applications. Technical depth clarified: arXiv research confirms retrieval essential for accuracy (0% without retrieval vs. 79% with in SQL/API generation), validating core architectural premise. However, production challenges persist: practitioner analysis identifies systematic failure modes where standard metrics mask query-specific catastrophes—reliable single-query retrieval remains constrained by evaluation blind spots and deployment complexity despite market-scale adoption.
  • 2026-Mar: Deployment evidence deepens across sectors. Conversion metrics accelerate: AI-referred B2B traffic converts 49-63% across industries (vs. organic 28-42%); 796% AI traffic growth YoY with 6,432% conversion growth and 87.4% of AI referral traffic from ChatGPT, signalling platform concentration. Healthcare deployment case study: a 65-person SaaS deployed RAG + ensemble + verifier pipeline reducing operational hallucination rate from 4.2% to 3.4% (FACTS benchmark, 90-day timeline). Critical limitations surface: Apple/Duke research identifies over-searching as a systematic failure mode (first search provides 0.874% accuracy ROI, subsequent searches show diminishing returns and hallucinations); Washington State University study finds ChatGPT achieves only 73% consistency across identical prompts and 16.4% accuracy identifying false hypotheses. 451 Research confirms vector database maturation now underpins RAG infrastructure at enterprise scale. Signal persists: broad adoption at scale with unresolved reliability constraints and evaluation blind spots that aggregate metrics continue to mask.
  • 2026-May: Adoption metrics hardened further — ChatGPT at 900M weekly users and 17% of all digital queries — while reliability evidence continued to accumulate against it. Stanford AI Index documented models failing at 86% when users assert false premises, and a 5,000-prompt benchmark found citation accuracy averaging just 12.4% across frontier models. Tow Center for Digital Journalism found Perplexity at 37% wrong and ChatGPT at 40% wrong overall, reinforcing the deployment paradox. Perplexity reached $454M ARR (45M+ MAU, 1B+ monthly queries) with a $750M 3-year Microsoft Azure partnership, while named enterprise deployments confirmed productivity value: Ontop reduced legal compliance response time from 20 minutes to 20 seconds saving 130 hours/month; a manufacturing firm cut competitive research cycles from 7 days to 2 days. GEO-Bench (10,000 queries, 25 domains) documented AI-cited traffic converting at 14.2% versus 2.8% organic, but only 11% citation overlap between ChatGPT and Perplexity on identical queries — platform-specific architecture variance growing. Critical failure evidence sharpened: Royal College of Surgeons found 25-34% of medical references fabricated; Lancet documented a 12-fold rise in fake citations across 2.5M biomedical papers since 2023; legal sanctions continued across US jurisdictions. Architectural research (SIRA, arXiv) showed LLM-guided corpus discrimination can compress multi-round search into single queries while outperforming dense retrievers, and frontier models reached 1.0-2.5% hallucination on summarization (down from 3-8% in 2023) — though hallucination rates still vary 5-15x by topic class. Single-query AI search is mainstream infrastructure with adoption growing rapidly, but production reliability in retrieval remains structurally unresolved.
  • 2026-Apr: Ecosystem maturation accelerates while reliability gaps harden. Major partnerships demonstrate infrastructure confidence: Microsoft commits $750M 3-year cloud partnership with Perplexity providing multi-model access (OpenAI, Anthropic, xAI); Samsung integrates Perplexity at OS level on Galaxy S26 (1B+ device ecosystem), signaling major vendor endorsement. Adoption metrics deepen: 73% of B2B buyers use AI tools for purchase research (multi-source meta-study across 680M citations); Perplexity enterprise product launched March 2026 with 100+ customers acquired first weekend. The CRAG benchmark (Meta/HKUST, 4,409 QA pairs) quantified the capability ceiling: advanced LLMs achieve ≤34% accuracy, basic RAG 44%, state-of-the-art solutions 63% without hallucination. A Nature-published Bixonimania experiment demonstrated citation laundering at scale — AI systems elaborated a fake disease into false statistics that contaminated peer review, illustrating how single-query systems amplify fabricated sources. Perplexity's Amazon Bedrock deployment documented production quality improvements (Claude 3 reducing hallucinations by half vs. Claude 2.1), while independent brand accuracy research found AI answers wrong about brands 40% of the time. Reliability constraints persist and sharpen: EACL 2026 peer-reviewed research formalizes error taxonomy for realistic RAG deployments; independent citation accuracy testing shows 78% precision across 847 queries; only 30% of AI-generated answer sources reappear in an identical follow-up query. Research confirms hybrid retrieval substantially outperforms single-stage methods; architectural advances address known failure modes but production systems remain constrained by aggregation blind spots that mask query-specific catastrophes. Broad adoption accelerates (900M weekly ChatGPT users, 1.2-1.5B Perplexity monthly queries) with 23x conversion advantage over organic search, but technical progress has plateaued around reliability ceilings no shipping product has systematically resolved.
  • 2026-Jun: Vendor ecosystem expansion and deepening systemic reliability evidence jointly defined the month. Anthropic launched a Web Search API enabling Claude to autonomously decide when to search, refine queries, and return cited results — a major vendor entry into single-query retrieval. Reka AI released Research-Eval (374 questions, replacing saturated SimpleQA), signaling ecosystem recognition that dedicated rigorous evaluation is required. Claude Citations API demonstrated measurable technical progress: testing showed error rates dropping 19%→2% overall, with domain-specific gains (legal 88%→52%, healthcare 56%→21%). Empirical citation architecture analysis (Machine Relations, 28,870 source events) confirmed structural platform divergence rather than convergence — 71% of sources exclusive to a single engine. Critical negative signals surfaced: ACM Web Conference research documented retrieval collapse when synthetic content contaminates indexes, with accuracy metrics staying reassuring while systems invisibly drift onto AI-generated evidence (synthetic content snowballing to 80%+ of top results); Wyoming DOT corpus tests showed vector search dilution causing accuracy collapse from 75% to below 40% as documents scaled from 54 to 1,128. Citation attribution remains non-standardized across platforms, with Wikipedia overweighting ChatGPT citations at 47.9% of top 10. Brands have now operationalized single-query retrieval as managed business infrastructure: quarterly hallucination audits, entity schema optimization, and AI visibility management as a distinct channel. Single-query research retrieval has crossed from technology adoption into organizational practice; maturity is measured not by platform convergence but by infrastructure specialization and risk management sophistication.