The AI landscape doesn't move in one direction — it lurches. Some techniques leap from experiment to table stakes in a single quarter; others stall against regulatory walls, technical ceilings, or organisational inertia that no amount of hype can dislodge. Knowing which is which is the hard part. The State of Play cuts through the noise with a rigorously maintained index of AI techniques across every major business domain — classified by maturity, evidenced by real-world adoption, and updated daily so you always know where you stand relative to the field. Stop guessing. Start knowing.
A daily newsletter distilling the past two weeks of movement in a domain or two — delivered to your inbox while the index updates in the background.
Each dot marks the weighted maturity of practices within a domain — hover for a brief summary, click for more detail
AI that translates natural language questions into SQL queries and performs semantic search across structured and unstructured data. Includes text-to-SQL tools and embedding-based retrieval; distinct from enterprise RAG which retrieves from document collections rather than databases.
Natural language data querying has reached critical adoption momentum alongside persistent production barriers quantified by May 2026 research. Snowflake's earnings inflection—50% of customer base (5,200+ weekly active Cortex AI accounts) with 200% growth in AI-related workloads—signals mainstream traction for the vendor platform. Yet May 2026 scholarship sharpens the reality gap: BEAVER benchmark (MIT/Intel/Harvard) tested on real corporate query logs (9,128 pairs across 19 domains, 19 database systems) reveals GPT-4o achieves 82% on sterile academic benchmarks but collapses to 10.8% on actual enterprise data with undocumented schemas—a 71-point accuracy cliff. PolySQL research documents systemic evaluation bias: most benchmarks support only SQLite, masking 10.1% accuracy drops on production databases (PostgreSQL, BigQuery, Snowflake). Success hinges entirely on upfront data engineering: context quality (semantic layer, metadata governance, schema curation), not model capability, determines whether text-to-SQL achieves production accuracy. Only enterprises investing heavily in semantic layer development and schema preparation achieve reliable deployments.
Production deployments (Uber 1.2M/month, Tapestry feedback analysis, finance teams using multi-turn conversational analytics, AmpUp sales agents generating customer briefs with Cortex Code, AWS AML case study reducing investigation from 30-90 minutes to <5 minutes) demonstrate viability at scale for well-governed data environments. However, the limiting factor is consistently enterprise context: 70% of real SQL queries follow just 13% of templates (Cornell), yet 50% of frontier model failures stem from context/domain gaps rather than model capability (Berkeley Data Agent Benchmark). May 2026 research shows architectural momentum shifting toward agentic, iterative approaches—Amazon's SQL-Trail achieves SOTA on BIRD-SQL through multi-turn reinforcement learning with feedback; June 2026 FlexSQL demonstrates 65.4% on realistic Spider2-Snow (more representative than base Spider). Yet June 2026 JP Morgan research reveals critical multi-turn limitation: all five frontier models collapse to 0% execution accuracy by Turn 3 without working memory, indicating fundamental conversational barriers. May-June 2026 security research quantifies deployment risks: generated SQL can violate permissions, leak sensitive fields, return semantically wrong results despite syntactic correctness; multi-agent text-to-SQL systems show 30-78% security detection rates, requiring deterministic validation and schema inspection in production.
Vendors have consolidated around agentic architecture and mandatory semantic layers—an acknowledgment that pure text-to-SQL is insufficient. Production case studies demonstrate operational maturity: conversational SQL platform improvement from 60% to 98% success rate via agent-based Python code generation; Cortex Code deployment success correlation with data freshness and semantic grounding. Production learnings from OpenAI, Google Cloud, Vercel, and Hex consensus: enterprise text-to-SQL requires context pruning, rigorous validation, and governance—engineering discipline matters more than prompt optimization. Cost-efficient fine-tuning patterns emerge ($0.80/month for 22,000 queries with LoRA), and research momentum persists (ACL 2026 papers show 70%+ accuracy on specialized benchmarks via agentic approaches). Yet mainstream adoption without substantial implementation investment remains elusive. The practice reaches leading-edge maturity: production-ready for organizations with resources to invest in data governance and schema engineering; early-adopter advantage shifting from capability innovation to operational execution.
The vendor ecosystem has consolidated around agentic architecture and semantic layers as non-negotiable requirements. Google Cloud's Database Center (GA May 2026, Gemini-powered conversational interface enabling queries across Cloud SQL, Spanner, and Bigtable), Snowflake Cortex Analyst, AWS Quick Suite, ThoughtSpot Spotter for Industries, and emerging players signal ecosystem maturity. June 2026 Snowflake earnings show accelerated adoption inflection: 13,600+ accounts using Snowflake AI capabilities (up from 9,100 in May), 7,100+ using Cortex Code with Snowflake Intelligence accounts doubling QoQ, 34% YoY product revenue growth. Scale AI deployed TextQL's Ana at scale: 1,900 requests/week across Finance/Ops/HR on 1.9T rows with 74.9% monthly adoption growth. Uber's QueryGPT handles 1.2M queries monthly with 10→3 minute authoring speedup; Dream11's platform achieved 98.4% execution accuracy with fine-tuned 8B models on 250M users. AWS deployed production AML alert triage using Cortex Analyst (structured) and Cortex Search (unstructured) reducing investigation time 30-90 minutes to <5 minutes. AmpUp's production comparison of Cortex Code agent effectiveness demonstrates conversational SQL generation quality critically depends on data freshness and semantic grounding, not just model capability. These successes require substantial upfront investment in semantic layers, schema curation, and business context systems. Semantic Layer Summit 2026 (6,000+ attendees) showcased production deployments: Carrefour France migrated 3,000 metrics across 40 countries; Vodafone Portugal reduced metric refresh times from hours to minutes by migrating to semantic layers on BigQuery.
However, May-June 2026 research sharpens the benchmark-to-production reality gap and surfaces additional robustness barriers. BEAVER benchmark (MIT/Intel/Harvard, 9,128 real corporate query pairs across 19 domains and database systems) reveals the adoption cliff: GPT-4o achieves 82% accuracy on Spider/BIRD academic benchmarks but collapses to 10.8% on actual enterprise data with undocumented schemas—a 71-point drop representing fundamental inability to grasp complex business logic rather than memory limitations. PolySQL research exposes systematic evaluation bias: most text-to-SQL benchmarks only test SQLite, creating false confidence—cross-dialect evaluation reveals 10.1% average accuracy degradation on production databases (PostgreSQL, BigQuery, Snowflake) due primarily to logical errors (61%) rather than syntax. SpotIt (ICLR 2026) reveals formal verification shows top methods lose 11-14% when evaluated for semantic correctness rather than output matching, demonstrating benchmarks systematically overestimate capability. June 2026 research reveals new robustness limitations: models produce inconsistent SQL across equivalent database schemas (Gemini 81.6% agreement vs DeepSeek 33.85% on same data with different schemas), and multi-turn conversational SQL systems collapse without working memory management. Practitioners document silent failures: queries execute without error but return wrong results due to semantic errors (fan-out traps in joins, NULL inconsistencies, ambiguous business terminology), performance failures, and SQL injection vulnerabilities. Spider benchmark uses 146 clean databases with 5-30 tables; production systems have 400+ tables with opaque naming conventions. Multi-agent text-to-SQL systems show 30-78% security detection rates, with schema metadata flowing uninspected.
Architecture and governance maturity continue to diverge from raw model capability. dbt Labs April 2026 benchmark quantifies this divergence: semantic layer (deterministic) achieves 98.2-100% accuracy vs text-to-SQL (probabilistic) 84-90% on identical business questions, with the key finding that "the Semantic Layer's deterministic query generation means the LLM can't produce subtly wrong results"—addressing the core production failure mode of silent semantic errors. Snowflake's Cortex Sense context-layer benchmark confirms: agents achieve ~24% accuracy in isolation but ~86% with assembled business context (query history, metadata, BI definitions, semantic views), demonstrating context rather than model capability is the limiting factor. ByteDance + Georgia Tech's TAHOE system demonstrates this architectural evolution in production: Spider 2.0-Snow pass rate 61.95%→79.42% via learned hint hints, with cross-model transferability (+19.7pp on weaker models) showing the architecture's robustness. However, governance lags deployment: production incidents (Replit, Vibe) show AI agents destroying databases due to missing role-based access controls and pre-flight validation—evidence adoption is occurring but operational security controls remain immature. Practitioners favour agentic function-calling over direct text-to-SQL due to SQL dialect complexity and model limitations. Production learnings from OpenAI, Google Cloud, Vercel, and Hex consensus: most bad queries don't fail during execution—success requires context pruning, rigorous validation (query shape, missing filters, suspicious joins), deterministic compilation via semantic layers, and governance matching database user access patterns. Cost-efficient fine-tuning patterns emerge (AWS LoRA approach: $0.80/month for 22,000 queries), but fundamental deployment barriers persist: 10-20% of AI-generated answers meet business decision thresholds on heterogeneous enterprise systems without semantic layer governance and extensive schema curation.
— Concrete deployment ROI: PepsiCo 12x faster root-cause analysis (2025); Top-10 pharma $12M opportunity with 2,200% ROI; documents market maturity with vendor consolidation around NL query capabilities.
— Critical negative signal: rigorously documents 70-point accuracy collapse (86-91% benchmarks → 10-21% enterprise); identifies three failure modes (scale, missing semantics, no verification) with empirically validated solutions (context layers recover 70-96%).
— dbt Labs April 2026 benchmark: semantic layer (deterministic) 98.2-100% vs text-to-SQL (probabilistic) 84-90% on same 11 insurance questions; demonstrates architectural solution to silent-wrong-answer failure mode.
— ByteDance + Georgia Tech production system: Spider 2.0-Snow pass rate 61.95%→79.42%; 100% Snowflake syntax; cross-model transferability (+19.7pp on Doubao-2.0-lite); demonstrates hint-learning architecture for real deployments.
— Cortex Sense context layer benchmark: agents alone ~24% accuracy, with context ~86%; demonstrates context assembly—not model capability—as bottleneck; validates semantic layer infrastructure as critical.
— Amazon Science framework outperforms LLM self-eval by 25.78% F1; improves execution accuracy up to 20pp on deployed systems—addresses critical production failure mode of semantically incorrect SQL passing syntax validation.
— Negative evidence: production incidents (Replit, Vibe) where AI agents destroyed databases due to governance gaps; reveals adoption reality—agents deployed in production but operational controls lag, indicating immature security posture.
— CoWork GA with adoption inflection: 13,600+ weekly active accounts with 2x QoQ growth; paired with context layers achieves 5x accuracy improvement (47% baseline, 5x with Atlan context), demonstrating context-not-model as limiting factor.