Perly Consulting │ Beck Eco

The State of Play

A living index of AI adoption across industries — where established practice meets the bleeding edge
UPDATED DAILY

The AI landscape doesn't move in one direction — it lurches. Some techniques leap from experiment to table stakes in a single quarter; others stall against regulatory walls, technical ceilings, or organisational inertia that no amount of hype can dislodge. Knowing which is which is the hard part. The State of Play cuts through the noise with a rigorously maintained index of AI techniques across every major business domain — classified by maturity, evidenced by real-world adoption, and updated daily so you always know where you stand relative to the field. Stop guessing. Start knowing.

The Daily Dispatch

A daily newsletter distilling the past two weeks of movement in a domain or two — delivered to your inbox while the index updates in the background.

AI Maturity by Domain

Each dot marks the weighted maturity of practices within a domain — hover for a brief summary, click for more detail

DOMAIN
BLEEDING EDGEESTABLISHED

Natural language data querying & semantic search

LEADING EDGE

TRAJECTORY

Stalled

AI that translates natural language questions into SQL queries and performs semantic search across structured and unstructured data. Includes text-to-SQL tools and embedding-based retrieval; distinct from enterprise RAG which retrieves from document collections rather than databases.

OVERVIEW

Natural language data querying has reached critical adoption momentum alongside persistent production barriers. Snowflake's May 2026 earnings inflection—50% of customer base (5,200+ weekly active Cortex AI accounts) with 200% growth in AI-related workloads—signals mainstream traction for the vendor platform. Yet success hinges entirely on upfront data engineering: context quality (semantic layer, metadata governance, schema curation), not model capability, determines whether text-to-SQL achieves production accuracy. The field consensus has crystallized: benchmark-to-reality gap remains unbridged despite four years of research; only enterprises investing heavily in semantic layer development and schema preparation achieve reliable deployments.

Production deployments (Uber 1.2M/month, Tapestry feedback analysis, finance teams using multi-turn conversational analytics for modeling and forecasting) demonstrate viability at scale for well-governed data environments. However, the limiting factor is consistently enterprise context: 70% of real SQL queries follow just 13% of templates (Cornell), yet 50% of frontier model failures stem from context/domain gaps rather than model capability (Berkeley Data Agent Benchmark). Academic text-to-SQL benchmarks report 79-87% accuracy; frontier models achieve 86.6% on Spider 1.0 but collapse to 10% on complex, real-world enterprise schemas. April 2026 security research quantifies deployment risks: generated SQL can violate permissions, leak sensitive fields, and return semantically wrong results despite syntactic correctness, requiring deterministic validation layers in production.

Vendors have consolidated around agentic architecture and mandatory semantic layers—an acknowledgment that pure text-to-SQL is insufficient. Cost-efficient fine-tuning emerges ($0.80/month for 22,000 queries with LoRA), and research momentum persists (ACL 2026 papers show 70%+ accuracy on specialized benchmarks via agentic approaches). Yet mainstream adoption without substantial implementation investment remains elusive. The practice reaches leading-edge maturity: production-ready for organizations with resources to invest in data governance and schema engineering; early-adopter advantage shifting from capability innovation to operational execution.

CURRENT LANDSCAPE

The vendor ecosystem has consolidated around agentic architecture and semantic layers as non-negotiable requirements. Google Cloud's QueryData (GA April 2026, #1 BiRD benchmark ranking with Hughes Network Systems deployment), Snowflake Cortex Analyst, AWS Quick Suite, ThoughtSpot Spotter for Industries, and emerging players signal ecosystem maturity. The April 2026 Snowflake earnings inflection -- 50% customer base (5,200+ weekly active users) using Cortex AI with 27% YoY growth in $1M+ spenders -- validates mainstream enterprise traction. Scale AI deployed TextQL's Ana at scale: 1,900 requests/week across Finance/Ops/HR on 1.9T rows with 74.9% monthly adoption growth. Uber's QueryGPT handles 1.2M queries monthly with 10→3 minute authoring speedup; Dream11's platform achieved 98.4% execution accuracy with fine-tuned 8B models on 250M users. These successes require substantial upfront investment in semantic layers, schema curation, and business context systems.

However, April 2026 evidence sharpens the benchmark-to-production reality gap. Practitioners document silent failures: queries execute without error but return wrong results due to semantic errors (fan-out traps in joins, NULL inconsistencies, ambiguous business terminology), performance failures, and SQL injection vulnerabilities. The gap is quantified: Spider benchmark uses 146 clean databases with 5-30 tables; production systems have 400+ tables with opaque naming conventions. GPT-4o achieves 90%+ accuracy on synthetic benchmarks but drops to 51% on real enterprise BI questions—a 39-point accuracy collapse. Amazon Science's PRACTIQ dataset addresses a core gap: production chatbots receive ambiguous questions and unanswerable queries that existing benchmarks never test. Practitioners favour agentic function-calling over direct text-to-SQL due to SQL dialect complexity and model limitations. Cost-efficient fine-tuning patterns emerge (AWS LoRA approach: $0.80/month for 22,000 queries), but fundamental deployment barriers persist: 10-20% of AI-generated answers meet business decision thresholds on heterogeneous enterprise systems without semantic layer governance and extensive schema curation.

TIER HISTORY

ResearchJan-2022 → Jan-2022
Bleeding EdgeJan-2022 → Jul-2024
Leading EdgeJul-2024 → present

EVIDENCE (117)

— Independent analysis reporting Snowflake's 9,100+ weekly active Cortex AI accounts with 200% growth in AI workloads; 50% customer adoption of Cortex Code since November 2025 launch.

— Finance organizations using multi-turn NLQ for conversational financial modeling, dynamic scenario planning, and variance analysis—showing matured NLQ practice beyond simple Q&A.

— Production security assessment: text-to-SQL risks extend beyond SQL injection—generated SQL can leak permissions, violate access controls, or answer wrong questions; deterministic validation essential after LLM generation.

— Production evaluation framework enabling continuous monitoring without schema access—addresses critical gap: current evaluations require ground-truth queries and schemas, rarely satisfied in deployment.

— Tapestry (Coach/Kate Spade parent) deployed NLQ feedback analysis on AWS Bedrock, collecting 30,000 feedback pieces and achieving 10x faster AI application development with faster business decisions.

Text-to-sql - CatalyzeXResearch Papers

— ACL 2026 research aggregation: semantic layers boost accuracy 17-23 percentage points across frontier models (Opus 4.7, Sonnet 4.6, GPT-5.4); R³-SQL reaches 75% BIRD-dev execution accuracy.

— Technical architecture analysis: NL-BI requires four-layer design (intent parsing, semantic layer, SQL generation, validation). Vendor consolidation around semantic layers and deterministic validation as production requirements.

— Production deployment guide from Snowflake partner documenting three semantic layer architectures—modularity vs. accuracy vs. scalability tradeoffs in multi-tenant Cortex Analyst rollouts.

HISTORY

  • 2022-H1: Early research breakthroughs in text-to-SQL accuracy (DIN-SQL: 85.3% Spider execution), major vendor product launches (AWS QuickSight Q GA), and critical independent assessments questioning real-world efficacy on unseen data. Adoption emerging in BI tools but constrained by accuracy limitations on complex queries.
  • 2022-H2: Two major surveys (COLING, Foundations and Trends) document field maturity and ongoing challenges; AWS expands QuickSight Q to data lakes; IBM research reveals severe generalization gap (75% on Spider vs. <20% on unseen databases); industry papers show incremental progress on semantic parsing. Consensus emerges: capability is real but production reliability remains the primary blocker.
  • 2023-H1: LLM-based methods (SQL-PaLM, RESDSQL) drive research state-of-the-art on text-to-SQL benchmarks; vendors move to product iteration with LLM integration (ThoughtSpot Sage, QuickSight Q expansion). Independent third-party evaluation by 30+ consultants flags usability barriers and training needs despite efficiency gains. Critical analyses emerge questioning semantic search viability for production, highlighting brittleness on unseen schemas and domain drift. Field transitions from research-driven to vendor-driven, but adoption remains limited to vendor internal deployments and niche use cases.
  • 2023-H2: DAIL-SQL advances text-to-SQL SOTA to 86.6% on Spider; BIRD benchmark reveals critical gap (GPT-4 at 54.89% vs. human 92.96% on real databases), signaling production reliability remains the core blocker. Vendor product expansion continues (AWS Q reaching 80+ capabilities, Gartner Challenger status); rare production deployment (DataQue) documented. Critical practitioner assessments reaffirm adoption barriers: poor ROI, data prep burden, limited scope. Research focuses on production reliability (schema scaling, validation-augmented parsing).
  • 2024-Q1: SuperSQL (NL2SQL360 framework) achieves 87% Spider accuracy; new vendor adoption (ServiceNow NLQ GA); research attention shifts to error taxonomy and user recovery strategies, but industry skepticism deepens with Yellowfin CEO documenting structural ambiguity failures in search-based NLQ. Real-world testing continues (Petrobras production database), yet the field consensus turns pragmatic: benchmark progress has decoupled from production viability, and adoption barriers remain fundamentally unresolved.
  • 2024-Q2: Major cloud vendors reach general availability (Amazon Q in QuickSight, AWS CloudWatch natural language querying), signaling product maturity and expanded scope into observability. Named enterprise deployments emerge (CBRE, BCG with Scale AI), validating production viability beyond vendor marketing. Simultaneously, research reveals pervasive annotation errors in benchmark datasets, undermining confidence in reported accuracy improvements. Field continues technical advancement (LLM-based method surveys) while adoption remains constrained by schema complexity, linguistic ambiguity, and poor ROI versus traditional BI tools. Vendor expansion accelerating; mainstream breakthrough absent.
  • 2024-Q3: Amazon Q QuickSight and ThoughtSpot demonstrate production deployments at scale (Cox 2M with 1.5M+ IoT events/hour; Docebo serving 3,800+ customers with 5x adoption increase). Research advances focus on reducing hallucinations (TA-SQL: 21% GPT-4 improvement) and domain knowledge integration; surveys document continued LLM method maturity. Practitioner feedback reveals persistent reality gaps: pure semantic search remains inadequate for structured queries; guided NLQ and hybrid SQL approaches more viable than search-only. Vendor expansion accelerating across cloud platforms; adoption remains limited to early investors despite production viability.
  • 2024-Q4: Spider 2.0 benchmark (632 real-world enterprise tasks) reveals critical gap: LLMs achieve only 17% success vs. 91.2% on Spider 1.0, making the reality gap explicit and data-driven. Vendor expansion continues (Amazon Q unstructured integration, Oracle EBS NLQ GA, Docebo at 4,000+ customers). LLM-based text-to-SQL surveys document method maturity and persistent production challenges. Practitioner reports (Hal9) confirm benchmark-to-reality decoupling. Adoption remains limited to organizations with substantial schema preparation and implementation investment; mainstream breakthrough absent.
  • 2025-Q1: Research advances in text-to-SQL methodology (HES-SQL hybrid reasoning: 79.14% BIRD accuracy with efficiency gains; Pi-SQL pivot-language method showing 3.20 accuracy improvement; systematic NLIDB review documenting persistent challenges). Real-world deployments demonstrate production viability for early adopters (ThoughtSpot/Snowflake retail case study with latency reduction hours-to-minutes; AWS QuickSight scenario analysis agentic features GA). However, industry skepticism deepens: practitioner opinion shifts toward agentic function calling as superior to text-to-SQL; critical assessments highlight SQL dialect complexity and model limitations (Gemini-2.0-turbo 54% failure rate). Comprehensive LLM-based text-to-SQL survey confirms academic recognition and ongoing research investment. Adoption barriers remain: complex schemas, linguistic ambiguity, extensive data preparation. Mainstream breakthrough absent; field remains production-ready for early investors only.
  • 2025-Q2: Vendor product expansion continues (Amazon Q embedded GA in QuickSight across 7 AWS regions, April 2025). Real-world deployments grow (HP Inc. case study: ThoughtSpot on Snowflake with 350 users, 155k queries in 6 months, turnaround days-to-<24hrs). However, domain-specific benchmarks expose persistent LLM limitations: Exaone 3.5 shows 4% accuracy on arithmetic reasoning and 31% on grouped ranking despite 93% on simple aggregation. Academic research broadens (comprehensive systematic reviews of text-to-SQL landscape and techniques). Practitioner assessments document evaluation complexity and production integration challenges (semantic layer, access controls, enterprise data integration). Industry consensus firms: benchmark-to-reality gap now quantified, adoption barriers remain structural, mainstream breakthrough absent. Field remains production-viable for organizations with substantial implementation investment.
  • 2025-Q3: Vendor ecosystem expansion continues with agentic capabilities (ThoughtSpot Gartner 2025 Leader, Verivox 70% adoption, ServiceNow NLQ GA). However, academic research explicitly documents fundamental production barriers: CORGI benchmark reveals LLM performance drops on high-level business questions (21% harder than BIRD); Text-to-Big SQL research identifies critical failures at scale (cost, latency). Research confirms text-to-SQL unsuitable for complex queries; practitioners increasingly favor agentic function calling over semantic search due to SQL dialect complexity and model limitations. Industry consensus unchanged: production viability limited to enterprises with extensive schema curation, semantic layer development, and data engineering resources; mainstream adoption remains absent despite vendor proliferation.
  • 2025-Q4: Vendor ecosystem continues to mature with SQL Server 2025 GA, Oracle EBS NLQ expansion, Elasticsearch hybrid search GA, and agentic AWS Quick Suite launch (Quick Research, Quick chat, Quick scan). Enterprise deployments demonstrate production scale (Odido €1M savings, Thrive Learning 20k+ customers in 6 weeks). However, Q4 research sharpens critique: Promethium's enterprise benchmark analysis quantifies the accuracy cliff (85-90% academic vs. 10-30% enterprise reality; GPT-4o 86% Spider 1.0 vs. 6% Spider 2.0); new hallucination detection research (SQLHD) and critical assessments document schema awareness, hallucination, and performance barriers as fundamental constraints. Research explicitly confirms: traditional text-to-SQL metrics fail at scale (cost, latency); only 10-20% of AI-generated answers meet business decision thresholds on heterogeneous enterprise systems. Industry position firms: mainstream adoption remains absent; deployments succeed only with extensive preparation (semantic layers, schema curation, access control integration); vendors increasingly pivot to agentic interfaces and guided NLQ rather than pure semantic search.
  • 2026-Jan: Research continues on specialized NLQ domains (NL4ST for spatio-temporal querying) while benchmarks expose persistent LLM limitations: CORGI reveals ~50% accuracy gap on complex business logic (GPT-4o ~50% vs. human expectation for production-grade queries). Vendor maturity consolidates (Snowflake Cortex Analyst, Sisense Simply Ask, Querio) with focus on production accuracy evaluation and Intent-First architectures. Developer adoption patterns shift toward self-learning agents (Agno, DuckDB-NSQL-7B) with knowledge-based query generation. Critical assessments document RAG/NLQ failures in production (72% enterprise search queries fail first attempt), reinforcing need for semantic layers and extensive schema preparation. Field shows no mainstream breakthrough; adoption remains limited to enterprises with substantial implementation investment.
  • 2026-Feb: Research advances in robustness (DIVER improves text-to-SQL resilience by 10.82%) while vendor product evolution continues (ThoughtSpot Analyst Studio with agentic data prep, Amazon Q Tokyo region GA). However, practitioner and analyst assessments sharpen critique: independent testing documents silent failures and security gaps in naive implementations; Gartner positions NLQ at Peak of Inflated Expectations; market analysis of 15+ tools highlights demo-to-production gap; cost transparency issues emerge (AWS Quick Suite billing complexity). Critical finding: LLMs remain >80% incorrect on raw data without semantic layer governance. Field consolidates around need for extensive schema preparation, semantic layers, and agentic architectures; pure text-to-SQL continues to fail in production despite research progress.
  • 2026-Q1: Production deployments scale with hybrid architectures: QSR chain achieves 2-3x speedup on Snowflake Cortex semantic model migration; 28-table MySQL case study documents Router Pattern combining text-to-SQL and function-calling; energy efficiency agency enables NLQ across fragmented data (assessment systems, rebates, real estate). Vendor evolution continues (ThoughtSpot Spotter for Industries with semantic layer + connectors). Research identifies fundamental barriers: Ambrosia+ benchmark reveals 36.5% relative improvement needed on ambiguous queries; dense retrieval semantic collapse on negation (MRR 0.023) documented in production biomedical QA. LLM-based relevance judgment shows no advantage over embedding retrieval. Industry consensus: production viability confirmed for organizations with investment in schema curation and semantic layers; mainstream adoption remains absent despite technology advancement; hybrid function-calling preferred over pure text-to-SQL.
  • 2026-Q2: Mainstream adoption inflection signals emerge: Snowflake Q1 FY2026 earnings report 50% of customer base (5,200+ weekly active users) using Cortex AI including Cortex Analyst, with 27% YoY growth in $1M+ customers; represents inflection from niche to mainstream. Production deployments demonstrate scale and architectural maturity: Uber's QueryGPT processes 1.2M interactive queries monthly across Operations with intent-agent + table-agent architecture reducing authoring time 10→3 minutes on 200+ column schemas; Dream11 fine-tuned text-to-SQL (8B params, 250M users) achieves 98.4% execution and 92.5% semantic accuracy outperforming GPT; finance deployments (FrankieOne, Northmill, Austin Capital Bank) enable 232 hrs/week time savings and 30% conversion improvements; Kalvium Labs finance system progresses 40%→91% accuracy through schema context and few-shot learning with validation layer. However, fundamental barriers persist: Cornell research shows 70% of SQL queries covered by 13% of templates, questioning LLM necessity; Berkeley Data Agent Benchmark exposes frontier models (Opus-4.6 43%, Gemini-3-Pro 38%) failing primarily due to context/domain gaps, not capability; market analysis documents semantic modeling complexity and cost unpredictability as adoption barriers across 7 platforms. Industry position: mainstream adoption now emerging for enterprises with substantial semantic layer and schema curation investment; deployment succeeds at scale with agentic, hybrid architectures; production reliability barriers persist despite technology advancement.
  • 2026-May: Snowflake's momentum continued with independent reporting of 9,100+ weekly active Cortex AI accounts and 200% growth in AI workloads; ACL 2026 research showed semantic layers boosting text-to-SQL accuracy 17-23 percentage points across frontier models, and an agentic parallel-exploration approach (PExA) reached 70.2% execution accuracy on the difficult Spider 2.0 benchmark. Production security concerns sharpened: analysis documented that generated SQL can leak permissions, violate access controls, and return semantically wrong answers despite syntactic correctness, establishing deterministic validation as a non-negotiable production requirement. Finance teams demonstrate matured NLQ practice using multi-turn conversational analytics for scenario planning and variance analysis, moving well beyond simple Q&A.

TOOLS