Perly Consulting │ Beck Eco

The State of Play

A living index of AI adoption across industries — where established practice meets the bleeding edge
UPDATED DAILY

The AI landscape doesn't move in one direction — it lurches. Some techniques leap from experiment to table stakes in a single quarter; others stall against regulatory walls, technical ceilings, or organisational inertia that no amount of hype can dislodge. Knowing which is which is the hard part. The State of Play cuts through the noise with a rigorously maintained index of AI techniques across every major business domain — classified by maturity, evidenced by real-world adoption, and updated daily so you always know where you stand relative to the field. Stop guessing. Start knowing.

The Daily Dispatch

A daily newsletter distilling the past two weeks of movement in a domain or two — delivered to your inbox while the index updates in the background.

AI Maturity by Domain

Each dot marks the weighted maturity of practices within a domain — hover for a brief summary, click for more detail

DOMAIN
BLEEDING EDGEESTABLISHED

Natural language data querying & semantic search

LEADING EDGE

TRAJECTORY

Stalled

AI that translates natural language questions into SQL queries and performs semantic search across structured and unstructured data. Includes text-to-SQL tools and embedding-based retrieval; distinct from enterprise RAG which retrieves from document collections rather than databases.

OVERVIEW

Natural language data querying has reached critical adoption momentum alongside persistent production barriers quantified by May 2026 research. Snowflake's earnings inflection—50% of customer base (5,200+ weekly active Cortex AI accounts) with 200% growth in AI-related workloads—signals mainstream traction for the vendor platform. Yet May 2026 scholarship sharpens the reality gap: BEAVER benchmark (MIT/Intel/Harvard) tested on real corporate query logs (9,128 pairs across 19 domains, 19 database systems) reveals GPT-4o achieves 82% on sterile academic benchmarks but collapses to 10.8% on actual enterprise data with undocumented schemas—a 71-point accuracy cliff. PolySQL research documents systemic evaluation bias: most benchmarks support only SQLite, masking 10.1% accuracy drops on production databases (PostgreSQL, BigQuery, Snowflake). Success hinges entirely on upfront data engineering: context quality (semantic layer, metadata governance, schema curation), not model capability, determines whether text-to-SQL achieves production accuracy. Only enterprises investing heavily in semantic layer development and schema preparation achieve reliable deployments.

Production deployments (Uber 1.2M/month, Tapestry feedback analysis, finance teams using multi-turn conversational analytics, AmpUp sales agents generating customer briefs with Cortex Code, AWS AML case study reducing investigation from 30-90 minutes to <5 minutes) demonstrate viability at scale for well-governed data environments. However, the limiting factor is consistently enterprise context: 70% of real SQL queries follow just 13% of templates (Cornell), yet 50% of frontier model failures stem from context/domain gaps rather than model capability (Berkeley Data Agent Benchmark). May 2026 research shows architectural momentum shifting toward agentic, iterative approaches—Amazon's SQL-Trail achieves SOTA on BIRD-SQL through multi-turn reinforcement learning with feedback; June 2026 FlexSQL demonstrates 65.4% on realistic Spider2-Snow (more representative than base Spider). Yet June 2026 JP Morgan research reveals critical multi-turn limitation: all five frontier models collapse to 0% execution accuracy by Turn 3 without working memory, indicating fundamental conversational barriers. May-June 2026 security research quantifies deployment risks: generated SQL can violate permissions, leak sensitive fields, return semantically wrong results despite syntactic correctness; multi-agent text-to-SQL systems show 30-78% security detection rates, requiring deterministic validation and schema inspection in production.

Vendors have consolidated around agentic architecture and mandatory semantic layers—an acknowledgment that pure text-to-SQL is insufficient. Production case studies demonstrate operational maturity: conversational SQL platform improvement from 60% to 98% success rate via agent-based Python code generation; Cortex Code deployment success correlation with data freshness and semantic grounding. Production learnings from OpenAI, Google Cloud, Vercel, and Hex consensus: enterprise text-to-SQL requires context pruning, rigorous validation, and governance—engineering discipline matters more than prompt optimization. Cost-efficient fine-tuning patterns emerge ($0.80/month for 22,000 queries with LoRA), and research momentum persists (ACL 2026 papers show 70%+ accuracy on specialized benchmarks via agentic approaches). Yet mainstream adoption without substantial implementation investment remains elusive. The practice reaches leading-edge maturity: production-ready for organizations with resources to invest in data governance and schema engineering; early-adopter advantage shifting from capability innovation to operational execution.

CURRENT LANDSCAPE

The vendor ecosystem has consolidated around agentic architecture and semantic layers as non-negotiable requirements. Google Cloud's Database Center (GA May 2026, Gemini-powered conversational interface enabling queries across Cloud SQL, Spanner, and Bigtable), Snowflake Cortex Analyst, AWS Quick Suite, ThoughtSpot Spotter for Industries, and emerging players signal ecosystem maturity. June 2026 Snowflake earnings show accelerated adoption inflection: 13,600+ accounts using Snowflake AI capabilities (up from 9,100 in May), 7,100+ using Cortex Code with Snowflake Intelligence accounts doubling QoQ, 34% YoY product revenue growth. Scale AI deployed TextQL's Ana at scale: 1,900 requests/week across Finance/Ops/HR on 1.9T rows with 74.9% monthly adoption growth. Uber's QueryGPT handles 1.2M queries monthly with 10→3 minute authoring speedup; Dream11's platform achieved 98.4% execution accuracy with fine-tuned 8B models on 250M users. AWS deployed production AML alert triage using Cortex Analyst (structured) and Cortex Search (unstructured) reducing investigation time 30-90 minutes to <5 minutes. AmpUp's production comparison of Cortex Code agent effectiveness demonstrates conversational SQL generation quality critically depends on data freshness and semantic grounding, not just model capability. These successes require substantial upfront investment in semantic layers, schema curation, and business context systems. Semantic Layer Summit 2026 (6,000+ attendees) showcased production deployments: Carrefour France migrated 3,000 metrics across 40 countries; Vodafone Portugal reduced metric refresh times from hours to minutes by migrating to semantic layers on BigQuery.

However, May-June 2026 research sharpens the benchmark-to-production reality gap and surfaces additional robustness barriers. BEAVER benchmark (MIT/Intel/Harvard, 9,128 real corporate query pairs across 19 domains and database systems) reveals the adoption cliff: GPT-4o achieves 82% accuracy on Spider/BIRD academic benchmarks but collapses to 10.8% on actual enterprise data with undocumented schemas—a 71-point drop representing fundamental inability to grasp complex business logic rather than memory limitations. PolySQL research exposes systematic evaluation bias: most text-to-SQL benchmarks only test SQLite, creating false confidence—cross-dialect evaluation reveals 10.1% average accuracy degradation on production databases (PostgreSQL, BigQuery, Snowflake) due primarily to logical errors (61%) rather than syntax. SpotIt (ICLR 2026) reveals formal verification shows top methods lose 11-14% when evaluated for semantic correctness rather than output matching, demonstrating benchmarks systematically overestimate capability. June 2026 research reveals new robustness limitations: models produce inconsistent SQL across equivalent database schemas (Gemini 81.6% agreement vs DeepSeek 33.85% on same data with different schemas), and multi-turn conversational SQL systems collapse without working memory management. Practitioners document silent failures: queries execute without error but return wrong results due to semantic errors (fan-out traps in joins, NULL inconsistencies, ambiguous business terminology), performance failures, and SQL injection vulnerabilities. Spider benchmark uses 146 clean databases with 5-30 tables; production systems have 400+ tables with opaque naming conventions. Multi-agent text-to-SQL systems show 30-78% security detection rates, with schema metadata flowing uninspected.

Architecture and governance maturity continue to diverge from raw model capability. dbt Labs April 2026 benchmark quantifies this divergence: semantic layer (deterministic) achieves 98.2-100% accuracy vs text-to-SQL (probabilistic) 84-90% on identical business questions, with the key finding that "the Semantic Layer's deterministic query generation means the LLM can't produce subtly wrong results"—addressing the core production failure mode of silent semantic errors. Snowflake's Cortex Sense context-layer benchmark confirms: agents achieve ~24% accuracy in isolation but ~86% with assembled business context (query history, metadata, BI definitions, semantic views), demonstrating context rather than model capability is the limiting factor. ByteDance + Georgia Tech's TAHOE system demonstrates this architectural evolution in production: Spider 2.0-Snow pass rate 61.95%→79.42% via learned hint hints, with cross-model transferability (+19.7pp on weaker models) showing the architecture's robustness. However, governance lags deployment: production incidents (Replit, Vibe) show AI agents destroying databases due to missing role-based access controls and pre-flight validation—evidence adoption is occurring but operational security controls remain immature. Practitioners favour agentic function-calling over direct text-to-SQL due to SQL dialect complexity and model limitations. Production learnings from OpenAI, Google Cloud, Vercel, and Hex consensus: most bad queries don't fail during execution—success requires context pruning, rigorous validation (query shape, missing filters, suspicious joins), deterministic compilation via semantic layers, and governance matching database user access patterns. Cost-efficient fine-tuning patterns emerge (AWS LoRA approach: $0.80/month for 22,000 queries), but fundamental deployment barriers persist: 10-20% of AI-generated answers meet business decision thresholds on heterogeneous enterprise systems without semantic layer governance and extensive schema curation.

TIER HISTORY

ResearchJan-2022 → Jan-2022
Bleeding EdgeJan-2022 → Jul-2024
Leading EdgeJul-2024 → present

EVIDENCE (142)

— Concrete deployment ROI: PepsiCo 12x faster root-cause analysis (2025); Top-10 pharma $12M opportunity with 2,200% ROI; documents market maturity with vendor consolidation around NL query capabilities.

— Critical negative signal: rigorously documents 70-point accuracy collapse (86-91% benchmarks → 10-21% enterprise); identifies three failure modes (scale, missing semantics, no verification) with empirically validated solutions (context layers recover 70-96%).

— dbt Labs April 2026 benchmark: semantic layer (deterministic) 98.2-100% vs text-to-SQL (probabilistic) 84-90% on same 11 insurance questions; demonstrates architectural solution to silent-wrong-answer failure mode.

— ByteDance + Georgia Tech production system: Spider 2.0-Snow pass rate 61.95%→79.42%; 100% Snowflake syntax; cross-model transferability (+19.7pp on Doubao-2.0-lite); demonstrates hint-learning architecture for real deployments.

— Cortex Sense context layer benchmark: agents alone ~24% accuracy, with context ~86%; demonstrates context assembly—not model capability—as bottleneck; validates semantic layer infrastructure as critical.

— Amazon Science framework outperforms LLM self-eval by 25.78% F1; improves execution accuracy up to 20pp on deployed systems—addresses critical production failure mode of semantically incorrect SQL passing syntax validation.

— Negative evidence: production incidents (Replit, Vibe) where AI agents destroyed databases due to governance gaps; reveals adoption reality—agents deployed in production but operational controls lag, indicating immature security posture.

— CoWork GA with adoption inflection: 13,600+ weekly active accounts with 2x QoQ growth; paired with context layers achieves 5x accuracy improvement (47% baseline, 5x with Atlan context), demonstrating context-not-model as limiting factor.

HISTORY

  • 2022-H1: Early research breakthroughs in text-to-SQL accuracy (DIN-SQL: 85.3% Spider execution), major vendor product launches (AWS QuickSight Q GA), and critical independent assessments questioning real-world efficacy on unseen data. Adoption emerging in BI tools but constrained by accuracy limitations on complex queries.
  • 2022-H2: Two major surveys (COLING, Foundations and Trends) document field maturity and ongoing challenges; AWS expands QuickSight Q to data lakes; IBM research reveals severe generalization gap (75% on Spider vs. <20% on unseen databases); industry papers show incremental progress on semantic parsing. Consensus emerges: capability is real but production reliability remains the primary blocker.
  • 2023-H1: LLM-based methods (SQL-PaLM, RESDSQL) drive research state-of-the-art on text-to-SQL benchmarks; vendors move to product iteration with LLM integration (ThoughtSpot Sage, QuickSight Q expansion). Independent third-party evaluation by 30+ consultants flags usability barriers and training needs despite efficiency gains. Critical analyses emerge questioning semantic search viability for production, highlighting brittleness on unseen schemas and domain drift. Field transitions from research-driven to vendor-driven, but adoption remains limited to vendor internal deployments and niche use cases.
  • 2023-H2: DAIL-SQL advances text-to-SQL SOTA to 86.6% on Spider; BIRD benchmark reveals critical gap (GPT-4 at 54.89% vs. human 92.96% on real databases), signaling production reliability remains the core blocker. Vendor product expansion continues (AWS Q reaching 80+ capabilities, Gartner Challenger status); rare production deployment (DataQue) documented. Critical practitioner assessments reaffirm adoption barriers: poor ROI, data prep burden, limited scope. Research focuses on production reliability (schema scaling, validation-augmented parsing).
  • 2024-Q1: SuperSQL (NL2SQL360 framework) achieves 87% Spider accuracy; new vendor adoption (ServiceNow NLQ GA); research attention shifts to error taxonomy and user recovery strategies, but industry skepticism deepens with Yellowfin CEO documenting structural ambiguity failures in search-based NLQ. Real-world testing continues (Petrobras production database), yet the field consensus turns pragmatic: benchmark progress has decoupled from production viability, and adoption barriers remain fundamentally unresolved.
  • 2024-Q2: Major cloud vendors reach general availability (Amazon Q in QuickSight, AWS CloudWatch natural language querying), signaling product maturity and expanded scope into observability. Named enterprise deployments emerge (CBRE, BCG with Scale AI), validating production viability beyond vendor marketing. Simultaneously, research reveals pervasive annotation errors in benchmark datasets, undermining confidence in reported accuracy improvements. Field continues technical advancement (LLM-based method surveys) while adoption remains constrained by schema complexity, linguistic ambiguity, and poor ROI versus traditional BI tools. Vendor expansion accelerating; mainstream breakthrough absent.
  • 2024-Q3: Amazon Q QuickSight and ThoughtSpot demonstrate production deployments at scale (Cox 2M with 1.5M+ IoT events/hour; Docebo serving 3,800+ customers with 5x adoption increase). Research advances focus on reducing hallucinations (TA-SQL: 21% GPT-4 improvement) and domain knowledge integration; surveys document continued LLM method maturity. Practitioner feedback reveals persistent reality gaps: pure semantic search remains inadequate for structured queries; guided NLQ and hybrid SQL approaches more viable than search-only. Vendor expansion accelerating across cloud platforms; adoption remains limited to early investors despite production viability.
  • 2024-Q4: Spider 2.0 benchmark (632 real-world enterprise tasks) reveals critical gap: LLMs achieve only 17% success vs. 91.2% on Spider 1.0, making the reality gap explicit and data-driven. Vendor expansion continues (Amazon Q unstructured integration, Oracle EBS NLQ GA, Docebo at 4,000+ customers). LLM-based text-to-SQL surveys document method maturity and persistent production challenges. Practitioner reports (Hal9) confirm benchmark-to-reality decoupling. Adoption remains limited to organizations with substantial schema preparation and implementation investment; mainstream breakthrough absent.
  • 2025-Q1: Research advances in text-to-SQL methodology (HES-SQL hybrid reasoning: 79.14% BIRD accuracy with efficiency gains; Pi-SQL pivot-language method showing 3.20 accuracy improvement; systematic NLIDB review documenting persistent challenges). Real-world deployments demonstrate production viability for early adopters (ThoughtSpot/Snowflake retail case study with latency reduction hours-to-minutes; AWS QuickSight scenario analysis agentic features GA). However, industry skepticism deepens: practitioner opinion shifts toward agentic function calling as superior to text-to-SQL; critical assessments highlight SQL dialect complexity and model limitations (Gemini-2.0-turbo 54% failure rate). Comprehensive LLM-based text-to-SQL survey confirms academic recognition and ongoing research investment. Adoption barriers remain: complex schemas, linguistic ambiguity, extensive data preparation. Mainstream breakthrough absent; field remains production-ready for early investors only.
  • 2025-Q2: Vendor product expansion continues (Amazon Q embedded GA in QuickSight across 7 AWS regions, April 2025). Real-world deployments grow (HP Inc. case study: ThoughtSpot on Snowflake with 350 users, 155k queries in 6 months, turnaround days-to-<24hrs). However, domain-specific benchmarks expose persistent LLM limitations: Exaone 3.5 shows 4% accuracy on arithmetic reasoning and 31% on grouped ranking despite 93% on simple aggregation. Academic research broadens (comprehensive systematic reviews of text-to-SQL landscape and techniques). Practitioner assessments document evaluation complexity and production integration challenges (semantic layer, access controls, enterprise data integration). Industry consensus firms: benchmark-to-reality gap now quantified, adoption barriers remain structural, mainstream breakthrough absent. Field remains production-viable for organizations with substantial implementation investment.
  • 2025-Q3: Vendor ecosystem expansion continues with agentic capabilities (ThoughtSpot Gartner 2025 Leader, Verivox 70% adoption, ServiceNow NLQ GA). However, academic research explicitly documents fundamental production barriers: CORGI benchmark reveals LLM performance drops on high-level business questions (21% harder than BIRD); Text-to-Big SQL research identifies critical failures at scale (cost, latency). Research confirms text-to-SQL unsuitable for complex queries; practitioners increasingly favor agentic function calling over semantic search due to SQL dialect complexity and model limitations. Industry consensus unchanged: production viability limited to enterprises with extensive schema curation, semantic layer development, and data engineering resources; mainstream adoption remains absent despite vendor proliferation.
  • 2025-Q4: Vendor ecosystem continues to mature with SQL Server 2025 GA, Oracle EBS NLQ expansion, Elasticsearch hybrid search GA, and agentic AWS Quick Suite launch (Quick Research, Quick chat, Quick scan). Enterprise deployments demonstrate production scale (Odido €1M savings, Thrive Learning 20k+ customers in 6 weeks). However, Q4 research sharpens critique: Promethium's enterprise benchmark analysis quantifies the accuracy cliff (85-90% academic vs. 10-30% enterprise reality; GPT-4o 86% Spider 1.0 vs. 6% Spider 2.0); new hallucination detection research (SQLHD) and critical assessments document schema awareness, hallucination, and performance barriers as fundamental constraints. Research explicitly confirms: traditional text-to-SQL metrics fail at scale (cost, latency); only 10-20% of AI-generated answers meet business decision thresholds on heterogeneous enterprise systems. Industry position firms: mainstream adoption remains absent; deployments succeed only with extensive preparation (semantic layers, schema curation, access control integration); vendors increasingly pivot to agentic interfaces and guided NLQ rather than pure semantic search.
  • 2026-Jan: Research continues on specialized NLQ domains (NL4ST for spatio-temporal querying) while benchmarks expose persistent LLM limitations: CORGI reveals ~50% accuracy gap on complex business logic (GPT-4o ~50% vs. human expectation for production-grade queries). Vendor maturity consolidates (Snowflake Cortex Analyst, Sisense Simply Ask, Querio) with focus on production accuracy evaluation and Intent-First architectures. Developer adoption patterns shift toward self-learning agents (Agno, DuckDB-NSQL-7B) with knowledge-based query generation. Critical assessments document RAG/NLQ failures in production (72% enterprise search queries fail first attempt), reinforcing need for semantic layers and extensive schema preparation. Field shows no mainstream breakthrough; adoption remains limited to enterprises with substantial implementation investment.
  • 2026-Feb: Research advances in robustness (DIVER improves text-to-SQL resilience by 10.82%) while vendor product evolution continues (ThoughtSpot Analyst Studio with agentic data prep, Amazon Q Tokyo region GA). However, practitioner and analyst assessments sharpen critique: independent testing documents silent failures and security gaps in naive implementations; Gartner positions NLQ at Peak of Inflated Expectations; market analysis of 15+ tools highlights demo-to-production gap; cost transparency issues emerge (AWS Quick Suite billing complexity). Critical finding: LLMs remain >80% incorrect on raw data without semantic layer governance. Field consolidates around need for extensive schema preparation, semantic layers, and agentic architectures; pure text-to-SQL continues to fail in production despite research progress.
  • 2026-Q1: Production deployments scale with hybrid architectures: QSR chain achieves 2-3x speedup on Snowflake Cortex semantic model migration; 28-table MySQL case study documents Router Pattern combining text-to-SQL and function-calling; energy efficiency agency enables NLQ across fragmented data (assessment systems, rebates, real estate). Vendor evolution continues (ThoughtSpot Spotter for Industries with semantic layer + connectors). Research identifies fundamental barriers: Ambrosia+ benchmark reveals 36.5% relative improvement needed on ambiguous queries; dense retrieval semantic collapse on negation (MRR 0.023) documented in production biomedical QA. LLM-based relevance judgment shows no advantage over embedding retrieval. Industry consensus: production viability confirmed for organizations with investment in schema curation and semantic layers; mainstream adoption remains absent despite technology advancement; hybrid function-calling preferred over pure text-to-SQL.
  • 2026-Q2: Mainstream adoption inflection signals emerge: Snowflake Q1 FY2026 earnings report 50% of customer base (5,200+ weekly active users) using Cortex AI including Cortex Analyst, with 27% YoY growth in $1M+ customers; represents inflection from niche to mainstream. Production deployments demonstrate scale and architectural maturity: Uber's QueryGPT processes 1.2M interactive queries monthly across Operations with intent-agent + table-agent architecture reducing authoring time 10→3 minutes on 200+ column schemas; Dream11 fine-tuned text-to-SQL (8B params, 250M users) achieves 98.4% execution and 92.5% semantic accuracy outperforming GPT; finance deployments (FrankieOne, Northmill, Austin Capital Bank) enable 232 hrs/week time savings and 30% conversion improvements; Kalvium Labs finance system progresses 40%→91% accuracy through schema context and few-shot learning with validation layer. However, fundamental barriers persist: Cornell research shows 70% of SQL queries covered by 13% of templates, questioning LLM necessity; Berkeley Data Agent Benchmark exposes frontier models (Opus-4.6 43%, Gemini-3-Pro 38%) failing primarily due to context/domain gaps, not capability; market analysis documents semantic modeling complexity and cost unpredictability as adoption barriers across 7 platforms. Industry position: mainstream adoption now emerging for enterprises with substantial semantic layer and schema curation investment; deployment succeeds at scale with agentic, hybrid architectures; production reliability barriers persist despite technology advancement.
  • 2026-May: Two critical research papers hardened the benchmark-to-reality gap: BEAVER (MIT/Intel/Harvard, 9,128 real corporate query pairs) shows GPT-4o collapsing from 82% on academic benchmarks to 10.8% on actual enterprise data with undocumented schemas; PolySQL reveals most benchmarks test only SQLite, masking a 10.1% accuracy drop on production databases (PostgreSQL, BigQuery, Snowflake) driven by logical errors, not syntax. Amazon's SQL-Trail demonstrated that multi-turn RL with feedback substantially outperforms single-pass generation on BIRD-SQL, shifting architectural momentum toward iterative agentic approaches. Google Cloud GA'd Database Center with Gemini-powered conversational querying across Cloud SQL, Spanner, and Bigtable. Production validation: Cortex Code agent quality depends on data freshness and semantic grounding; agent-based code generation improved one platform from 60% to 98% accuracy. ACL 2026 showed semantic layers add 17-23 percentage points across frontier models. Security research established deterministic validation as a non-negotiable production requirement—generated SQL can violate access controls and return semantically wrong answers despite syntactic correctness. The practice's limiting factor remains data engineering investment, not model capability.
  • 2026-Jun: Snowflake Q1 FY2027 earnings confirmed accelerating adoption: 13,600+ accounts using Snowflake AI capabilities (up from 9,100), Snowflake Intelligence accounts doubling QoQ, 34% YoY product revenue growth; Cortex CoWork's 13,600+ weekly active accounts showed 2x QoQ growth with 5x accuracy improvement using context layers (47% baseline). Semantic Layer Summit 2026 (6,000+ attendees) showcased enterprise-scale deployments: Carrefour France migrated 3,000 metrics across 40 countries; Vodafone Portugal cut metric refresh from hours to minutes. AWS published a production AML triage case study combining Cortex Analyst and Cortex Search reducing investigation time from 30-90 minutes to under 5 minutes; Scale AI's TextQL Ana processed 1,900 requests/week across 1.9T rows with 74.9% monthly adoption growth. Multi-agent text-to-SQL security research (373 queries) revealed 30-78% detection rates and architectural blind spots where schema metadata flows uninspected—establishing deterministic validation as non-negotiable. JP Morgan research documented a critical conversational barrier: all five frontier models collapse to 0% execution accuracy by Turn 3 without working memory, and FlexSQL (65.4% on Spider2-Snow) confirmed agentic iterative architectures as the most viable path. The production accuracy cliff sharpened: rigorous industry testing documents a 70-point gap (86-91% benchmarks → 10-21% enterprise), but dbt Labs April 2026 benchmark showed deterministic semantic layers achieving 98.2-100% vs. probabilistic text-to-SQL at 84-90% on identical business questions—with context assembly, not model capability, as the limiting factor confirmed by Snowflake's own Cortex Sense benchmark (~24% without context, ~86% with).

TOOLS